In computing, a roundoff error, also called rounding error, is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic.[1] Rounding errors are due to inexactness in the representation of real numbers and the arithmetic operations done with them. This is a form of quantization error. When using approximation equations or algorithms, especially when using finitely many digits to represent real numbers (which in theory have infinitely many digits), one of the goals of numerical analysis is to estimate computation errors. Computation errors, also called numerical errors, include both truncation errors and roundoff errors.
When a sequence of calculations with an input involving any roundoff error are made, errors may accumulate, sometimes dominating the calculation. In ill-conditioned problems, significant error may accumulate.
In short, there are two major facets of roundoff errors involved in numerical calculations:[2]
The error introduced by attempting to represent a number using a finite string of digits is a form of roundoff error called representation error.[3] Here are some examples of representation error in decimal representations:
Notation | Representation | Approximation | Error | |
---|---|---|---|---|
1/7 | 0. | 0.142 857 | 0.000 000 | |
0.693 147 180 559 945 309 41... | 0.693 147 | 0.000 000 180 559 945 309 41... | ||
0.301 029 995 663 981 195 21... | 0.3010 | 0.000 029 995 663 981 195 21... | ||
1.259 921 049 894 873 164 76... | 1.25992 | 0.000 001 049 894 873 164 76... | ||
1.414 213 562 373 095 048 80... | 1.41421 | 0.000 003 562 373 095 048 80... | ||
2.718 281 828 459 045 235 36... | 2.718 281 828 459 045 | 0.000 000 000 000 000 235 36... | ||
3.141 592 653 589 793 238 46... | 3.141 592 653 589 793 | 0.000 000 000 000 000 238 46... |
Increasing the number of digits allowed in a representation reduces the magnitude of possible roundoff errors, but any representation limited to finitely many digits will still cause some degree of roundoff error for uncountably many real numbers. Additional digits used for intermediary steps of a calculation are known as guard digits.[4]
Rounding multiple times can cause error to accumulate.[5] For example, if 9.945309 is rounded to two decimal places (9.95), then rounded again to one decimal place (10.0), the total error is 0.054691. Rounding 9.945309 to one decimal place (9.9) in a single step introduces less error (0.045309). This can occur, for example, when software performs arithmetic in x86 80-bit floating-point and then rounds the result to IEEE 754 binary64 floating-point.
Compared with the fixed-point number system, the floating-point number system is more efficient in representing real numbers so it is widely used in modern computers. While the real numbers
R
F
A floating-point number system
F
4
\beta
base or radix
p
precision
[L,U]
exponent range, where
L
U
Any
x\inF
di
0\leqdi\leq\beta-1
i=0,1,\ldots,p-1
E
L\leqE\leqU
d0
d0.d1d2\ldotsdp-1
1\leqsignificand<\betap
\pm1.bb\ldotsb x 2E
b\in{0,1}
1
F
x
fl(x)
2
(\beta-1)
\betap-1
U-L+1
1
0
In the IEEE standard the base is binary, i.e.
\beta=2
Precision | Sign (bits) | Exponent (bits) | Trailing Significand field (bits) | |
---|---|---|---|---|
Single | 1 | 8 | 23 | |
Double | 1 | 11 | 52 |
Machine epsilon can be used to measure the level of roundoff error in the floating-point number system. Here are two different definitions.[1]
\epsilonmach
x
\epsilonmach
\epsilon
fl(1+\epsilon)>1
fl(1+\delta)=fl(1)=1
|\delta|<\epsilonmach
There are two common rounding rules, round-by-chop and round-to-nearest. The IEEE standard uses round-to-nearest.
\beta
x
(p-1)
fl(x)
x
\beta
2
0
x | Round-by-chop | Roundoff Error | Round-to-nearest | Roundoff Error | |
---|---|---|---|---|---|
1.649 | 1.6 | 0.049 | 1.6 | 0.049 | |
1.650 | 1.6 | 0.050 | 1.6 | 0.050 | |
1.651 | 1.6 | 0.051 | 1.7 | -0.049 | |
1.699 | 1.6 | 0.099 | 1.7 | -0.001 | |
1.749 | 1.7 | 0.049 | 1.7 | 0.049 | |
1.750 | 1.7 | 0.050 | 1.8 | -0.050 |
Suppose the usage of round-to-nearest and IEEE double precision.
(9.4)10=(1001.{\overline{0110}})2
9.4
fl(9.4)
1 x 2-52 x 23=2-49
Then
fl(9.4)=9.4-0.4 x 2-48+2-49=9.4+(0.2)10 x 2-49
Thus, the roundoff error is
(0.2 x 2-49)10
The machine epsilon
\epsilonmach
\epsilonmach=\beta1-p
\epsilonmach=
1 | |
2 |
\beta1-p
Let
x=d0.d1d2\ldotsdp-1dp\ldots x \betan\inR
n\in[L,U]
fl(x)
x
d0 ≠ 0
1
(\beta-1).(\beta-1){\overline{(\beta-1)}}=\beta
|x-fl(x)| | |
|x| |
\leq
\beta | |
1 |
x \beta-p=\beta1-p
\epsilon=\beta1-p
Even if some numbers can be represented exactly by floating-point numbers and such numbers are called machine numbers, performing floating-point arithmetic may lead to roundoff error in the final result.
Machine addition consists of lining up the decimal points of the two numbers to be added, adding them, and then storing the result again as a floating-point number. The addition itself can be done in higher precision but the result must be rounded back to the specified precision, which may lead to roundoff error.[1]
1
2-53
\begin{align} 1.00\ldots0 x 20+1.00\ldots0 x 2-53&=1.\underbrace{00\ldots0}52bits x 20+0.\underbrace{00\ldots0}52bits1 x 20\\ &=1.\underbrace{00\ldots0}52bits1 x 20. \end{align}
1.\underbrace{00\ldots0}52bits x 20
1+2-53
1
2-53
This example shows that roundoff error can be introduced when adding a large number and a small number. The shifting of the decimal points in the significands to make the exponents match causes the loss of some of the less significant digits. The loss of precision may be described as absorption.[6]
Note that the addition of two floating-point numbers can produce roundoff error when their sum is an order of magnitude greater than that of the larger of the two.
10
2
fl(62)=6.2 x 101
fl(41)=4.1 x 101
62+41=103
fl(103)=1.0 x 102
103-fl(103)=3
This kind of error can occur alongside an absorption error in a single operation.
In general, the product of two p-digit significands contains up to 2p digits, so the result might not fit in the significand.[1] Thus roundoff error will be involved in the result.
\beta=10
2
fl(77)=7.7 x 101
fl(88)=8.8 x 101
77 x 88=6776
fl(6776)=6.7 x 103
2
6776-fl(6776)=6776-6.7 x 103=76
In general, the quotient of 2p-digit significands may contain more than p-digits.Thus roundoff error will be involved in the result.
1/3=0.333\ldots
fl(1/3)=fl(0.333\ldots)=3.3 x 10-1
0.333\ldots-3.3 x 10-1=0.00333\ldots
Absorption also applies to subtraction.
2-60
1
\underbrace{1.00\ldots0}53bits x 20
1-2-60
1
-2-60
The subtracting of two nearly equal numbers is called subtractive cancellation.[1] When the leading digits are cancelled, the result may be too small to be represented exactly and it will just be represented as
0
|\epsilon|<\epsilonmach
(1+\epsilon)-(1-\epsilon)
1+\epsilon
1-\epsilon
(1+\epsilon)-(1-\epsilon)=1+\epsilon-1+\epsilon=2\epsilon
fl((1+\epsilon)-(1-\epsilon))=fl(1+\epsilon)-fl(1-\epsilon)=1-1=0
2\epsilon
\epsilon
0
Even with a somewhat larger
\epsilon
1.99999 x 102-1.99998 x 102=0.00001 x 102=1 x 10-5 x 102=1 x 10-3
1 x 10-3
This is closely related to the phenomenon of catastrophic cancellation, in which the two numbers are known to be approximations.
Errors can be magnified or accumulated when a sequence of calculations is applied on an initial input with roundoff error due to inexact representation.
An algorithm or numerical process is called stable if small changes in the input only produce small changes in the output, and unstable if large changes in the output are produced.[7] For example, the computation of
f(x)=\sqrt{1+x}-1
x=0
style{f(x)=
x | |
\sqrt{1+x |
+1}}
Even if a stable algorithm is used, the solution to a problem may still be inaccurate due to the accumulation of roundoff error when the problem itself is ill-conditioned.
The condition number of a problem is the ratio of the relative change in the solution to the relative change in the input.[1] A problem is well-conditioned if small relative changes in input result in small relative changes in the solution. Otherwise, the problem is ill-conditioned.[1] In other words, a problem is ill-conditioned if its conditions number is "much larger" than 1.
The condition number is introduced as a measure of the roundoff errors that can result when solving ill-conditioned problems.[2]