4-6 ERRORS IN FLOATING POINT COMPUTATIONS ****************************************** Excerpt from The Art of Computer Programming by Donald E. Knuth: ---------------------------------------------------------------- "Floating-point computation is by nature inexact, and it is not difficult to misuse it so that the computed answers consist almost entirely of 'noise'. One of the principal problems of numerical analysis is to determine how accurate the results of certain numerical methods will be; a 'credibility gap' problem is involved here: we don't know how much of the computer's answers to believe. Novice computer users solve this problem by implicitly trusting in the computer as an infallible authority; they tend to believe all digits of a printed answer are significant. Disillusioned computer users have just the opposite approach, they are constantly afraid their answers are almost meaningless." Properties of Floating-point arithmetic --------------------------------------- Real numbers have a lot of 'good' properties, which of them are retained by floating-point arithmetic? Topological Validity ----------- ----------------------------- Connectivity no All points are isolated Completeness no No converging sequences Field axioms: ------------- Closure under addition no We may have Overflow/Underflow Associativity of addition no (a + b) + c .NE. a + (b + c) Additive commutativity yes a + b .EQ. b + a Unique zero element yes a + 0 .EQ. a Unique additive inverse yes a + (-a) .EQ. 0 Closure under multiplication no We may have Overflow/Underflow Associativity of multiplication no (a * b) * c .NE. a * (b * c) Multiplicative commutativity yes a * b .EQ. b * a Unique unit element yes a * 1 .EQ. a Unique multiplicative inverse yes a * (1/a) .EQ. 1 Distributivity no a * (b + c) .NE. (a * b) + (a * c) Ordered field axioms: --------------------- Completeness yes Transitivity yes (a .ge. b) .and. (b .ge. c) implies (a .ge. c) Density no Translation invariance yes Scale invariance Triangle inequality Archimedes' axiom yes We see that some of the basic properties of real numbers are missing. Some properties that are usually derived from the basic axioms are still true: (x * y) .eq. 0.0 IS EQUIVALENT TO (x .eq. 0.0) .or. (y .eq. 0.0) (x .le. y) .and. (z .gt. 0.0) IMPLIES (x * z) .le. (y * z) (x .eq. y) IS EQUIVALENT TO (x - y) .eq. 0.0 (If using denormals) When you are doing floating-point arithmetic, bear in mind that you are dealing with a discrete number system, which "quantize" every value that appears in the calculations, mangling everything in its way. A side remark ------------- Division is non-associative, simple counter-examples are: 1 = (8 / 4) / 2 .NE. 8 / (4 / 2) = 4 4 = (8 / 4) * 2 .NE. 8 / (4 * 2) = 1 Be sure to supply parentheses to force the right order of evaluation! Sources of errors ----------------- This is a schematic classification based on Dahlquist/Bjork: A) Errors in the input data - measurement errors, errors introduced by the conversion of decimal data to binary, roundoff errors. B) Roundoff errors during computation C) Truncation errors - using approximate calculation is inevitable when computing quantities involving limits and other infinite processes on a computer: 1) Summing only a finite number terms in an infinite series 2) Discretization errors, approximating a derivative/integral by a difference quotient/finite sum 3) Approximating functions by polynomials or linear functions D) Simplifications in the mathematical model of the problem E) Human mistakes and machine malfunctions A bit of Error Analysis ----------------------- We can visualize errors with intervals. A number X that is known with absolute error dX, can be represented by the interval (X - dX, X + dX), with a positive dX. Addition of intervals is given by: (X-dX, X+dX) + (Y-dY, Y+dY) = (X+Y - (dX+dY), X+Y + (dX+dY)) Similarly subtraction of intervals is given by: (X-dX, X+dX) - (Y-dY, Y+dY) = (X-Y - (dX+dY), X-Y + (dX+dY)) A good measure of accuracy is the ABSOLUTE RELATIVE ERROR (ARE). If X is represented by the interval (X-dX, X+dX) the absolute relative error is: dX ARE(X) = ------ abs(X) The ARE is always positive since dX is so. The absolute relative error of addition is: dX + dY ARE(X + Y) = ---------- abs(X + Y) which can be written as: abs(X) dX abs(Y) dY ARE(X + Y) = ---------- * ------ + ---------- * ------ abs(X + Y) abs(X) abs(X + Y) abs(Y) abs(X) * are(X) + abs(Y) * are (Y) ARE(X + Y) = ------------------------------------ abs(X + Y) In a similar way: abs(X) * are(X) + abs(Y) * are(Y) are(X - Y) = ---------------------------------- abs(X - Y) Note that these formulas doesn't take into account errors introduced by the addition/subtraction operations, they are purely error-analytic. To get the total error you have to add a term whose magnitude will be calculated in the next section. err(X + Y) = are(X + Y) + round(X + Y) Other error measures used in error analysis are the relative error, ULP (Units in the last place), and the number of significant digits. Estimating floating-point roundoff errors ----------------------------------------- The IEEE floating-point representation of a real number X is: X = ((-1) ** S) * 1.ffff...fff * (2 ** e) | | +--------+ t binary digits We have t binary digits instead of an infinite series, that is the cause of the roundoff error. An upper bound for the roundoff error is the maximum value the truncated tail could have: Upper-Bound <= (2 ** (-t)) * (2 ** e) = 2 ** (e - t) The (2 ** (-t)) term is the maximal value the tail could have, it is just the sum of an infinite geometric sequence composed of binary 1's. Note that we found here an upper bound for the roundoff error made in one rounding operation. Since almost every arithmetic operation involves rounding, our upper bound will get propagated and multiplied. The fractional part of the representation lays in the range [1,2) and so we have that abs(X) is of the same order of magnitude as (2 ** e). We can denote this by abs(X) ~ (2 ** e). The maximum Absolute Relative Error is approximately: 2 ** (e -t) round(X) = ----------- ~ 2 ** (-t) abs(X) Relative error in addition and subtraction ------------------------------------------ On machines which doesn't use guard digits, e.g. Crays (which have also problems with division), the relative error introduced by the operation of addition (even if the operands have no errors) can be as large as 1.0, i.e. a relative error of 100% ! Machines with one guard digit will have at most a small relative error of 2 * (2 ** (-t)) due to the addition operation. See the article of David Goldberg for a proof of these results. Even on "good" machines, when you add numbers with rounding errors from previous calculations (an unavoidable situation), it may get very bad. This is due to an error-analytic fact that has nothing to do with the way machines perform addition: abs(X) * are(X) + abs(Y) * are (Y) are(X + Y) = ------------------------------------ abs(X + Y) To simplify the argument, suppose are(X) is equal to are(Y), and denote the common value by "are(X|Y)": abs(X) + abs(Y) are(X + Y) = --------------- * are(X|Y) abs(X + Y) Note that the error associated with the addition operation should include also the rounding error associated with machine-adding X and Y. If X and Y have the same sign, we get "are(X|Y)" again. However, if X and Y have opposite signs and similar magnitudes the value of the expression can be VERY LARGE, since the denominator is small. It can't get arbitrarily large, since X and Y belongs to a discrete number system and can't have arbitrarily close absolute values (if not equal), but that is a small consolation. The amplification of previous roundoff errors by subtracting nearly equal numbers is sometimes called "catastrophic cancellation". If no previous roundoff errors are present (an unlikely situation) we have "benign cancellation". The strange terminology was probably invented when floating-point arithmetic was less understood. +-------------------------------------------------------------------+ | Addition of two floating-point numbers with similar magnitudes | | and opposite signs, may cause a LARGE PRECISION LOSS. | | | | Subtraction of two floating-point numbers with similar | | magnitudes and equal signs, may cause a LARGE PRECISION LOSS. | +-------------------------------------------------------------------+ A good example - computing a derivative --------------------------------------- A good example is provided by the numerical computation of a derivative, which involves the subtraction of two very close numbers. The subtraction generates a relative error which increases as the step size (h) decrease. On the other hand, the truncation error of the formula decrease with decreasing h. These two opposing "effects" produce a minimum for the total error, located high above the precision threshold. PROGRAM NUMDRV INTEGER I REAL X, H, DYDX, TRUE C ------------------------------------------------------------------ X = 0.1 H = 10.0 TRUE = COS(X) C ------------------------------------------------------------------ DO I = 1, 15 DYDX = (SIN(X + H) - SIN(X - H)) / (2.0 * H) WRITE (*,'(1X,I3,E12.2,F17.7,F17.5)') & I, H, DYDX, 100.0 * (DYDX - TRUE) / TRUE H = H / 10.0 ENDDO C ------------------------------------------------------------------ WRITE (*,*) ' *** ANALYTICAL DERIVATIVE IS: ', TRUE C ------------------------------------------------------------------ END Following are the results on a classic computer (VAX). The second column is the step size, the third is the computed value, and the fourth is the relative error: 1 0.10E+02 -0.0541303 -105.44022 2 0.10E+01 0.8372672 -15.85290 3 0.10E+00 0.9933466 -0.16659 4 0.10E-01 0.9949874 -0.00168 5 0.10E-02 0.9950065 0.00023 6 0.10E-03 0.9949879 -0.00164 7 0.10E-04 0.9950252 0.00211 8 0.10E-05 0.9946526 -0.03533 9 0.10E-06 0.9313227 -6.40012 10 0.10E-07 0.7450581 -25.12010 11 0.10E-08 0.0000000 -100.00000 12 0.10E-09 0.0000000 -100.00000 13 0.10E-10 0.0000000 -100.00000 14 0.10E-11 0.0000000 -100.00000 15 0.10E-12 0.0000000 -100.00000 *** Analytical derivative is: 0.9950042 The results can be divided into 4 ranges: 1 - 2 The step 'h' is just too large, truncation error dominates 3 - 8 Useful region, optimal value of 'h' is in the middle 9 - 10 Onset of severe numerical problems near precision threshold 11 - 15 Computation is completely trashed Different magnitudes problem ---------------------------- A small program will illustrate another effect of roundoff errors: PROGRAM FLTADD C ------------------------------------------------------------------ INTEGER I REAL X C ------------------------------------------------------------------ X = 1.0 DO I=1,20 WRITE(*,*) ' 1.0 + ', X, ' = ', 1.0 + X X = X / 10.0 ENDDO C ------------------------------------------------------------------ END We see that addition (subtraction) of two floating points with different magnitudes can be very inaccurate, the smaller float can be effectively treated as zero. Whenever the difference between the binary exponents is larger than the number of bits in the mantissa, the smaller number will get shifted out completely upon addition (subtraction). For example, when summing the terms of a sequence of numbers that begins with some large terms and then goes down we may get 'effective truncation'. The addition of the small terms to the large partial sum may leave it unchanged: PROGRAM TRNCSR INTEGER I REAL SUM, AN C ------------------------------------------------------------------ AN(I) = 1000.0 * EXP(-REAL(I)) C ------------------------------------------------------------------ SUM = 0.0 C ------------------------------------------------------------------ DO I = 1, 21 SUM = SUM + AN(I) WRITE (*,*) I, AN(I), SUM ENDDO C ------------------------------------------------------------------ END The output on a classic machine (VAX): 1 367.8795 367.8795 2 135.3353 503.2148 3 49.78707 553.0018 4 18.31564 571.3174 5 6.737947 578.0554 6 2.478752 580.5342 7 0.9118820 581.4460 8 0.3354626 581.7815 9 0.1234098 581.9049 10 4.5399930E-02 581.9503 11 1.6701700E-02 581.9670 12 6.1442126E-03 581.9732 13 2.2603294E-03 581.9755 14 8.3152874E-04 581.9763 15 3.0590233E-04 581.9766 16 1.1253518E-04 581.9767 17 4.1399377E-05 581.9768 18 1.5229979E-05 581.9768 19 5.6027966E-06 581.9768 20 2.0611537E-06 581.9768 21 7.5825608E-07 581.9768 We can see that the series is effectively truncated at fairly large terms, about two orders of magnitude above machine precision. Another example is the following "identity": 1 - (1 - e) 1 = ------------ # 0 (e is a small enough number) (1 - 1) + e Where the '=' denotes the exact value, and # denotes the result of a floating-point computation. A more subtle problem occurs when the smaller number doesn't get "nullified" but just get rounded off with a large error. +-------------------------------------------------------------+ | On adding (subtracting) several floating points, | | you have to group them according to relative size, | | so that as much as possible operations will be | | performed between numbers with similar magnitudes. | +-------------------------------------------------------------+ Underflow problem ----------------- Many computers replace by default an underflowed result with zero, it seems very plausible to replace a number that is smaller than your minimal number with zero. However, if we replace the number X with 0.0, we create a relative error of: X - 0.0 Relative Error = ------- = 1.0 (A 100% error!) X The following small program computes two MATHEMATICALLY IDENTICAL expressions, the results are illuminating. Replace the constant 'TINY' by 0.3E-038 on old DEC machines, and by 0.1E-044 on new IEEE workstations. PROGRAM UNDRFL C ------------------------------------------------------------------ REAL * A, B, C, D, X, * LEFT, RIGHT C ------------------------------------------------------------------ A = 0.5 B = TINY C = 1.0 D = TINY X = TINY C ------------------------------------------------------------------ LEFT = (A * X + B) / (C * X + D) RIGHT = (A + (B / X)) / (C + (D / X)) WRITE(*,*) ' X = ', X, ' LEFT = ', LEFT, ' RIGHT = ', RIGHT C ------------------------------------------------------------------ END A difference of 50% is a little unexpected, isn't it? If the FPU underflows 'gracefully' (using 'denormalized' numbers), the errors will develop and become large many orders of magnitudes below the 'least usable number'. +------------------------------------------+ | Enable underflow checking if possible, | | if not check manually for underflow. | +------------------------------------------+ Error accumulation ------------------ Roundoff errors are more or less random, the following example program is a crude model that gives some insight on the accumulation of random quantities: PROGRAM CUMERR C ------------------------------------------------------------------ INTEGER * i, * seed C ------------------------------------------------------------------ REAL * x, * sum C ------------------------------------------------------------------ seed = 123456789 sum = 0.0 C ------------------------------------------------------------------ DO i = 1, 1000 x = 2.0 * (RAN(seed) - 0.5) write (*,*) x sum = sum + x IF (MOD(i,10) .EQ. 0) WRITE (*,*) ' ', sum END DO C ------------------------------------------------------------------ END RAN() is a random number generator, a random sequence composed of {-1, 1} is generated and summed, and the sum is displayed every 10 iterations. Examining the program's output, you will see that the partial sums are not zero, they oscillate irregularly. Most of the partial sums are contained in the interval [-15, 15] (may depend on the random generator and the seed value), because long 'unbalanced' sub-sequences are relatively rare. The conclusion is that random errors tend to cancel on a large scale, and accumulate on small scale. However it is easy to give an example in which the random errors DO accumulate: PROGRAM ACCRND C ------------------------------------------------------------------ INTEGER * I C ------------------------------------------------------------------ REAL * X, TMP C ------------------------------------------------------------------ X = 0.1 WRITE (*,*) 'DIRECT COMPUTATION: ', 1000000.0 * X C ------------------------------------------------------------------ TMP = 0.0 DO I = 1, 1000000 TMP = TMP + X ENDDO WRITE (*,*) 'LOOP COMPUTATION: ', TMP C ------------------------------------------------------------------ END The error accumulate to about 1%, and could accumulate further if the number of loop iterations were made larger. A more realistic example is: PROGRAM NUMINT C ------------------------------------------------------------------ INTEGER * i, j C ------------------------------------------------------------------ REAl * interval, * sum C ------------------------------------------------------------------ DO i = 10, 2000, 20 sum = 0.0 interval = 3.1415926 / REAL(i) DO j = 1, i sum = sum + interval * SIN(j * interval) END DO WRITE (*,*) i, sum END DO C ------------------------------------------------------------------ END Here we do a crude numerical integration of SIN() in the range [0, Pi], the result should be 2.0 = (COS(0) - COS(Pi)) and is converging quite closely to that value. However when the number of intervals used is further increased (higher i), the result develops a slight oscillation due to roundoff errors.Return to contents page