Home | Projects | Notes > Computer Architecture & Organization > Floating-Point Arithmetic and the Programmer
Floating-point number in the computer is an approximation not the true representation of the number.
Floating-point number addition is not associative. (The order the numbers are added mkae a difference!)
xxxxxxxxxx51(-1.5 x 10^38 + 1.5 x 10^38) + 1.0 = 12-1.5 x 10^38 + (1.5 x 10^38 + 1.0) = 03 -----------------4 Floating-point number format cannot express '1.0' portion5 of the addition result, so it gets truncated!If you are doing floating-point number calculations, check your input numbers and your answers.
Especially if you are using a parallel machine that dynamically assigns calculations to processors.
Similar issues can happen when you try to assign a number with more precision to another with less precision. (This happens regardless of the data type.)
The calculation used for error propagation with computer calculations are the same formula as for errors in other scientific fields.
xxxxxxxxxx71X, Y : Real values2X', Y': Computer representations of the real values3Rx, Ry: Difference between the real value and the represented values4 (errors)5
6X' = X + Rx7Y' = Y + RyRelative errors can be calculated by:
xxxxxxxxxx21Rx/X * 100 (%)2Ry/Y * 100 (%)If we do this calculation X * Y, in reality what is calculated is:
xxxxxxxxxx61X' * Y' (where X' = X + Rx and Y' = Y + Ry)2= (X + Rx)(Y + Ry)3= X*Y + X*RY + Y*Rx + Rx*Ry4= X*Y + X*RY + Y*Rx (Rx*Ry is small enough to ignore) 5 -----------6 Error Relative error can be calculated by:
xxxxxxxxxx11(X*RY + Y*Rx) / (X*Y) * 100 (%)The addition error propagations are much smaller than the multiplication error propagations.
To calculate the error propagation for a polynomial we can take the first derivative of the function and put in the value to be calculated and multiply by the estimated error.
Example:
xxxxxxxxxx11f(x) = 2x^3 + 4x^2 + 3x + 2 To determine the estimated error for when x = 2:
xxxxxxxxxx51f'(x) = 6x^2 + 8x + 3 2f'(2) = 6*4 + 8*2 + 3 3 = 43*Rx4 -----5 ErrorWe can do this because
Rx^2or greater does not contribute much to the overall error calculation.The math behind it is the Taylor series expansion.
Taylor series expansion is used to calculate functions such as sqrt(x) and sin(x).
As you expand more and more, you should be getting the more accurate value but remember that there will also be more propagations of errors from the calculation of each term.
Also, more calculations means more time!

Another way to make this calculation faster and less prone to errors, look-up tables stored in ROM can be used.
Instead of doing a calculation, all there is required is a memory access. Memory access time turned out to be faster than doing all these calculations.
Linear interpolation can then be used to better approximate values in between the stored values.
Example:
xxxxxxxxxx21cos(30) = 0.8660254037844392cos(35) = 0.819152044288992If we need to calculate cos(32):
(Note that cos() is not a linear function (not a straight line) but we use a straight line to get an estimation of the actual value.)
xxxxxxxxxx71(0.8660 - 0.8192) / 5 = 0.0094 2 ------3 Value for every degree over this range 30-354
50.8660 - 0.0094*2 = 0.84726 ------7 Linear interpolated value of cos(32)The actual value is cos(32) = 0.848048096156426. It is not bad if all you really need is two significant digits and speed is more important the accuracy.
Chebyshev series
Better behave than Taylor series.
Errors are distributed more evenly.
Fewer terms are required to converge the series.
US currency
Change the units from dollars to cents.
It will be defined as an integer instead of a floating point.
Just remember to change the format before it is displayed so that it meets the convention.
Distance
Change the units so that integer calculations can be used.
e.g., Tenths of miles - 35.4 miles, store 354.
e.g., Thousands of miles - Assuming the least 3 significant digits are not important and would still fit in 32-bit unsigned number. So, 45698734 would become 45698.
e.g., Little easier with the metric system since everything is already on power of 10. So, instead of keeping the units in kilometers you can do it in meters.
If there are still not enough digits, then you can build your own
If calculation speed and accurate representation are needed, then this type of approach works well. The drawback is keeping track of the units, especially if there are calculations that take place. Make sure you are using compatible units.
e.g., You have keeping a distance in tenths of feet but to use