1. Introduction
Computer memory has a finite capacity. Real numbers in , do not, in general, however, have finite uniform representation. For example, consider representing the rational number
as a decimal value: it is
– it is impossible to do so exactly! Since only a finite number of digits can be stored in computer memory, the reals have to be approximated in some fashion (rounded or truncated), when represented on a computer.
In this tutorial, we’ll go over the basic ideas of floating-point representation and learn the limits of floating-point accuracy, when doing practical numerical computing.
2. Rounding and Chopping
There are two distinguishable ways of rounding off a real number to a given number
of decimals.
In chopping, we simply leave off all decimals to the right of the th digit. For example,
truncated to
decimal place yields
.
In rounding to nearest, we choose a number with decimals, which is the closest to
. For example, consider rounding off
to
decimal place. There are two possible candidates:
and
. On the real number line,
is at a distance of
from
, whilst
is at a distance of
. So,
is the nearest and we round to
.
Intuitively, we are comparing the fractional part of the number to the right of the th digit, which is
with
. As
, we incremented the
th digit, which is
by
.
What if, we wished to round off to
decimal places? Observe that, the fractional part is
. As
, we leave the
th digit unchanged. So, the result is
, which is indeed nearest to
.
In general, let be the fractional part of the number to the right of
th digit (after the decimal point). If
, we increment the
th digit. If
, we leave the
th digit unchanged.
In the case of a tie, when is equidistant to two decimal
-digit numbers, we raise the
th decimal if it is odd or leave it unchanged if it is even. In this way, the error in rounding off a decimal number is positive or negative equally often.
Let’s see some more examples of rounding and chopping, to make sure, this sinks in. Assume that, we are interested in rounding off all quantities to decimal places:
x
Rounding
Chopping
0.2397
0.240
0.239
0.23750
0.238
0.237
0.23650
0.236
0.236
0.23652
0.237
0.236
3.142
3.141
The difference between chopping and rounding has real-world implications! The Vancouver stock exchange index began trading in 1982, with a base value of . Although the underlying stocks were performing decent, the index began hitting the low 500s at the end of 1983. A computer program re-calculated the value of the index thousands of times each day, and the program used chopping instead of rounding to the nearest. A rounded calculation gave a value of
.
3. Computer Number Systems
In our daily life, we represent numbers using the decimal number system with base . In the decimal system, we’re aware, that if a digit
stands
digits to the right of the decimal point, the value it contributes is
. For example the sequence of digits
means:
In fact any integer (or
can be used as a base. Analogously, every real number
has a unique representation of the form:
or compactly , where the coefficients
, the digits in system
, are positive integers such that
.
3.1. Conversion Algorithm Between Two Number Systems
Consider the problem of conversion between two number systems with different bases. For the sake of concreteness, let’s try to convert to the binary format. We may write:
We can pull out a factor of from each term, except the last, and equivalently write:
Intuitively, therefore, if we were to divide by
, the expression in brackets, let’s call it
, is the quotient and
is the remainder of the division. Similarly, since
, division by
would return the expression in the brackets as the quotient; call it
, and
as the remainder.
In general, if is an integer with base
and we want to determine it’s representation in a number system with base
, we perform successive divisions of
with
: set
and
is the quotient and
is the remainder in the division.
Let’s look at the result of applying the algorithm to :
0
250
250/2 = 125
0
1
125
125/2 = 62
1
2
62
62/2 = 31
0
3
31
31/2 = 15
1
4
15
15/2 = 7
1
5
7
7/2 = 3
1
6
3
3/2 = 1
1
7
1
1/2 = 0
1
Therefore, .
3.2. Converting Fractions to Another Number System
If the real number is not an integer, we write it as
, where
is the integer part and
is the fractional part, where are to be determined.
Observe that, multiplying both sides by the base yields,
an integer and a fractional portion. The integer portion is precisely – the first digit of
represented in base
. Consecutive digits are obtained as the integer parts, when successively multiplying
by
.
In general, if a fraction must be converted to another number system with base
, we perform successive multiplications of
with
: set
and
Let’s look at an example of converting to binary number system:
0
0.15625
0
0.3125
-1
0.3125
0
0.625
-2
0.625
1
0.25
-3
0.25
0
0.50
-4
0.50
1
0.00
Therefore, in the binary system.
3.3. Fixed and Floating-Point Representation
Computers are equipped to handle pieces of information of a fixed size called a word. Typical word lengths are 32-bits or 64-bits.
Early computers made numerical calculations in a fixed-point number system. That is, real numbers were represented using a fixed number of binary digits in the fractional part. If the word-length of the computer is
bits, (including the sign bit), then only numbers in the bounded interval
are permitted. This limitation is problematic, since, for example, even when
and
, it is possible that
is outside the bounds of the interval
.
In a floating-point number system, a real number is represented as:
The fractional part of the number is called the mantissa or significand.
is called the exponent and
is the base. It’s clear that
, because if
, then we can always decrease the exponent by
and shift the decimal point one place to the right.
The mantissa and the exponent
are limited by the fixed word length in computers.
is rounded off to a number
with
digits and the exponent
lies in a certain range.
Thus, we can only represent floating-point numbers of the form:
A floating-point number system is completely characterized by the base
, precision
, the numbers
and
. Since
, the set
, contains, including the number
,
numbers. Intuitively, there are choices for the sign. The digit
can be chosen from the set
. Each of the successive
digits can be chosen from
. The exponent can be chosen from
numbers. By the multiplication rule, there are
distinguishable numbers. Including the number
gives us the expression above.
3.4. Why Are Floating-Point Numbers Inaccurate?
Consider a toy floating-point number system . The set
contains exactly
numbers. The positive numbers in the set
are shown below:
It is apparent that not all real numbers, for instance in, are present in
. Moreover, not all floating-point numbers are equally spaced; the spacing jumps by a factor of
at each power of
.
The spacing of the floating-point numbers is characterized by the machine epsilon , which is the distance from
to the next largest floating-point number.
If a real number is in the range of the floating-point system, the obvious thing to do is to round off
to
, where
is the floating-point number in
, that is closest to
. It should already be clear that representing
by
introduces an error.
One interesting question is, how large is this error? It would be great if we could guarantee, that the round-off error at no point exceeds a certain amount. Everybody loves a guarantee! In other words, we seek an upper bound for the relative error:
Recall, from the earlier discussion, that when rounding to
decimals, we leave the
th decimal unchanged, if the part
, of the number to right of the
th decimal, is smaller than
, else raise the
th decimal by
. Here, we are working with the generic base
instead of the decimal base
. Consequently, the round-off error in the mantissa is bounded by:
The relative round-off error in is thus bounded by:
Modern floating-point standards such as the IEEE guarantee this upper bound. This upper bound is called the rounding unit, denoted by .
4. IEEE Floating-point Standard in a Nutshell
Actual computer hardware implementations of floating-point systems differ from the toy system, we just designed. Most current computers conform to the IEEE 754 standard for binary floating-point arithmetic. There is two main basic formats – single and double precision, requiring 32-bit and 64-bit storage.
In single-precision, a floating-point number is stored as the sign
(
bit), the exponent
(
bits), the mantissa
(
bits).
In double-precision, bits are allocated to the exponent
, whereas
bits are allocated to the mantissa
. The exponent is stored as
.
The value of the floating-point number
in the normal case is:
Note, that the digit before the binary point is always , similar to the scientific notation we studied in high school. In that way,
bit is gained for the mantissa.
4.1. Quirks in Floating-Point Arithmetic
Consider the following comparison:
(0.10 + 0.20) == 0.30
The result of this logical comparison is false. This abrupt behavior is expected because the floating-point system is broken. However, let’s take a deeper look at what’s going on.
Let’s put in the double-precision format. Because
is positive, the sign bit
.
in base-
scientific notation can be written as
. This means we must factor
into a number
in the range
and a power of
. If we divide
by different powers of
, we get:
Therefore, . The exponent is stored as
, so the bit pattern in the exponent part is
.
The mantissa is the fractional part
in the binary form. Successive multiplication by
quickly yields
. However, the double-precision format allocates
bits to the mantissa, so we must round off
to
digits. The fractional part
, after the
nd digit,
exceeds
. So, we raise the
th decimal by
. Thus, the rounded mantissa is
.
Finally, we put the binary strings in the correct order. So, in the IEEE double-precision format is:
This machine number is approximately in base
. In a similar fashion, the closest machine number to
in the floating-point system is approximately
. Therefore,
. On the right hand side, the closest machine number to
is
. So,
.
To summarize, algebraically equivalent statements are not necessarily numerically equivalent.
5. Conclusion
In this article, we’ve learned how the rules of chopping and rounding to the nearest work. We developed an algorithm for conversion between number systems. We also learnt that, any floating-point system is characterized by a quadruple ). For example, the IEEE double-precision format is given as
. Because computers have finite memory capacity, the set of machine numbers,
are only a subset of the real numbers
. The spacing between machine numbers in technical speak is called the machine epsilon.
Further, any real number is always rounded off to the nearest machine number on a computer. The IEEE standards guarantee that the relative roundoff error is no more than a certain amount. Moreover, as programmers, we need to take proper care when doing floating-point operations.