1. Introduction

Computer memory has a finite capacity. Real numbers in R, do not, in general, however, have finite uniform representation. For example, consider representing the rational number \frac{4}{3} as a decimal value: it is (1.333\overline{3})_{10} – it is impossible to do so exactly! Since only a finite number of digits can be stored in computer memory, the reals have to be approximated in some fashion (rounded or truncated), when represented on a computer.

In this tutorial, we’ll go over the basic ideas of floating-point representation and learn the limits of floating-point accuracy, when doing practical numerical computing.

2. Rounding and Chopping

There are two distinguishable ways of rounding off a real number x to a given number t of decimals.

In chopping, we simply leave off all decimals to the right of the tth digit. For example, 56.555 truncated to t=1 decimal place yields 56.5.

In rounding to nearest, we choose a number with t decimals, which is the closest to x . For example, consider rounding off 56.555 to t=1 decimal place. There are two possible candidates: 56.5 and 56.6. On the real number line, 56.5 is at a distance of |56.555 - 56.5|=0.55 \times 10^{-1} from 56.555, whilst 56.6 is at a distance of |56.6 - 56.555|=0.45 \times 10^{-1}. So, 56.6 is the nearest and we round to 56.6.

Intuitively, we are comparing the fractional part of the number to the right of the tth digit, which is 0.055 = 0.55 \times 10^{-1} with 0.5 \times 10^{-1}. As 0.55 \times 10^{-1} > 0.5 \times 10^{-1}, we incremented the tth digit, which is 5 by 1.

What if, we wished to round off 56.549 to t=1 decimal places? Observe that, the fractional part is 0.49 \times 10^{-1}. As 0.49 \times 10^{-1} < 0.5 \times 10^{-1}, we leave the tth digit unchanged. So, the result is 56.5, which is indeed nearest to 56.549.

In general, let p be the fractional part of the number to the right of tth digit (after the decimal point). If p>0.5 \times 10^{-t}, we increment the tth digit. If p<0.5 \times 10^{-t}, we leave the tth digit unchanged.

In the case of a tie, when x is equidistant to two decimal t-digit numbers, we raise the tth decimal if it is odd or leave it unchanged if it is even. In this way, the error in rounding off a decimal number is positive or negative equally often.

Let’s see some more examples of rounding and chopping, to make sure, this sinks in. Assume that, we are interested in rounding off all quantities to t=3 decimal places:

x

Rounding

Chopping

0.2397

0.240

0.239

0.23750

0.238

0.237

0.23650

0.236

0.236

0.23652

0.237

0.236

\pi \approx 3.1415926

3.142

3.141

The difference between chopping and rounding has real-world implications! The Vancouver stock exchange index began trading in 1982, with a base value of 1000.00. Although the underlying stocks were performing decent, the index began hitting the low 500s at the end of 1983. A computer program re-calculated the value of the index thousands of times each day, and the program used chopping instead of rounding to the nearest. A rounded calculation gave a value of 1082.00.

3. Computer Number Systems

In our daily life, we represent numbers using the decimal number system with base 10. In the decimal system, we’re aware, that if a digit d_{-k} stands k digits to the right of the decimal point, the value it contributes is d_{-k} \cdot 10^{-k}. For example the sequence of digits 4711.303 means:

    \begin{align*} 4 \cdot 10^3 + 7 \cdot  10^2 + 1 \cdot 10^1 + 1\cdot 10^0 + 3 \cdot 10^{-1} + 0 \cdot 10^{-2} + 3 \cdot 10^{-3} \end{align*}

In fact any integer \beta \geq 2 (or \beta \leq -2) can be used as a base. Analogously, every real number x \in R has a unique representation of the form:

    \begin{align*} x = d_n \beta^n + d_{n-1}\beta^{n-1} + \ldots + d_1 \beta^{1} + d_0 \beta^0 + d_{-1}\beta^{-1} + d_{-2} \beta^{-2} + \ldots \end{align*}

or compactly d_{n}\ldots d_1 d_0.d_{-1} d_{-2} \ldots, where the coefficients d_i, the digits in system \beta, are positive integers such that 0 \leq d_i < \beta.

3.1. Conversion Algorithm Between Two Number Systems

Consider the problem of conversion between two number systems with different bases. For the sake of concreteness, let’s try to convert (250)_{10} to the binary format. We may write:

    \begin{align*} 250 = d_7 \cdot 2^7 + d_6\cdot 2^6 + d_5 \cdot 2^5 + d_4\cdot 2^4 + d_3\cdot 2^3 + d_2\cdot 2^2 + d_1\cdot 2^1 + d_0 \end{align*}

We can pull out a factor of 2 from each term, except the last, and equivalently write:

    \begin{align*} 250 = 2 \times (d_7 \cdot 2^6 + d_6\cdot 2^5 + d_5 \cdot 2^4 + d_4\cdot 2^3 + d_3\cdot 2^2 + d_2\cdot 2^1 + d_1) + d_0 \end{align*}

Intuitively, therefore, if we were to divide q_0=250 by 2, the expression in brackets, let’s call it q_1, is the quotient and d_0 is the remainder of the division. Similarly, since q_1 = 2 \times (d_7 \cdot 2^5 + d_6\cdot 2^4 + d_5 \cdot 2^3 + d_4\cdot 2^2 + d_3\cdot 2^1 + d_2) + d_1, division by 2 would return the expression in the brackets as the quotient; call it q_2, and d_1 as the remainder.

In general, if a is an integer with base \alpha and we want to determine it’s representation in a number system with base \beta, we perform successive divisions of a with \beta: set q_0 = a and

    \begin{align*} q_k = \beta \times q_{k+1} + d_k, \quad k = 0,1,2,\ldots \end{align*}

q_{k+1} is the quotient and d_k is the remainder in the division.

Let’s look at the result of applying the algorithm to (250)_{10}:

k

q_k

q_k+1

d_k

0

250

250/2 = 125

0

1

125

125/2 = 62

1

2

62

62/2 = 31

0

3

31

31/2 = 15

1

4

15

15/2 = 7

1

5

7

7/2 = 3

1

6

3

3/2 = 1

1

7

1

1/2 = 0

1

Therefore, (250)_{10} = (d_7 d_6 d_5 d_4 d_3 d_2 d_1 d_0)_2 = 1111\,1010.

3.2. Converting Fractions to Another Number System

If the real number a is not an integer, we write it as a = b + c, where b is the integer part and

    \begin{align*} c = c_{-1}\beta^{-1} + c_{-2}\beta^{-2} + c_{-3}\beta^{-3} + \ldots \end{align*}

is the fractional part, where c_{-1},c_{-2},c_{-3},\ldots are to be determined.

Observe that, multiplying both sides by the base \beta yields,

    \begin{align*} c \cdot \beta = \underbrace{c_{-1}}_{\text{Integer}} + \underbrace{c_{-2}\beta^{-1} + c_{-3}\beta^{-2} + \ldots}_{\text{Fractional part}} \end{align*}

an integer and a fractional portion. The integer portion is precisely c_{-1} – the first digit of c represented in base \beta. Consecutive digits are obtained as the integer parts, when successively multiplying c by \beta.

In general, if a fraction c must be converted to another number system with base \beta, we perform successive multiplications of c with \beta: set p_0 = c and

    \begin{align*} p_{k} \cdot \beta = c_{k-1} + p_{k-1}, \quad k = 0,-1,-2,\ldots \end{align*}

Let’s look at an example of converting (0.15625)_{10} to binary number system:

k

p_k

p_k\cdot \beta

c_k-1

p_k-1

0

0.15625

0.15625 \times 2 =0.3125

0

0.3125

-1

0.3125

0.3125 \times 2 =0.625

0

0.625

-2

0.625

0.625 \times 2 =1.25

1

0.25

-3

0.25

0.25 \times 2 =0.50

0

0.50

-4

0.50

0.50 \times 2 =1.00

1

0.00

Therefore, (0.15625)_{10} = 0.0010\,1000 in the binary system.

3.3. Fixed and Floating-Point Representation

Computers are equipped to handle pieces of information of a fixed size called a word. Typical word lengths are 32-bits or 64-bits.

Early computers made numerical calculations in a fixed-point number system. That is, real numbers were represented using a fixed number of t binary digits in the fractional part. If the word-length of the computer is s+1 bits, (including the sign bit), then only numbers in the bounded interval I=[-2^{s-t},2^{s-t}] are permitted. This limitation is problematic, since, for example, even when x \in I and y \in I, it is possible that x - y is outside the bounds of the interval I.

In a floating-point number system, a real number a is represented as:

    \begin{align*} a = \pm m \cdot \beta^e, \quad m = (0.d_1 d_2 d_3 \ldots)_\beta, \quad e \text{ an integer } \end{align*}

The fractional part m of the number is called the mantissa or significand. e is called the exponent and \beta is the base. It’s clear that d_1 \neq 0, because if d_1 = 0, then we can always decrease the exponent by 1 and shift the decimal point one place to the right.

The mantissa m and the exponent e are limited by the fixed word length in computers. m is rounded off to a number \overline{m} with t digits and the exponent e lies in a certain range.

Thus, we can only represent floating-point numbers of the form:

    \begin{align*} a = \pm \overline{m} \cdot \beta^e, \quad \overline{m} = (0.d_1 d_2 d_3 \ldots d_t)_\beta, \quad e_{min} \leq e \leq e_{max}, \quad e \in \mathbf{Z} \end{align*}

A floating-point number system F is completely characterized by the base \beta, precision t, the numbers e_{min} and e_{max} . Since d_1 \neq 0, the set F, contains, including the number 0,

    \begin{align*} 2 (\beta - 1) \beta^{t-1}(e_{max} - e_{min} + 1) + 1 \end{align*}

numbers. Intuitively, there are 2 choices for the sign. The digit d_1 can be chosen from the set \{1,2,\ldots,\beta - 1\}. Each of the successive t-1 digits can be chosen from \{0,1,2,\ldots,\beta - 1\}. The exponent can be chosen from e_{max} - e_{min} + 1 numbers. By the multiplication rule, there are 2 (\beta - 1) \beta^{t-1}(e_{max} - e_{min} + 1) distinguishable numbers. Including the number 0 gives us the expression above.

3.4. Why Are Floating-Point Numbers Inaccurate?

Consider a toy floating-point number system F(\beta = 2, t = 3, e_{min}=-1, e_{max}=2). The set F contains exactly 2 \cdot 16 + 1 = 33 numbers. The positive numbers in the set F are shown below:

(0.d_1 d_2 d_3)_2

2^e

Decimal representation

(0.d_1 d_2 d_3)_2

2^e

Decimal representation

(0.100)_2

2^{-1}

0.25

(0.100)_2

2^1

1.00

(0.101)_2

2^{-1}

0.3125

(0.101)_2

2^1

1.25

(0.110)_2

2^{-1}

0.375

(0.110)_2

2^1

1.50

(0.111)_2

2^{-1}

0.4375

(0.111)_2

2^1

1.75

(0.100)_2

2^0

0.50

(0.100)_2

2^2

2.00

(0.101)_2

2^0

0.625

(0.101)_2

2^2

2.50

(0.110)_2

2^0

0.75

(0.110)_2

2^2

3.00

(0.111)_2

2^0

0.875

(0.111)_2

2^2

3.50

It is apparent that not all real numbers, for instance in, \mathbf{[1,2]} are present in \mathbf{F} . Moreover, not all floating-point numbers are equally spaced; the spacing jumps by a factor of \beta at each power of \beta.

The spacing of the floating-point numbers is characterized by the machine epsilon \mathbf{\epsilon_M} , which is the distance from 1.0 to the next largest floating-point number.

If a real number x is in the range of the floating-point system, the obvious thing to do is to round off x to \overline{x}, where \overline{x}=float(x) is the floating-point number in F, that is closest to x. It should already be clear that representing \mathbf{x} by \mathbf{float(x)} introduces an error.

One interesting question is, how large is this error? It would be great if we could guarantee, that the round-off error at no point exceeds a certain amount. Everybody loves a guarantee! In other words, we seek an upper bound for the relative error:

    \begin{align*} \frac{|fl(x) - x|}{|x|} \end{align*}

Recall, from the earlier discussion, that when rounding m to t decimals, we leave the tth decimal unchanged, if the part p, of the number to right of the tth decimal, is smaller than \frac{1}{2} \times 10^{-t}, else raise the tth decimal by 1. Here, we are working with the generic base \beta instead of the decimal base 10. Consequently, the round-off error in the mantissa is bounded by:

    \begin{align*} |\overline{m} - m| \leq \frac{1}{2}\cdot \beta^{-t} \end{align*}

The relative round-off error in x is thus bounded by:

    \begin{align*} \frac{|fl(x) - x|}{|x|} &= \frac{|\overline{m} - m| \cdot \beta^e}{m \cdot \beta^e}\\ &\leq \frac{\frac{1}{2}\cdot \beta^{-t} \cdot \beta^e}{(0.1)_\beta \beta^e} \quad \quad \{ m \geq (0.1)_\beta \} \\ &= \frac{1}{2} \cdot \beta^{-t + 1} \end{align*}

Modern floating-point standards such as the IEEE guarantee this upper bound. This upper bound is called the rounding unit, denoted by \mathbf{u} .

4. IEEE Floating-point Standard in a Nutshell

Actual computer hardware implementations of floating-point systems differ from the toy system, we just designed. Most current computers conform to the IEEE 754 standard for binary floating-point arithmetic. There is two main basic formats – single and double precision, requiring 32-bit and 64-bit storage.

In single-precision, a floating-point number a is stored as the sign s (1 bit), the exponent e (8 bits), the mantissa m (23 bits).

In double-precision, 11 bits are allocated to the exponent e, whereas 52 bits are allocated to the mantissa m. The exponent is stored as b=e+1023.

Rendered by QuickLaTeX.com

The value v of the floating-point number a in the normal case is:

    \begin{align*} v = (-1)^s (1.m)_2 \cdot 2^e, \quad \quad e_{min} \leq e \leq e_{max} \end{align*}

Note, that the digit before the binary point is always 1, similar to the scientific notation we studied in high school. In that way, 1 bit is gained for the mantissa.

4.1. Quirks in Floating-Point Arithmetic

Consider the following comparison:

(0.10 + 0.20) == 0.30

The result of this logical comparison is false. This abrupt behavior is expected because the floating-point system is broken. However, let’s take a deeper look at what’s going on.

Let’s put 0.10 in the double-precision format. Because 0.10 is positive, the sign bit s=0.

0.10 in base-2 scientific notation can be written as (-1)^0 (1 + m) 2^{e}. This means we must factor 0.10 into a number (1 + m) in the range 1 \leq m < 2 and a power of 2. If we divide 0.10 by different powers of 2, we get:

    \begin{align*} 0.10 / 2^{-1} &= 0.20\\ 0.10 / 2^{-2} &= 0.40\\ 0.10 / 2^{-3} &= 0.80\\ 0.10 / 2^{-4} &= 1.60 \end{align*}

Therefore, 0.10 = 1.60 \times 2^{-4}. The exponent is stored as e + 1023=-4 + 1023 = 1019, so the bit pattern in the exponent part is (011\, 1111\, 1011)_2.

The mantissa m is the fractional part 0.60 in the binary form. Successive multiplication by 2 quickly yields m = (0.1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001 \ldots)_2. However, the double-precision format allocates t=52 bits to the mantissa, so we must round off m to 52 digits. The fractional part p, after the 52nd digit, (0.1001 \ldots)_2 \times 2^{-52} exceeds \frac{1}{2} \beta^{-t} = (0.1)_2 \times 2^{-52}. So, we raise the tth decimal by 1. Thus, the rounded mantissa is \overline{m}= (0.1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1010)_2.

Finally, we put the binary strings in the correct order. So, 0.10 in the IEEE double-precision format is:

    \begin{align*} \underbrace{0}_{s}\,\underbrace{011\, 1111\, 1011}_{e+1023}\,\underbrace{1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1010}_{\overline{m}} \end{align*}

This machine number is approximately 0.10000000000000000555 in base 10. In a similar fashion, the closest machine number to 0.2 in the floating-point system is approximately 0.2000000000000000111. Therefore, float(0.1) + float(0.2) > 0.30. On the right hand side, the closest machine number to 0.3 is 0.29999999999999998889. So, float(0.30) < 0.30.

To summarize, algebraically equivalent statements are not necessarily numerically equivalent.

5. Conclusion

In this article, we’ve learned how the rules of chopping and rounding to the nearest work. We developed an algorithm for conversion between number systems. We also learnt that, any floating-point system is characterized by a quadruple F(\beta,t,e_{min},e_{max}). For example, the IEEE double-precision format is given as F(\beta=2,t=52,e_{min}=-1022,e_{max}=1023). Because computers have finite memory capacity, the set of machine numbers, \mathbf{M} are only a subset of the real numbers \mathbf{R} . The spacing between machine numbers in technical speak is called the machine epsilon.

Further, any real number x is always rounded off to the nearest machine number on a computer. The IEEE standards guarantee that the relative roundoff error is no more than a certain amount. Moreover, as programmers, we need to take proper care when doing floating-point operations.