为什么浮点数不精确？

1. Introduction

Computer memory has a finite capacity. Real numbers in , do not, in general, however, have finite uniform representation. For example, consider representing the rational number $\frac{4}{3}$ as a decimal value: it is $(1.333\overline{3})_{10}$ – it is impossible to do so exactly! Since only a finite number of digits can be stored in computer memory, the reals have to be approximated in some fashion (rounded or truncated), when represented on a computer.

In this tutorial, we’ll go over the basic ideas of floating-point representation and learn the limits of floating-point accuracy, when doing practical numerical computing.

2. Rounding and Chopping

There are two distinguishable ways of rounding off a real number to a given number of decimals.

In chopping, we simply leave off all decimals to the right of the th digit. For example, 56.555 truncated to t=1 decimal place yields 56.5 .

In rounding to nearest, we choose a number with decimals, which is the closest to . For example, consider rounding off 56.555 to t=1 decimal place. There are two possible candidates: 56.5 and 56.6 . On the real number line, 56.5 is at a distance of $|56.555 - 56.5|=0.55 \times 10^{-1}$ from 56.555 , whilst 56.6 is at a distance of $|56.6 - 56.555|=0.45 \times 10^{-1}$ . So, 56.6 is the nearest and we round to 56.6 .

Intuitively, we are comparing the fractional part of the number to the right of the th digit, which is $0.055 = 0.55 \times 10^{-1}$ with $0.5 \times 10^{-1}$ . As $0.55 \times 10^{-1} > 0.5 \times 10^{-1}$ , we incremented the th digit, which is by .

What if, we wished to round off 56.549 to t=1 decimal places? Observe that, the fractional part is $0.49 \times 10^{-1}$ . As $0.49 \times 10^{-1} < 0.5 \times 10^{-1}$ , we leave the th digit unchanged. So, the result is 56.5 , which is indeed nearest to 56.549 .

In general, let be the fractional part of the number to the right of th digit (after the decimal point). If $p>0.5 \times 10^{-t}$ , we increment the th digit. If $p<0.5 \times 10^{-t}$ , we leave the th digit unchanged.

In the case of a tie, when is equidistant to two decimal -digit numbers, we raise the th decimal if it is odd or leave it unchanged if it is even. In this way, the error in rounding off a decimal number is positive or negative equally often.

Let’s see some more examples of rounding and chopping, to make sure, this sinks in. Assume that, we are interested in rounding off all quantities to t=3 decimal places:

Rounding

Chopping

0.2397

0.240

0.239

0.23750

0.238

0.237

0.23650

0.236

0.23652

0.237

0.236

$\pi \approx 3.1415926$

3.142

3.141

The difference between chopping and rounding has real-world implications! The Vancouver stock exchange index began trading in 1982, with a base value of 1000.00 . Although the underlying stocks were performing decent, the index began hitting the low 500s at the end of 1983. A computer program re-calculated the value of the index thousands of times each day, and the program used chopping instead of rounding to the nearest. A rounded calculation gave a value of 1082.00 .

3. Computer Number Systems

In our daily life, we represent numbers using the decimal number system with base . In the decimal system, we’re aware, that if a digit $d_{-k}$ stands digits to the right of the decimal point, the value it contributes is $d_{-k} \cdot 10^{-k}$ . For example the sequence of digits 4711.303 means:

$\begin{align*} 4 \cdot 10^3 + 7 \cdot 10^2 + 1 \cdot 10^1 + 1\cdot 10^0 + 3 \cdot 10^{-1} + 0 \cdot 10^{-2} + 3 \cdot 10^{-3} \end{align*}$

In fact any integer $\beta \geq 2$ (or $\beta \leq -2)$ can be used as a base. Analogously, every real number $x \in R$ has a unique representation of the form:

$\begin{align*} x = d_n \beta^n + d_{n-1}\beta^{n-1} + \ldots + d_1 \beta^{1} + d_0 \beta^0 + d_{-1}\beta^{-1} + d_{-2} \beta^{-2} + \ldots \end{align*}$

or compactly $d_{n}\ldots d_1 d_0.d_{-1} d_{-2} \ldots$ , where the coefficients d_i , the digits in system $\beta$ , are positive integers such that $0 \leq d_i < \beta$ .

3.1. Conversion Algorithm Between Two Number Systems

Consider the problem of conversion between two number systems with different bases. For the sake of concreteness, let’s try to convert $(250)_{10}$ to the binary format. We may write:

$\begin{align*} 250 = d_7 \cdot 2^7 + d_6\cdot 2^6 + d_5 \cdot 2^5 + d_4\cdot 2^4 + d_3\cdot 2^3 + d_2\cdot 2^2 + d_1\cdot 2^1 + d_0 \end{align*}$

We can pull out a factor of from each term, except the last, and equivalently write:

$\begin{align*} 250 = 2 \times (d_7 \cdot 2^6 + d_6\cdot 2^5 + d_5 \cdot 2^4 + d_4\cdot 2^3 + d_3\cdot 2^2 + d_2\cdot 2^1 + d_1) + d_0 \end{align*}$

Intuitively, therefore, if we were to divide q_0=250 by , the expression in brackets, let’s call it q_1 , is the quotient and d_0 is the remainder of the division. Similarly, since $q_1 = 2 \times (d_7 \cdot 2^5 + d_6\cdot 2^4 + d_5 \cdot 2^3 + d_4\cdot 2^2 + d_3\cdot 2^1 + d_2) + d_1$ , division by would return the expression in the brackets as the quotient; call it q_2 , and d_1 as the remainder.

In general, if is an integer with base $\alpha$ and we want to determine it’s representation in a number system with base $\beta$ , we perform successive divisions of with $\beta$ : set q_0 = a and

$\begin{align*} q_k = \beta \times q_{k+1} + d_k, \quad k = 0,1,2,\ldots \end{align*}$

$q_{k+1}$ is the quotient and d_k is the remainder in the division.

Let’s look at the result of applying the algorithm to $(250)_{10}$ :

q_k

q_k+1

d_k

250

250/2 = 125

125

125/2 = 62

62/2 = 31

31/2 = 15

15/2 = 7

7/2 = 3

3/2 = 1

1/2 = 0

Therefore, $(250)_{10} = (d_7 d_6 d_5 d_4 d_3 d_2 d_1 d_0)_2 = 1111\,1010$ .

3.2. Converting Fractions to Another Number System

If the real number is not an integer, we write it as a = b + c , where is the integer part and

$\begin{align*} c = c_{-1}\beta^{-1} + c_{-2}\beta^{-2} + c_{-3}\beta^{-3} + \ldots \end{align*}$

is the fractional part, where $c_{-1},c_{-2},c_{-3},\ldots$ are to be determined.

Observe that, multiplying both sides by the base $\beta$ yields,

$\begin{align*} c \cdot \beta = \underbrace{c_{-1}}_{\text{Integer}} + \underbrace{c_{-2}\beta^{-1} + c_{-3}\beta^{-2} + \ldots}_{\text{Fractional part}} \end{align*}$

an integer and a fractional portion. The integer portion is precisely $c_{-1}$ – the first digit of represented in base $\beta$ . Consecutive digits are obtained as the integer parts, when successively multiplying by $\beta$ .

In general, if a fraction must be converted to another number system with base $\beta$ , we perform successive multiplications of with $\beta$ : set p_0 = c and

$\begin{align*} p_{k} \cdot \beta = c_{k-1} + p_{k-1}, \quad k = 0,-1,-2,\ldots \end{align*}$

Let’s look at an example of converting $(0.15625)_{10}$ to binary number system:

p_k

$p_k\cdot \beta$

c_k-1

p_k-1

0.15625

$0.15625 \times 2 =0.3125$

0.3125

-1

0.3125

$0.3125 \times 2 =0.625$

0.625

-2

0.625

$0.625 \times 2 =1.25$

0.25

-3

0.25

$0.25 \times 2 =0.50$

0.50

-4

0.50

$0.50 \times 2 =1.00$

0.00

Therefore, $(0.15625)_{10} = 0.0010\,1000$ in the binary system.

3.3. Fixed and Floating-Point Representation

Computers are equipped to handle pieces of information of a fixed size called a word. Typical word lengths are 32-bits or 64-bits.

Early computers made numerical calculations in a fixed-point number system. That is, real numbers were represented using a fixed number of binary digits in the fractional part. If the word-length of the computer is s+1 bits, (including the sign bit), then only numbers in the bounded interval $I=[-2^{s-t},2^{s-t}]$ are permitted. This limitation is problematic, since, for example, even when $x \in I$ and $y \in I$ , it is possible that x - y is outside the bounds of the interval .

In a floating-point number system, a real number is represented as:

$\begin{align*} a = \pm m \cdot \beta^e, \quad m = (0.d_1 d_2 d_3 \ldots)_\beta, \quad e \text{ an integer } \end{align*}$

The fractional part of the number is called the mantissa or significand. is called the exponent and $\beta$ is the base. It’s clear that $d_1 \neq 0$ , because if d_1 = 0 , then we can always decrease the exponent by and shift the decimal point one place to the right.

The mantissa and the exponent are limited by the fixed word length in computers. is rounded off to a number $\overline{m}$ with digits and the exponent lies in a certain range.

Thus, we can only represent floating-point numbers of the form:

$\begin{align*} a = \pm \overline{m} \cdot \beta^e, \quad \overline{m} = (0.d_1 d_2 d_3 \ldots d_t)_\beta, \quad e_{min} \leq e \leq e_{max}, \quad e \in \mathbf{Z} \end{align*}$

A floating-point number system is completely characterized by the base $\beta$ , precision , the numbers $e_{min}$ and $e_{max}$ . Since $d_1 \neq 0$ , the set , contains, including the number ,

$\begin{align*} 2 (\beta - 1) \beta^{t-1}(e_{max} - e_{min} + 1) + 1 \end{align*}$

numbers. Intuitively, there are choices for the sign. The digit d_1 can be chosen from the set $\{1,2,\ldots,\beta - 1\}$ . Each of the successive t-1 digits can be chosen from $\{0,1,2,\ldots,\beta - 1\}$ . The exponent can be chosen from $e_{max} - e_{min} + 1$ numbers. By the multiplication rule, there are $2 (\beta - 1) \beta^{t-1}(e_{max} - e_{min} + 1)$ distinguishable numbers. Including the number gives us the expression above.

3.4. Why Are Floating-Point Numbers Inaccurate?

Consider a toy floating-point number system $F(\beta = 2, t = 3, e_{min}=-1, e_{max}=2)$ . The set contains exactly $2 \cdot 16 + 1 = 33$ numbers. The positive numbers in the set are shown below:

(0.d_1 d_2 d_3)_2

2^e

Decimal representation

(0.d_1 d_2 d_3)_2

2^e

Decimal representation

(0.100)_2

$2^{-1}$

0.25

(0.100)_2

2^1

1.00

(0.101)_2

$2^{-1}$

0.3125

(0.101)_2

2^1

1.25

(0.110)_2

$2^{-1}$

0.375

(0.110)_2

2^1

1.50

(0.111)_2

$2^{-1}$

0.4375

(0.111)_2

2^1

1.75

(0.100)_2

2^0

0.50

(0.100)_2

2^2

2.00

(0.101)_2

2^0

0.625

(0.101)_2

2^2

2.50

(0.110)_2

2^0

0.75

(0.110)_2

2^2

3.00

(0.111)_2

2^0

0.875

(0.111)_2

2^2

3.50

It is apparent that not all real numbers, for instance in, $\mathbf{[1,2]}$ are present in $\mathbf{F}$ . Moreover, not all floating-point numbers are equally spaced; the spacing jumps by a factor of $\beta$ at each power of $\beta$ .

The spacing of the floating-point numbers is characterized by the machine epsilon $\mathbf{\epsilon_M}$ , which is the distance from 1.0 to the next largest floating-point number.

If a real number is in the range of the floating-point system, the obvious thing to do is to round off to $\overline{x}$ , where $\overline{x}=float(x)$ is the floating-point number in , that is closest to . It should already be clear that representing $\mathbf{x}$ by $\mathbf{float(x)}$ introduces an error.

One interesting question is, how large is this error? It would be great if we could guarantee, that the round-off error at no point exceeds a certain amount. Everybody loves a guarantee! In other words, we seek an upper bound for the relative error:

$\begin{align*} \frac{|fl(x) - x|}{|x|} \end{align*}$

Recall, from the earlier discussion, that when rounding to decimals, we leave the th decimal unchanged, if the part , of the number to right of the th decimal, is smaller than $\frac{1}{2} \times 10^{-t}$ , else raise the th decimal by . Here, we are working with the generic base $\beta$ instead of the decimal base . Consequently, the round-off error in the mantissa is bounded by:

$\begin{align*} |\overline{m} - m| \leq \frac{1}{2}\cdot \beta^{-t} \end{align*}$

The relative round-off error in is thus bounded by:

$\begin{align*} \frac{|fl(x) - x|}{|x|} &= \frac{|\overline{m} - m| \cdot \beta^e}{m \cdot \beta^e}\\ &\leq \frac{\frac{1}{2}\cdot \beta^{-t} \cdot \beta^e}{(0.1)_\beta \beta^e} \quad \quad \{ m \geq (0.1)_\beta \} \\ &= \frac{1}{2} \cdot \beta^{-t + 1} \end{align*}$

Modern floating-point standards such as the IEEE guarantee this upper bound. This upper bound is called the rounding unit, denoted by $\mathbf{u}$ .

4. IEEE Floating-point Standard in a Nutshell

Actual computer hardware implementations of floating-point systems differ from the toy system, we just designed. Most current computers conform to the IEEE 754 standard for binary floating-point arithmetic. There is two main basic formats – single and double precision, requiring 32-bit and 64-bit storage.

In single-precision, a floating-point number is stored as the sign ( bit), the exponent ( bits), the mantissa ( bits).

In double-precision, bits are allocated to the exponent , whereas bits are allocated to the mantissa . The exponent is stored as b=e+1023 .

Rendered by QuickLaTeX.com

The value of the floating-point number in the normal case is:

$\begin{align*} v = (-1)^s (1.m)_2 \cdot 2^e, \quad \quad e_{min} \leq e \leq e_{max} \end{align*}$

Note, that the digit before the binary point is always , similar to the scientific notation we studied in high school. In that way, bit is gained for the mantissa.

4.1. Quirks in Floating-Point Arithmetic

Consider the following comparison:

(0.10 + 0.20) == 0.30

The result of this logical comparison is false. This abrupt behavior is expected because the floating-point system is broken. However, let’s take a deeper look at what’s going on.

Let’s put 0.10 in the double-precision format. Because 0.10 is positive, the sign bit s=0 .

0.10 in base- scientific notation can be written as $(-1)^0 (1 + m) 2^{e}$ . This means we must factor 0.10 into a number (1 + m) in the range $1 \leq m < 2$ and a power of . If we divide 0.10 by different powers of , we get:

$\begin{align*} 0.10 / 2^{-1} &= 0.20\\ 0.10 / 2^{-2} &= 0.40\\ 0.10 / 2^{-3} &= 0.80\\ 0.10 / 2^{-4} &= 1.60 \end{align*}$

Therefore, $0.10 = 1.60 \times 2^{-4}$ . The exponent is stored as e + 1023=-4 + 1023 = 1019 , so the bit pattern in the exponent part is $(011\, 1111\, 1011)_2$ .

The mantissa is the fractional part 0.60 in the binary form. Successive multiplication by quickly yields $m = (0.1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001 \ldots)_2$ . However, the double-precision format allocates t=52 bits to the mantissa, so we must round off to digits. The fractional part , after the nd digit, $(0.1001 \ldots)_2 \times 2^{-52}$ exceeds $\frac{1}{2} \beta^{-t} = (0.1)_2 \times 2^{-52}$ . So, we raise the th decimal by . Thus, the rounded mantissa is $\overline{m}= (0.1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1010)_2$ .

Finally, we put the binary strings in the correct order. So, 0.10 in the IEEE double-precision format is:

$\begin{align*} \underbrace{0}_{s}\,\underbrace{011\, 1111\, 1011}_{e+1023}\,\underbrace{1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1010}_{\overline{m}} \end{align*}$

This machine number is approximately 0.10000000000000000555 in base . In a similar fashion, the closest machine number to 0.2 in the floating-point system is approximately 0.2000000000000000111 . Therefore, float(0.1) + float(0.2) > 0.30 . On the right hand side, the closest machine number to 0.3 is 0.29999999999999998889 . So, float(0.30) < 0.30 .

To summarize, algebraically equivalent statements are not necessarily numerically equivalent.

5. Conclusion

In this article, we’ve learned how the rules of chopping and rounding to the nearest work. We developed an algorithm for conversion between number systems. We also learnt that, any floating-point system is characterized by a quadruple $F(\beta,t,e_{min},e_{max}$ ). For example, the IEEE double-precision format is given as $F(\beta=2,t=52,e_{min}=-1022,e_{max}=1023)$ . Because computers have finite memory capacity, the set of machine numbers, $\mathbf{M}$ are only a subset of the real numbers $\mathbf{R}$ . The spacing between machine numbers in technical speak is called the machine epsilon.

Further, any real number is always rounded off to the nearest machine number on a computer. The IEEE standards guarantee that the relative roundoff error is no more than a certain amount. Moreover, as programmers, we need to take proper care when doing floating-point operations.

Persistence

REST

Security

1. Introduction

2. Rounding and Chopping

3. Computer Number Systems

3.1. Conversion Algorithm Between Two Number Systems

3.2. Converting Fractions to Another Number System

3.3. Fixed and Floating-Point Representation

3.4. Why Are Floating-Point Numbers Inaccurate?

4. IEEE Floating-point Standard in a Nutshell

4.1. Quirks in Floating-Point Arithmetic

5. Conclusion