Please, update your browser.

Floating-Point Numbers Representation

Understand floating-point representation in computer systems. Learn about precision, rounding errors, and how numbers are stored and processed. See how floating-point numbers represented in binary formats.

Enter a floating point number.
Examples: 0.1, -2.0, 6.0

Format

Real numbers are represented in computing as an array of bits. Generally, it is 32 or 64 bits long.

The first bit is a sign followed by the exponent bits. Other bits are for significand (mantissa). Implementations of the floating-point arithmetics are described in the IEE 754 standard.

The single-precision (float data type in C language) values are stored as 4 bytes or 32 bits: 1 sign bit, 8-bit exponent, 23-bit significand.

The double-precision (double data type in C language) values are stored as 8 bytes or 64 bits: 1 sign bit, 11-bit exponent, 52-bit significand.

Other formats are: 2-byte (half-precision), 10-byte (double-extended precision), and 16-byte (quadruple-precision). The single-precision floating-point format is described in this article on Wikipedia.

Exponent

The exponents are power of two. The exponent is stored in biased form. To get the real exponent substract the value 127 (for single-precision) or 1023 (for double-precision) from the stored exponent. Note: all zeroes and all ones in exponent are reserved for special numbers.

For example, the single-precision exponent stored as 123. The real exponent is 123 - 127 = -4. And the exponent part encoded the value 2-4 = 1/24 = 0,0625.

Mantissa

The mantissa is stored as a binary fraction of the form 1.NNN...But this 1 is not stored but assumed. For example, the number 1.625 is stored as 1.625×20. The value 1.625 stored as 0.625 in binary. Which is 1010000 00000000 00000000 = 1/21 + 0/22 + 1/23 + 0/24 ... + 0/223 = 1/2 + 1/8 = 0.625.

How to Deal with Inaccuracies in Real Numbers Caused by Binary Representation?

If the range of real number values is known, consider using specialized data types such as Decimal in Python, BigDecimal in Java, and BigNumber.js in JavaScript.

Avoid comparing real numbers using the equality operator, such as if (x == 0.3). Instead, compare the difference between the numbers with a small threshold, like this:

const DELTA = 0.001;
...
if (Math.abs(x - 0.3) <= DELTA) ...

The error in binary representation accumulates when summing real numbers. While algorithms like the Kahan summation algorithm can help mitigate these numerical errors, they do not eliminate them completely and only reduce their impact.

#CS #Computer science #FP #Float32 #Float64 #Decimal #Binary #IEE754 #Calculator