MAD 3401
IEEE Notes - Sect. 2
Generic IEEE-style floating-point representation

 ====================

It is useful pedagogically (for discussion, homework, and exams) to have a simple, compact, but still concrete version of the IEEE representation. Theoretical discussions, such as that in the ACM article I referenced above, discuss arithmetic in symbolic terms without reference to an actual format. Although useful, I think that kind of approach is more abstract than we need in this course.

I will define an IEEE-style representation by writing something like

s eeeeee fffffffff

to specify that the bits (16 in this case) are allocated so that there is a leading sign bit, s, Ne bits (6 in this case) are devoted to the biased exponent, while the remaining Nf bits (9 in this case) are used for the fractional part of the mantissa.

Note that there is no rule governing the number of exponent bits one chooses to use. With a fixed number of bits, there is a tradeoff between the size of the exponent (and thus the largest and smallest numbers that can be represented) and the size of the mantissa (and thus the precision of the numbers that can be represented). The single (32-bit) and double (64-bit) IEEE standards make choices that were considered by the committee members to be an optimal compromise. Some computer manufacturers reached a different conclusion and arrange their floating point formats following a different convention. You should always know what arithmetic model is being used by the computer you are using.

We store bias+p, where bias = 01...1 (binary) = 2^{Ne-1} - 1 for Ne exponent bits.

The upper limit on the power p is U = bias; the lower limit is L = - (bias-1).

Since there is a hidden bit along with the Nf bits in the fractional part of the mantissa, there is a total of Nf+1 bits of precision. Since the mantissa is rounded, the magnitude of the relative error is bounded by 2^{-(Nf+1)}.

(The largest possible mantissa, M = 2^{Nf+1}, provides another way to look at precision.)

The largest number we can store before overflow occurs is 1.111...111 x 2^{U}.
This is exactly (2 - 2^{-Nf}) * 2^{U}, which is approximately 2^{U+1}. Overflow is to Infinity.

The smallest number we can store before underflow occurs is 1.000...000 x 2^{L}. Remember that L is negative. Underflow is to zero.

The 32-bit (see section #3) and 64-bit (see section #5) cases defined in the actual IEEE 754 standard will serve as examples of how this works.

 ====================

Next: Section 3: 32-bit IEEE representation

 ====================

This material is © Copyright 1996, by James Carr. FSU students enrolled in MAD-3401 have permission to make personal copies of this document for use when studying. Other academic users may link to this page but may not copy or redistribute the material without the author's permission.

Return to the Home Page for MAD-3401.