IEEE Notes - Sect. 2

Generic IEEE-style floating-point representation

It is useful pedagogically (for discussion, homework, and exams) to have a simple, compact, but still concrete version of the IEEE representation. Theoretical discussions, such as that in the ACM article I referenced above, discuss arithmetic in symbolic terms without reference to an actual format. Although useful, I think that kind of approach is more abstract than we need in this course.

I will define an IEEE-style representation by writing something like

`s eeeeee fffffffff`

to specify that the bits (16 in this case) are allocated so that
there is a leading sign bit, `s`,
**Ne** bits (6 in this case) are devoted to the biased exponent,
while the remaining **Nf** bits (9 in this case) are used for
the fractional part of the mantissa.

**Note** that there is no rule governing the number of exponent
bits one chooses to use. With a fixed number of bits,
**there is a tradeoff** between the size of the
exponent (and thus the largest and smallest numbers that
can be represented) and the size of the mantissa (and thus
the precision of the numbers that can be represented).
The single (32-bit)
and double (64-bit) IEEE standards
make choices that were considered by the committee members
to be an optimal compromise. Some computer manufacturers
reached a different conclusion and arrange their floating
point formats following a different convention. You should
always know what arithmetic model is being used by the
computer you are using.

We store *bias*+**p**, where *bias* = 01...1 (binary) =
2^{**Ne**-1} - 1 for **Ne** exponent bits.

The upper limit on the power **p** is **U** = *bias*;
the lower limit is **L** = - (*bias*-1).

Since there is a hidden bit along with the **Nf** bits
in the fractional part of the mantissa, there is a total of
**Nf+1** bits of precision. Since the mantissa is rounded,
the magnitude of the relative error is bounded by 2^{-(**Nf+1**)}.

(The largest possible mantissa, M = 2^{**Nf+1**}, provides another
way to look at precision.)

The largest number we can store before overflow occurs
is 1.111...111 x 2^{**U**}.

This is exactly (2 - 2^{-**Nf**}) * 2^{**U**},
which is approximately 2^{**U+1**}.
Overflow is to **Inf**inity.

The smallest number we can store before underflow occurs
is 1.000...000 x 2^{**L**}.
Remember that **L** is negative. Underflow is to zero.

The 32-bit (see section #3) and 64-bit (see section #5) cases defined in the actual IEEE 754 standard will serve as examples of how this works.

**Next:**
Section 3: 32-bit IEEE representation

This material is © Copyright 1996, by James Carr. FSU students enrolled in MAD-3401 have permission to make personal copies of this document for use when studying. Other academic users may link to this page but may not copy or redistribute the material without the author's permission.

Return to the Home Page for MAD-3401.