MAD 3401
IEEE Floating-Point Notes

These days, a significant fraction of numerical work is being done on workstations that conform to the IEEE (Institute for Electrical and Electronics Engineers) standard #754 for storing and doing arithmetic with floating-point numbers in electronic digital computers. In particular, the workstations available in the CS majors lab and in the Math Department lab are IEEE-compliant machines. That is why I use the IEEE 754 floating-point representation as a specific example of the subject matter presented in Chapter 2 of the textbook.

These notes provide an electronic web-textbook for review of what is covered in my class lectures. The Chapter 2 problems I assign, particularly the additional problems that I have created, illustrate applications of this material. © Copyright 1996.


Reference material

The article "What Every Computer Scientist Should Know About Floating-Point Arithmetic" by David Goldberg provides a tutorial on floating-point numbers and arithmetic, particularly as it affects computer designers. This journal is filed under 510.5 C7386 in the FSU Dirac Science Library.

Another suitable reference on the storage of floating-point numbers according to the IEEE standard is Section 4.4 [page 103] of A Programmer's View of Computer Architecture by Goodman and Miller, the current textbook for COP-3400.


1. Floating-point numbers

We will work with normalized floating point numbers of the form x = sign * 1.fffff * 2^p.

2. Generic IEEE-style floating-point representation

Most examples and problems will use a generalization of the formal IEEE 754 specification that can be applied to a word size of some arbitrary (but small) number of bits. This uses a sign bit, a biased exponent stored in Ne bits, and Nf bits for the fractional part of a normalized and rounded mantissa.

3. IEEE 32-bit floating-point representation

Single precision (32-bit) floating-point numbers use Ne = 8 and Nf = 23, which gives about 7 digit precision and allows storage of numbers whose magnitudes range from roughly 10^{-38} to about 10^{38}.

4. Representing Special Values

The IEEE 754 standard uses a special representation for zero and also has a representation for +Inf and -Inf that is used when numbers overflow as well as one for NaN that is used when illegal operations (like 0/0) are encountered.

5. IEEE 64-bit floating-point representation

Double precision (64-bit) floating-point numbers use Ne = 11 and Nf = 52, which gives about 15 digit precision and allows storage of numbers whose magnitudes range from roughly 10^{-308} to about 10^{308}.


This material is © Copyright 1996, by James Carr. FSU students enrolled in MAD-3401 have permission to make personal copies of this document for use when studying. Other academic users may link to this page but may not copy or redistribute the material without the author's permission.

Return to the Home Page for MAD-3401.