MAD 3401 IEEE Notes - Sect. 1 Floating-point numbers The general form for a floating point number x is
x = s * Mx * B^p
where s is the sign, Mx is the (normalized) mantissa, B is the base (usually written as beta rather than B, also known as the radix), and p is an integer power. The representation of these numbers in a digital computer will restrict p to some range, say [L,U] based on the number of exponent bits Ne, while the precision of the mantissa Mx is restricted to Nf (or Nf+1 if a "hidden" bit is used) bits.

Many conventions for the choice of B and the normalization of Mx exist; several common ones are described on pages 26 and 30 of our text. Most computer systems today, other than IBM mainframes and their clones, use B = 2. The normalization of the mantissa Mx is chosen to be 1.fffff for the IEEE standard, although you will find systems that use 0.1fffff.

Since we only have a certain number of bits for the mantissa, the question arises of what to do with the bits that cannot be stored. We have two choices: we can chop them off (discard them completely) or we can round the stored part up or down based on whether the next bit is 1 or 0. This is just like the rounding you do in decimal numbers, only simpler. For example, if we have 1.01010101 and need to store it so there are only 3 places after the binary (radix) point, chopping gives 1.010 while rounding gives 1.011. Note that the IEEE standard uses rounding.

To summarize:
The IEEE standard uses B = 2 and works from a number written as
x = s * 1.ffff...ff * 2^p
where the mantissa is a binary number whose leading 1 will be "hidden" when the number is stored. The mantissa is rounded to the required number of bits. Next: Section 2: Generic representation This material is © Copyright 1996, by James Carr. FSU students enrolled in MAD-3401 have permission to make personal copies of this document for use when studying. Other academic users may link to this page but may not copy or redistribute the material without the author's permission.