EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS

This document has been created to provide the net.community with some
detailed information about mathematical coprocessors for the Intel 80x86 CPU
family. It may also help to answer some of the FAQs (frequently asked
questions) about this topic. The primary focus of this document is on 80387-
compatible chips, but there is also some information on the other chips in
the 80x87 family and the Weitek family of coprocessors. Care was taken to
make the information included as accurate as possible. If you think you have
discovered erroneous information in this text, or think that a certain detail
needs to be clarified, or want to suggest additions, feel free to contact me
at:

         S_JUFFA@IRAVCL.IRA.UKA.DE

         or at my SnailMail address:

         Norbert Juffa
         Wielandtstr. 14
         7500 Karlsruhe 1
         Germany


This is the fifth version of this document (dated 01-13-93) and I'd like
to thank those who have helped improving it by commenting on the previous
versions:

         Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM), Peter Forsberg
         (peter@vnet.ibm.com), Richard Krehbiel (richk@grevyn.com), Arto
         Viitanen (av@cs.uta.fi), Jerry Whelan (guru@stasi.bradley.edu),
         Eric Johnson (johnson%camax01@uunet.UU.NET), Warren Ferguson
         (ferguson@seas.smu.edu), Bengt Ask (f89ba@efd.lth.se), Thomas Hoberg
         (tmh@prosun.first.gmd.de), Nhuan Doduc (ndoduc@framentec.fr), John
         Levine (johnl@iecc.cambridge.ma.us), David Hough (dgh@validgh.com),
         Duncan Murdoch (dmurdoch@mast.QueensU.CA), Benjamin Eitan
         (benny.iil.intel.com)

A very special thanks goes to David Ruggiero (osiris@halcyon.halcyon.com),
who did a great job editing and formatting this article. Thanks David!


Contents of this document
-------------------------

1)  What are math coprocessors?
2)  How PC programs use a math coprocessor
3)  Which applications benefit from a math coprocessor
4)  Potential performance gains with a math coprocessor
5)  How various math coprocessors work
6)  Coprocessor emulator software
7)  Installing a math coprocessor
8)  Detailed description and specifications for all available math
    coprocessor chips
9)  Finding out which coprocessor you have (the COMPTEST program)
10) Current coprocessor prices and purchasing advice
11) The coprocessor benchmark programs (performance comparisons of
    available math coprocessors using various CPUs)
12) Clock-cycle timings for each coprocessor instruction
13) Accuracy tests and IEEE-754 conformance for various coprocessors
14) Accuracy of transcendental function calculations for various coprocessors
15) Compatibility tests with Intel's 387DX / the SMDIAG program
16) References (literature)
17) Addresses of manufacturers of math coprocessors
18) Appendix A: Test programs for partial compatibility and accuracy checks
19) Appendix B: Benchmark programs TRNSFORM and PEAKFLOP



===========================
What are math coprocessors?
===========================

A coprocessor in the traditional sense is a processor, separate from the main
CPU, that extends the capabilities of a CPU in a transparent manner. This
means that from the program's (and programmer's) point of view, the CPU and
coprocessor together look like a single, unified machine.

The 80x87 family of math coprocessors (also known as MCPs [Math
CoProcessors], NDPs [Numerical Data Processors], NPXs [Numerical Processor
eXtensions], or FPUs [Floating-Point Units], or simply "math chips") are
typical examples of such coprocessors. The 80x86 CPUs, with the exception of
the 80486 (which has a built-in FPU) can only handle 8, 16, or 32 bit
integers as their basic data types. However, many PC-based applications
require the use of not only integers, but floating-point numbers. Simply put,
the use of floating-point numbers enables a binary representation of not only
integers, but also fractional values over a wide range. A common application
of floating-point numbers is in scientific applications, where very small
(e.g., Planck's constant) and very large numbers (e.g., speed of light) must
be accurately expressed. But floating-point numbers are also useful for
business applications such as computing interest, and in the geometric
calculations inherent in CAD/CAM processing.

Because the instruction sets of all 80x86 CPUs directly support only integers
and calculations upon integers, floating-point numbers and operations on them
must be programmed indirectly by using series of CPU integer instructions.
This means that computations when floating-point numbers are used are far
slower than normal, integer calculations. And this is where the 80x87
coprocessors come in: adding an 80x87 to an 80x86-based system augments the
CPU architecture with eight floating-point registers, five additional data
types and over 70 additional instructions, all designed to deal directly with
floating-point numbers as a basic data type. This removes the 'penalty' for
floating-point computations, and greatly increases overall system performance
for applications which depend heavily on these calculations.

In addition to being able to quickly execute load/store operations on
floating-point numbers, the 80x87 coprocessors can directly perform all the
basic arithmetic operation on them. Besides "knowing" how to add, subtract,
multiply and divide floating-point numbers, they can also operate on them to
perform comparisons, square roots, transcendental functions (such as logarithms
and sine/cosine/tangent), and compute their absolute value and remainder.

Like most things in life, floating-point arithmetic has been standardized.
The relevant standard (to which I will refer quite often in this document) is
the "IEEE-754 Standard for Binary Floating-Point Arithmetic" [10,11]. The
standard specifies numeric formats, value sets and how the basic arithmetic
(+,-,*,/,sqrt, remainder) has to work. All the coprocessors covered in this
document claim full or at least partial compliance with the IEEE-754
standard.



=================================================
How PC programs use 80x87 and Weitek coprocessors
=================================================

The basic data type used by all 80x87 coprocessors is an 80-bit long
floating-point number. This data type (called "temporary real" or "double
extended precision") can directly represent numbers which range in size
between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to 1.19*10^4932
including denormal numbers) where '^' denotes the power operator. (For those
familiar with floating-point formats, this format has 64 mantissa bits, 15
exponent bits and 1 sign bit, for the total of 80 bits.) This format provides
a precision of about 19 decimal places. 80x87s can also handle additional
data types that are converted to/from the internal format upon being loaded
or stored to/from the coprocessor. These include 16 bit, 32 bit, and 64 bit
integers as well as a 18 digit BCD (binary coded decimal) data type occupying
10 bytes and providing 18 decimal digits.

The 80x87 also supports two additional floating-point types. The short real
data type (also called "single-precision") has 32 bits that split into 23
mantissa bits, 8 exponent bit and a sign bit. By using the "hidden bit"
technique, the effective length of the mantissa is increased to 24 bits. (The
hidden bit technique exploits the fact that for normalized floating-point
numbers, the mantissa m always is in the range 1 <= m < 2. Since the first
mantissa bit represents the integer part of the mantissa, it is always set
for normalized numbers, and therefore need not be stored, as it is guaranteed
to always be 1.) The IEEE single-precision format provides a precision of
about 6-7 decimal places and can represent numbers between 1.17*10^-38 and
3.40*10^38 (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long
real, or double-precision, data type has 64 bits, consisting of 52 mantissa
bits, 11 exponent bits, and the sign bit. It provides 15-16 decimal digits of
precision and can handle numbers from 2.22*10^-308 to 1.79*10^308 (4.94*10^-
324 to 1.79*10^308 including denormal numbers). (This format also uses the
hidden bit technique to provide effectively 53 mantissa bits.)

The eight registers in the 80x87 are organized in a stack-like manner which
takes some time getting used to if one programs the coprocessor directly in
assembly language. However, nowadays the compilers or interpreters for most
high level languages (HLLs) can give a programmer easy access to the
coprocessor's data types and use their instructions, so there is not much
need to deal directly with the rather unusual architecture of the 80x87.


The architecture of the Weitek chips differs significantly from the 80x87.
Strictly speaking, the Weitek Abacus 3167 and 4167 are not coprocessors in
that they do not transparently extend the CPU architecture; rather, they
could be described as highly-specialized, memory-mapped IO devices. But as
the term "coprocessor" has been traditionally used for these chips, they will
be referred to as such here.

The Weitek coprocessors have a RISC-like architecture which has been tuned
for maximum performance. Only a small instruction set has been implemented in
the chip, but each instruction executes at a very high speed (usually only a
few clock cycles each). Instructions available include load/store, add,
subtract, subtract reverse, multiply, multiply and negate, multiply and
accumulate, multiply and take absolute value, divide reverse, negate,
absolute value, compare/test, convert fix/float, and square root. In contrast
to the 80x87 family, the Weitek Abacus does not support a double extended
format, has no built-in transcendental functions, and does not support
denormals. The resources required to implement such features have instead
been devoted to implement the basic arithmetic operations as fast as
possible.

While the 80x87 coprocessors perform all internal calculations in double
extended precision and therefore have about the same performance for single
and double-precision calculations, the Weitek features explicit single and
double-precision operations. For applications that require only single-
precision operations, the Weitek can therefore provide very high performance,
as single-precision operations are about twice as fast as their double-
precision counterparts. Also, since the Weitek Abacus has more registers than
the 80x87 coprocessors (31 versus 8), values can be kept in registers more
often and have to be loaded from memory less frequently. This also leads to
performance gains.

The Weitek's register file consists of 31 32-bit registers, each one capable
of holding an IEEE single-precision number. Pairs of consecutive single-
precision registers can also be used as 64-bit IEEE double-precision
registers; thus there are 15 double-precision registers. The Weitek register
file has the standard organization like the register files in the 80386, not
the special stack-like organization of the 80x87 coprocessors.

To the main CPU, the Weitek Abacus appears as a 64 KB block of memory
starting at physical address 0C0000000h. Each address in this range
corresponds to a coprocessor instruction. Accessing a specified memory
location within this block with a MOV instruction causes the corresponding
Weitek instruction to be executed. (The instructions have been cleverly
assigned to memory locations in such a way that loads to consecutive
coprocessor registers can make use of the 386/486 MOVS string instruction.)
This memory-mapped interface is much faster than the IO-oriented protocol
that is used to couple the CPU to an 80287 or 80387 coprocessor. The Weitek's
memory block can actually be assigned to any logical address using the MMU
(memory management unit) in the 386/486's protected and virtual modes. This
also means that the Weitek Abacus *cannot* be used in the real mode of those
processors, since their physical starting address (0C0000000h) is not within
the 1 MByte address range and the MMU is inoperable in real mode. However,
DOS programs can make use of the Weitek by using a DOS extender or a memory
manager (such as QEMM or EMM386) that runs in protected/virtual mode itself
and can therefore map the Weitek's memory block to any desired location in
the 1 MByte address range.

Typically the FS segment register is then set up to point to the Weitek's
memory block. On the 80486, this technique has severe drawbacks, as using the
FS: prefix takes an additional clock cycle, thereby nearly halving the
performance of the 4167. Most DOS-based compilers exhibit this problem, so
the only way around it is to code in assembly language [75]. The Weitek
Abacus 3167 and 4167 are also supported by the UNIX operating system [33].



==========================================================
Which application programs benefit from a math coprocessor
==========================================================

According to the Intel 387DX User's Guide, there are more than 2100
commercial programs that can make use of a 387-compatible coprocessor. Every
program that uses floating-point arithmetic somewhere and contains the
instructions to support an 80x87 or Weitek chip can gain speed by installing
one. However, the speedup will vary from program to program (and even within
the same program) depending on how computation-intensive the program or
operation within the program is. Typical applications that benefit from the
use of a math coprocessor are:

   - CAD programs (AutoCAD, VersaCAD, GenericCAD)
   - Spreadsheet programs (Lotus 1-2-3, Excel, Quattro, Wingz)
   - Business graphics programs (Arts&Letters, Freedom of Press, Freelance)
   - Mathematical analysis and statistical programs (Mathematica, TKSolver,
       SPSS/PC, Statgraphics)
   - Database programs (dBase IV, FoxBase, Paradox, Revelation)

Note that for spreadsheets and databases, a coprocessor only helps if some
kind of floating-point computation is performed; this is true more often for
spreadsheets than for databases. Also note that the speed of many programs
depends quite heavily on factors such the speed of the graphics adapter (CAD)
or the disk performance (databases), so the computational performance is only
a (small) part of the total performance of the application. There are some
programs that won't run without a coprocessor, among them AutoCAD (R10 and
later) and Mathematica.

Most GUIs (graphical user interfaces) such as Microsoft Windows or the OS/2
Presentation Manager do *not* gain additional speed from using a
*mathematical* coprocessor, since their graphics operations only use integer
arithmetic [71]. They *will* benefit from a graphics board with a graphics
"coprocessor" that speeds up certain common graphics operations such as
BitBlt or line drawing. A few GUIs used on PCs, such as X-Windows, use a
certain amount of floating-point operations for operations such as arc
drawing. However, the use of floating-point operations in X-Windows seems to
have decreased significantly in versions after X11R3, so the overall
performance impact of a coprocessor is small [72]. Applications running under
any GUI may take advantage of a math coprocessor, of course (for example,
Microsoft Excel running under Windows).

While support for 80x87 coprocessors is very common in application programs,
the Weitek Abacus coprocessors do not enjoy such widespread support. Due to
their higher price, only a few high-end PCs have been equipped with Weitek
coprocessors. Some machines, such as IBM's PS/2 series, do not even have
sockets to accommodate them. Therefore, most of the programs that support
these coprocessors are also high-end products, like AutoCAD and Versacad-386.



==============================================
Potential performance gains with a coprocessor
==============================================

The Intel Math Coprocessor Utilities Disk that accompanies the Intel 387DX
coprocessor has a demonstration program that shows the speedup of certain
application programs when run with the Intel coprocessor versus a system with
no coprocessor:

         Application       Time w/o 387   Time w/387    Speedup

         Art&Letters          87.0 sec      34.8 sec     150%
         Quattro Pro           8.0 sec       4.0 sec     100%
         Wingz                17.9 sec       9.1 sec      97%
         Mathematica         420.2 sec     337.0 sec      25%


         The following table is an excerpt from [70]:

         Application        Time w/o 387   Time w/387  Speedup

         Corel Draw          471.0 sec     416.0 sec      13%
         Freedom Of Press    163.0 sec      77.0 sec     112%
         Lotus 1-2-3         257.0 sec      43.0 sec     597%


         The following table is an excerpt from [25]:

         Application        Time w/o 387   Time w/387  Speedup

         Design CAD, Test1    98.1 sec      50.0 sec      96%
         Design CAD, Test2    75.3 sec      35.0 sec     115%
         Excel, Test 1         9.2 sec       6.8 sec      35%
         Excel, Test 1        12.6 sec       9.3 sec      35%


Note that coprocessor performance also depends on the motherboard, or more
specifically, the chipset used on the motherboard. In [34] and [35]
identically configured motherboards using different 386 chipsets were tested.
Among other tests a coprocessor benchmark was run which is based on a fractal
computation and its execution time recorded. The following tables showing
coprocessor performance to vary with the chipset have been copied from these
articles in abridged form:

                  Cyrix                                   Cyrix
    chip set      387+                 chip set           83D87

    Opti, 40 MHz  24.57 sec   97.0%    PC-Chips, 33 MHz  26.97 sec   93.0%
    Elite,40 MHz  24.46 sec   97.4%    UMC,      33 MHz  27.69 sec   90.5%
    ACT,  40 MHz  23.84 sec  100.0%    Headland, 33 MHz  25.08 sec  100.0%
    Forex,40 MHz  23.84 sec  100.0%    Eteq,     33 MHz  27.38 sec   91.6%


This shows that performance of the same coprocessor can vary by up to ~10%
depending on the chipset used on your board, at least for 386 motherboards
(similar numbers for 286, 386SX, and 486 are, unfortunately, not available).
The benchmarks for this article were run on a motherboard with the Forex chip
set, one of the fastest 386 chip sets available, and not only with respect to
floating-point performance [35].



==================================
How various math coprocessors work
==================================

In any 80x86 system with an 80x87 math coprocessor, CPU instructions and
coprocessor instructions are executed concurrently. This means that the CPU
can execute CPU instructions while the coprocessor executes a coprocessor
instruction at the same time. The concurrency is restricted somewhat by the
fact that the CPU has to aid the coprocessor in certain operations. As the
CPU and the coprocessor are fed from the same instruction stream and both
instruction streams may operate on the same data, there has to be a
synchronizing mechanism between the CPU and the coprocessor.


The 8087
--------
In 8086/8088 systems with 8087 coprocessors, both chips look at every opcode
coming in from the bus. To do this, both chips have the same BIU (bus
interface unit) and the 8086 BIU sends the status signals of its prefetch
queue to the 8087 BIU. This insures that both processors always decode the
same instructions in parallel. Since all coprocessor instruction start with
the bit pattern 11011, it is easy for the 8087 to ignore all other
instructions. Likewise the CPU ignores all coprocessor instructions, unless
they access memory. In this case, the CPU computes the address of the LSB
(least significant byte) of the memory operand and does a dummy read. The
8087 then takes the data from the data bus. If more than one memory access is
needed to load an memory operand, the 8087 requests the bus from the CPU,
generates the consecutive addresses of the operand's bytes and fetches them
from the data bus. After completing the operation, the 8087 hands bus control
back to the CPU. Since 8087 and CPU are hooked up to the same synchronous
bus, they must run at the same speed. This means that with the 8087, only
synchronous operation of CPU and coprocessor is possible.

Another 8087 coprocessor instruction can only be started if the previous one
has been completed in the NEU (numerical execution unit) of the 8087. To
prevent the 8086 from decoding a new coprocessor instruction while the 8087
is still executing the previous coprocessor instruction, a coding mechanism
is employed:  All 8087-capable compilers and assemblers automatically
generate a WAIT instruction before each coprocessor instruction. The WAIT
instruction tests the CPU's /TEST pin and suspends execution until its input
becomes "LOW". In all 8086/8087 systems, the 8086 /TEST pin is connected to
the 8087 BUSY pin. As long as the NEU executes a coprocessor instruction, it
forces its BUSY pin "HIGH"; thus, the WAIT opcode preceding the coprocessor
instruction stops the CPU until any still-executing coprocessor instruction
has finished.

The same synchronization is used before the CPU accesses data that was
written by the coprocessor. A WAIT instruction after any coprocessor
instruction that writes to memory causes the CPU to stop until the
coprocessor has completed transfer of the data to memory, after which the CPU
can safely access it.


The 80287
---------
The 80287 coprocessor-CPU interface is totally different from the 8087
design. Since the 80286 implements memory protection via an MMU based on
segmentation, it would have been much too expensive to duplicate the whole
memory protection logic on the coprocessor, which an interface solution
similar to the 8087 would have required. Instead, in an 80286/80287 system,
the CPU fetches and stores all opcodes and operands for the coprocessor.
Information is then passed through the CPU ports F8h-FFh. (As these ports are
accessible under program control, care must be taken in user programs not to
accidentally perform write operations to them, as this could corrupt data in
the math coprocessor.)

The 8087/8087 combination can be characterized as a cooperation of partners
with equal rights, while the 80286/287 is more a master-slave relationship.
This makes synchronization easier, since the complete instruction and data
flow of the coprocessor goes through the CPU. Before executing most
coprocessor instructions, the 80286 tests its /BUSY pin, which is tied to the
287 coprocessor and signals if the 80287 is still executing a previous
coprocessor instruction or has encountered an exception. The 80286 then waits
until the /BUSY signal goes to "low" before loading the next coprocessor
instruction into the 80287. Therefore, a WAIT instruction before every
coprocessor instruction is not required. These WAITs are permissible, but not
necessary, in 80287 programs. The second form of WAIT synchronization (after
the coprocessor has written a memory operand) *is* still necessary on 286/287
systems.

The execution unit of the 80287 is practically identical to that of the 8087;
that is, nearly all coprocessor instructions execute in the same number of
clock cycles on both coprocessors. However, due to the additional overhead of
the 80287's CPU/coprocessor interface (at least ~40 clock cycles), an 8 MHz
80286/80287 combination can have lower floating-point performance than an
8086/8087 system running at the same speed. Additionally, older 286 boards
were often configured to run the coprocessor at only 2/3 the speed of the
CPU, making use of the ability of the 80287 to run asynchronously: The 80287
has a CKM pin that causes the incoming system clock to be divided by three
for the coprocessor if it is tied to ground. The 80286 always divides the
system clock by two internally, hence the final ratio of 2/3. However, when
the CKM (ClocK Mode) pin is tied high on the 80287, it does not divide the
CLK input. This feature has been exploited by the maker of coprocessor speed
sockets. These sockets tie CKM high and supply their own CLK signal with a
built-in oscillator, thereby allowing the 80287 or compatible to run at a
much higher speed than the CPU. With an IIT or Cyrix 287 one can have a 20
MHz coprocessor running with a 8 MHz 80286! Note, however, that the floating-
point performance of such a configuration does not scale linearly with the
coprocessor clock, since all the data has to be passed through the much
slower CPU. If the coprocessor executes mostly simple instructions (such as
addition and multiplication), doubling the coprocessor clock to 20 MHz in a
10 MHz system does not show any performance increase at all [24].

The Intel 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals of
a 387 coprocessor, but are pin-compatible to the original 287. These chips
divide the system clock by two internally, as opposed to three in the
original 80287. Since the 80286 also divides the system clock by two, they
usually run synchronously with respect to the CPU, although they can also be
run asynchronously.


The 80387
---------
The coprocessor interface in 80386/80387 systems is very similar to the one
found in 286/287 systems. However, to prevent corruption of the coprocessor's
contents by programming errors, the IO ports 800000F8h-800000FFh are used,
which are not accessible to programs. The CPU/coprocessor interface has been
optimized and uses full 32-bit transfers; the interface overhead has been
reduced to about 14-20 clock cycles. For some operations on the 387 'clones'
that take less than about 16 clock cycles to complete, this overhead
effectively limits the execution rate of coprocessor instructions. The only
sensible solution to provide even higher floating-point performance was to
integrate the CPU and coprocessor functionality onto the same chip, which
is exactly what Intel did with the 80486 CPU. The FPU in the 486 also benefits
from the instruction pipelining and from the on-chip cache.



=====================
Coprocessor emulators
=====================

In the absence of a coprocessor, floating-point calculations are often
performed by a software package that simulates its operations. Such a program
is called a coprocessor emulator. Simulating the coprocessor has the
advantage for application programs that identical code can be generated for
use with either the coprocessor and the emulator, so that it's possible to
write programs that run on any system without regard to whether a coprocessor
is present or not. Whether the program will use an actual coprocessor or
software emulating it can easily be determined at run-time by detecting the
presence or absence of the coprocessor chip.

Two approaches to interface an 80x87 emulator to programs are common. The
first method makes use of the fact that all coprocessor instruction start
with the same five bit pattern 11011. Thus the first byte of a coprocessor
instruction will be in the range D8-DF hexadecimal. In addition, coprocessor
instructions usually are preceded by a WAIT instruction (opcode 9Bh) which is
one byte long (the reason for doing this has been described in the previous
chapter dealing with the operating details of the 80x87). One common approach
is to replace the WAIT instruction and the first byte of the coprocessor
instruction with one out of eight interrupt instructions; the remaining bytes
of the coprocessor instruction are left unchanged. Interrupts 34 to 3B
hexadecimal are used for this emulation technique. (Note that the sequences
9B D8 ... 9B DF can be easily converted to the interrupt instructions CD 34
... CD 3B by simple addition and subtraction of constants.) The compiler or
assembler initially produces code that contains these appropriate interrupt
calls instead of the coprocessor instructions. If a hardware coprocessor is
detected at run-time, the emulator interrupts point to a short routine that
converts the interrupts calls back to coprocessor instructions (yes, this
is known as "self-modifying code"). If no coprocessor is found the interrupts
point to the emulation package, which examines the byte(s) following the
interrupt instruction to determine which floating-point operation to perform.
This method is used by many compilers, including those from Microsoft and
Borland. It works with every 80x86 CPU from the 8086/8088 on.

The second method to interface an emulator is only available on 286/386/486
machines. If the emulation bit in the machine status word of these processors
is set, the processors will generate an interrupt 7 whenever a coprocessor
instruction is encountered. The vector for this interrupt will have been set
up to point at an emulation package that decodes the instruction and performs
the desired operation. This approach has the advantage that the emulator
doesn't have to be included in the program code, but can be loaded once (as a
TSR or device driver) and then used by every program that requires a
coprocessor. Emulation via interrupt 7 is transparent, which means that
programs containing coprocessor instructions execute just like a coprocessor
was present, only slower. This approach is taken by the public domain EM87
emulator, the shareware program Q387, and the commercial Franke387 emulator,
for example. Even programs that require a coprocessor to run like AutoCAD
are 'fooled' to believe that a coprocessor is present with emulators using
INT 7.

Operating systems such as OS/2 2.0 and Windows 3.1 provide coprocessor
emulations using INT 7 automatically if they do not find a coprocessor to be
installed. The emulator in Windows doesn't seem to be very fast, as people
who have ported their Turbo Pascal programs from the TP 6.0 DOS compiler
(using the emulation built into the TP 6.0 run-time library) to the TPW 1.5
Windows compiler (using MS Windows' emulator) have noticed. Slowdowns of as
much as a factor of five have been reported [79].

The size of the emulator used by TP 6.0 is about 9.5 KB, while EM87 occupies
about 15.8 KB as a TSR, and Franke387 uses about 13.4 KB as a device driver.
Note that Franke387 and especially EM87 model a real coprocessor much more
closely than Turbo Pascal's emulator does. In particular, EM87 supports
denormal numbers, precision control, and rounding control. The emulator in TP
6.0 does not implement these features. The version of Franke387 tested (V2.4)
supports denormals in single and double-precision, but not double extended
precision, and it supports precision control, but not rounding control.
The recently introduced shareware program Q387 only runs on 386, 386SX, 486SX
and compatible processors. The program loads completely into extended memory
and uses about 330 KB. To enable INT 7 trapping to a service routine in
extended memory it needs to run with a memory manager (e.g. EMM386, QEMM,
or 386MAX). The huge size of the program stems from the fact that it was
solely optimized for speed, assuming that extended memory is a cheap resource.
Presumably it uses large tables to speed computations. Intel's E80287 program
is supposed to be an 100% exact emulation of the 80287 coprocessor [44]. Note
that the more closely a real coprocessor is modelled by the emulator, the
slower the emulator runs and the larger the code for the emulator gets.


         Relative execution times of coprocessor vs. software emulators
         for selected coprocessor instructions

                        Intel 387DX    TP 6.0 Emulator   EM87 Emulator

         FADD ST, ST(0)       1              26                104
         FDIV [DWord]         1              22                136
         FXAM                 1              10                 73
         FYL2X                1              33                102
         FPATAN               1              36                110
         F2XM1                1              38                110



         The following table is an excerpt from [44]:

                        Intel 80287  Intel E80287 Emulator

         FADD ST, ST(0)       1              42
         FDIV [DWord]         1             266
         FXAM                 1             139
         FYL2X                1              99
         FPATAN               1             153
         F2XM1                1              41



         The following has been adapted from [43] and merged with my own
         data:

                        Intel 8087  TP 6.0 Emul. (8086)  Intel Emul. (8086)

         FADD ST, ST(0)       1              20                 94
         FDIV [DWord]         1              22                 82
         FPTAN                1              18                144
         F2XM1                1               6                171
         FSQRT                1              44                544



One of the reasons emulators are so slow is that they are often designed to
run with every CPU from the 8086/8088 on upwards. This is the case with the
emulators built into the compiler libraries of the Turbo Pascal 6.0 (also
used by Turbo C/C++) and Microsoft C 6.0 compiler (probably also used in
other Microsoft products) and is also true for the EM87 emulator in the
public domain. By using code that can run on a 8086/8088, these emulators
forego the speed advantage offered by the additional instructions and
architectural enhancements (such as 32-bit registers) of the more advanced
Intel 80x86 processors. A notable exception to this is the Franke387
emulator, a commercial emulator that is also sold as shareware. It uses 386-
specific 32-bit code and only runs on 386/386SX/486SX computers.

Besides being slow, coprocessor emulators have other drawbacks when compared
with real coprocessors. Most of the emulators do not support the additional
instructions that the 387-compatible coprocessors offer over the 80287.
Often, some of the low-level stack-manipulating instructions like FDECSTP are
not emulated. For example, [76] lists the coprocessor instructions not
emulated by Microsoft's emulator (included in the MS-C and MS-FORTRAN
libraries) as follows:

         FCOS         FRSTOR      FSINCOS      FXTRACT
         FDECSTP      FSAVE       FUCOM
         FINCSTP      FSETPM      FUCOMP
         FPREM1       FSIN        FUCOMPP

Additionally, some parts of the coprocessor architecture, like the status
register, are often not or only partially emulated. Some emulators do not
conform to the IEEE-754 standard in their implementation of the basic
arithmetic functions, while the hardware coprocessors do. Also, they
sometimes lack the support for denormals (a special class of floating-point
numbers) although it is required by the standard. Not all the 80x87 emulators
support rounding control and precision control, also features required by
IEEE-754. Most of these omissions are aimed at making the emulator faster and
smaller. Because of the performance gap and these other shortcomings of
coprocessor emulators, a real coprocessor is a must for anybody planning to
do some serious computations. (At today's prices, this shouldn't pose much of
a problem to anybody!)

Nhuan Doduc (ndoduc@framentec.fr) has tested a number of standalone
coprocessor emulators for PCs, among them the two emulators, EM87 and
Franke387 V2.4, already mentioned. He found Franke387 to be the best in terms
of reliability, speed, and accuracy.



=============================
Installing a math coprocessor
=============================

Usually, installing a coprocessor doesn't pose much of a problem, as every
coprocessor comes with installation instructions and a diagnostic disk that
lets you check its correct operation after installation. In addition, the
user manuals of most computers have a section on coprocessor installation.

1)   Make sure to buy the right coprocessor for your system. An 8087 works
     together with 8086, 8088, V20, and V30 CPUs. An 80287, 287XL or
     compatible works with a 80286 CPU. (There are also some old 386
     motherboards that accept a 80287 coprocessor, but they usually also
     provide a socket for the 387; given today's pricing, it makes no sense
     not to get a 387 for these systems.) A 80387, 387DX or compatible
     coprocessor is for 386-based systems, as is the Intel RapidCAD. 387
     coprocessors also work with the Cyrix 486DLC CPU (which, despite its
     name, does not include an FPU). Similarly, the 387SX or compatible
     coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC.

     The Weitek Abacus 3167 works with a 386 CPU but requires a 121-pin EMC
     socket in the system; this is *not* the same socket used by a 80387 or
     compatible chip, and some computers, such as IBM's PS/2s, don't have
     this socket. The Weitek Abacus 4167 works together with the 486 and
     requires a special 142-pin socket to be present.

2)   Always install a coprocessor that's rated at the same clock speed as the
     CPU. For example, in a 40 MHz 386 system using an AMD Am386-40, install
     a coprocessor rated for 40 MHz such as a Cyrix 83D87-40, C&T 38700DX-40,
     IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its specified
     frequency rating may cause it to produce false results, which you might
     fail to recognize as such. (I have personally experienced this problem
     with a Cyrix 83D87-33 that I tried to push to 40 MHz. It passed all the
     diagnostic benchmarks on the Cyrix diagnostic disk and the tests of some
     commercial system test programs. However, I found it to fail the
     Whetstone and Linpack benchmarks, which include accuracy checks.)
     Although there is usually no problem with overheating when pushing a
     coprocessor over the specified maximum frequency rating, be warned that
     operation of a coprocessor above the maximum ratings stated by the
     manufacturer may make its operation unreliable.

     Some 386 boards allow the coprocessor to be clocked differently than the
     CPU. This is called "asynchronous operation" and allows you, for
     example, to run the coprocessor at 33 MHz while the CPU runs at 40 MHz.
     Of the currently available math coprocessors, only the Intel 80387 and
     387DX support asynchronous operation. The 387-compatible "clones" from
     Cyrix, C&T, IIT and ULSI always run at the full speed of the CPU, even
     if you have set up your motherboard for asynchronous operation.

3)   Once you've got the correct coprocessor for your system you can start
     the actual installation process. Turn off the computer's power switch
     and unplug the power cord from the wall outlet, remove the case, and
     locate the math coprocessor socket. This socket is always located right
     next to the main CPU, which can be identified by the printing on top of
     the chip. (It's also usually one of the biggest chips on the board). The
     8078 and 80287 DIL sockets are rectangular sockets with 20 pin holes on
     each of the longer sides. The 387SX PLCC socket is a square socket that
     has 17 vertical connector strips on the 'wall' of each side. The 387 PGA
     socket is square and has two rows of pin holes on each side. The EMC
     socket for the Weitek 3167 is similar but has three rows of holes on
     each side. The PGA socket for the Weitek 4167 is also square with three
     rows of holes on each side. If you can't find the math coprocessor
     socket, consult your owner's manual, your computer dealer, or a
     knowledgeable friend.

     If you are installing the Intel RapidCAD chipset in a 386 system, you
     will have to remove the 386 CPU first. Intel provides an easy-to-use
     chip extractor and a storage box for the 386 chip for this purpose. Just
     follow the instructions in the RapidCAD installation manual.

     On many systems, the motherboard is supported only at a small number of
     points. Since considerable force is required to insert a pin grid chip
     like the 80387, RapidCAD, or Weitek Abacus 3167 into its socket, the
     board may bend quite a lot due to the insertion pressure. This could
     cause cracks in the board's conductive traces that may render it
     intermittently or completely inoperable. Damage done to the board in
     this way is usually not covered by the computer's warranty! Therefore,
     it may be a good idea to first check how much the board bends by
     pressing on the math coprocessor socket with your finger. If you find it
     to bend easily, try to put something under the board directly beneath
     the coprocessor socket. If this is impossible, as it is in many desktop
     cases, consider removing the whole mother board from the case, and
     placing it on a hard, flat surface free of static electricity. (You will
     also have to do this if your system's CPU and coprocessor socket are on
     a separate card rather than on the motherboard, as is typical in many
     modular systems.)

     Be sure you are properly grounded before you remove the coprocessor from
     its antistatic box, as even a tiny jolt of static electricity can ruin
     the coprocessor. Make sure you do not touch the pins on the bottom of
     the chip.

     Check the pins and make sure none are bent; if some are, you can
     *carefully* straighten them with needle-nose pliers or tweezers.

4)   Match the coprocessor's orientation with the orientation of the socket.
     Correct orientation of the coprocessor is absolutely essential, because
     if you insert it the wrong way it may be damaged.

     8087 and 287 coprocessors have a notch on one the shorter sides of their
     rectangular DIL package that should be matched with the notch of the
     coprocessor socket. Usually the 286 CPU and the 287 coprocessor are
     placed alongside each other and both have the same orientation, (that
     is, their respective notches point in the same direction). 387SX
     coprocessors feature a white dot or similar mark that matches with some
     sort of marking on the socket. 387 coprocessors have a bevelled corner
     that is also marked with a white dot or similar marking. This should be
     matched with the bevelled or otherwise marked corner of the socket. If
     your system has only a large EMC socket and you are installing a 387 in
     it, you will leave one row of pin holes free on each side of the chip.

     Once you have found the correct orientation, place the chip over the
     socket and make sure all pins are correctly aligned with their
     respective holes. Press firmly and evenly on the chip -- you may have to
     press hard to seat the coprocessor all the way. Again, make sure your
     motherboard does not bend more than slightly under the insertion
     pressure. For 8087, 287, and 387 coprocessors it is normal that the
     coprocessor does not go all the way in; about one millimeter (1/25 inch)
     of space is usually left between the socket and the bottom of the
     coprocessor chip. (This allows the insertion of a extraction device
     should it become necessary to remove the chip. Note that the
     construction of the 387SX's PLCC socket makes it next-to-impossible to
     remove the coprocessor once fully inserted, as the top of the chip is
     level with the socket's 'walls'.)

5)   Check your computer's manual for the proper position of any jumpers or
     switches that need to be set to tell the system it now has a coprocessor
     (and possibly, which kind it has). Put the cover back on the system
     unit, reconnect the power, and turn on your computer. Depending on your
     system's BIOS, you may now have to run a setup or configuration program
     to enable the coprocessor. Finally, run the programs supplied on the
     diagnostic disk (included with your coprocessor) to check for its
     correct operation.



=================================================================
Descriptions of available coprocessors, CPU+FPU (as of 01-11-93):
=================================================================

Intel 8087

     [43] This was the first coprocessor that Intel made available for the
     80x86 family. It was introduced in 1980 and therefore does not have full
     compatibility with the IEEE-754 standard for floating-point arithmetic,
     (which was finally released in 1985). It complements the 8088 and 8086
     CPUs and can also be interfaced to the 80188 and 80186 processors.

     The 8087 is implemented using NMOS. It comes in a 40-pin CERDIP (ceramic
     dual inline package). It is available in 5 MHz, 8 MHz (8087-2), and 10
     MHz (8087-1) versions. Power consumption is rated at max. 2400 mW [42].

     A neat trick to enhance the processing power of the 8087 for
     computations that use only the basic arithmetic operations (+,-,*,/) and
     do not require high precision is to set the precision control to single-
     precision. This gives one a performance increase of up to 20%. For
     details about programming the precision control, see program PCtrl in
     appendix A.

     With the help of an additional chip, the 8087 can in theory be
     interfaced to an 80186 CPU [36]. The 80186 was used in some PCs (e.g.
     from Philips, Siemens) in the 1982/1983 time frame, but with IBM's
     introduction of the 80286-based AT in 1984, it soon lost all
     significance for the PC market.


Intel 80187

     The 80187 is a rather new coprocessor designed to support the 80C186
     embedded controller (a CMOS version of the 80186 CPU; see above). It was
     introduced in 1989 and implements the complete 80387 instruction set. It
     is available in a 40 pin CERDIP (ceramic dual inline package) and a 44
     pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation.
     Power consumption is rated at max. 675 mW for the 12.5 MHz version and
     max. 780 mW for the 16 MHz version [37].


Intel 80287

     [44] This is the original Intel coprocessor for the 80286, introduced in
     1983. It uses the same internal execution unit as the 8087 and therefore
     has the same speed (actually, it is sometimes slower due to additional
     overhead in CPU-coprocessor communication). As with the 8087, it does
     not provide full compatibility with the IEEE-754 floating point standard
     released in 1985.

     The 80287 was manufactured in NMOS technology, and is packaged in a 40-
     pin CERDIP (ceramic dual inline package). There are 6 MHz, 8 MHz, and 10
     MHz versions. Power consumption can be estimated to be the same as that
     for the 8087, which is 2400 mW max.

     The 80287 has been replaced in the Intel 80x87 family with its faster
     successor, the CMOS-based Intel 287XL, which was introduced in 1990 (see
     below). There may still be a few of the old 80287 chips on the market,
     however.


Intel 80287XL

     This chip is Intel's second-generation 287, first introduced in 1990.
     Since it is based on the 80387 coprocessor core, it features full IEEE
     754 compatibility and faster instruction execution. Intel claims about
     50% faster operation than the 80287 for typical benchmark tests such as
     Whetstone [45]. Comparison with benchmark results for the AMD 80C287,
     which is identical to the Intel 80287, support this claim [1]: The Intel
     287XL performed 66% faster than the AMD 80C287 on a fractal benchmark
     and 66% faster on the Whetstone benchmark in these tests. Whetstone
     results from [46] show the Intel 287XL at 12.5 MHz to perform 552
     kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91%
     performance increase. A benchmark using the MathPak program showed the
     Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0
     sec.) [26]. Since the 287XL has all the additional instructions and
     enhancements of a 387, most software automatically identifies it as an
     80387-compatible coprocessor and therefore can make use of extra 387-
     only features, such as the FSIN and FCOS instructions.

     The 287XL is manufactured in CMOS and therefore uses much less power
     than the older NMOS-based 80287. At 12.5 MHz, the power consumption is
     rated at max. 675 mW, about 1/4 of the 80287 power consumption. The
     287XL is available in either a 40-pin CERDIP (ceramic dual inline
     package) or a 44 pin PLCC (plastic leaded chip carrier). (This latter
     version is called the 287XLT and intended mainly for laptop use.) The
     287XL is rated for speeds of up to 12.5 MHz.


AMD 80C287

     This chip, manufactured by Advanced Micro Devices (AMD), is an exact
     clone of the old Intel 80287, and was first brought to market by AMD in
     1989. It contains the original microcode of the 80287 and is therefore
     100% compatible with it. However, as the name indicates, the 80C287 is
     manufactured in CMOS and therefore uses less power than an equivalent
     Intel 80287. At 12.5 MHz, its power consumption is rated at max. 625 mW
     or slightly less than that of the Intel 80287XL [27]. There is also
     another version called AMD 80EC287 that uses an 'intelligent' power save
     feature to reduce the power consumption below 80C287 levels. Tests at
     10.7 MHz show typical power consumption for the 80EC287 to be at 30 mW,
     compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and
     1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally
     suited for low power laptop systems.

     The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. (I have
     only seen it being offered in 10 MHz and 12 MHz versions, however.) At
     about US$ 50, it is currently the cheapest coprocessor available. Note
     that it provides less performance than the newer Intel 287XL (see
     above). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs
     (dual inline package) and as 44 pin PLCC (plastic leaded chip carrier).

     Due to recent legal battles with Intel over the right to use the 287
     microcode, which AMD lost, AMD may have to discontinue this product
     (disclaimer: I am not a legal expert).


Cyrix 82S87

     This 80287-compatible chip was developed from the Cyrix 83D87, (Cyrix's
     80387 'clone') and has been available since 1991. It complies completely
     with the IEEE-754 standard for floating-point arithmetic and features
     nearly total compatibility with Intel's coprocessors, including
     implementation of the full Intel 80387 instruction set. It implements
     the transcendental functions with the same degree of accuracy and the
     superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the
     fastest [1] and most accurate 287 compatible coprocessor available.
     Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5
     MHz system, while the Intel 287XL performs only 552 kWhets/sec. 82S87
     chips manufactured after 1991 use the internals of the Cyrix 387+, which
     succeeds the original 83D87 [73].

     The 82S87 is a fully static CMOS design with very low power requirements
     that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the
     82S87 to consume about the same amount of power as the AMD 80C287 (see
     above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded
     chip carrier) compatible with the pinout of the Intel 287XLT and
     ideally suited for laptop use.


IIT 2C87

     This chip was the first 80287 clone available, introduced to the market
     in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87
     implements the full 80387 instruction set [38]. Tests I ran on the 3C87
     seem to indicate that it is not fully compatible with the IEEE-754
     standard for floating-point arithmetic (see below for details), so it
     can be assumed that the 2C87 also fails these test (as it presumably
     uses the same core as the 3C87).

     The IIT 2C87 provides extra functions not available on any other 287
     chip [38]. It has 24 user-accessible floating-point registers organized
     into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2)
     allow switching from one bank to another. (Transfers between registers
     in different banks are not supported, however, so this feature by itself
     is of limited usefulness. Also, there seems to be only one status
     register (containing the stack top pointer), so it has to be manually
     loaded and stored when switching between banks with a different number
     of registers in use [40]). The register bank's main purpose is to aid
     the fourth additional instruction the 2C87 has (F4X4), which does a full
     multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D-
     graphics applications [39]. The built-in matrix multiply speeds this
     operation up by a factor of 6 to 8 when compared to a programmed
     solution according to the manufacturer [38]. Tests show the speed-up to
     be indeed in this range [40]. For the 3C87, I measured the execution
     time of F4X4 to be about 280 clock cycles; the execution time on the
     2C87 should be somewhat larger - I estimate it to be around 310 clock
     cycles due to the higher CPU-NDP communication overhead in instruction
     execution in 286/287 systems (~45-50 clock cycles) compared with 386/387
     systems (~16-20 clock cycles). As desirable as the F4X4 instruction may
     seem, however, there are very few applications that make use of it when
     an IIT coprocessor is detected at run time (among them Schroff
     Development's Silver Screen and Evolution Computing's Fast-CAD 3-D
     [25]).

     The 2C87 is available for speeds of up to 20 MHz. It is implemented in
     an advanced CMOS process and has therefore a low power consumption of
     typically about 500 mW [38].


Intel 80387

     This chip was the first generation of coprocessors designed specifically
     for the Intel 80386 CPU. It was introduced in 1986, about one year after
     the 80386 was brought to market. Early 386 system were therefore
     equipped with both a 80287 and a 80387 socket. The 80386 does work with
     an 80287, but the numerical performance is hardly adequate for such a
     system.

     The 80387 has itself since been superseded by the Intel 387DX introduced
     by a quiet change in 1989 (see below). You might find it when acquiring
     an older 386 machine, though. The old 80387 is about 20% slower than the
     newer 387DX.

     The 80387 is packaged in a 68-pin ceramic PGA, and was manufactured
     using Intel's older 1.5 micron CHMOS III technology, giving it moderate
     power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW
     typical), at 20 MHz max. 1550 mW (950 mW typical), and at 25 MHz max.
     1950 mW (1250 mW typical) [60].


Intel 387DX

     The 387DX is the second-generation Intel 387; it was quietly introduced
     to replace the original 80387 in 1989. This version is done in a more
     advanced CMOS process which enables the coprocessor to run at a maximum
     frequency of 33 MHz (the 80387 was limited to a maximum frequency of 25
     MHz). The 387DX is also about 20% faster than the 80387 on the average
     for the same clock frequency. For a 386/387 system operating at 29 MHz
     the Whetstone benchmark (compiled with the highly optimizing Metaware
     High-C V1.6) runs at 2377 kWhetstones/sec for the 80387 and at 2693
     kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation
     programmed in assembly language, the 387DX performance was 28% higher
     than the performance of the 80387. The transcendental functions have
     also sped up from the 80387 to the 387DX. In the Savage benchmark
     (again, compiled with Metaware High-C V1.6 and running on a 29 MHz
     system), the 80387 evaluated 77600 function calls/second, while the
     387DX evaluated 97800 function calls/second, a 26% increase [7]. Some
     instructions have been sped up a lot more than the average 20%. For
     example, the performance of the FBSTP instruction has increased by a
     factor of 3.64.

     The Intel 387DX (and its predecessor 80387) are the only 387
     coprocessors that support asynchronous operation of CPU and coprocessor.
     The 387 consists of a bus interface unit and a numerical execution unit.
     The bus interface unit always runs at the speed of the CPU clock
     (CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc,
     the numerical execution unit runs at the same speed as the bus interface
     unit. If CKM is tied to ground, the numerical execution unit runs at the
     speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor
     clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10.
     For example, for a 20 MHz 386, the Intel 387DX could be clocked from
     12.5 MHz to 28 MHz via the NUMCLK2 input. (On the Cyrix 83D87, Cyrix
     387+, ULSI 83C87, and the IIT 387, the CKM pin is not connected. These
     coprocessors are therefore not capable of asynchronous operation and
     always run at the speed of the CPU.)

     The Intel 387DX is manufactured using Intel's advanced low power CHMOS
     IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW
     typical), at 25 MHz max. 1050 mW (625 mW typical), and at 33 MHz max.
     1250 mW (750 mW typical) [59].


Intel 387SX

     This is the coprocessor paired with the Intel 386SX CPU. The 386SX is an
     Intel 80386 with a 16-bit, rather than 32-bit, data path. This reduces
     (somewhat) the costs to build a 386SX system as compared to a full 32-
     bit design required by a 386DX. (The 386SX's main *marketing* purpose
     was to replace the 80286 CPU, which was being sold more cheaply by other
     manufacturers [such as AMD], and which Intel subsequently stopped
     producing.) Due to the 16-bit data path, the 386SX is slower than the
     386DX and offers about the same speed as an 80286 at the same clock
     frequency for 16-bit applications. But as the 386SX is a complete 80386
     internally, it offers also the possibility to run 32-bit applications
     and supports the virtual 8086 mode (used for example by Windows' 386
     enhanced mode).

     The 387SX has all the features of the Intel 80387, including the ability
     of asynchronous operation of CPU and coprocessor (see Intel 387DX
     information, above). Due to the 16 bit data path between the CPU and the
     coprocessor, the 387SX is a bit slower than a 80387 operating at the
     same frequency. In addition, the 387SX is based on the core of the
     original 80387, which executes instructions slower than the second
     generation 387DX.

     The 387SX comes in a 68-pin PLCC (plastic leaded chip carrier) package
     and is available in 16 MHz and 20 MHz versions. (Coprocessors for faster
     386SX systems based on the Am386SX CPU are available from IIT, Cyrix,
     and ULSI.) Power consumption for the 387SX at 16 MHz is max. 1250 mW
     (740 mW typical); for the 20 MHz version it is max. 1500 mW (1000 mW
     typical) [62].


Intel 387SL

     This coprocessor is designed for use in systems that contain an Intel
     386SL as the CPU. The 386SL is directly derived from the 386SX. It is a
     static CHMOS IV design with very low power requirements that is intended
     to be used in notebook and laptop computers. It features an integrated
     cache controller, a programmable memory controller, and hardware support
     for expanded memory according to the LIM EMS 4.0 standard. The 387SL,
     introduced in early 1992, has been designed to accompany the 386SL in
     machines with low power consumption and substitute the 387SX for this
     purpose. It features advanced power saving mechanisms. It is based on
     the 387DX core, rather than on the older and slower 80387 core (which is
     used by the 387SX).


IIT 3C87

     This IIT chip was introduced in 1989, about the same time as the Cyrix
     83D87. Both coprocessors are faster than Intel's 387DX coprocessor. The
     IIT 3C87 also provides extra functions not available on any other 387
     chip [38]. It has 24 user-accessible floating-point registers organized
     into three register banks. Three additional instructions (FSBP0, FSBP1,
     FSBP2) allow switching from one bank to another. (Transfers between
     registers in different banks are not supported, however, so this feature
     by itself is of limited usefulness. Also, there seems to be only one
     status register [containing the stack top pointer], so it has to be
     manually loaded and stored when switching between banks with a different
     number of registers in use [40]). The register bank's main purpose is to
     aid the fourth additional instruction the 3C87 has (F4X4), which does a
     full multiply of a 4x4 matrix by a 4x1 vector, an operation common in
     3D-graphics applications [39]. The built-in matrix multiply speeds this
     operation up by a factor of 6 to 8 when compared to a programmed
     solution according to the manufacturer [38]. Tests show the speed-up to
     be indeed in this range [40]. I measured the F4X4 to execute in about
     280 clock cycles, during which time it executes 16 multiplications and
     12 additions. The built-in matrix multiply speeds up the matrix-by-
     vector multiply by a factor of 3 compared with a programmed solution
     according to IIT [39]. The results for my own TRNSFORM benchmark support
     this claim (see results below), showing a performance increase by a
     factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly
     as fast as on an Intel 486 at the same clock frequency. As desirable as
     the F4X4 instruction may seem, however, there are very few applications
     that make use of it when an IIT coprocessor is detected at run time
     (among them Schroff Development's Silver Screen and Evolution
     Computing's Fast-CAD 3-D [25]).

     These IIT-specific instructions also work correctly when using a Chips &
     Technologies 38600DX or a Cyrix 486DLC CPU, which are both marketed as
     faster replacements for the Intel 386DX CPU.

     Tests I ran with the IEEETEST program show that the 3C87 is not fully
     compatible with the IEEE-754 standard for floating-point arithmetic,
     although the manufacturer claims otherwise. It is indeed possible that
     the reported errors are due to personal interpretations of the standard
     by the program's author that have been incorporated into IEEETEST and
     that the standard also supports the different interpretation chosen by
     IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST
     have become somewhat of an industry standard [66] and Intel's 387, 486,
     and RapidCAD chips pass the test without a single failure, so the fact
     that the IIT 3C87 fails some of the tests indicates that it is not fully
     compatible with the Intel 387 coprocessor. My tests also show that the
     IIT 3C87 does not support denormals for the double extended format. It
     is not entirely clear whether the IEEE standard mandates support for
     extended precision denormals, as the IEEE-754 document explicitly only
     mentions single and double-precision denormals. Missing support for
     denormals is not a critical issue for most applications, but there are
     some programs for which support of denormals is at the very least quite
     helpful [41]. In any case, failure of the 3C87 to support extended
     precision denormal numbers does represent an incompatibility with the
     Intel 387 and 486 chips.

     The 3C87 is implemented in an advanced CMOS process and has low power
     requirements, typically about 600 mW. Like the 387 'clones' from Cyrix
     and ULSI, the 3C87 does not support asynchronous operation of the CPU
     and the coprocessor, but always runs at the full speed of the CPU. It is
     available in 16, 20, 25, 33, and 40 MHz versions.


IIT 3C87SX

     This is the version of the IIT 3C87 that is intended for use with
     Intel's 386SX or AMD's Am386SX CPU, and is functionally equivalent to
     the IIT3C87. Due to the 16-bit data path between the CPU and the
     coprocessor in a 386S