Reconfigurable systems make it possible to create extremely
high-performance implementations for many different types of applications.
While techniques such as logic emulation provide a new tool specifically
for logic designers, many other FPGA-based systems serve as
high-performance replacements for standard computers for everything from
embedded systems to supercomputers.
The creators of these implementations are often software programmers,
not hardware designers. However, if these systems hope to be usable by
software programmers, they must be able to translate applications
described in standard software programming languages into FPGA
realizations. Thus, mapping tools that can synthesize hardware
implementations from C, C++, Fortran or assembly language descriptions
must be developed.
Although there are ways to transform specifications written in
hardware-description languages into electronic circuits, translating a
standard software program into hardware presents extra challenges. HDLs
focus mainly on constructs and semantics that can be efficiently
translated into hardware (though even these languages allow the creation
of nonsynthesizable specifications).
Software programming languages have no such restrictions. For example,
hardware is inherently parallel and HDLs have an execution model that
easily expresses concurrency. Most standard software languages normally
have a sequential execution model, with instructions executing one after
another.
This means that a hardware implementation of a software program is
either restricted to sequential operation, yielding an extremely
inefficient circuit, or the mapping software must figure out how to make
parallel an inherently sequential specification. Also, there are
operations commonly found in software programs that are relatively
expensive to implement in hardware. This includes multiplication and
variable-length shifts, as well as floating-point operations.
Although hardware can be synthesized to support these operations,
software that makes extensive use of these operations will result in
extremely large designs. Finally, software algorithms operate on
standard-size data values, using 8-, 16-, 32- or 64-bit values even for
operations that could easily fit into smaller bit-widths. By using wider
than necessary operands, circuit data paths must be made wider, increasing
the hardware costs. Thus, because we are using a language designed for
specifying software programs to create hardware implementations, the
translation software faces a mapping process more complex than for
standard HDLs.
Code translators
Many research projects have
developed methods for translating code in C , C++ , Ada, Occam, data
parallel C, Smalltalk, assembly and Java, as well as special HDLs into
FPGA realizations. These systems typically take software programs written
in a subset of the programming language, translate the da-ta computations
into hardware operations and insert multiplexing, latches and control
state machines to recreate the control flow.
But software programming languages contain constructs that are
difficult to handle efficiently in FPGA logic. And because of this such
translating techniques can restrict the language constructs that can be
present in the code to be translated. Most do not allow multiplication,
division or floating-point operations; some ban the use of structures,
pointers or arrays, eliminate recursion or do not support function calls
or control flow constructs such as case, do-while and loops without fixed
iteration counts.
Some techniques, which are intended primarily to compile only short
code sequences, may restrict data structures to only bit vectors, or not
support memory accesses at all. Other techniques extend C++ for use as an
HDL.
It is relatively simple to translate straight-line code from software
languages into hardware. Expressions in the code have direct
implementations in hardware that can compute the correct result (they
must, since a processor on which the language is intended to run must have
hardware to execute it). Variables could be stored in registers, with the
result from each instruction latched immediately and one instruction
executing at a time. However, this loses the inherent concurrency of
hardware. We can instead combine multiple instructions, latching the
results only at the end of the overall computation.
Renaming is a standard compiler technique for making such code
parallel, a variant of which is contained in the Transmogrifier C compiler
developed at the University of Toronto. The compiler moves sequentially
through a straight-line code sequence, remembering the logic used to
create the value of each assignment. Variables in an assignment that comes
from outside this piece of code draw their values from registers for that
variable. Variables that have been the target of an assignment in the code
sequence are replaced with the output from the logic that computes its
value earlier in the sequence.
For complex control flow, such as loops that execute a variable number
of times or "if" statements containing function calls, control-state
machines must be implemented. The code is broken into blocks of
straight-line code. Each of these code segments is represented by a
separate state in the control state machine. The straight-line code is
converted into logic functions with the series of operations collapsed
into a combinational circuit. Variables whose scope spans code segments
are held in registers. The input to the register is a mux that combines
the logic for assignment statements from different code sequences.
Once the controller for the hardware is constructed, techniques can be
used to simplify this state machine. States can be combined sequentially
or in parallel, allowing greater concurrency in the hardware as well as
minimizing the hardware cost of the controller. However, an even more
important matter is the simplification of the data path, something that
hasn't yet been dealt with adequately. In the construction given above
every operation in the source code generates unique hardware. For simple
computations this is fine.
However, complex opera-tions such as multiplication and division will
be scattered throughout the source code, implying that a huge amount of
hardware will be needed to implement all of these computations. But each
multiplier will be used only within the state corresponding to that
portion of the source code; otherwise, it will sit idle. The circuit's
size could be greatly reduced if those hardware multipliers were reused in
different places in the code. A single hardware multiplier can be used for
many separate multiplication operations from the source code, as long as
each occurs in different states.
There are several different systems that convert code sequences in
software programming languages into hardware realizations. While they all
support the translation of control flow and data path operations into
circuits, they differ on the amount of code they convert and the model of
operation they assume.
Perhaps the most straightforward is the Transmogrifier C system. It
takes a complete program written in C and translates it into a circuit
that can be implemented directly in the configurable logic blocks of
Xilinx FPGAs. Special pragma (compiler directive) statements are included
to declare external inputs and outputs, including the assignment of those
communications to specific FPGA pins. This yields a system where the
resulting hardware implementation is expected to be a complete
self-contained system implementing the entire functionality of the desired
behavior.
Critical assumption
Most of the systems for
translating software programs into hardware algorithms assume that only
the most time-critical portions of the code are mapped. Those systems use
the FPGA, or FPGAs, as a coprocessor to a standard CPU. The processor
implements most of the program, handling much of the operations that are
necessary to implement complex algorithms, but which contribute little to
the total computation time. The truly time-critical portions of the
algorithm are translated into hardware, using the FPGA to implement the
small fraction of the total code complexity that accounts for most of the
overall run-time. In that way the strengths of both FPGAs and standard
processors are combined into a single system. Processors can easily
implement a large variety of operations by working through a complex
series of instructions stored in high-density memory chips.
Mapping all those instructions into FPGA logic means that the complete
functionality of the entire program must be available simultaneously,
using a huge amount of circuit space. However, we can implement only the
most frequently used portions of code inside the FPGA, achieving a
significant performance boost with only a small amount of hardware. During
the execution of the program, the processor executes the software code
until it hits a portion of the code implemented inside the FPGA
coprocessor. The processor then transfers the inputs to the function to
the FPGA coprocessor and tells it to begin computing the correct
subroutine.
Once the FPGA has computed the function, the results are transferred
back to the processor, which continues with the rest of the software code.
An added benefit of the coprocessor model is that the software-to-hardware
compiler does not have to support all operations from the software
programming language, since complex functions such as multiplication or
memory loads can be handled instead by the host processor.
Limited
portions
This does limit the portions of the code that can be
translated into hardware. However, a system that converts the complete
program into hardware must either convert those operations into FPGA
realizations, yielding much larger area requirements, or ban those
constructs from the source code, limiting the types of operations and
algorithms that can be supported. Systems such as the Nimble compiler
developed at Synopsys provide a middle ground, making effective use of
both FPGA and CPU components for different portions of a given
computation.
Although compiling only the critical regions of a software algorithm
can reduce the area requirements and avoid hardware-inefficient
operations, it does introduce problems unique to those types of systems.
One problem is that some mechanism must be introduced for communicating
operands and results between the processor and the coprocessor. For
systems like Harvard's Prisc, which view the FPGA as merely a mechanism
for increasing the instruction set of the host processor, instructions are
restricted to reading from two source registers and writing one result
register, just like any other instruction on the processor.
However, other systems have much less tightly coupled FPGAs and require
protocols between the two systems. In most of these systems the
communication mechanism puts a hard limit on the amount of information
that can be communicated and thus the amount of the computation that can
be migrated to hardware. For example, if only two input words and a single
output word are allowed, there is obviously only so much useful
computation that can be performed in most circumstances.
A second important concern with compiling only a portion of a program
into hardware is determining which portions of the code to so map.
Obviously, the code that gets mapped to the hardware needs to contain a
large portion of the run-time of the overall algorithm if the designer
hopes to achieve significant performance improvements.
However, it is difficult to determine strictly from the source code
where the critical portions of a program are. In general, the solutions
are to profile the execution of the software on a sample input set, to
find the most frequently executed code sequences, or to have the user pick
the code sequences to use by hand.
But simply identifying the most often executed code sequences may not
yield the best speedups. Specifically, some code sequences will achieve
higher performance improvements than others when mapped to hardware. Also,
some code sequences may be too complex to map into hardware, not fitting
within the logic resources present in the FPGA. In general, greater
automation of this decision process is necessary.