SGen

SGen is a generator capable of producing efficient hardware designs for a variety of signal processing transforms. These designs operate on streaming data, meaning that the dataset is divided into several chunks that are processed during several cycles, thus allowing a reduced use of resources. The size of these chunks is called the streaming width. As an example, the figures below represent three discrete Fourier transforms on $8$ elements, with a streaming width of $8$ (no streaming), $4$ and $2$.

Iterative Cooley-Tukey FFT on 8 points, streamed with a streaming width of 4.

Iterative Cooley-Tukey FFT on 8 points, streamed with a streaming width of 2.

The generator outputs a Verilog file that can be used for FPGAs.

Using SGen

There are several ways of using SGen:

You can use the following web interface to download one of the designs we have pre-generated. Note that only a subset of the design parameters offered by SGen is present.
Transformation i
Transformation
Designs that can be downloaded here perform one of the following transforms:

Discrete/Fast Fourier transforms (DFT/FFT),

Walsh-Hadamard transforms (WHT), or

permutations.
Loading... Streaming width i
Streaming width
This parameter controls the number of elements that are processed every cycle. It corresponds to the number of input ports, and the number of output ports. The number of cycles to input (resp. output) the dataset is the tranform size divided by the streaming width.
Loading...

Get it! (ZIP file)
You can download the source code of this project on github and compile it using Scala 3 (Dotty). It works on any operating system, and doesn’t require any external library. It can optionnaly use Xilinx Vivado to run simulations on the generated designs or to measure their performance after place-and-route. It contains pre-generated FloPoco operators for a variety of floating-point representations, and can use FloPoCo for more exotic arithmetic representations (linux only).

Assuming that git and SBT are installed, the following commands in a Windows or Linux console will generate a streaming Walsh-Hadamard transform on 8 points:

git clone https://github.com/fserre/sgen.git
cd sgen
sbt "run -n 3 wht"

Technical highlights

The generator receives as input the desired transform, and its parameters (size or streaming width for instance). The desired design goes trhrough a pipeline that consists of four intermediate representation, where different levels of optimization take place:

an algorithm level representation, SPL,
a streaming-block level representation, Streaming-SPL,
an acyclic streaming IR, and
an RTL level IR.

Each of these representation use a domain specific languages (DSL) introduced below, and explained in details in [1] and [3].

SPL

The transform chosen by the user is first expressed in SPL [8], a language capable of representing linear algorithms through the use of matrices, and operators on these matrices. A set of rewriting rules simplifies this expression until it can be used for the subsequent stage. This process is independent of the streaming width, and of the arithmetic representation that will be used in the final design.

Keeping the example introduced earlier, the algorithm to compute a DFT on $2^3 = 8$ can be derived using iteratively a Cooley-Tukey Fast Fourier tranform rewriting. It yields (after simplifications) the following architecture:

This dataflow is represented internally with the following SPL formula (note that matrices order is reversed compared to the dataflow as an input vector would be evaluated from right to left):

$\pi\begin{pmatrix}. & . & 1\\1 & . & .\\. & 1 & .\end{pmatrix} \cdot \left( I_{2^{2}}\otimes DFT_2\right ) \cdot E(3,1,0) \cdot \pi\begin{pmatrix}. & . & 1\\. & 1 & .\\1 & . & .\end{pmatrix} \cdot \left( I_{2^{2}}\otimes DFT_2\right ) \cdot E(3,1,1) \cdot \pi\begin{pmatrix}1 & . & .\\. & . & 1\\. & 1 & .\end{pmatrix} \cdot \left( I_{2^{2}}\otimes DFT_2\right ) \cdot \pi\begin{pmatrix}. & . & 1\\. & 1 & .\\1 & . & .\end{pmatrix}.$

Here, $I_{2^{2}}\otimes DFT_2$ represents an array of butterflies, $E$ a multiplication by constants, and $\pi$ a permutation.

Given a streaming width, this SPL formula is then transformed (folded) into a streaming-SPL formula, using the methods described in [2] and [5], using the algorithms of [6].

Streaming-SPL

This DSL represents linear transforms as well, but includes the corresponding streaming width. It shares similarities with the Hardware-SPL introduced in [8], but offers a finer representation of streaming permutations by introducing new nodes (array of switches or multiplexers, temporal permutations). More details on these streaming blocks can be found in [2].

After the folding from SPL, the design is optimized using a set of rewriting rules (for example, in the case of compact designs, some permutation elements can be unrolled outside of a streaming product to increase throughput).

Following our previous example, folding our FFT on $8$ elements with a streaming width of $4$ yields in the streaming-SPL the following formula:

$$\pi\begin{pmatrix}. & 1\\1 & .\end{pmatrix} \cdot SA\begin{pmatrix}1\end{pmatrix} \cdot TP\left(\begin{pmatrix}1\end{pmatrix},\begin{pmatrix}. & 1\end{pmatrix}\right) \cdot SA\begin{pmatrix}1\end{pmatrix} \cdot \left( I_{2^{2}}\otimes DFT_2\right ) \cdot E^{3,1}_{0} \cdot SA\begin{pmatrix}1\end{pmatrix} \cdot TP\left(\begin{pmatrix}1\end{pmatrix},\begin{pmatrix}. & 1\end{pmatrix}\right) \cdot SA\begin{pmatrix}1\end{pmatrix} \cdot $$ $$\left( I_{2^{2}}\otimes DFT_2\right ) \cdot E^{3,1}_{1} \cdot \pi\begin{pmatrix}. & 1\\1 & .\end{pmatrix} \cdot \left( I_{2^{2}}\otimes DFT_2\right ) \cdot SA\begin{pmatrix}1\end{pmatrix} \cdot TP\left(\begin{pmatrix}1\end{pmatrix},\begin{pmatrix}. & 1\end{pmatrix}\right) \cdot SA\begin{pmatrix}1\end{pmatrix} $$

This corresponds to the following dataflow:

Iterative Cooley-Tukey FFT on 8 points folded with a streaming width of 4

Here, 2x2 switches are represented as gray boxes ($SA$ in the formula), and blue squares represent a temporal permutation ($TP$ in the formula) implemented using shift registers. Larger designs can as well use RAM banks instead, as it is the case with a streaming width of 2:

Iterative Cooley-Tukey FFT on 8 points folded with a streaming width of 2

Once this step is completed, (acyclic) nodes are converted into an acyclic streaming IR, as described in the next section (In our example, all nodes are acyclic. Nodes that have cycles (streaming product) are directly implemented in the final RTL graph instead).

Acyclic Streaming IR

In a third step, the design (or more precisely, parts of the design that do not contain cycles) is represented in an acyclic streaming IR. This is a directed acyclic graph where nodes, signals, represent the different operations performed (e.g. multiplexers, adders, multipliers, etc…). It is similar to the final RTL Graph representation, but abstracts away the hardware arithmetic representation; it uses software datatypes (e.g. double precision floating point) to propagate constants.

In addition, absolute timing doesn’t appear at this stage (only local relative timing for RAM banks / shift registers) thus allowing easier arithmetic simplifications and rewritings.

Back to our example: the following graph corresponds to the $DFT_8$ design with a streaming width of $4$. Note that constants and some other source operators (timers, counters) are duplicated in this graph for better view.

When the simplifications at this step are done, the graph is translated into an RTL graph in two steps:

Signals are translated to the desired hardware arithmetic representation, and the software datatype is erased.
Signals are synchronised: they receive an absolute timing (i.e. the number of cycles of advance they should have compared to the design outputs). The latency of the design (in cycles) is computed, registers and shift registers are added between nodes, and a token-based synchronisation system is generated.

RTL Graph

As a final step, the design is represented as an RTL Graph. In this DAG, nodes correspond to the operators of a hardware description language.

The figure below shows the RTL Graph of our $DFT_8$, streamed on $4$ ports, and that uses a $2 \times (8, 8)$-fixed-point representation. Note that constants are duplicated, and that the reset signal is not represented for better view.

Compared to the Acyclic Streaming IR, one can note that

registers and shift registers are now present,
constants show the decimal representation that their binary value would have if read as a two’s complement signed integer.

This design is then unparsed as a Verilog code.

Generator implementation

Our generator is implemented in Scala 3 (Dotty), and implements its DSLs using some of the concepts that were introduced in LMS/SpiralS. As an example here, we consider the implementation of the acyclic streaming IR.

All the different kind of signals inherit from the class Sig: [actual code]

abstract class Sig[T: HW]:
  final val hw = HW[T]

Here, T represents the underlying numeric datatype of the signal as it would be represented in software (e.g. Int, Double), and HW a context bound indicating how T should be translated into hardware.

Below are some concrete examples of signals, which respectively represent an immediate constant (class Const [actual code]) and the bitwise and of two signals (class And [actual code]):

case class Const[T: HW](value: T) extends Sig[T]
case class And(lhs: Sig[Int], rhs: Sig[Int]) extends Sig(using lhs.hw)

The hardware datatype HW contains all the informations needed to implement a given numerical representation into hardware. They extend the class HW: [actual code]

abstract class HW[T: Numeric](val size: Int):
  def bitsOf(const: T): BigInt
  def valueOf(const: BigInt): T
  def add(terms: Seq[Sig[T]]): Sig[T]
  def minus(lhs: Sig[T], rhs: Sig[T]): Sig[T]
  ...

The following classes represent some of these hardware datatypes: [actual code]

case class Unsigned(_size: Int) extends HW[Int](_size):
  ...
  
case class FixedPoint(wE: Int, wF: Int) extends HW[Double](wE + wF):
  ...
  
case class FloPoCo(wE: Int, wF: Int) extends HW[Double](wE + wF + 2):
  ...

Methods/operators that only exist on a specific datatype are implemented using extensions. Here, the operator x & y is implemented only for signals representing an integer: [actual code]

extension (lhs: Sig[Int]) def &(rhs: Sig[Int]) = And(lhs, rhs)

Smart constructors allow to propagate constants without having to ever create a new signal: [actual code]

object And:
  def apply(lhs: Sig[Int], rhs: Sig[Int]): Sig[Int] =
    require(lhs.hw == rhs.hw, "The two operands do not have the same hardware representation.")
    given HW[Int] = lhs.hw
    (lhs, rhs) match 
      case (Const(lhs), Const(rhs)) => Const(lvalue & rvalue)
      case _ => new And(lhs, rhs)

This DSL allows the nodes of the Streaming-SPL IR to be easily expressed in the acyclic streaming IR. For example, the node $E$ containing the twiddle factors for the Cooley-Tukey FFT can be written as: [actual code]

def implement(inputs: Seq[Sig[Complex[Double]]]): Seq[Sig[Complex[Double]]] = (0 until K).map(p => //for each input wire inputs(p),
  val twiddles: Vector[Complex[Double]] = Vector.tabulate(T)(c => coef((c * K) + p)) // create the appropriate vector of twiddles
  val control = Timer(T) // Timer that runs for T cycles, where T = 2^t
  val twiddle = ROM(twiddles, control)(hw) // ROM containing the twiddles, and controlled by the timer
  inputs(p) * twiddle) // Multiplication

Note that this code is agnostic of the final hardware arithmetic representation.

Command-line interface

A command line consists of a list parameters followed by the desired transform.

Parameters

The following parameters can be used:

-n n Logarithm of the size of the transform. As an example, -n 3 means that the transform operates on 2^3=8 points. This parameter is required.
-k k Logarithm of the streaming width of the implementation. -k 2 means that the resulting implementation will have 2^2=4 input ports and 4 output ports, and will perform a transform every 2^(n-k) cycles. In case where this parameter is not specified, the implementation is not folded, i.e. the implementation will have one input port and one output port for each data point, and will perform one transform per cycle.
-r r Logarithm of the radix (for DFTs and WHTs). This parameter specifies the size of the base transform used in the algorithm. r must divide n, and, for compact designs (dftcompact and whtcompact), be smaller or equal to k. It is ignored by tranforms not requiring it (permutations). If this parameter is not specified, the highest possible radix is used.
-o filename Name of the output file.
-benchmark Adds a benchmark module in the generated design.
-rtlgraph Produces a DOT graph representing the design.
-dualramcontrol Uses dual-control for memory (read and write addresses are computed independently). This option yields designs that use more resources than with single RAM control (default), but that have more flexible timing constraints (see the description in generated files). This option is automatically enabled for compact designs.
-singleported Uses single-ported memory (read and write addresses are the same). This option has the same constraints as single RAM control (default), but may have a higher latency.
-zip Creates a zip file containing the design and its dependencies (e.g. FloPoCo modules).
-hw repr Hardware arithmetic representation of the input data. repr can be one of the following:
- char Signed integer of 8 bits. Equivalent of signed 8.
- short Signed integer of 16 bits. Equivalent of signed 16.
- int Signed integer of 32 bits. Equivalent of signed 32.
- long Signed integer of 64 bits. Equivalent of signed 64.
- uchar Unsigned integer of 8 bits. Equivalent of unsigned 8.
- ushort Unsigned integer of 16 bits. Equivalent of unsigned 16.
- uint Unsigned integer of 32 bits. Equivalent of unsigned 32.
- ulong Unsigned integer of 64 bits. Equivalent of unsigned 64.
- float Simple precision floating point (32 bits). Equivalent of ieee754 8 23.
- double Double precision floating point (64 bits). Equivalent of ieee754 11 52.
- half Half precision floating point (16 bits). Equivalent of ieee754 5 10.
- minifloat Minifloat of 8 bits. Equivalent of ieee754 4 3.
- bfloat16 bfloat16 floating point . Equivalent of ieee754 8 7.
- unsigned size Unsigned integer of size bits.
- signed size Signed integer of size bits. Equivalent of fixedpoint size 0.
- fixedpoint integer fractional Signed fixed-point representation with a integer part of integer bits and a fractional part of fractional bits.
- flopoco wE wF FloPoCo floating point representation with an exponent size of wE bits, and a mantissa of wF bits. The resulting design will depend on the corresponding FloPoCo generated arithmetic operators, which must be placed in the flopoco folder. In the case where the corresponding vhdl file is not present, SGen provides the command line to generate it. Custom options (e.g. frequency or target) can be used.
- ieee754 wE wF IEEE754 floating point representation with an exponent size of wE bits, and a mantissa of wF bits. Arithmetic operations are performed using FloPoCo. Note that, unless otherwise specified when generating FloPoCo operators, denormal numbers are flushed to zero.
- complex repr Cartesian complex number. Represented by the concatenation of its coordinates, each represented using repr arithmetic representation.

Transforms

Supported transforms are the following: ### Streaming linear permutations Linear permutations can be implemented using the lp command:

# generates a bit-reversal permutation on 32 points, streamed on 2^2=4 ports.
sbt "run -n 5 -k 2 lp bitrev"

# generates a streaming datapath that performs a bit-reversal permutation on 8 points on the first dataset, and a "half-reversal" on the second dataset on 2 ports
sbt "run -n 3 -k 1 lp bitrev 100110111"

The command lp takes as parameter the invertible bit-matrix representing the linear permutation (see this publication for details) in row-major order. Alternatively, the matrix can be replaced by the keyword bitrev or identity.

Several bit-matrices can be listed (seperated by a space) to generate a datapath performing several permutations. In this case, the first permutation will be applied to the first dataset entering, the second one on the second dataset, …

The resulting implementation supports full throughput, meaning that no delay is required between two datasets.

Fourier Transforms (full throughput)

Fourier transforms (with full-throughput, i.e. without delay between datasets) can be generated using the dft command:

# generates a Fourier transform on 16 points, streamed on 4 ports, with fixed-point complex datatype with a mantissa of 8 bits and an exponent of 8 bits.
sbt "run -n 4 -k 2 -hw complex fixedpoint 8 8 dft"

Fourier Transforms (compact design)

Fourier transforms (with an architecture that reuses several times the same hardware) can be generated using the dftcompact command:

# generates a Fourier transform on 1024 points, streamed on 8 ports, with fixed-point complex datatype with a mantissa of 8 bits and an exponent of 8 bits.
sbt "run -n 10 -k 3 -hw complex fixedpoint 8 8 dftcompact"

RAM control

In the case of a streaming design (n > k), memory modules may need to be used. In this case, SGen allows to choose the control strategy used for this module:

Dual RAM control: read and write addresses are computed independently. This offers the highest flexibility (a dataset can be input at any time after the previous one), but this uses more resources. It is automatically used for compact designs (dftcompact), but can be enabled for other designs using the -dualRAMcontrol parameter.
Single RAM control: write address is the same as the read address, delayed by a constant time. This uses less resources, but it has less flexibility: a dataset must be input either immediately after the previous one, or wait that the previous dataset is completely out. This is the default mode (except for compact designs).
Single-ported RAM: write and read addresses are the same. This has the same constraints as Single RAM control, but may have a higher latency.

Contact

Feel free to report any bug, issue, or suggestion to François Serre.

References

François Serre and Markus Püschel
DSL-Based Hardware Generation with Scala: Example Fast Fourier Transforms and Sorting Networks
ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol. 13, Issue 1, No. 1, 2019
François Serre and Markus Püschel
Memory-Efficient Fast Fourier Transform on Streaming Data by Fusing Permutations
Proc. International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 219-228, 2018
François Serre and Markus Püschel
A DSL-Based FFT Hardware Generator in Scala
Proc. International Conference on Field Programmable Logic and Applications (FPL), pp. 315-322, 2018
François Serre and Markus Püschel
Optimal Streamed Linear Permutations
Proc. Symposium on Computer Arithmetic (ARITH), pp. 60-61, 2017
François Serre, Thomas Holenstein and Markus Püschel
Optimal Circuits for Streamed Linear Permutations using RAM
Proc. International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 215-223, 2016
François Serre and Markus Püschel
Generalizing block LU factorization: A lower-upper-lower block triangular decomposition with minimal off-diagonal ranks
Linear Algebra and its Applications, Vol. 509, pp. 114-142, 2016
Peter A. Milder
Spiral DFT/FFT IP Core Generator
Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel
Computer Generation of Hardware for Linear Digital Signal Processing Transforms
ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol. 17, No. 2, 2012