Clover: 4bit Quantized Linear Algebra
Clover is a new library for the efficient computation on lowprecision data, providing
mathematical routines required by fundamental methods in optimization and sparse recovery. Our library faithfully implements
variants of stochastic quantization and supports 4bit quantized and 8bit quantized data formats and demonstrate that 4bit
can be implemented efficiently using Intel AVX2
despite the lack of native support for this data format.
Clover also supports 16bit half precision and 32bit single precision IEEE754 formats that are natively supported by Intel processors. Experimental results with dot product, matrixvector multiplication, gradient descent (GD), and iterative hard thresholding (IHT) illustrate the attainable speedups, in many cases close to linear in the reduction of precision because of reduced data movement.
This library is implemented in collaboration with Tyler Michael Smith as a research project at Department of Computer Science at ETH Zurich, supervised by Dan Alistarh and Markus Püschel. Experimental results of this work are published at SIPS 2018 and are accessible in the preprint version here.
Clover Data Structure
Quantization is a lossy compression technique that maps continuous values to a finite set of fixed bitwidth numbers. For a given vector of size and precision of bits, we first derive a factor that scales the vector elements into the representable range:
$$ s_v = \frac{2^{b1}  1}{\max_{i \in [1, n]} v_i}. $$
The scaled are then quantized stochastically:
$$ v_i \rightarrow \left\lfloor v_i \cdot s_v + \mu \right\rfloor $$
where is drawn uniformly from the interval . With this, a quantized array consists of one scaling factor and an array of quantized bit values.
Functionality for quantized 4bit linear algebra operations is provided by the CloverVector4
and CloverMatrix4
containers. CloverVector4
contains a vector represented in a quantized 4bit format. The values and the scaling factors are stored in separate contiguous blocks of memory,
such that subvectors of length 64 share the same scale.
An illustration of the memory layout of a CloverVector4
container is shown above. Each value in the vector is stored as a
two’s complement in a 4bit nibble. The x86 architecture uses byte addressing,
so we address each the nibbles in software. We accomplish this explicitly in the application logic, storing two consecutive
nibbles in a byte. As a result, the values in a CloverVector4
of length n are represented as an array of n/2 bytes.
CloverVector8
shares the same design principles. CloverVector16
and CloverVector32
containers do not have scales,
and the data layout is not organized into blocks.
A simplified C++
representation of the CloverVector4
container is shown below:
The 4bit values stored in the values
array are
in the range of , therefore each element at position pos
is obtained by multiplying the value with the
corresponding scale, and dividing the result by 7.0f
. Therefore get
routine will have the following form:
where _mm_srai_epi8_ss
performs right bit shifts, extending in sign bits, and operates on 8bit integers.
Fast Packing and Unpacking of 4bit values
The x86 architecture does not support 4bit integer arithmetic. Thus we unpack the values, representing them as larger (i.e., 8, 16 or 32bit) integers for computation. Then, for efficient storage and data movement, we pack these larger integers, representing them as 4bit integers stored within a larger entity.
The goal of unpacking from a 32bit integer (right part of figure below) is as follows: We start with a 32bit entity that stores multiple nibbles inside of it. We wish to extract a single 4bit nibble and represent it as a 32bit integer. This can be done with two bit shifts:
 (1) a logical left shift is used to shift the nibble so that it occupies the highestorder 4bits of the 32bit entity.
 (2) an arithmetic right shift is used to shift the nibble to the lowest order 4bits of the 32bit entity.
The arithmetic right shift has sign extension, filling the highorder 28 bits with the sign bit of the nibble. yielding a 32bit integer with the same value as the two’s complement 4bit value.
The goal of packing (left part of figure above) is to revert the unpacking operation. Two bit shifts can be used to place the lowest order 4 bits of a 32bit integer anywhere within a 32bit entity.
(1) a logical left shift is used to shift the nibble so that it occupies the highestorder 4bits of the 32bit entity. (2) a logical right shift is used to shift the nibble to somewhere within the 32bit entity.
The first sets the bits lowerordered than the nibble to zero,
and the second sets the bits higherordered than the nibble to zero.
A bitwise OR
operation can then be used to store up to eight nibbles in the 32bit entity.
Dot Product in Clover
Taking in consideration the fast packing and unpacking, as well as the structure of the
Clover
containers, let’s illustrate how we can write a fast dot product routine using
AVX2
. Let’s assume that we like to perform dot product on two vectors U
and V
.
We denote the values of the U
vector with the following variables:
And the vector V
with:
Now, let’s setup the constants and initialize the accumulator for the result variable:
Since values are stored in blocks of 64 elements, we will traverse each block using iterator i
, loading
quantized values of 4bits and their corresponding block scales:
The dot product routine requires all values to be multiplied pointwise, before the
result is reduced by summation. Therefore, once we have the scales, we can multiply them,
while scaling each with factor of 7.0f
. However, division instructions are expensive,
and we can use multiplication with the reciprocal value of 49.0f
(stored into clover_mm256_rcp_49_ps
) instead:
The two loads from the two value arrays, result into two AVX
registers that hold 64 4bit values.
As we can not operate with 4bits natively, we will convert the 4bit values
into 8bit integers. Assuming presence of 8bit shifts, we can apply the fast packing/unpacking
method to extract the low and highbits at each 8bit chunk. However, AVX2
provides
us with 16bit shifts only.
To deal with that, we shift by 4 bits left, and then we perform bitwise AND
operations,
such that the low 4bits of the 8bit chunk are placed in the high 4bits. To avoid dirty
bits, we apply the previously defined mask in clover_mm256_1st_bit_set_epi8
to extract
both the high and lowbits of qu
and qv
:
At this point in time, we have 2 variables for qu
and 2 variables for qv
such that
each contain 32 elements. qu_hi
represents the 4 high bits, and qu_lo
represents
the low 4 bits. For each of these 4 variables, the extracted 4bit values reside in the
high 4bit of the 8bit chunk i.e. they are all multiplied by .
Now, we would like to multiply the corresponding high/low bits of the first vector with the
corresponding high/low bits of the second vector. Unfortunately AVX2
does not provide us
with a vector instruction that can multiply signed 8bit integers. Instead, it provide us with
the vpmaddubsw
instruction that we can use to perform this computation:
The definition of this instruction is given as: Vertically multiply each unsigned 8bit integer from a with the corresponding signed 8bit integer from b, producing intermediate signed 16bit integers. Horizontally add adjacent pairs of intermediate signed 16bit integers, and pack the saturated results in dst.
Therefore, in order to benefit from this instruction we need to make sure that left operand is unsigned and that the sign is accumulated in the right operand:
At this point, we have computed the multiplication of corresponding elements, and we have added neighbouring elements, obtaining the result into 16bit chunks. As the 8bit chunks were already offsetted by factor of , the 16bit chunks contain values in the range of , or . To continue, we shift the values right, while extending in sign bits, effectively dividing the 16bit chunks by factor of (making sure that the values are in the range ):
Now we can simply add the two 16bit values using 16bit addition provided by AVX2
:
Once we a single variable that represents the dot product of the block, the last thing we need is to multiply this value with the corresponding scale. To achieve that, we need to convert the 16bit chunk into a float. We can do this in a single instruction using vpmaddwd instruction.
The definition of this instruction is given with: Multiply packed signed 16bit integers in a and b, producing intermediate signed 32bit integers. Horizontally add adjacent pairs of intermediate 32bit integers, and pack the results in dst.
To use it, we define clover_mm256_epi16
such that it is set to 1
in each 16bit chunk:
Finally, we perform the last multiplication in the block iteration with the
corresponding scale, and accumulate it into the result. For that we can use an FMA
instructions:
Once all blocks are iterated, the last step is to horizontally add the values into the accumulator and return the result. A function that will get the job done is given with:
Clover Supported Routines
Appart from the dot routine each container implements the following routines:
.quantize 
That converts 32bit vector intro 4/8/16bit quantized matrix.  
.restore 
That converts the quantized vector into 32bit matrix.  
.dot 
Dot Product Routine.  
.scaleAndAdd 
Scale and Add routine (equvalent to AXPY)  
.threshold(K) 
Thresholding routine that drop all but the higest K elements in magnitude. 

The matrix container, namely CloverMatrix4
, is represented similarly. The values and scales are stored as separate contiguous blocks of memory, both in rowmajor order. Submatrices of size 64 × 64 share the same scale. The length of each vector is padded to 128 elements, and the row and column size of the matrix is also padded to 128. Similar design decisions are made for CloverMatrix8
. CloverMatrix16
and CloverMatrix32
do not contain scales. The following routines are available:
.quantize 
That converts 32bit matrix intro 4/8/16bit quantized matrix.  
.restore 
That converts the quantized matrix into 32bit matrix.  
.mvm 
Matrix Vector Multiplication routine.  
.transpose 
Routine to transpose the matrix.  
Each of these routines are implemented using SIMD code as much as possible. For each routine we provide scalar implementation having the same nomenclature, following _scalar
for testing and validation purposes. When possible, we parallize the code using OpenMP and provide the same routines in threaded code folloging the _parallel
suffix.
Performance Evaluation
We evaluated our implementation on an Intel Xeon CPU E31285L v3 3.10GHz
Haswell with 32GB of RAM and 25.6 GB/s bandwidth to main memory, running Debian GNU/Linux 8 (jessie),
kernel 3.16.432+deb8u3
. We use the Intel ICC compiler 17.0.0
, Intel IPP 2017.0.0 (r52494)
,
and Intel MKL 2017.0.0 (Build 20160801)
. We use RDTSC
to measure the cycle count for each test,
performing 15 repetitions with warm cache, and using the median as a result. To avoid the effects
of frequency scaling and resource sharing on the measurements, Turbo Boost and HyperThreading are disabled.
For each routine we derive a pseudo flop count, using the number of additions and multiplications
required to perform the mathematical operations, and report our results in flops per cycle (F/C).
The flop count for dot product and scale and add is 2 * n
and for MVM it is 2 * n * n
.
Dot Product. The figure below shows the performance profile of the dot product. When the data fits in L1 and L2 cache the 32bit version is much faster than 4bit because the entire computation is done using native instructions without bitunpacking. Once the data exceed L3 cache, 4bit is fastest since the dot product is memory bound: data movement from RAM becomes the bottleneck. The speedup is up to 6x over the 32bit version.
Scale and add. Similar to the dot product, the plots below shows that the 32bit and 16bit implementations are faster than 4bit and 8bit within cache for the same reasons. However, even outside L3 cache, 4 and 8bit are only able to match the 32bit implementation due to the higher overhead. This is reflected in the low bandwidth used. As a result, parallelization yields nearlinear speedup for 4bit and near none for 32bit, making 4bit about 3x faster.
MVM. In the figure below we compare our sequential and parallel implementations of MVM, respectively, for each datatype, including the mixed 4,8bit MVM. For the sequential case, for problems that do not fit in cache, we see that pure 4bit is about 4.6x faster than 32bit but uses only one third of the available bandwidth, and the mixed 4bit and 8bit MVM is noticeable slower. However, once parallelized, all version exhaust the available bandwidth and thus reach a speedup linear in the precision reduction. The mixed 4,8bit MVM is now as fast as the 4bit version since the bottleneck is loading the matrix.