ARITH 2024 Accepted Papers with Abstracts

Abstract: RISC-V is an open royalty-free instruction set architecture (ISA) under the governance of RISC-V International. The ISA consists of a base instruction set and many modular extensions among which the vector extension that supports data parallel execution is recently ratified. Supporting the goal of an open ISA to accelerate innovation, we developed an open-source double-precision vector math library as a highly relevant addition to the high-performance computing (HPC) ecosystem. This paper introduces the RISC-V vector ISA, some of those computational instructions we found most useful, and the software structure of, as well as the various numerical techniques employed to make the library accurate and efficient.

Abstract: Multipliers have always been an important component of computer architecture, but the increasing relevance of Artificial Intelligence (AI) has brought about a massive increase in the number of multipliers on all compute platforms. At the same time, multiplier use for signal processing has also increased unabated.

Multiplier architectures have not changed appreciably over the recent past. In this paper, we introduce a new technique for calculating partial products, which can be used with known compression tree and adder combinations. We demonstrate the efficiency of our new multiplier by reporting results from 800MHz to 2GHz in a current 7nm production library, and comparing to the well-known modified Booth's radix 4 and 8 architectures.

Abstract: Industrial datapath designers consider dynamic power consumption to be a key metric. Arithmetic circuits contribute a major component of total chip power consumption and are therefore a common target for power optimization. While arithmetic circuit area and dynamic power consumption are often correlated, there is also a tradeoff to consider, as additional gates can be added to explicitly reduce arithmetic circuit activity and hence reduce power consumption. In this work, we consider two forms of power optimization and their interaction: circuit area reduction via arithmetic optimization, and the elimination of redundant computations using both data and clock gating. By encoding both these classes of optimization as local rewrites of expressions, our tool flow can simultaneously explore them, uncovering new opportunities for power saving through arithmetic rewrites using the e-graph data structure. Since power consumption is highly dependent upon the workload performed by the circuit, our tool flow facilitates a data dependent design paradigm, where an implementation is automatically tailored to particular contexts of data activity.
We develop an automated RTL to RTL optimization framework, ROVER, that takes circuit input stimuli and generates power-efficient architectures. We evaluate the effectiveness on both open-source arithmetic benchmarks and benchmarks derived from Intel production examples. The tool is able to reduce the total power consumption by up to 33.9%.

Abstract: Fully Homomorphic Encryption (FHE) enables computation on encrypted data, holding immense potential for enhancing data privacy and security in various applications. Presently, FHE adoption is hindered by slow computation times, caused by data being encrypted into large polynomials. Optimized FHE libraries and hardware acceleration are emerging to tackle this performance bottleneck. Often, these libraries implement the Number Theoretic Transform (NTT) algorithm for efficient polynomial multiplication. Existing implementations mostly focus on the case where the polynomials are defined over a power-of-two cyclotomic ring, allowing to make use of the simpler Cooley-Tukey NTT. However, generalized cyclotomics have several benefits in the BGV FHE scheme, including more SIMD plaintext slots and a simpler bootstrapping algorithm.

We present a hardware architecture for the NTT targeting generalized cyclotomics within the context of the BGV FHE scheme. We explore different non-power-of-two NTT algorithms, including the Prime-Factor, Rader, and Bluestein NTTs. Our most efficient architecture targets the 21845-th cyclotomic polynomial --- a practical parameter for BGV --- with ideal properties for use with a combination of the Prime-Factor and Rader algorithms. The design achieves high throughput with optimized resource utilization, by leveraging parallel processing, pipelining, and reusing processing elements. Compared to Wu et al.'s VLSI architecture of the Bluestein NTT, our approach showcases 2x to 5x improved throughput and area efficiency. Simulation and implementation results on an AMD Alveo U250 FPGA demonstrate the feasibility of the proposed hardware design for FHE.

Abstract: Level-index arithmetic appeared in the 1980s. One of the principal purposes of it was to abolish the issues caused by underflows and overflows in floating point. However, level-index arithmetic does not expand the set of numbers but spaces out the numbers of large magnitudes even more than floating-point representations to move the infinities further: gaps between numbers become very large. It is a tradeoff which did not seem to convince the hardware manufacturers. We revisit the 40 year old idea by presenting a custom precision simulator in MATLAB. This toolbox is useful for exploring performance of level-index arithmetic in research projects, such as using 8-bit and 16-bit level-index representations in machine learning algorithms where narrow bit-width is desired but overflow/underflow of equivalent width floating-point representations causes difficulties.

Abstract: What is the most efficient way to compute the exponential function when allowing for the precomputation of lookup tables? In this paper we study this question as a function of the working precision and analyze both classical and asymptotically fast approaches. We present new complexity results, discuss efficient parameter choices and point out improvements that lead to speedups over existing implementations.

Abstract: In this paper, we propose a high-radix generalized hyperbolic coordinate rotation digital computer (HGH-CORDIC), which not only can compute the logarithmic and exponential functions with any fixed base, but also can reduce the number of iterations compared with the traditional CORDIC. First, we propose the general iteration formulas for HGH-CORDIC. Then we demonstrate its important convergence property and selection criteria, and illustrate them with the most commonly used example. Through software simulation, we further prove the correctness of the theory. Finally, we analyze how to implement it efficiently in hardware. Compared with the state-of-the-art work, HGH-CORDIC can reduce the number of iterations by more than 50% with the same accuracy.

Abstract: We present an error analysis of an algorithm due to Serra et al. (Journal of Guidance Control and Dynamics, 2016) for computing the orbital collision probability in the short term encounter model. The algorithm reduces the numerical computation of the collision probability to that of the sum of a series whose coefficients are produced by a linear recurrence relation, and is specifically designed to avoid cancellation issues in the evaluation of the sum. The authors derived a bound on the method error arising from the truncation of the series, and observed experimentally that the computation of the terms of the sum is numerically stable, but did not study the evaluation error. Here we give a rigorous bound on the accumulated rounding error when Serra et al.'s algorithm is implemented in floating-point arithmetic.
For a unit round-off u and a truncation order N, the bound is of the form (N + A) u + o(u), where A is an explicit constant depending on the problem parameters and o(u) stands for explicitly bounded small terms compared to u. Our analysis is based on the observation that the generating series of the errors affecting each individual term is solution to a perturbed form of a differential equation satisfied by the Laplace transform of a function related to the collision probability.

Abstract: We show that the availability of fused arithmetic operators that evaluate expressions of the form ab+cd (FD2 instruction) or ab+cd+e (FD2A instruction) in floating-point arithmetic with one final rounding only would significantly facilitate many calculations that are hard to perform with high accuracy at small cost using only the traditional operations +, -, *, /, sqrt, and fused multiply-add (FMA).

Abstract: In this paper, we introduce a novel implementation of a square root algorithm specifically tailored for posit arithmetic. Unlike traditional methods, the proposed approach capitalizes on the inherent flexibility of posits, which lack fixed-length fields, to optimize square root computations. By accurately estimating the minimum number of required fraction bits, our algorithm substantially reduces the recurrence iterations without sacrificing accuracy. Implemented across standard 16-bit, 32-bit, and 64-bit posit formats, our units showcase a significant latency reduction in different applications with only a marginal increase in resource utilization. Comparative analysis against previous pipelined designs underscores the area efficiency of our proposed solutions. This research significantly contributes to the advancement of posit-based arithmetic units, presenting promising opportunities for improving computational system efficiency.

Abstract: Over the last few years the power of logarithmic quantizations and computations has been recognized as a useful tool in optimizing the performance of large ML models. In this article, we provide results that demonstrate significantly better quantization signal-to-noise ratio performance thanks to multiple-base logarithmic number systems (MDLNS) in comparison with the floating-point quantizations that use the same number of bits. On a hardware level, we present details about our Xilinx VCU-128 FPGA design for dot product and matrix-vector computations. The MDLNS matrix-vector design significantly outperforms equivalent fixed-point binary designs in terms of area (A) and time (T) complexity and power consumption as evidenced by a 4x scaling of AT^2 metric for VLSI performance, and 57% increase in computational throughput per watt compared to fixed-point arithmetic.

Abstract: As CPU performance is unable to keep up with the dramatic growth of the past few decades, CPU architects are looking into domain-specific architectures to accelerate certain tasks. A recent trend is the introduction of matrix-multiplication accelerators to CPUs by manufacturers such as IBM, Intel and ARM, some of which have not launched commercially yet. Apple's systems-on-chip (SoCs) for its mobile phones, tablets and personal computers include a proprietary, undocumented CPU-coupled matrix multiplication coprocessor called AMX. In this paper, we leverage AMX to accelerate the post-quantum lattice-based cryptosystems Saber and FrodoKEM, and benchmark their performance on Apple M1 and M3 SoCs. We propose a variant of the Toeplitz Matrix-Vector Product algorithm for polynomial multiplication, which sets new speed records for Saber using AMX (up to 13% for the main KEM operations, and 151% for matrix-vector multiplication of polynomials). For FrodoKEM, we set new speed records with our AMX implementation (up to 21% for the main KEM operations, and 124% for matrix multiplication, with even greater improvements for 4x-batching). Such speedups are relative to our optimized NEON implementation, also presented here, which improves upon the state-of-the-art implementation for ARMv8 CPUs.

Abstract: In many instances of fixed-point multiplication, a full precision result is not required. Instead it is sufficient to return a faithfully rounded result. Faithful rounding permits the machine representable number either immediately above or
below the full precision result, if the latter is not exactly representable. Multipliers which take full advantage of this freedom can be implemented using less circuit area and consuming less power. The most common implementations internally truncate the partial product array. However, truncation applied to the most common of multiplier architectures, namely Booth architectures, results in non-commutative implementations. The industrial adoption of truncated multipliers is limited by the
absence of formal verification of such implementations, since exhaustive simulation is typically infeasible. We present a commutative truncated Booth multiplier architecture and derive closed form necessary and sufficient conditions for faithful
rounding. We also provide the bit-vectors giving rise to the worst-case error. We present a formal verification methodology based on ACL2 which scales up to 42 bit multipliers. We synthesize a range of commutative faithfully rounded multipliers and show that truncated booth implementations are up to 31% smaller than externally truncated multipliers.

Abstract: Multipliers play an essential role in various data processing applications and have garnered significant attention in approximate computing (AxC) for their energy-efficient features. However, formulating a precise error model for approximate data processing algorithms in conjunction with hardware metrics presents a challenge, leading to substantial time consumption in the design space exploration. This paper introduces a novel analytical model for approximate multipliers while considering input patterns. This model furnishes accurate error metrics, along with high-precision hardware metrics for various approximate multiplier configurations, impervious to variations in input data distribution. The proposed error model reduces the runtime by an average factor of 120.85 and, in some instances, by as much as 2,500 times, when contrasted with simulation-based methods. The design space exploration is performed on a 9×9 convolution circuit, revealing a comparable Pareto-optimal set and substantial reductions of up to 79.46% in the Power-Delay-Product (PDP) and 71.98% in area compared to the accurate counterpart. Additionally, the result of the Gaussian Blur application experiment demonstrates a 68.59% reduction in PDP and a 56.21% reduction in area, all while maintaining a PSNR of 30 dB.

Abstract: Montgomery modular multiplication (MMM) in residue number systems (RNS) uses a base extension (BE) technique. This is to avoid division, which is hard, slow and costly in RNS. It is somewhat less costly and faster than the reverse conversion, via Chinese remainder theorem (CRT) and reduction factor method. However, it is used one after the other, for each of the equally large bases. In this work, we modify the conventional RNS-MMM algorithm via replacing the two unparalleled BE undertakings with three parallel CRT-like operations with the same complexity, as BE. As for the reduction factors, we use a special case of the Kawamura’s algorithm that leads to definitive result. The proposed RNS-MMM method allows for squaring the working dynamic range, or halving the bit-width of the balanced residue channels. Moreover, the common practice of dynamically changing the working moduli set in security and crypto applications is less critical due to doubled size of the pool of available moduli. The proposed circuits are simulated, tested and synthesized via Synopsys Design Compiler on the TSMC 65-nm technology, to show 69% less delay and 28% less area-time-product at the cost of 14% more energy consumption, with respect to the most relevant reference work.

Abstract: This paper presents novel access patterns for P-parallel N-point radix-2 memory-based fast Fourier transform (FFT) architectures. This work aims to reduce the latency and increase the throughput by changing the data order and/or choosing different places of the architectures to input/output data. In this way, we can eliminate the loading time and/or the time to collect the output data. This results in a reduction in latency and an increase in throughput. Likewise, the architectures use the same permutation circuits for each iteration, which simplifies the circuit. In addition to improvements in latency and throughput with different access patterns for memory-based FFTs, this work also offers bit-reversed or natural input/output alternatives.

Abstract: A new Python library, APyTypes, suitable for simulating finite word-length effects is presented. The library supports configurable fixed-point and floating-point types, both scalars and multi-dimensional arrays, and is implemented in C++. The underlying design principles are introduced and examples show how the library can be used. We argue that the library has significant advantages, especially from a hardware design perspective, over existing libraries. Finally, some directions for further work are outlined.

Abstract: We study a bug that we found in the GNU Emacs text editor when built against the Cairo graphics library. We analyze both the Emacs code and the Cairo code, and we suggest what can be done to avoid unexpected results. This involves a particular case with a computation that can be reduced to the equivalent floating-point expression ((1/s)·b)·s, where s and b are small positive integers such that b < s and the basic operations are rounded to nearest. The analysis takes into account the values of s and b that can occur in practice, and the suggestions for workarounds must avoid handling this particular case in a separate branch or breaking the structure of the Cairo library (so that just returning b is not possible).

Abstract: Training of neural networks has been computed with 16 bit and 32 bit formats. Computational requirements for training AI/ML models are becoming large and costly. Hence, it has become imperative to further reduce the size of the data elements (activations, weights, etc.) to better manage memory footprint, compute resources and power consumption. Reduced precision formats are becoming increasingly popular in academia and research industry. Active research is trending towards FP8 to determine if acceptable model performance can be achieved by using 8-bit floating point numbers compared to using FP16 and FP32. This paper presents hardware implementations of fused multi-way reduced precision floating point multiply-accumulate with single rounding, resulting in power and area efficient characteristics. We propose two different micro-architectures implementing novel design techniques for computing fused FP8 DOT4 accumulating to higher precision FP32 with scaling to adjust the dynamic range. Our first design, dot product with late accumulation computes fused FP8 DOT4 by calculating the dot product in the first two cycles, expanding the products to fixed point format and another two cycles for the accumulation operation. This design allows the reuse of an existing FP arithmetic unit. Our second design, dot product with early accumulation is implemented as a standalone FP8 datapath computing products and accumulation in first two cycles and another two cycles for normalization and single rounding operation. This design aligns addends (products and accumulator) from an “Anchor” for efficient, arithmetically fused, N-way FP DOT product computation. Furthermore, we synthesized the two designs proposed in a 5nm technology node and compared the cost of implementation.

Abstract: There is a recent trend in artificial intelligence (AI) inference towards lower precision data formats down to 8 bits and less.
As the multiplication operation is the most complex operation in typical inference tasks, there is a large demand for efficient small multipliers.
The large DSP-blocks have limitations implementing many small multipliers efficiently.
Hence, this work proposes a solution for better logic-based multipliers that is especially beneficial for small multipliers.
Our work is based on the multiplier tiling method in which a multiplier is designed out of several sub-multiplier tiles.
The key observation we made is that these sub-multipliers do not necessarily have to perform a complete (rectangular) NxK multiplication and more efficient sub-multipliers are possible that are incomplete (non-rectangular).
This proposal first seeks to identify efficient incomplete irregular sub-multipliers and then demonstrates improvements over state-of-the-art designs.
It is shown that optimal solutions can be found using integer linear programming (ILP), which are evaluated in FPGA synthesis experiments.

Abstract: This paper introduces Precision Trade-Off Floats (PT-Floats). This format uses the concept of Tapered Precision (TP), a concept introduced by R. Morris in 1971 and later exploited in the Unum formats, most notably by the Posit number system. The idea is to trade off exponent bits with the significant bits according to the application domain. In general, the numbers having an absolute value close to 1 are given more
significant/mantissa bits and fewer exponent bits, and the other
numbers trade off significant bits with exponent bits. This way,
near unit values are kept at high precision, while the other values span an extensive dynamic range. The central idea in this work is to use a hidden bit in the exponent representation to prevent redundant representations while avoiding using something akin to Posit regime bits that unnecessarily balloon the Dynamic range and lose too many bits that could be used for precision. This format reduces the amount of hardware and energy consumption for IoT applications as per our experimental results.