An Open-Source RISC-V Vector Math Library

Multiplier Architecture with a Carry-Based Partial Product Encoding

Multiplier architectures have not changed appreciably over the recent past. In this paper, we introduce a new technique for calculating partial products, which can be used with known compression tree and adder combinations. We demonstrate the efficiency of our new multiplier by reporting results from 800MHz to 2GHz in a current 7nm production library, and comparing to the well-known modified Booth's radix 4 and 8 architectures.

Combining Power and Arithmetic Optimization via Datapath Rewriting

We develop an automated RTL to RTL optimization framework, ROVER, that takes circuit input stimuli and generates power-efficient architectures. We evaluate the effectiveness on both open-source arithmetic benchmarks and benchmarks derived from Intel production examples. The tool is able to reduce the total power consumption by up to 33.9%.

Hardware Acceleration of the Prime-Factor and Rader NTT for BGV Fully Homomorphic Encryption

We present a hardware architecture for the NTT targeting generalized cyclotomics within the context of the BGV FHE scheme. We explore different non-power-of-two NTT algorithms, including the Prime-Factor, Rader, and Bluestein NTTs. Our most efficient architecture targets the 21845-th cyclotomic polynomial --- a practical parameter for BGV --- with ideal properties for use with a combination of the Prime-Factor and Rader algorithms. The design achieves high throughput with optimized resource utilization, by leveraging parallel processing, pipelining, and reusing processing elements. Compared to Wu et al.'s VLSI architecture of the Bluestein NTT, our approach showcases 2x to 5x improved throughput and area efficiency. Simulation and implementation results on an AMD Alveo U250 FPGA demonstrate the feasibility of the proposed hardware design for FHE.

MATLAB Simulator of Level-Index Arithmetic

Fast multiple precision $\exp(x)$ with precomputations

HGH-CORDIC: A High-Radix Generalized Hyperbolic COordinate Rotation Digital Computer

Rounding Error Analysis of an Orbital Collision Probability Evaluation Algorithm

For a unit round-off u and a truncation order N, the bound is of the form (N + A) u + o(u), where A is an explicit constant depending on the problem parameters and o(u) stands for explicitly bounded small terms compared to u. Our analysis is based on the observation that the generating series of the errors affecting each individual term is solution to a perturbed form of a differential equation satisfied by the Laplace transform of a function related to the collision probability.

Useful applications of correctly-rounded operators of the form ab+cd+e

Square Root Unit with Minimum Iterations for Posit Arithmetic

Multiple-base Logarithmic Quantization and Application in Reduced Precision AI Computations

PQC-AMX: accelerating Saber and FrodoKEM on the Apple M1 and M3 SoCs

On the Systematic Creation of Faithfully Rounded Commutative Truncated Booth Multipliers

below the full precision result, if the latter is not exactly representable. Multipliers which take full advantage of this freedom can be implemented using less circuit area and consuming less power. The most common implementations internally truncate the partial product array. However, truncation applied to the most common of multiplier architectures, namely Booth architectures, results in non-commutative implementations. The industrial adoption of truncated multipliers is limited by the

absence of formal verification of such implementations, since exhaustive simulation is typically infeasible. We present a commutative truncated Booth multiplier architecture and derive closed form necessary and sufficient conditions for faithful

rounding. We also provide the bit-vectors giving rise to the worst-case error. We present a formal verification methodology based on ACL2 which scales up to 42 bit multipliers. We synthesize a range of commutative faithfully rounded multipliers and show that truncated booth implementations are up to 31% smaller than externally truncated multipliers.

A Time Efficient Comprehensive Model of Approximate Multipliers for Design Space Exploration

Montgomery Modular Multiplication via Single-Base Residue Number Systems

Novel Access Patterns based on Overlapping Loading and Processing Times to Reduce Latency and Increase Throughput in Memory-based FFTs

APyTypes: Algorithmic Data Types in Python for Efficient Simulation of Finite Word-Length Effects

An Emacs-Cairo Scrolling Bug due to Floating-Point Inaccuracy

Fused FP8 4-Way Dot Product with Scaling and FP32 Accumulation

Small Logic-based Multipliers with Incomplete Sub-Multipliers for FPGAs

As the multiplication operation is the most complex operation in typical inference tasks, there is a large demand for efficient small multipliers.

The large DSP-blocks have limitations implementing many small multipliers efficiently.

Hence, this work proposes a solution for better logic-based multipliers that is especially beneficial for small multipliers.

Our work is based on the multiplier tiling method in which a multiplier is designed out of several sub-multiplier tiles.

The key observation we made is that these sub-multipliers do not necessarily have to perform a complete (rectangular) NxK multiplication and more efficient sub-multipliers are possible that are incomplete (non-rectangular).

This proposal first seeks to identify efficient incomplete irregular sub-multipliers and then demonstrates improvements over state-of-the-art designs.

It is shown that optimal solutions can be found using integer linear programming (ILP), which are evaluated in FPGA synthesis experiments.

PT-Float: A Floating-Point Unit with Dynamically Varying Exponent and Fraction Sizes

significant/mantissa bits and fewer exponent bits, and the other

numbers trade off significant bits with exponent bits. This way,

near unit values are kept at high precision, while the other values span an extensive dynamic range. The central idea in this work is to use a hidden bit in the exponent representation to prevent redundant representations while avoiding using something akin to Posit regime bits that unnecessarily balloon the Dynamic range and lose too many bits that could be used for precision. This format reduces the amount of hardware and energy consumption for IoT applications as per our experimental results.