0
votes

Design: A single data receive module accepts one trial of data, composed of 8-bit samples aligned contiguously. This module adds N number of these trials together (by adding N samples across trials together to form summed samples aligned contiguously making up a single summed trial) before sending this final trial summation to the next module for further processing.

Fig. 1: Trial 1[SAMPLE1-1, SAMPLE1-2, SAMPLE 1-3...] + Trial 2[SAMPLE2-1, SAMPLE2-2, SAMPLE 2-3...] = Summed Trial[(SAMPLE1-1+SAMPLE2-1), (SAMPLE1-2+SAMPLE2-2), (SAMPLE1-3+SAMPLE2-3), ...]

Currently in my RTL, I am using a for loop statement within a generate block to instantiate the number of adders (which I design myself and uses just the simple '+' operation) required to add samples across trials, and letting the synthesis tool (Vivado) decide the primitives to use.

I am seeking techniques to use the least # of CLBs and logic resources to perform this addition, whether through optimizations in my RTL, instantiating primitives directly, or others. Any suggestions would be greatly appreciated. Thanks!

1

1 Answers

0
votes

Synthesis tools, including XST (Xilinx Synthesis Technology), use very efficient algorithms to implement arithmetic operations, however that doesn't mean the tools are unbeatable.

We have a nice concept called "column compression" to sum N numbers. It is applicable to FPGAs as well as ASICs. There are very famous compression algorithms which are originally targeting summation of partial products in multiplier circuits. They are "Wallace Tree" and "Dadda Tree". Here is a study on multipliers, which looks public.

However mostly half/full adders are used for column compression on an ASIC, look-up tables (LUTs) should be taken into account on an FPGA. It's possible to construct a tree by instantiating LUT primitives. There is a manual here for Xilinx 7-series FPGAs, instantiation examples in VHDL and Verilog can be found in it.

There is a tricky part on FPGAs. Since carry chains exist between slices to implement fast adders, a tree of LUTs is not efficient in most cases.

Contrary to a low-level design, a binary tree of + operators may be the easiest way and usually the performance is not so bad.