TY - GEN
T1 - FPU generator for design space exploration
AU - Galal, Sameh
AU - Shacham, Ofer
AU - Brunhaver, John S.
AU - Pu, Jing
AU - Vassiliev, Artem
AU - Horowitz, Mark
PY - 2013/8/13
Y1 - 2013/8/13
N2 - FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair "apples to apples" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.
AB - FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair "apples to apples" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.
KW - Fused multiply add
KW - floating point
KW - multipliers
KW - power efficiency
UR - http://www.scopus.com/inward/record.url?scp=84881255791&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84881255791&partnerID=8YFLogxK
U2 - 10.1109/ARITH.2013.27
DO - 10.1109/ARITH.2013.27
M3 - Conference contribution
AN - SCOPUS:84881255791
SN - 9780769549576
T3 - Proceedings - Symposium on Computer Arithmetic
SP - 25
EP - 34
BT - Proceedings - 2013 IEEE 21st Symposium on Computer Arithmetic, ARITH 2013
T2 - 21st Symposium on Computer Arithmetic, ARITH 2013
Y2 - 7 April 2013 through 10 April 2013
ER -