FPU generator for design space exploration

Sameh Galal, Ofer Shacham, John Brunhaver, Jing Pu, Artem Vassiliev, Mark Horowitz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair "apples to apples" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.

Original languageEnglish (US)
Title of host publicationProceedings - Symposium on Computer Arithmetic
Pages25-34
Number of pages10
DOIs
StatePublished - 2013
Externally publishedYes
Event21st Symposium on Computer Arithmetic, ARITH 2013 - Austin, TX, United States
Duration: Apr 7 2013Apr 10 2013

Other

Other21st Symposium on Computer Arithmetic, ARITH 2013
CountryUnited States
CityAustin, TX
Period4/7/134/10/13

Fingerprint

Design Space Exploration
Generator
Multiplication
Throughput
Apple
Encoder
Summation
Adders
Cascade
Latency
Energy efficiency
Pipelining
Pipe
Wire
Energy Efficiency
Optimization Techniques
Count
Unit
Methodology
Architecture

Keywords

  • floating point
  • Fused multiply add
  • multipliers
  • power efficiency

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • Theoretical Computer Science

Cite this

Galal, S., Shacham, O., Brunhaver, J., Pu, J., Vassiliev, A., & Horowitz, M. (2013). FPU generator for design space exploration. In Proceedings - Symposium on Computer Arithmetic (pp. 25-34). [6545888] https://doi.org/10.1109/ARITH.2013.27

FPU generator for design space exploration. / Galal, Sameh; Shacham, Ofer; Brunhaver, John; Pu, Jing; Vassiliev, Artem; Horowitz, Mark.

Proceedings - Symposium on Computer Arithmetic. 2013. p. 25-34 6545888.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Galal, S, Shacham, O, Brunhaver, J, Pu, J, Vassiliev, A & Horowitz, M 2013, FPU generator for design space exploration. in Proceedings - Symposium on Computer Arithmetic., 6545888, pp. 25-34, 21st Symposium on Computer Arithmetic, ARITH 2013, Austin, TX, United States, 4/7/13. https://doi.org/10.1109/ARITH.2013.27
Galal S, Shacham O, Brunhaver J, Pu J, Vassiliev A, Horowitz M. FPU generator for design space exploration. In Proceedings - Symposium on Computer Arithmetic. 2013. p. 25-34. 6545888 https://doi.org/10.1109/ARITH.2013.27
Galal, Sameh ; Shacham, Ofer ; Brunhaver, John ; Pu, Jing ; Vassiliev, Artem ; Horowitz, Mark. / FPU generator for design space exploration. Proceedings - Symposium on Computer Arithmetic. 2013. pp. 25-34
@inproceedings{1b2b3af41c5d422aa8de9970a112f782,
title = "FPU generator for design space exploration",
abstract = "FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair {"}apples to apples{"} methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.",
keywords = "floating point, Fused multiply add, multipliers, power efficiency",
author = "Sameh Galal and Ofer Shacham and John Brunhaver and Jing Pu and Artem Vassiliev and Mark Horowitz",
year = "2013",
doi = "10.1109/ARITH.2013.27",
language = "English (US)",
isbn = "9780769549576",
pages = "25--34",
booktitle = "Proceedings - Symposium on Computer Arithmetic",

}

TY - GEN

T1 - FPU generator for design space exploration

AU - Galal, Sameh

AU - Shacham, Ofer

AU - Brunhaver, John

AU - Pu, Jing

AU - Vassiliev, Artem

AU - Horowitz, Mark

PY - 2013

Y1 - 2013

N2 - FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair "apples to apples" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.

AB - FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair "apples to apples" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.

KW - floating point

KW - Fused multiply add

KW - multipliers

KW - power efficiency

UR - http://www.scopus.com/inward/record.url?scp=84881255791&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881255791&partnerID=8YFLogxK

U2 - 10.1109/ARITH.2013.27

DO - 10.1109/ARITH.2013.27

M3 - Conference contribution

AN - SCOPUS:84881255791

SN - 9780769549576

SP - 25

EP - 34

BT - Proceedings - Symposium on Computer Arithmetic

ER -