TY - GEN
T1 - Parallelization techniques for implementing trellis algorithms on graphics processors
AU - Zheng, Q.
AU - Chen, Y.
AU - Dreslinski, R.
AU - Chakrabarti, Chaitali
AU - Anastasopoulos, A.
AU - Mahlke, S.
AU - Mudge, T.
PY - 2013
Y1 - 2013
N2 - In this paper, we study different schemes to parallelize trellis algorithms for efficient implementation on a GPU. We consider parallelization schemes at the packet-level, subblock-level and trellis-level to increase the number of threads in a GPU implementation. At the trellis-level, we consider state-level, forward-backward traversal and branch-metric parallelism. To evaluate the performance of the different schemes, an LTE uplink Turbo decoder is implemented on an NVIDIA GTX470 GPU. Tradeoffs between throughput, latency and bit error rate are presented. Our most balanced configuration is simultaneously processing multiple subblocks in a packet in conjunction with recovery schemes and trellis-level parallelism, which can achieve a throughput of 19.65 Mbps with a latency of 0.56 ms at bit error rate of 10-5 for 1.3 dB channel SNR. We also show how different combinations of parallelization schemes can be used to satisfy systems with widely varying requirements of throughput, latency and bit error rate.
AB - In this paper, we study different schemes to parallelize trellis algorithms for efficient implementation on a GPU. We consider parallelization schemes at the packet-level, subblock-level and trellis-level to increase the number of threads in a GPU implementation. At the trellis-level, we consider state-level, forward-backward traversal and branch-metric parallelism. To evaluate the performance of the different schemes, an LTE uplink Turbo decoder is implemented on an NVIDIA GTX470 GPU. Tradeoffs between throughput, latency and bit error rate are presented. Our most balanced configuration is simultaneously processing multiple subblocks in a packet in conjunction with recovery schemes and trellis-level parallelism, which can achieve a throughput of 19.65 Mbps with a latency of 0.56 ms at bit error rate of 10-5 for 1.3 dB channel SNR. We also show how different combinations of parallelization schemes can be used to satisfy systems with widely varying requirements of throughput, latency and bit error rate.
UR - http://www.scopus.com/inward/record.url?scp=84883326393&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84883326393&partnerID=8YFLogxK
U2 - 10.1109/ISCAS.2013.6572072
DO - 10.1109/ISCAS.2013.6572072
M3 - Conference contribution
AN - SCOPUS:84883326393
SN - 9781467357609
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
SP - 1220
EP - 1223
BT - 2013 IEEE International Symposium on Circuits and Systems, ISCAS 2013
T2 - 2013 IEEE International Symposium on Circuits and Systems, ISCAS 2013
Y2 - 19 May 2013 through 23 May 2013
ER -