A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

Ahn, Heungseop; Choi, Seungwon

doi:10.1587/transfun.E100.A.1188

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

Authors: Ahn, Heungseop; Choi, Seungwon

Issue Date: May-2017

Publisher: IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG

Keywords: GPU; CUDA; turbo decoder; coalesced memory access; SDR

Citation: IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, v.E100A, no.5, pp.1188 - 1196

Indexed: SCIE
SCOPUS

Journal Title: IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES

Volume: E100A

Number: 5

Start Page: 1188

End Page: 1196

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/20348

DOI: 10.1587/transfun.E100.A.1188

ISSN: 0916-8508

Abstract: The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100 % without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30 % compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4 Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Choi, Seung won photo

Choi, Seung won: 서울 공과대학 (서울 융합전자공학부)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1365

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE