FCCM Main Page › Forums › Poster Session 4 – Applications and Architectures › High-Performance Parallel Radix Sort on FPGA
- This topic has 3 replies, 4 voices, and was last updated 3 weeks, 4 days ago by bashar.romanous.
-
AuthorPosts
-
-
April 8, 2020 at 6:25 am #1196Ken EguroKeymaster
High-Performance Parallel Radix Sort on FPGA – Link for PDF
Bashar Romanous (University of California, Riverside), Mohammadreza Rezvani (University of California, Riverside), Junjie Huang (University of California, Riverside), Daniel Wong (University of California, Riverside), Evangelos E. Papalexakis (University of California, Riverside), Vassilis J. Tsotras (University of California, Riverside), and Walid Najjar (University of California, Riverside)
-
May 5, 2020 at 5:56 am #1594davidsidlerParticipant
Hey there,
Interesting work. How many radix bits can you support? Do you have to do multiple passes over the data for large data sets or high cardinality?
How this compare to the CPU?
On the CPU you benefit from write-combining through the use of SIMD stores, are you implementing a similar mechanism on the FPGA? if so at what threshold are you writing to DRAM?At a more high-level, how does your approach compare to sorting networks in existing work?
David
-
May 5, 2020 at 4:01 pm #1607Dustin RichmondKeymaster
Thank you for posting your question!
-
May 6, 2020 at 3:43 am #1647bashar.romanousParticipant
Hello David,
Thank you for your interest in our work!
How many radix bits can you support?
We support radix of 16 bits on Xilinx UltraScale+ XCVU7P chip, but a chip with larger BRAM/URAM such as Xilinx UltraScale+ XCVU13P would make a radix of 20 bits possible.
Do you have to do multiple passes over the data for large data sets or high cardinality
The number of iterations is not related to the dataset size but by the radix. So for example a radix of 16 bits with 80-bit keys, means that 80/16 = 5 iterations are needed. The cardinality of the dataset doesn’t affect the sorting time.
How this compare to the CPU?
CPU performance starts to degrade slowly after a peak when dataset size grows as cache misses grow and the the dataset competes with the OS for memory resources. The FPGA throughput is constant for large datasets. However, the very high frequency of CPUs cores, and large memory BW, makes the comparison in terms of raw throughput difficult. Hence, the figures were normalized by BW and throughput in terms of sorted records per cycle.
On the CPU you benefit from write-combining through the use of SIMD stores, are you implementing a similar mechanism on the FPGA? if so at what threshold are you writing to DRAM?
We didn’t implement a s similar mechanism as there are no spacial locality in the writes. The writes are not contiguous and take place in scattered fashion.
At a more high-level, how does your approach compare to sorting networks in existing work?
The time complexity of distribution-based sort, Radix Sort, is O(N*B) where N is the dataset size and B is the number of iterations; which is ratio between the key size to radix. In comparison-based sort, as used in sorting networks, the time complexity is O(N*LOG(N)).
Sorting networks are limited by their scalability. For a larger dataset you have to run multiple iterations on the sorting network and requires merging of the sorted sub-datasets at the end.
Kindest regards,
Bashar
-
-
AuthorPosts
- You must be logged in to reply to this topic.