High-Performance Parallel Radix Sort on FPGA – The 28th IEEE International Symposium on Field-Programmable Custom Computing Machines

This topic has 3 replies, 4 voices, and was last updated 3 weeks, 4 days ago by bashar.romanous.

Viewing 3 reply threads

Author

Posts
- April 8, 2020 at 6:25 am #1196
  
  Ken Eguro
  Keymaster
  
  High-Performance Parallel Radix Sort on FPGA – Link for PDF
  Bashar Romanous (University of California, Riverside), Mohammadreza Rezvani (University of California, Riverside), Junjie Huang (University of California, Riverside), Daniel Wong (University of California, Riverside), Evangelos E. Papalexakis (University of California, Riverside), Vassilis J. Tsotras (University of California, Riverside), and Walid Najjar (University of California, Riverside)
- May 5, 2020 at 5:56 am #1594
  
  davidsidler
  Participant
  
  Hey there,
  
  Interesting work. How many radix bits can you support? Do you have to do multiple passes over the data for large data sets or high cardinality?
  How this compare to the CPU?
  On the CPU you benefit from write-combining through the use of SIMD stores, are you implementing a similar mechanism on the FPGA? if so at what threshold are you writing to DRAM?
  
  At a more high-level, how does your approach compare to sorting networks in existing work?
  
  David
- May 5, 2020 at 4:01 pm #1607
  
  Dustin Richmond
  Keymaster
  
  Thank you for posting your question!
- May 6, 2020 at 3:43 am #1647
  
  bashar.romanous
  Participant
  
  Hello David,
  
  Thank you for your interest in our work!
  
  How many radix bits can you support?
  
  We support radix of 16 bits on Xilinx UltraScale+ XCVU7P chip, but a chip with larger BRAM/URAM such as Xilinx UltraScale+ XCVU13P would make a radix of 20 bits possible.
  
  Do you have to do multiple passes over the data for large data sets or high cardinality
  
  The number of iterations is not related to the dataset size but by the radix. So for example a radix of 16 bits with 80-bit keys, means that 80/16 = 5 iterations are needed. The cardinality of the dataset doesn’t affect the sorting time.
  
  How this compare to the CPU?
  
  CPU performance starts to degrade slowly after a peak when dataset size grows as cache misses grow and the the dataset competes with the OS for memory resources. The FPGA throughput is constant for large datasets. However, the very high frequency of CPUs cores, and large memory BW, makes the comparison in terms of raw throughput difficult. Hence, the figures were normalized by BW and throughput in terms of sorted records per cycle.
  
  On the CPU you benefit from write-combining through the use of SIMD stores, are you implementing a similar mechanism on the FPGA? if so at what threshold are you writing to DRAM?
  
  We didn’t implement a s similar mechanism as there are no spacial locality in the writes. The writes are not contiguous and take place in scattered fashion.
  
  At a more high-level, how does your approach compare to sorting networks in existing work?
  
  The time complexity of distribution-based sort, Radix Sort, is O(N*B) where N is the dataset size and B is the number of iterations; which is ratio between the key size to radix. In comparison-based sort, as used in sorting networks, the time complexity is O(N*LOG(N)).
  
  Sorting networks are limited by their scalability. For a larger dataset you have to run multiple iterations on the sorting network and requires merging of the sorted sub-datasets at the end.
  
  Kindest regards,
  Bashar
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.