Hey there,
Interesting work. How many radix bits can you support? Do you have to do multiple passes over the data for large data sets or high cardinality?
How this compare to the CPU?
On the CPU you benefit from write-combining through the use of SIMD stores, are you implementing a similar mechanism on the FPGA? if so at what threshold are you writing to DRAM?
At a more high-level, how does your approach compare to sorting networks in existing work?
David