Forum Replies Created
-
AuthorPosts
-
bashar.romanousParticipant
Hello David,
Thank you for your interest in our work!
How many radix bits can you support?
We support radix of 16 bits on Xilinx UltraScale+ XCVU7P chip, but a chip with larger BRAM/URAM such as Xilinx UltraScale+ XCVU13P would make a radix of 20 bits possible.
Do you have to do multiple passes over the data for large data sets or high cardinality
The number of iterations is not related to the dataset size but by the radix. So for example a radix of 16 bits with 80-bit keys, means that 80/16 = 5 iterations are needed. The cardinality of the dataset doesn’t affect the sorting time.
How this compare to the CPU?
CPU performance starts to degrade slowly after a peak when dataset size grows as cache misses grow and the the dataset competes with the OS for memory resources. The FPGA throughput is constant for large datasets. However, the very high frequency of CPUs cores, and large memory BW, makes the comparison in terms of raw throughput difficult. Hence, the figures were normalized by BW and throughput in terms of sorted records per cycle.
On the CPU you benefit from write-combining through the use of SIMD stores, are you implementing a similar mechanism on the FPGA? if so at what threshold are you writing to DRAM?
We didn’t implement a s similar mechanism as there are no spacial locality in the writes. The writes are not contiguous and take place in scattered fashion.
At a more high-level, how does your approach compare to sorting networks in existing work?
The time complexity of distribution-based sort, Radix Sort, is O(N*B) where N is the dataset size and B is the number of iterations; which is ratio between the key size to radix. In comparison-based sort, as used in sorting networks, the time complexity is O(N*LOG(N)).
Sorting networks are limited by their scalability. For a larger dataset you have to run multiple iterations on the sorting network and requires merging of the sorted sub-datasets at the end.
Kindest regards,
Bashar -
AuthorPosts