Interesting paper, and highly topical!
I have a couple questions though:
– What functionality does Shuhai provide over the traffic generators that Xilinx provides, for example, with the HBM example design?
– In your paper, you compared DDR4 latency and HBM latency and you studied the switch. Were the latency differences caused by fundamental DDR4/HBM architecture differences, because of the added Xilinx crossbar in HBM, or another reason?
– Did you study non-strided accesses? If so, what conclusion could you draw from those? Was there an “optimal” access size to get maximum throughput?
Thanks!