Hi Dustin,
Thanks for your interest in our paper.
1, we provide the strided memory access pattern that is also parameterized. This memory access pattern is able to cover the performance characteristics of HBM. You are right, the HBM example design provides a couple of memory access patterns, which are orthogonal to our pattern. We will explore more patterns in our future work.
2, we guess the HBM’s latency is much higher due to its high-speed serial link between HBM chip and FPGA die. It means it needs more cycles to do a parallel-serial-parallel conversion. Actually, we measure the HBM’s latency when the switch is disabled, but HBM still has higher latency.
3, Not yet, we focus on the strided access pattern in our paper. Yes, the burst size is critical to achieving higher memory throughput. Larger burst size can tolerate larger stride size, as shown in Figure 6. When burst size is 256 bytes, the case (stride=4K) can still achieve the same throughput as the sequential accesses.
Thanks,
Zeke