Shuhai: Benchmarking High Bandwidth Memory on FPGAs – The 28th IEEE International Symposium on Field-Programmable Custom Computing Machines

This topic has 3 replies, 3 voices, and was last updated 3 weeks, 4 days ago by wzk6_3_8.

Viewing 3 reply threads

Author

Posts
- April 8, 2020 at 5:57 am #1163
  
  Ken Eguro
  Keymaster
  
  Shuhai: Benchmarking High Bandwidth Memory on FPGAs – Link for PDF
  Zeke Wang (Zhejiang University), Hongjing Huang (Zhejiang University), Jie Zhang (Zhejiang University), and Gustavo Alonso (ETH Zurich)
- May 4, 2020 at 1:49 pm #1652
  
  wzk6_3_8
  Participant
  
  Thanks for dropping by and listening to my presentation on Shuhai.
  Feel free to ask any questions, I will get back to you as soon as I can.
  
  If you like to use our tool Shuhai to benchmark memory, e.g., HBM and DDR4, please check out the source code on https://github.com/RC4ML/Shuhai.
  
  Zeke
- May 5, 2020 at 8:54 pm #1634
  
  Dustin Richmond
  Keymaster
  
  Interesting paper, and highly topical!
  
  I have a couple questions though:
  – What functionality does Shuhai provide over the traffic generators that Xilinx provides, for example, with the HBM example design?
  – In your paper, you compared DDR4 latency and HBM latency and you studied the switch. Were the latency differences caused by fundamental DDR4/HBM architecture differences, because of the added Xilinx crossbar in HBM, or another reason?
  – Did you study non-strided accesses? If so, what conclusion could you draw from those? Was there an “optimal” access size to get maximum throughput?
  
  Thanks!
- May 6, 2020 at 1:43 pm #1651
  
  wzk6_3_8
  Participant
  
  Hi Dustin,
  
  Thanks for your interest in our paper.
  
  1, we provide the strided memory access pattern that is also parameterized. This memory access pattern is able to cover the performance characteristics of HBM. You are right, the HBM example design provides a couple of memory access patterns, which are orthogonal to our pattern. We will explore more patterns in our future work.
  
  2, we guess the HBM’s latency is much higher due to its high-speed serial link between HBM chip and FPGA die. It means it needs more cycles to do a parallel-serial-parallel conversion. Actually, we measure the HBM’s latency when the switch is disabled, but HBM still has higher latency.
  
  3, Not yet, we focus on the strided access pattern in our paper. Yes, the burst size is critical to achieving higher memory throughput. Larger burst size can tolerate larger stride size, as shown in Figure 6. When burst size is 256 bytes, the case (stride=4K) can still achieve the same throughput as the sequential accesses.
  
  Thanks,
  Zeke
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.