Proposing a Fast and Scalable Systolic Array for Matrix Multiplication – The 28th IEEE International Symposium on Field-Programmable Custom Computing Machines

This topic has 2 replies, 3 voices, and was last updated 6 days ago by Bahar.

Viewing 2 reply threads

Author

Posts
- April 8, 2020 at 6:12 am #1176
  
  Ken Eguro
  Keymaster
  
  Proposing a Fast and Scalable Systolic Array for Matrix Multiplication – Link for PDF
  Bahar Asgari (Georgia Institute of Technology), Ramyad Hadidi (Georgia Institute of Technology), and Hyesoon Kim (Georgia Institute of Technology)
- May 21, 2020 at 5:47 pm #1717
  
  fjhormigo
  Participant
  
  Your work seems quite interesting but I don’t have enough information to evaluate properly your proposal.
  -You account latency based on the number of cycles only, but how your proposal affects the clock frequency compared to the other methods, and then the latency in seconds?
  -Another key issue is how your proposal affect throughput compared to the others, does it also reduce throughput? said ina different way, what is the initiation interval (the number of cycles between to consecutive calculations) for the three different approach?
  -Is there any significant change in resource utilization (number of LUTS, DSP?
  -which sices of matrices are you using on your benchmarks? could you provide more information on how you proceed to compute the speed-up and Energy consumption?
  Thank you very much.
- May 25, 2020 at 5:38 pm #1719
  
  Bahar
  Participant
  
  Hello!
  Many thanks for showing interest in our work. In the following we answer your questions in order:
  
  – Our proposed structure does not affect the clock frequency. In other words, increasing or decreasing the clock frequency equally impact all three designs (ours and the two previous systolic arrays) by either removing positive slacks or increasing the number of cycles.
  – The maximum attainable throughput (GByte/sec x Ops/Byte) is defined by the depth of the systolic array (Ops/Byte or reuse rate) and memory bandwidth. Therefore, the throughput of same-size systolic arrays (ours and the two previous designs) connected to same memory system (hence same memory bandwidth) are the same and is higher than CPUs, and GPUs. About your question regarding the cycles between consecutive calculation, please note that our design and the TPU-style systolic array benefit from overlapping load and process phase and reusing the preloaded matrix, while the non-stationary systolic array does not.
  – Regardless of their interconnections, all implemented systolic arrays (ours and the two previous systolic arrays) are similar in the total number of multipliers, adders, and registers (as they all store a value in their PEs, either an operand or a partial output). As a result, even though our design uses slightly fewer FFs and LUTs because of its multiplier-plus-adder-tree architecture, we do not see significant differences in resource utilization.
  – Our benchmark includes VGGS, VGG16, AlexNet, CifarNet, and ResNet50 consisting of various-size matrices with dimensions between 16 to 50176. The reported speedup and energy consumption are for performing only the matrix multiplications for the inference using the mentioned sets of DNNs.
  
  Please let us know if further clarifications are required.
  Thanks!
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.