Proposing a Fast and Scalable Systolic Array for Matrix Multiplication

FCCM Main Page Forums Poster Session 1 – Arithmetic and Security Proposing a Fast and Scalable Systolic Array for Matrix Multiplication

  • This topic has 2 replies, 3 voices, and was last updated 6 days ago by Bahar.
Viewing 2 reply threads
  • Author
    Posts
    • #1176
      Ken Eguro
      Keymaster

      Proposing a Fast and Scalable Systolic Array for Matrix MultiplicationLink for PDF
      Bahar Asgari (Georgia Institute of Technology), Ramyad Hadidi (Georgia Institute of Technology), and Hyesoon Kim (Georgia Institute of Technology)

    • #1717
      fjhormigo
      Participant

      Your work seems quite interesting but I don’t have enough information to evaluate properly your proposal.
      -You account latency based on the number of cycles only, but how your proposal affects the clock frequency compared to the other methods, and then the latency in seconds?
      -Another key issue is how your proposal affect throughput compared to the others, does it also reduce throughput? said ina different way, what is the initiation interval (the number of cycles between to consecutive calculations) for the three different approach?
      -Is there any significant change in resource utilization (number of LUTS, DSP?
      -which sices of matrices are you using on your benchmarks? could you provide more information on how you proceed to compute the speed-up and Energy consumption?
      Thank you very much.

    • #1719
      Bahar
      Participant

      Hello!
      Many thanks for showing interest in our work. In the following we answer your questions in order:

      – Our proposed structure does not affect the clock frequency. In other words, increasing or decreasing the clock frequency equally impact all three designs (ours and the two previous systolic arrays) by either removing positive slacks or increasing the number of cycles.
      – The maximum attainable throughput (GByte/sec x Ops/Byte) is defined by the depth of the systolic array (Ops/Byte or reuse rate) and memory bandwidth. Therefore, the throughput of same-size systolic arrays (ours and the two previous designs) connected to same memory system (hence same memory bandwidth) are the same and is higher than CPUs, and GPUs. About your question regarding the cycles between consecutive calculation, please note that our design and the TPU-style systolic array benefit from overlapping load and process phase and reusing the preloaded matrix, while the non-stationary systolic array does not.
      – Regardless of their interconnections, all implemented systolic arrays (ours and the two previous systolic arrays) are similar in the total number of multipliers, adders, and registers (as they all store a value in their PEs, either an operand or a partial output). As a result, even though our design uses slightly fewer FFs and LUTs because of its multiplier-plus-adder-tree architecture, we do not see significant differences in resource utilization.
      – Our benchmark includes VGGS, VGG16, AlexNet, CifarNet, and ResNet50 consisting of various-size matrices with dimensions between 16 to 50176. The reported speedup and energy consumption are for performing only the matrix multiplications for the inference using the mentioned sets of DNNs.

      Please let us know if further clarifications are required.
      Thanks!

Viewing 2 reply threads
  • You must be logged in to reply to this topic.