Your work seems quite interesting but I don’t have enough information to evaluate properly your proposal.
-You account latency based on the number of cycles only, but how your proposal affects the clock frequency compared to the other methods, and then the latency in seconds?
-Another key issue is how your proposal affect throughput compared to the others, does it also reduce throughput? said ina different way, what is the initiation interval (the number of cycles between to consecutive calculations) for the three different approach?
-Is there any significant change in resource utilization (number of LUTS, DSP?
-which sices of matrices are you using on your benchmarks? could you provide more information on how you proceed to compute the speed-up and Energy consumption?
Thank you very much.