Hardware-efficient matrix multiplication core optimization for edge AI on FPGA

Phan Hong Minh; Nguyen Tien Viet; Do Doanh Dien

doi:10.54939/1859-1043.j.mst.IITE.2025.123-130

Authors

Phan Hong Minh (Corresponding Author) Institute of Information Technology and Electronics, Academy of Military Science and Technology
Nguyen Tien Viet Information Technology Office, Military Region 2
Do Doanh Dien Institute of Information Technology and Electronics, Academy of Military Science and Technology

DOI:

https://doi.org/10.54939/1859-1043.j.mst.IITE.2025.123-130

Keywords:

IP cores; Matrix multiplication; FPGA-CNN; MAC; Vivado-Vitis.

Abstract

This paper presents an optimization approach for matrix multiplication IP cores on FPGA by transforming convolution operations into matrix multiplications. The proposed method leverages parallel computation combined with simultaneous data loading within the same processing cycle, thereby reducing memory requirements and computational latency. Furthermore, casting the output data from 64-bit to 32-bit effectively shrinks the output buffer, resulting in significant hardware resource savings. Simulation results on ModelSim and Vivado–Vitis demonstrate that the design achieves higher computational efficiency and resource utilization compared to traditional implementations, while maintaining stable processing time. This work contributes to the design of CNN inference accelerators on FPGA for edge AI applications, where resource constraints and power consumption are critical factors..

References

[1]. Nguyen, X.-Q. and Pham-Quoc, C., “An FPGA-base Convolution IP Core for Deep Neural Networks Acceleration,” Rev Journal on Electronics and Communications, Vol. 12, No. 1–2, pp. 1–6 (2022). DOI: 10.21553/rev-jec.286. DOI: https://doi.org/10.21553/rev-jec.286

[2]. Han, S., Pool, J., Tran, J., and Dally, W. J., “Learning Both Weights and Connections for Efficient Neural Networks,” Neural Information Processing Systems (NeurIPS), Vol. 28 (2015).

[3]. Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H., “Learning Structured Sparsity in Deep Neural Networks,” Advances in Neural Information Processing Systems (NeurIPS) (2016).

[4]. Gschwend, D., “ZynqNet: An FPGA-Accelerated Embedded Convolutional Neural Network,” arXiv Preprint, arXiv:2005.06892 (2020).

[5]. Li, Y., et al., “Implementation of Energy‐Efficient Fast Convolution Algorithm for Deep Convolutional Neural Networks Based on FPGA,” Electronics Letters, Vol. 56, No. 5, pp. 234–236 (2020). DOI: https://doi.org/10.1049/el.2019.4188

[6]. Liu, X et al., “WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs,” Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors (ASAP) (2021). DOI: https://doi.org/10.1109/ASAP52443.2021.00045

[7]. Zhang, Y., et al., “An Efficient Convolutional Neural Network Accelerator Design on FPGA Using the Layer-to-Layer Unified Input Winograd Architecture,” Electronics, Vol. 14, No. 6, Article 1182 (2025). DOI: 10.3390/electronics14061182. DOI: https://doi.org/10.3390/electronics14061182

[8]. Taka, E., Huang, N.-C., Chang, C.-C., Wu, K.-C., Arora, A., and Marculescu, D., “Systolic Sparse Tensor Slices: FPGA Building Blocks for Sparse and Dense AI Acceleration,” arXiv Preprint, arXiv:2502.03763v1 [cs.AR] (2025). DOI: https://doi.org/10.1145/3706628.3708867

[9]. https://www.fpga4student.com/2016/11/matrix-multiplier-core-design.html

[10]. https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2020/bjd86_lgp36/bjd86_lgp36/index.html

[11]. https://www.mathworks.com/help/hdlverifier/xilinxfpgaboards/ug/large-matrix-multiplication-using-ethernet-aximaster.html