Abstract:
Over the last year, Deep neural networks (DNN) have been significantly accepted for computer vision applications because of high classification accuracy and versatility. Convolutional Neural Network (CNN) is one of the most popular architectures of DNN which is widely adopted for image, speech and video recognition. Extensive computation and large memory requirement of CNN s poses the bottleneck on its application. Field Programmable Gate Arrays (FPGAs) are considered to be suitable hardware platforms for deployment of CNNs with low power requirements. This paper focus on the design and implementation of hardware accelerator to perform the convolution product (matrix-matrix multiplication. We have used two optimization techniques to achieve energy efficiency. First, dataflow of the convolution phase is rescheduled to reduce the undesired on-chip memory accesses. Further, efficiency is enhanced by reducing the internal parallelism of structure as much as possible. Our architecture is implemented on the Xilinx ZCU104 evaluation board. The implemented design attains 98.1 GOPS/Joule and 32.77 GOPS/Joule for 8-bit and 16-bit data width respectively.