DeepBurning  is an end-to-end neural network acceleration design tool that generates both customized neural network model and neural processing unit (NPU) for a specialized learning task on FPGAs. The overview of DeepBurning is shown in Figure 1. It only requires the dataset of the target application and high-level design constraints such as total resource budget to produce a unified optimized acceleration solution targeting at a typical heterogeneous CPU+FPGA architecture that can be immediately deployed, while the application developers can focus on the application development without dealing with the complex neural network model designing nor the low-level accelerator parameter tuning. Particularly, we propose an efficient co-designed autoML search framework named YOSO  that seeks to optimize the neural network architecture and the NPU parameters at the same time. Note that DeepBurning relies on a pre-built NPU template that allows flexible configuration and customization. The template is supposed to be developed by skilled hardware designers to ensure efficient hardware implementation.
DeepBurning is under active development. The major components including YOSO and NPU compilation are already in use while the automatic NPU generation based on the pre-built template still needs quite some handcrafted adjustment. We will put it online soon when we get it ready. Currently, we only allow the users to compile neural network models to a specific NPU configuration.
# Key features
Given high-level design constraints, YOSO can be used to search for the optimized neural network architecture and NPU configuration.
Neural network models described in Prototxt can be compiled to instructions and then deployed on the pre-built NPU. Currently, we just provide some pre-compiled neural networks and we will offer a free on-line compiler later.
A typical NPU with 2D array computing architecture is provided as a netlist. Its architecture is shown in Figure 2. It consists of 128 KB I/O buffer that can be allocated for input and output dynamically and supports data prefetch to hide the external memory access overhead. It covers a large number of typical operations utilized in typical neural networks and relevant image processing operations, so it supports more than 30 neural networks. The supported operations and neural network models are listed in Table 1.
The generated accelerators and drivers can be utilized in Xilinx Zynq 7000 devices. Particularly, the design is verified on ZC706 and MZ7100. The corresponding Linux kernel and root file system is also provided.
|Neural network operations||General computing operations||Neural network models|
|Convolution, deconvolution, 3D convolution, grouped convolution, Full connection, Softmax, |
Elementwise, Concat, Reorganization, Batch normalization, Pooling (average, max)
Activation function (Relu, Prelu, Leaky Relu, tanh, Sigmoid, …)
|Matrix-matrix multiplication, Matrix-vector multiplication, Dot-production, Cosine distance, Feature scaling||GoogleNet, DenseNet, VGG, ResNet, MobileNet, SqueezeNet, DCGAN, LSTM, MTCNN, Hourglass, …|
# Performance evaluation
We measure the performance and the FPGA resource consumption on MZ7100 board which includes a Zynq 7100 FPGA chip. The NPU kernel runs at 100 MHz and it can be optimized up to 200 MHz. The measured fps on ImageNet is shown in Table 2 and the total FPGA resource overhead is presented in Table 3.
|Neural Network Models||Fps (100 MHz)||Fps (200 MHz)|
|ResNet18||5 fps||10 fps|
|YOLO v2||2.5 fps||4.5 fps|
|MTCNN+Facenet||2 fps||4.2 fps|
# Demo video
We also present two application videos in which we utilize DeepBurning to generate the acceleration solution on MZ7100 board.
Object detection: The input figures are captured by the camera and processed on NPU deployed on the FPGA. While the figures selected from ImageNet and displayed on screen with another computer.
DCGAN based face generation: The faces are generated with DCGAN which is a typical generative neural network.
Prof. Ying Wang (firstname.lastname@example.org)