XpulpNN Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V based IoT End Nodes
ABSTARCT :
Strongly quantized fixed-point arithmetic is now considered a well-established solution to deploy Convolutional Neural Networks (CNNs) on limited-memory low-power IoT endnodes. Such a trend is challenging due to the lack of support for low bitwidth fixed-point instructions in the Instruction Set Architecture (ISA) of state-of-the-art embedded Microcontrollers (MCUs), which are mainly based on closed ISA such as ARM Thumb2 and associated Helium extensions. Emerging opensource ISAs such as RISC-V provide a flexible way to address this challenge. This work introduces lightweight extensions to the RISC-V ISA to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we are able to show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64 × peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 × and 8 × faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators, and up to three orders of magnitude better than state-of-theart ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU.
EXISTING SYSTEM :
? The cloud-edge continuum computing paradigm relies on the possibility of local processing in the edge of the IoT whenever it is convenient for reasons of energy efficiency, reliability, or data security.
? Vector-oriented hardware acceleration has gained renewed interest to support artificial intelligence (AI) applications like convolutional networks or classification algorithms.
? We implemented such exploration addressing the execution of the VGG-16 deep convolutional neural network inference, widely known for its image recognition performance as well as for its high computing power and storage demand.
? The VGG-16 execution is composed of consecutive layers having different computational characteristics.
DISADVANTAGE :
? In this work, we tackle this problem by proposing a set of lightweight domain-specific extensions to the RISC-V ISA, namely XpulpNN, targeting specifically the computing requirements of low-bitwidth QNNs, with the support for subbyte SIMD operations (8-, 4-, 2-bits).
? This reduces the memory traffic, allows a higher grade of flexibility for data reuse (we are not limited by the compiler scheduler on the time we can keep an operand into the GP-RF), and solves the problem of using two different registers to encode the same address.
? The drawback of the nnsdotp is that the encoding of the new instruction is more complex.
? Anytime the C&U instruction is issued in the EX-stage, the Dotp-Unit fetches its first operand (the weight element in the case of the PULP-NN MatMul) from the NN-RF.
PROPOSED SYSTEM :
? We present a comprehensive investigation of the performance and power efficiency achievable by configurable vector acceleration subsystems, obtaining evidence of both the high potential of the proposed microarchitecture and the advantage of hardware customization in total transparency to the software program.
? The coprocessor architecture proposed in this work is general purpose in nature, being based on vector operations, and can be tailored to support a given computation kernel in the most efficient way.
? As opposed to this view, the proposed hardware architecture study is independent of technology assumptions, such as the supply voltage, and addresses any physical implementation, particularly soft-cores on commercial FPGA devices, in the view of exploiting application-driven configurability.
ADVANTAGE :
? We integrate the extended core in an eight cores parallel ultra-low-power (PULP) computing cluster, showing that we improve the performance of QNN kernels almost linearly with respect to the single-core execution.
? Dedicated accelerators are top-in-class for what concerns performance and energy efficiency on the QNN workloads.
? The high performance and energy efficiency achieved by these accelerators are counterbalanced by their poor flexibility, which makes the end-to-end deployment of real-sized DNNs harder.
? Thus the reduction of numerical precision for CNN models plays a key role in achieving good performance and energy efficiency.
? These few extra instructions do not affect the performance since they lay outside the critical loop.
|