DORY Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs

Abstract

The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency – requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY) – an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device, DORY achieves up to 2.5× better MAC/cycle than the GreenWaves proprietary software solution and 18.1× better than the state-of-the-art result on an STM32-H743 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps – 15.4× better than an STM32-H743. We release all our developments – the DORY framework, the optimized backend kernels, and the related heuristics – as open-source software.

Existing System

? Artificial intelligence-powered pocket-sized air robots have the potential to revolutionize the Internet-of-Things ecosystem, acting as autonomous, unobtrusive, and ubiquitous smart sensors. ? Nano-sized UAVs, with a sub-ten centimeters form-factor and a few tens of grams in weight, have the potential to enable exciting use cases, out of reach for bulkier aircraft. ? We present our dataset augmentation methodology, which maximizes the model’s generalization capability with synthetic pitch, photometric, optical, and geometric enhancements. ? Moving into nano-scale UAVs, we focus on those which employ novel deep learning-based algorithms.

Disadvantages

? We also investigate the impact of memory dimensions on the network execution time. ? The GAP-8 ’cluster’ is composed by eight 4-stage inorder single-issue pipeline RI5CY [38] cores, implementing the RISC-V RV32IMCXpulpV2 Instruction Set Architecture (ISA). ? New tools such as TFLite Micro and the Larq Computing Engine (LCE) offer a model-agnostic deployment framework and overcome these problems. ? The Solver relies on a 2-step engine, which solves the L3-L2 tiling constrained problem first, and the L2-L1 one afterwards. ? Therefore, the load of the internal tiles and the asynchronous I/O DMA load of the following layer’s weights are often impacting performance.

Proposed System

• Our work demonstrates that deep learning models for robotic perception, trained and deployed with the proposed methodology, can afford extreme complexity reduction. • In, a model-based reinforcement learning (RL) policy is proposed to control a pocket-size quadcopter. • Parallel ultra-low power (PULP) processing is a recently proposed paradigm that is getting industrial and academic traction to respond to this heightened need of performance and energy efficiency for low-power edge devices. • The proposed neural network is inspired by the original Proximity network, where the same task was addressed with different ResNetbased topology and robotic platform.

Advantages

? In particular, the scarce availability of memory constitutes a real Deep Learning Memory Wall: a fundamental limitation to the maximum performance of an embedded DNN compute system. ? However, to “unlock” such a system’s theoretical performance often requires carefully managed data movement by means of cache locking or explicit DMA transfers. ? We evaluate the performance and energy efficiency of the deployed networks produced by DORY on GWT GAP-8, considering both single layers and end-to-end networks. ? The GWT AutoTiler directly tackles the data-movement and tile sizing challenge to optimize memory access, reaching state-of-the-art performance on the execution of many networks.

Download DOC Download PPT