MODULARIZING DEEP LEARNING VIA PAIRWISE LEARNING WITH KERNELS

Abstract

By redefining the conventional notions of layers, we present an alternative view on finitely wide, fully trainable deep neural networks as stacked linear models in feature spaces, leading to a kernel machine interpretation. Based on this construction, we then propose a provably optimal modular learning framework for classification that does not require between-module backpropagation This modular approach brings new insights into the label requirement of deep learning (DL). It leverages only implicit pairwise labels (weak supervision) when learning the hidden modules. When training the output module, on the other hand, it requires full supervision but achieves high label efficiency, needing as few as ten randomly selected labeled examples (one from each class) to achieve 94.88% accuracy on CIFAR-10 using a ResNet-18 backbone. oreover, modular training enables fully modularized DL workflows, which then simplify the design and implementation of pipelines and improve the maintainability and reusability of models. To showcase the advantages of such a modularized workflow, we describe a simple yet reliable method for estimating reusability of pretrained modules as well as task transferability in a transfer learning setting At practically no computation overhead, it precisely described the task space structure of 15 binary classification tasks from CIFAR-10

Existing System

? In Existing system we extended their theoretical framework, explicitly established the connections between NNs and KMs, and proposed a fully modular training approach with proven strong optimality guarantee. ? kernels to imitate the computations performed by infinitely wide networks in expectation. Shankar et al. [4] proposed kernels that are equivaleent to expectations of finite-widths random networks. ? proposed an a posteriori method that analyzes a trained network as modules in order to extract useful information. Most works in this direction, however, achieved only partial

Disadvantages

? When performance is unsatisfying, it is practically impossible to trace the source of the problem to a particular design choice. ? Moreover, this method can be extended to measure task transferability, a central problem in transfer learning, continual/lifelong learning, and multitask learning Unlike many existing methods, our approach requires no training, is task agnostic, flexible, and completely data-driven. . ? Underlying this practical problem is a theoretical issue that is central to many important research domains, including transfer learning, continual/lifelong learning, metalearning, and multitask learning: .

Proposed System

we then propose a provably optimal modular learning framework for classification that does not require between-module backpropagation. This modular approach brings new insights into the label requirement of deep learning (DL). It leverages only implicit pairwise labels (weak supervision) when learning the hidden modules. When training the output module, on the other hand, it requires full supervision but achieves high label efficiency, needing as few as ten randomly selected labeled examples (one from each class) to achieve 94.88% accuracy on CIFAR-10 using a ResNet-18 backbone

Advantages

? This definition is often used as an identity and is referred to as the “kernel trick” in some more modern taxe ? Along this line of research, some more recent methods include H-score [51] and the Bregman-correntropy conditional divergence [52], the latter of which used the correntropy functional [53] and the Bregman matrix divergence [54] to quantify divergence between mappings. ? To facilitate fair comparisons, end-to-end and modular training operate on the same backbone network. For all results, we used stochastic gradient descent as the optimizer with batch size 128. For each module in the modular method as well as the end-to-end baseline, we trained with annealing 3This type of results is reminiscent of the “ideal bounds” for representati ? The momentum was set to 0.9 throughout. For data preprocessing, we used simple mean subtraction followed by division by standard deviation.

Download DOC Download PPT