Efficient Neural Architecture Search via Parameter Sharing

Efficient Neural Architecture Search via Parameter Sharing

Abstract

ENAS, a controller discovers neural network architectures by searching for an optimal subgraph within a large computation graph

sharing parameters among child models

much fewer GPU-hours

Introduction

In NAS, an RNN controller is trained in a loop, the controller first samples a candidate architecture, i.e. a child model, and then trains it to convergence to measure its performance on the task of desire. The controller then uses the performances as a guiding signal to find more promising architectures

NAS use 450 GPUs for 3-4 days

In ENAS, forcing all child models to share weights to eschew training each child model from scratch to convergence

a single Nvidia GTX 1080Ti GPU, less than 16 hours

Method

NAS ends up iterating over can be viewed as sub-graphs of a larger graph

Representing NAS’s search space using a single DAG

An architecture can be realized by taking a subgraph of the DAG

nodes represent the local computations
edges represent the flow of information

Designing Recurrent Cells

the discussion of ENAS with an example that illustrates how to design a cell for rnn from a specified DAG and a controller

ENAS’s controller is an RNN that decides:

which edges are activated
which computations are performed at each node in the DAG

NAS fix the topology of their architectures as a binary tree and only learn the operations at each node of the tree

ENAS design both the topology and the operation in the RNN cells

AutoML-basic framework

independent parameter matrix $W_{l,j}^{(h)}$

all current cells in a search space share the same set of parameters

Search space includes an exponential number of configurations. N nodes and 4 activation functions (tanh, ReLU, identity, sigmoid), the search space has $4^N * N!$

(每一个节点有4种激活方式，N个节点相互独立，生成4N种激活方式，第一个节点只能和输入进行连接，第二个节点可以和输入和第一个节点2个里面选一个连接，第三个节点可以和输入和第1,2共3个节点里面选一个连接，依次类推，最终产生N!种连接方式。所以最终生成4N × N!种有向图DAG。)

Training ENAS and Deriving Architecture

Controller network is LSTM with 100 hidden units

Trainable parameter, two interleaving phases:

First phase: shared parameters of the child models, $\omega$
Second phase: parameters of controller LSTM, denoted by $\theta$

Training the shared parameters $\omega$ of the child models

controller’s policy $\pi(m;\theta)$

SGD on $\omega$ to minimize on loss function

M=1 works fine, i.e. we can update $\omega$ using the gradients from any single model m sampled from $\pi(m;\theta)$

Train $\omega$ during a entire pass through the training data

Training the controller parameters $\theta$

Fix $\omega$, update $\theta$

AutoML-basic framework

Deriving Architecture

We first sample several models from the trained policy $\pi(m, \theta)$. For each sampled model, we compute its reward on a single minibatch sampled from the validation set. We then take only the model with the highest reward to re-train from scratch. It

Designing Convolution Network

At each decision block:

what previous nodes to connect to
what computation operation to use

6 operations available for controller:

convolution filter 3*3, 5*5
depthwise-seperable convolution: 3*3, 5*5
max_pooling and average pooling: 3*3

Sample a network of L layers, there are $6^L * 2^{L(L-1)/2}$ networks in search space

Designing Convolution Cells

Rather than designing the entire convolutional network, one can design smaller modules and then connect them together to form a network

AutoML-basic framework

Reduction cell:

sample a computational graph from the search space
applying all operations with a stride of 2

A reduction cell thus reduces the spatial dimensions of its input by a factor of 2

其中，5表示5中操作（identity，3 × 3 depthwise-separable卷积，5 × 5depthwise-separable卷积，3 × 3 max pooling，average pooling）。类似前面的思路，2个输入节点确定，B-2个节点有(B - 2)!种连接方式，这样就会产生5 × (B - 2)!种网络，而该BLOCK的每个节点都包含2个输入，每个输入相互独立，就会产生(5 × (B - 2)!)2种网络结构。也就是说，普通卷积就会产生(5 × (B - 2)!)2种网络结构，而该模块BLOCK还使用了stride=2的卷积，该卷积也会产生(5 × (B - 2)!)2种网络结构。最终就会产生(5 × (B - 2)!)4种网络结构。

See in: https://blog.csdn.net/qq_14845119/article/details/84070640

Experiments

Training details for Penn Treebank

Augment the simple transformations between nodes in the constructed recurrent cell with highway connections. e.g. elementwise multiplication

Using a large lr whilst clipping the gradient norm at a small threshold makes the updates on $\omega$ more stable

Training details for CIFAR-10

The importance of ENAS

Conclusion

Also see in:

https://www.zybuluo.com/Team/note/1445001

https://towardsdatascience.com/illustrated-efficient-neural-architecture-search-5f7387f9fb6

Abstract

Introduction

Method

Designing Recurrent Cells

Training ENAS and Deriving Architecture

Training the shared parameters $\omega$ of the child models

Training the controller parameters $\theta$

Deriving Architecture

Designing Convolution Network

Designing Convolution Cells

Experiments

Training details for Penn Treebank

Training details for CIFAR-10

The importance of ENAS

Related work and discussion

Conclusion