Efficient Neural Architecture Search via Parameter Sharing
Abstract
ENAS, a controller discovers neural network architectures by searching for an optimal subgraph within a large computation graph
PG
sharing parameters among child models
much fewer GPU-hours
Introduction
In NAS, an RNN controller is trained in a loop, the controller first samples a candidate architecture, i.e. a child model, and then trains it to convergence to measure its performance on the task of desire. The controller then uses the performances as a guiding signal to find more promising architectures
NAS use 450 GPUs for 3-4 days
In ENAS, forcing all child models to share weights to eschew training each child model from scratch to convergence
a single Nvidia GTX 1080Ti GPU, less than 16 hours
Method
NAS ends up iterating over can be viewed as sub-graphs of a larger graph
Representing NAS’s search space using a single DAG
An architecture can be realized by taking a subgraph of the DAG
- nodes represent the local computations
- edges represent the flow of information
Designing Recurrent Cells
the discussion of ENAS with an example that illustrates how to design a cell for rnn from a specified DAG and a controller
ENAS’s controller is an RNN that decides:
- which edges are activated
- which computations are performed at each node in the DAG
NAS fix the topology of their architectures as a binary tree and only learn the operations at each node of the tree
ENAS design both the topology and the operation in the RNN cells
independent parameter matrix $W_{l,j}^{(h)}$
all current cells in a search space share the same set of parameters
Search space includes an exponential number of configurations. N nodes and 4 activation functions (tanh, ReLU, identity, sigmoid), the search space has $4^N * N!$
(每一个节点有4种激活方式,N个节点相互独立,生成4N种激活方式,第一个节点只能和输入进行连接,第二个节点可以和输入和第一个节点2个里面选一个连接,第三个节点可以和输入和第1,2共3个节点里面选一个连接,依次类推,最终产生N!种连接方式。所以最终生成4N × N!种有向图DAG。)
Training ENAS and Deriving Architecture
Controller network is LSTM with 100 hidden units
Trainable parameter, two interleaving phases:
- First phase: shared parameters of the child models, $\omega$
- Second phase: parameters of controller LSTM, denoted by $\theta$
Training the shared parameters $\omega$ of the child models
controller’s policy $\pi(m;\theta)$
SGD on $\omega$ to minimize on loss function
M=1 works fine, i.e. we can update $\omega$ using the gradients from any single model m sampled from $\pi(m;\theta)$
Train $\omega$ during a entire pass through the training data
Training the controller parameters $\theta$
Fix $\omega$, update $\theta$
Deriving Architecture
We first sample several models from the trained policy $\pi(m, \theta)$. For each sampled model, we compute its reward on a single minibatch sampled from the validation set. We then take only the model with the highest reward to re-train from scratch. It
Designing Convolution Network
At each decision block:
- what previous nodes to connect to
- what computation operation to use
6 operations available for controller:
- convolution filter 3*3, 5*5
- depthwise-seperable convolution: 3*3, 5*5
- max_pooling and average pooling: 3*3
(每一个节点有4种激活方式,N个节点相互独立,生成4N种激活方式,第一个节点只能和输入进行连接,第二个节点可以和输入和第一个节点2个里面选一个连接,第三个节点可以和输入和第1,2共3个节点里面选一个连接,依次类推,最终产生N!种连接方式。所以最终生成4N × N!种有向图DAG。)
Sample a network of L layers, there are $6^L * 2^{L(L-1)/2}$ networks in search space
Designing Convolution Cells
Rather than designing the entire convolutional network, one can design smaller modules and then connect them together to form a network
Reduction cell:
- sample a computational graph from the search space
- applying all operations with a stride of 2
A reduction cell thus reduces the spatial dimensions of its input by a factor of 2
其中,5表示5中操作(identity,3 × 3 depthwise-separable卷积,5 × 5depthwise-separable卷积,3 × 3 max pooling,average pooling)。类似前面的思路,2个输入节点确定,B-2个节点有(B - 2)!种连接方式,这样就会产生5 × (B - 2)!种网络,而该BLOCK的每个节点都包含2个输入,每个输入相互独立,就会产生(5 × (B - 2)!)2种网络结构。也就是说,普通卷积就会产生(5 × (B - 2)!)2种网络结构,而该模块BLOCK还使用了stride=2的卷积,该卷积也会产生(5 × (B - 2)!)2种网络结构。最终就会产生(5 × (B - 2)!)4种网络结构。
See in: https://blog.csdn.net/qq_14845119/article/details/84070640
Experiments
Training details for Penn Treebank
Augment the simple transformations between nodes in the constructed recurrent cell with highway connections. e.g. elementwise multiplication
Using a large lr whilst clipping the gradient norm at a small threshold makes the updates on $\omega$ more stable
Training details for CIFAR-10
The importance of ENAS
Related work and discussion
Conclusion
Also see in:
https://www.zybuluo.com/Team/note/1445001
https://towardsdatascience.com/illustrated-efficient-neural-architecture-search-5f7387f9fb6