Neural Architecture Search with Reinforcement Learning

[TOC]

Neural Architecture Search with Reinforcement Learning

Abstract

Use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set

Introduction

A gradient-based method for finding architecture

Work is based on the observation that the structure and connectivity of a neural network can be typically specified by a variable-length string. Use a recurrent network, the controller, to generate such string

Reward: result in an accuracy on a validation set

Controller will give higher probabilities to architectures that receive high accuracies

Hyperparameter optimization

Existed work only search models from a fixed-length space. They often work better if they are supplied with a good initial model.

  • Bayesian optimization models could search non fixed length architectures, but less general and less flexible than this one

  • Modern neuro-evolution algorithm

    less practical at a large scale

    slow or require many heuristics to work well

The controller in NAS is auto-regressive, which means it predicts hyperparameters one a time, conditioned on previous predictions

Method learns directly from the reward signal without any supervised bootstrapping

The idea of learning to learn or meta-learning

Methods

Generate model description with a controller recurrent neural network

Controller to generate architectural hyperparameters of neural networks: RNN

Use their controller to generate their parameters as a sequence of tokens

AutoML-basic framework

Every prediction is carried out by a softmax classifier and then fed into the next time step as input

Generating an architecture stops if the number of layers exceeds a certain value

At convergence, the accuracy of the network on a held-out validation set is recorded

The parameters of controller RNN: $\theta_c$

Training with Reinforce

The list of tokens: a list of action $a_{1:T}$

$R$: reward signal, accuracy on a hold-out dataset

PG in RL:
$$
J(\theta)=E_{p(a_{1:T};\theta_c)}[R]
$$

$$
\nabla_{\theta_c} J(\theta_c)=\sum_{t=1}^TE_{p(a_{1:T};\theta_c)}[\nabla_{\theta_c}logP(a_t|a_{(t-1):1};\theta_c)R]
$$

$$
\frac{1}{m}\sum_{k=1}^m\sum_{t=1}^T\nabla_{\theta_c}logP(a_t|a_{(t-1):1};\theta_c)R_k
$$

m: number of different architectures

T: number of hyperparameters controller has to predict to design a nn architecture

Baseline function, address unbiased estimate
$$
\frac{1}{m}\sum_{k=1}^m\sum_{t=1}^T\nabla_{\theta_c}logP(a_t|a_{(t-1):1};\theta_c)(R_k-b)
$$
baseline b is an exponential moving average of the previous architecture accuracies

Accelerate training with parallelism and Asynchronous updates

Distributed training and asynchronous parameters updates to speed up the learning process of the controller

Each controller replica samples m different child architectures that are trained in parallel

Each controller replica then samples m architectures and run the multiple child models in parallel

The controller then collects gradients according to the results of that minibatch of m architectures at convergence and sends them to the parameter server in order to update the weights across all controller replicas

Convergence of each child network is reached when its training exceeds a certain number of epochs

NAS-parameter server

Increase architecture complexity with skip connections and other layer types

Widen the search space, using skip connections or branch layers

built upon the attention mechanism

At each layer, an anchor pointer which has N-1 content-based sigmoids to indicate the previous layers that need to be connected

Each sigmoid is a function of the current hiddenstate of the controller and the previous hiddenstates of the previous N-1 anchor points
$$
P(\text{layer j is an input to layer i})=sigmoid(v^T tanh(W_{prev}h_j+W_{curr}h_i))
$$
Then sample from these sigmoids to decide what previous layers to be used as inputs to the current layer

$W_{prev}, W_{curr}, v$ are trainable parameters

NAS-parameter server

If one layer has many input layers then all input layers are concatenated in the depth dimension. Skip connections can cause “compilation failures” where one layer is not compatible with another layer, or one may not have any input or output. Employ three simple techniques

Generate recurrent cell architectures

The computation for basic RNN and LSTM cells can be generalized as a tree of steps and take $x_t$ and $h_{t-1}$ as inputs and produce $h_t$ as final output

Index the node in the tree in an order so that the controller RNN can visit each node one by one and label the needed hyperparameters

NAS-parameter server

Experiments and results

Learning convolution architecture for CIFAR-10

Dataset
Search space

convolution layers, rectified linear units, non-linearities, batch normalization, skip connection between layers

Training details

Controller: 2 layer LSTM with 35 hidden units

Adam, lr = 0.0006

server shard S: 20, controller replicas K:100, chlid replicas: 8

800 networks being trained on 800 GPUs concurrently at any time

Once the controller RNN samples an architecture, a child model is constructed and trained for 50 epochs. The reward used for updating the controller is the maximum validation accuracy of the last 5 epochs cubed

During training, use a schedule of increasing number of layers in the child networks as training progresses

After finding the architecture that achieves the best validation accuracy, run a small grid search over learning rate, weight decay, batch norm epsilon, and what epoch to decay the learning rate

Learning recurrent cells for Penn TreeBank

Control Experiment1: Adding more functions in the search space

E.g., add max function to combination functions; add sin to activation functions

Control Experiment1: Comparsion against Random Search

Conclusion

Code for running the models found by the controller will be released at https://github.com/tensorflow/models

the RNN cell found using our method under the name NASCell into TensorFlow


Also see in https://www.cnblogs.com/marsggbo/p/9347678.html