Neural Architecture Optimization

Neural Architecture Optimization, NIPS

Abstract

Propose a simple and efficient method to automatic neural architecture design based on continuous optimization

3 components:

  • an encoder embeds/maps neural network architectures into a continuous space
  • a predictor takes the continuous representation of a network as input and predicts its accuracy
  • a decoder maps a continuous representation of a network back to its architecture

competitive for CIFAR-10 and PTB

successfully transfered

limited computational resources, less than 10 GPU hours

Introduction

RL:

the choice of a component of the architecture is regarded as an action

a sequence of actions defines an architecture of a neural network, whose dev set accuracy is used as the reward

EA:

search is performed through mutations and re-combinations of architectural components, where those architectures with better performances will be picked to continue evolution

essentially perform search within the discrete architecture space

However, directly searching the best architecture within discrete space is inefficient given the exponentially growing search space with the number of choices increasing

Optimize network architecture by mapping architectures into a continuous vector space, network embeddings, and conducting optimization in this continuous space via gradient based method

NAO

  • Core of NAO is an encoder model responsible to map a neural network into a continuous representation, build a regression model to approximate the final performance

    similar to the performance predictor

  • decoder is an LSTM model equipped with an attention mechanism that makes the exact recovery easy

weight sharing mechanism in ENAS, reduce the large complexity in the parameter space of child models

Most recent work parallel to this is DARTS:

  • Both NAO and DARTS conducts continuous optimization via gradient based method
  • Different, the continuous space:
    • DARTS, mixture weights
    • NAO, embedding of neural architectures
  • Different, how to derive the best architecture from continuous space:
    • DARTS, argmax of mixture weights
    • NAO, decoder to exactly recover the discrete architecture

BO:

the effectiveness of GP heavily relies on the choice of covariance function $K(x,x’)$ which essentially models the similarity between two architectures $x$ and $x’$

One need to pay more efforts in setting good $K(x,x’)$ in the context of architecture design, bringing additional manual efforts whereas the performance might still be unsatisfactory

As a comparison, do not build method on GP setup and empirically find that our model is simpler and more intuitive works much better in practice

Approach

Architecture space

For searching CNN architecture

CNN architecture is hierarchical in that a cell is stacked for a certain number of times (denoted as N) to form the final CNN architecture

Goal is to design a cell

A cell is a convolutional neural network containing B nodes

Each of the nodes contains two branches, 11 operations

  • decide which two previous nodes are used as the inputs to its two branches
  • decide the operation to apply to its two branches

For searching RNN architectures

  • a previous node as its input
  • the activation function to apply

Use a sequence consisting of discrete string tokens to describe a CNN or RNN architecture

E.g., CNN cell: node-2 conv 33 node1 max-pooling 3\3

Components of Neural Architecture Optimization

  • encoder
  • performance predictor
  • decoder

Encoder

takes the string sequence describing an architecture as input, maps it into a continuous space $\mathcal{E}$

encoder:
$$
E : \mathcal{X} \rightarrow \mathcal{E}\\
e_{x}=E(x)
$$
use a single layer LSTM as the basic model of encoder

hidden state of the LSTM are used as the continuous representation of $x$
$$
e_{x}=\left{h_{1}, h_{2}, \cdots, h_{T}\right} \in \mathcal{R}^{T \times d}
$$

Performance predictor

$$
f : \mathcal{E} \rightarrow \mathcal{R}^{+}
$$

map the continuous representation $e_x$ of an architecture $x$ into its performance $s_x$ measured by dev set accuracy
$$
e_{x}=\left{h_{1}, \cdots, h_{T}\right} \text { to obtain } \overline{e}_{x}=\frac{1}{T} \sum_{t}^{T} h_{t}
$$
then maps $\overline{e}_{x}$ to a scalar value using a feed-forward network as the predicted performance

training data: $x,s_x$

minimize the least square regression loss: $\left(s_{x}-f(E(x))\right)^{2}$

guarantee the permutation invariance of architecture embedding: if $x_1$ and $x_2$ are symmetric ( $x_2$ is formed via swapping two branches with a node in $x_1$), their embeddings should be close to produce the same performance prediction scores

to achieve that, adopt a simple data augmentation approach inspired from the data augmentation method in cv

for each $(x_1, s_x)$ add additional pair $(x_2, s_x)$ where $x_1$ and $x_2$ are symmetric, use both pairs to train the encoder and performance predictor

found that acting in this way brings non-trivial gain

Decoder

$$
D : \mathcal{E} \rightarrow x\\
x=D\left(e_{x}\right)
$$

D: an LSTM model

attention mechanism is leveraged to make decoding easier, output a context vector $ctx_r$ combining all encoder outputs ${h_t}^T_{t=1}$ at timestep $r$
$$
P_{D}\left(x | e_{x}\right)=\prod_{r=1}^{T} P_{D}\left(x_{r} | e_{x}, x_{<r}\right) \text { on } x
$$

$$
P_{D}\left(x_{r} | e_{x}, x_{<r}\right)=\frac{\exp \left(W_{x_{r}}\left[s_{r}, c t x_{r}\right]\right)}{\sum_{x^{\prime} \in V_{r}} \exp \left(W_{x^{\prime}}\left[s_{r}, c t x_{r}\right]\right)}
$$

==need to check attention again==
$$
\text{maximize }\log P_{D}(x | E(x))=\sum_{r=1}^{T} \log P_{D}\left(x_{r} | E(x), x_{<r}\right)
$$

Training and inference

Jointly train the encoder $E$, performance predictor $f$ and decoder $D$ by minimizing the combination of performance prediction loss $L_{pp}$ and structure reconstruction loss $L_{rec}$
$$
L=\lambda L_{p p}+(1-\lambda) L_{r e c}=\lambda \sum_{x \in X}(s_{x}-f(E(x))^{2}-(1-\lambda) \sum_{x \in X} \log P_{D}(x | E(x))
$$
==why do not have encoder loss==

performance prediction loss acts as a regularizer that forces the encoder not optimized into a trivial state to simply copy tokens in the decoder side, which is typically eschewed by adding noise in encoding $x$ by previous work

Both the encoder and decoder are optimized to convergence, the inference process for better architectures is performed in the continuous space $\mathcal{E}$

a better continuous representation $e_{x’}$ along the gradient direction of $f$:
$$
h_{t}^{\prime}=h_{t}+\eta \frac{\partial f}{\partial h_{t}}, \quad e_{x^{\prime}}=\left{h_{1}^{\prime}, \cdots, h_{T}^{\prime}\right}
$$
Afterwards, we feed $e_{x’}$ into decoder to obtain a new architecture $x’$ assumed to have better performance

call the original architecture $x$ as ‘seed’ architecture and iterate such process for several rounds

Combination with weight sharing

Different with NAO that tries to reduce the huge computational cost brought by the search algorithm, weight sharing aims to ease the huge complexity brought by massive child models via the one-shot model setup

Experiments

Results on CIFAR-10 classification

After the best cell architecture is found, we build a larger network by stacking such cells 6 times, and enlarging the filter size, and train it on the whole training datasets

among 600 randomly sampled architectures, randomly choose 50 of them as $X_{test}$

To evaluate $f$:

$acc_f = \frac{\sum_{x_{1} \in X_{\text {test}}, x_{2} \in X_{\text {test}}} \mathbb{1}_{f\left(E\left(x_{1}\right)\right) \geq f\left(E\left(x_{2}\right)\right)} \mathbb{1}_{s_{x_{1}} \geq s_{x_{2}}}}{\sum_{x_{1} \in X_{\text {test}}, x_{2} \in X_{\text {test}}} 1}$

To evaluator decode $D$: compute Hamming distance, denoted as Dist
$$
_{D}=\frac{1}{ | X_{\text {test}}|} \sum_{x \in X_{\text {test}}} \operatorname{Dist}(D(E(x)), x)
$$
Hamming distance: number of 1 in a XOR b

Result:

the decoder D is powerful in that it can almost exactly recover the network architecture less than 0.5, on average the difference between the decoded sequence and the original one is less than 0.5 tokens

Inspect whether the gradient update really helps to generate better architecture representations that are further decoded to architectures via $D$

compare:

  • the mean of real performance values: $\frac{1}{\left|X_{\text {eval}}\right|} \sum_{x \in X_{\text {eval}}} s_{x}$
  • the mean of predicted value: $\frac{1}{\left|X_{e v a l}\right|} \sum_{x \in X_{e v a l}} f(E(x))$

Turns out small gap

Transferring the discovered architecture to CIFAR-10

NAO-result.ong

Results of Language Modeling on PTB

Transferring the discovered architecture to WikiText-2

Conclusion

Future work

  • try other methods to further improve the performance of the discovered architecture, e.g., mixture of softmax for language modeling
  • apply NAO to discover better architectures for more applications such as Neural Machine Translation
  • design better neural models from the view of teaching and learning to learn