Random Search and Reproducibility for Neural Architecture Search

Random Search and Reproducibility for Neural Architecture Search

Abstract

Two points:

Evaluate both random search with early-stopping and a novel random search with weight-shared algorithm. Results show that random search with early-stopping is a competitive NAS baseline.
Explore the existing reproducibility issues of published NAS results.

Introduction

3 fundamental issues with the current states of NAS research:

Inadequate baselines
Complex methods
Lack of reproducibility

Inadequate baselines

Existing comparisons between novel NAS methods and standard hyperparameters optimization methods are inadequate

Without benchmarking against leading hyperparameter optimization baselines, it difficult to quantify the performance gains provided by specialized NAS methods

Complex Methods

Novel NAS methods progress in many different methods, including complicated training studies, architecture transformations, model assumptions

It’s unclear what NAS components are necessary to achieve a competitive empirical result

Lack of reproducibility

“exact reproducibility”, whether it is possible to reproduce explicitly reported experimental results
“broad reproducibility”, the degree to which the reported experimental results are themselves robust and generalizable

Each fails on account of some combination of missing model evaluation code, architecture search code, random seeds used for search and evaluation, and/or undocumented hyperparameter tuning

Contributions

new perspective on the gap between traditional hyperparameter optimization and leading NAS methods

Evaluate a general hyperparameter optimization method combining random search with early-stopping
Identify a small subset of NAS components that are sufficient for achieving good empirical results

Construct a simple algorithm from the ground up starting from vanilla random search, properly tuned random search with weight-sharing is competitive with much more complicated methods when using similar computational methods

Meta-hyperparameter: batch size, number of epochs, network size and number of evaluated architectures
Open-source all of the necessary code, random seeds, and documentation necessary to reproduce our experiments

Background

Overview of the components of hyperparameter optimization has three components, each of which can have NAS-specific approaches

Search space

include continuous or discrete hyperparameters in a structured or unstructured fashion

DAG

cell blocks, that are repeated in some way via a preset or learned meta-architecture to form a larger architecture

Design random search NAS algorithm for such a cell block search space
Search Method

Random search, the most basic approach

Bayesian approaches based on Gaussian process

Gradient-based approaches are generally only applicable to continuous search spaces

Tree-based Bayesian, evolutionary strategies, and random search are more flexible and can be applies to any search space
Evolution method

e.g., its predictive accuracy on a validation set

partial training methods exploit early-stopping to speed up the evaluation process at the cost of noisy estimations of configuration quality

Many of these methods center around sharing and reuse:
- network morphisms build upon previously trained architecture
- hypernetworks and performance prediction encode information from previously seen architectures
- weight-sharing methods use a single set of weights for all possible architectures

Additional context for the current states of NAS research

Inadequate baselines

We choose to use a simple method combining random search with early-stopping called ASHA to provide a competitive baseline for standard hyperparameter optimization

Complex Methods

evolutionary approaches need to define a set of possible mutations to apply to different architectures
Bayesian optimization approaches rely on specially designed kernels
Gradient-based methods transform the discrete architecture search problem into a continuous problem
Reinforcement learning to train a rnn controller to generate good architectures

Since methods some times use different search spaces and evolution methods

To simplify the search process and help isolate important components of NAS, we use random search to sample architecture from the search space

Considering training time and performance, we use random search with weight-sharing as our starting point for a simple and efficient NAS method

Work inspired by which showed that random search, combined with a well-trained set of shared weights can successfully differentiate good architecture from poor performing ones. This work required several modifications to stabilize training(e.g., a tunable path dropout schedule over edges of the search DAG and a specialized ghost batch normalization scheme)

Lack of reproducibility

Architecture search code
Model evaluation code
Hyperparameter tuning documentation
Random seeds

DARTS is particularly commendable in acknowledging its dependence on random initialization, prompting the use multiple runs to select the best architecture

Our work go one step further and evaluate the broad reproducibility of our results with another set of random sets

Methodology

Our algorithm is designed for an arbitrary search space with a DAG representation

Use the same search spaces as that considered by DARTS, recurrent cell has N=8 nodes, 4 operations: tanh, relu, sigmoid, identity

Apply random search in the following manner:

For each node in the DAG, determine what decisions must be made
For each decision, identify the possible choices for the given node
Finally, moving from node to node, we sample uniformly from the set of possible choices for each decision that needs to be made

Shared weights are updated by selecting a single architecture for a given minibatch and updating the shared weights by back-propagating through the network with only the edges and operations as indicated by the architecture activated

After training, we use these trained shared weights to evaluate the performance of a number of randomly sampled architectures on a separate held out dataset

Relevant Meta-Hyperparameters

That impact the behavior of our search algorithm, both in terms of search quality and computational costs

Training epochs

Training with more architectures should help the shared weights generalize better to what are likely unseen architecture in the evolution step

More epochs increase the computational time required for architecture search
Batch size

Decreasing the batch size increases the number of minibatch updates but at the cost of noisier gradient update

May necessitate adjusting other meta-hyperparameters to account for the noisier gradient noiser
Network size

Increasing the search network size increases the dimension of the shared weights

This should boost performance since a larger search network can store more information about different architectures

require more GPU memory
Number of evaluated architectures

Increasing the number of architectures that we evaluate using the shared weights allows for more exploration in the architecture search space
gradient clipping

Memory footprint

Train the shared weights using a single architecture at a time, only loading the weights associated with the operations and edges that are activated into GPU memory

The memory footprint of our random search with weight-sharing can be reduced to that of a single model

===Larger “proxyless” models ?? ===usually used in the final architecture evaluations step instead of the smaller proxy models that are used in the search step

Experiments

Three stages

Perform architecture search for a cell block on a cheaper search task
Evaluate the best architecture from the first stage by retraining a larger, network formed from multiple cell blocks of the best found architecture from scratch
Perform the full evaluation of the best found architecture from the second stage by training more epochs or with more seeds

Perform architecture searching using partial training of the stage2, and then select the best architecture for stage3 evaluation

PTB benchmark

first stage as the proxy network

the network in the later stages as the proxyless network

Final Search Results

Evaluate the ASHA baseline

ASHA evaluated over 300 architectures

Result demonstrates that the gap between SOTA NAS methods and standard hyperparameter optimization approaches on the PTB benchmark is significantly smaller than that suggested by existing comparisons to random search
Evaluate random search with weight-sharing with tuned meta-typerparameters

achieving SOTA perplexity compared to previous NAS approaches

Manually designed architectures are competitive with RNN cells designed by NAS methods on this benchmark

The work using LSTM with mixture of experts in the softmax layer(MoS) outperforms automatically designed cells
examine the reproducibility of the NAS methods

Impact of meta-hyperparameters

Perform 4 separate trials of each version of random search with weight-sharing

Stage1, train the shared weights and then use them to evaluate 2000 randomly sampled architectures

Stage2, select the best architecture out of 2000, according to the shared weights, to train from scratch using the proxyless network for 300 epochs

Adjusting the following meta-hyperparameters:

In stage1:
- gradient clipping
- batch size
- network size

4 Random:

Random 1: using the same setup as DARTS
Random 2: decrease the maximum gradient norm to account for discrete architecture

gradient updates are not as large in each direction
Random 3: decrease batch size from 256 to 64 in order to increase the number of architectures used to train the shared weights
Random 4: train the larger proxyless network architecture with shared weights, increasing the number of parameters in the model

=== what about stage2===

stems from the fact that we did not perform any additional hyperparameter tuning in stage3

Investigating Reproducibility

Examine the stage2 intermediate results

Even partial training for 300 epochs does not recover the correct ranking, training using shared weights further obscures the signal

Overall, demonstrate a high variance in the stage2 intermediate results across trials, along with issues related to differing convergence rates for different architectures

CIFAR-10 Benchmark

Final Search Results

These results suggest that the gap between SOTA NAS methods and standard hyperparameter optimization is much smaller than previously reported

Evaluate random search with weight-sharing with tuned meta-hyperparameters

Random search with weight-sharing can also directly search over larger proxyless networks since it trains using discrete architectures

We hypothesize that using a proxyless network and applying random search with weight-sharing to the same search space as ProxylessNAS would further improve our results–future work

The final results are quite similar across independents run for both DARTS and random search with weight-sharing

Impact of meta-hyperparameters

Both the training of shared weights and the evaluation of architectures using these trained weights:

number of training epochs
gradient clipping
number of architectures evaluating shared weights
network size

5 Randoms:

start by training the shared weights with the proxy network used by DARTS and default values
increase the number of training epochs from 50 to 150, increase the number of architectures used to update the shared weights
reduce the maximum gradient norm from 5 to 1 to adjust for discrete architectures instead
increase the number of epochs for training the proxy network with shared weights to 300 and increase the number of architectures evaluated using the shared weights to 11k
increase the proxy network size to be as large as possible given

Similar to the PTB benchmark, the best setting for random search was Random (5), which has a larger network size

Investigating Reproducibility

Both DARTS and Random(5) are broadly reproducible on this benchmark

Ranking is unstable between 100 and 600 epochs in two significant cases(reproduced DARTS and Random(5) Run 2), which motivated our strategy of training the final architectures across trials to 600 epochs in order to select the best architecture for final evaluation across 10 seeds

Computational Cost

There is a trade off between computational cost and the quality of this signal that we get per architecture that we evaluate

Full training
Partial training:

9 minutes per architecture for PTB benchmark and 19 minutes for CIFAR-10 benchmark
Weight-sharing

It’s difficult to quantify the equivalent number of architecture evaluated by DARTS and random search with weight-sharing

For random search with weight sharing, this is a tunable meta-hyperparameter and the quality of the performance estimates we receive can be noisy

Rough estimate, 0.2 minutes per architecture for PTB benchmark and 0.8 minutes for CIFAR-10 benchmark

In contrast, we were able to achieve nearly competitive performance with the default settings of ASHA using roughly the same total computation as that needed by DARTS and random search with weight-sharing

Available Code

deterministic conditioned on a fixed random seed

Conclusion

Better baselines are accurately quantify the performance gains of NAS methods
Ablation studies that isolate the impact of individual NAS components
Reproducible results that engender confidence and foster scientific progress

Consequently, we conclude that either significantly more computational resources need to be devoted to evaluating NAS methods and/or more computationally tractable benchmarks need to be developed to lower the barrier for performing adequate empirical evaluations

Key point:

In stage (1), we train the shared weights and use them to evaluate a given number of randomly sampled architectures on the test set. In stage (2), we select the best architecture, according to the shared weights, to train from scratch using the proxyless network

Abstract

Introduction

Inadequate baselines

Complex Methods

Lack of reproducibility

Contributions

Background

Related work

Inadequate baselines

Complex Methods

Lack of reproducibility

Methodology

Relevant Meta-Hyperparameters

Memory footprint

Experiments

Three stages

PTB benchmark

Final Search Results

Impact of meta-hyperparameters

Investigating Reproducibility

CIFAR-10 Benchmark

Final Search Results

Impact of meta-hyperparameters

Investigating Reproducibility

Computational Cost

Available Code

Conclusion