Random Search and Reproducibility for Neural Architecture Search

Random Search and Reproducibility for Neural Architecture Search

Abstract

Two points:

  1. Evaluate both random search with early-stopping and a novel random search with weight-shared algorithm. Results show that random search with early-stopping is a competitive NAS baseline.
  2. Explore the existing reproducibility issues of published NAS results.

Introduction

3 fundamental issues with the current states of NAS research:

  • Inadequate baselines
  • Complex methods
  • Lack of reproducibility

Inadequate baselines

Existing comparisons between novel NAS methods and standard hyperparameters optimization methods are inadequate

Without benchmarking against leading hyperparameter optimization baselines, it difficult to quantify the performance gains provided by specialized NAS methods

Complex Methods

Novel NAS methods progress in many different methods, including complicated training studies, architecture transformations, model assumptions

It’s unclear what NAS components are necessary to achieve a competitive empirical result

Random search-component of ho

Lack of reproducibility

  • “exact reproducibility”, whether it is possible to reproduce explicitly reported experimental results
  • “broad reproducibility”, the degree to which the reported experimental results are themselves robust and generalizable

Each fails on account of some combination of missing model evaluation code, architecture search code, random seeds used for search and evaluation, and/or undocumented hyperparameter tuning

Contributions

  1. new perspective on the gap between traditional hyperparameter optimization and leading NAS methods

    Evaluate a general hyperparameter optimization method combining random search with early-stopping

  2. Identify a small subset of NAS components that are sufficient for achieving good empirical results

    Construct a simple algorithm from the ground up starting from vanilla random search, properly tuned random search with weight-sharing is competitive with much more complicated methods when using similar computational methods

    Meta-hyperparameter: batch size, number of epochs, network size and number of evaluated architectures

  3. Open-source all of the necessary code, random seeds, and documentation necessary to reproduce our experiments

Background

Overview of the components of hyperparameter optimization has three components, each of which can have NAS-specific approaches

  1. Search space

    include continuous or discrete hyperparameters in a structured or unstructured fashion

    DAG

    cell blocks, that are repeated in some way via a preset or learned meta-architecture to form a larger architecture

    Design random search NAS algorithm for such a cell block search space

  2. Search Method

    Random search, the most basic approach

    Bayesian approaches based on Gaussian process

    Gradient-based approaches are generally only applicable to continuous search spaces

    Tree-based Bayesian, evolutionary strategies, and random search are more flexible and can be applies to any search space

  3. Evolution method

    e.g., its predictive accuracy on a validation set

    partial training methods exploit early-stopping to speed up the evaluation process at the cost of noisy estimations of configuration quality

    Many of these methods center around sharing and reuse:

    • network morphisms build upon previously trained architecture
    • hypernetworks and performance prediction encode information from previously seen architectures
    • weight-sharing methods use a single set of weights for all possible architectures

Additional context for the current states of NAS research

Inadequate baselines

We choose to use a simple method combining random search with early-stopping called ASHA to provide a competitive baseline for standard hyperparameter optimization

Complex Methods

  • evolutionary approaches need to define a set of possible mutations to apply to different architectures
  • Bayesian optimization approaches rely on specially designed kernels
  • Gradient-based methods transform the discrete architecture search problem into a continuous problem
  • Reinforcement learning to train a rnn controller to generate good architectures

Since methods some times use different search spaces and evolution methods

To simplify the search process and help isolate important components of NAS, we use random search to sample architecture from the search space

Considering training time and performance, we use random search with weight-sharing as our starting point for a simple and efficient NAS method

Work inspired by which showed that random search, combined with a well-trained set of shared weights can successfully differentiate good architecture from poor performing ones. This work required several modifications to stabilize training(e.g., a tunable path dropout schedule over edges of the search DAG and a specialized ghost batch normalization scheme)

Lack of reproducibility

  • Architecture search code
  • Model evaluation code
  • Hyperparameter tuning documentation
  • Random seeds

DARTS is particularly commendable in acknowledging its dependence on random initialization, prompting the use multiple runs to select the best architecture

Our work go one step further and evaluate the broad reproducibility of our results with another set of random sets

Methodology

Our algorithm is designed for an arbitrary search space with a DAG representation

Use the same search spaces as that considered by DARTS, recurrent cell has N=8 nodes, 4 operations: tanh, relu, sigmoid, identity

Apply random search in the following manner:

  1. For each node in the DAG, determine what decisions must be made
  2. For each decision, identify the possible choices for the given node
  3. Finally, moving from node to node, we sample uniformly from the set of possible choices for each decision that needs to be made

Shared weights are updated by selecting a single architecture for a given minibatch and updating the shared weights by back-propagating through the network with only the edges and operations as indicated by the architecture activated

After training, we use these trained shared weights to evaluate the performance of a number of randomly sampled architectures on a separate held out dataset

Relevant Meta-Hyperparameters

That impact the behavior of our search algorithm, both in terms of search quality and computational costs

  1. Training epochs

    Training with more architectures should help the shared weights generalize better to what are likely unseen architecture in the evolution step

    More epochs increase the computational time required for architecture search

  2. Batch size

    Decreasing the batch size increases the number of minibatch updates but at the cost of noisier gradient update

    May necessitate adjusting other meta-hyperparameters to account for the noisier gradient noiser

  3. Network size

    Increasing the search network size increases the dimension of the shared weights

    This should boost performance since a larger search network can store more information about different architectures

    require more GPU memory

  4. Number of evaluated architectures

    Increasing the number of architectures that we evaluate using the shared weights allows for more exploration in the architecture search space

  5. gradient clipping

Memory footprint

Train the shared weights using a single architecture at a time, only loading the weights associated with the operations and edges that are activated into GPU memory

The memory footprint of our random search with weight-sharing can be reduced to that of a single model

===Larger “proxyless” models ?? ===usually used in the final architecture evaluations step instead of the smaller proxy models that are used in the search step

Experiments

Three stages

  1. Perform architecture search for a cell block on a cheaper search task
  2. Evaluate the best architecture from the first stage by retraining a larger, network formed from multiple cell blocks of the best found architecture from scratch
  3. Perform the full evaluation of the best found architecture from the second stage by training more epochs or with more seeds

Perform architecture searching using partial training of the stage2, and then select the best architecture for stage3 evaluation

PTB benchmark

first stage as the proxy network

the network in the later stages as the proxyless network

Final Search Results
  1. Evaluate the ASHA baseline

    ASHA evaluated over 300 architectures

    Result demonstrates that the gap between SOTA NAS methods and standard hyperparameter optimization approaches on the PTB benchmark is significantly smaller than that suggested by existing comparisons to random search

  2. Evaluate random search with weight-sharing with tuned meta-typerparameters

    achieving SOTA perplexity compared to previous NAS approaches

    Manually designed architectures are competitive with RNN cells designed by NAS methods on this benchmark

    The work using LSTM with mixture of experts in the softmax layer(MoS) outperforms automatically designed cells

  3. examine the reproducibility of the NAS methods

Impact of meta-hyperparameters

Perform 4 separate trials of each version of random search with weight-sharing

Stage1, train the shared weights and then use them to evaluate 2000 randomly sampled architectures

Stage2, select the best architecture out of 2000, according to the shared weights, to train from scratch using the proxyless network for 300 epochs

Adjusting the following meta-hyperparameters:

  • In stage1:
    • gradient clipping
    • batch size
    • network size

4 Random:

  • Random 1: using the same setup as DARTS

  • Random 2: decrease the maximum gradient norm to account for discrete architecture

    gradient updates are not as large in each direction

  • Random 3: decrease batch size from 256 to 64 in order to increase the number of architectures used to train the shared weights

  • Random 4: train the larger proxyless network architecture with shared weights, increasing the number of parameters in the model

=== what about stage2===

stems from the fact that we did not perform any additional hyperparameter tuning in stage3

Investigating Reproducibility

Examine the stage2 intermediate results

Even partial training for 300 epochs does not recover the correct ranking, training using shared weights further obscures the signal

Overall, demonstrate a high variance in the stage2 intermediate results across trials, along with issues related to differing convergence rates for different architectures

CIFAR-10 Benchmark

Final Search Results

These results suggest that the gap between SOTA NAS methods and standard hyperparameter optimization is much smaller than previously reported

Evaluate random search with weight-sharing with tuned meta-hyperparameters

Random search with weight-sharing can also directly search over larger proxyless networks since it trains using discrete architectures

We hypothesize that using a proxyless network and applying random search with weight-sharing to the same search space as ProxylessNAS would further improve our results–future work

The final results are quite similar across independents run for both DARTS and random search with weight-sharing

Impact of meta-hyperparameters

Both the training of shared weights and the evaluation of architectures using these trained weights:

  • number of training epochs
  • gradient clipping
  • number of architectures evaluating shared weights
  • network size

5 Randoms:

  1. start by training the shared weights with the proxy network used by DARTS and default values
  2. increase the number of training epochs from 50 to 150, increase the number of architectures used to update the shared weights
  3. reduce the maximum gradient norm from 5 to 1 to adjust for discrete architectures instead
  4. increase the number of epochs for training the proxy network with shared weights to 300 and increase the number of architectures evaluated using the shared weights to 11k
  5. increase the proxy network size to be as large as possible given

Similar to the PTB benchmark, the best setting for random search was Random (5), which has a larger network size

Investigating Reproducibility

Both DARTS and Random(5) are broadly reproducible on this benchmark

Ranking is unstable between 100 and 600 epochs in two significant cases(reproduced DARTS and Random(5) Run 2), which motivated our strategy of training the final architectures across trials to 600 epochs in order to select the best architecture for final evaluation across 10 seeds

Computational Cost

There is a trade off between computational cost and the quality of this signal that we get per architecture that we evaluate

  1. Full training

  2. Partial training:

    9 minutes per architecture for PTB benchmark and 19 minutes for CIFAR-10 benchmark

  3. Weight-sharing

    It’s difficult to quantify the equivalent number of architecture evaluated by DARTS and random search with weight-sharing

    For random search with weight sharing, this is a tunable meta-hyperparameter and the quality of the performance estimates we receive can be noisy

    Rough estimate, 0.2 minutes per architecture for PTB benchmark and 0.8 minutes for CIFAR-10 benchmark

In contrast, we were able to achieve nearly competitive performance with the default settings of ASHA using roughly the same total computation as that needed by DARTS and random search with weight-sharing

Available Code

deterministic conditioned on a fixed random seed

Conclusion

  1. Better baselines are accurately quantify the performance gains of NAS methods

  2. Ablation studies that isolate the impact of individual NAS components

  3. Reproducible results that engender confidence and foster scientific progress

    Consequently, we conclude that either significantly more computational resources need to be devoted to evaluating NAS methods and/or more computationally tractable benchmarks need to be developed to lower the barrier for performing adequate empirical evaluations


Key point:

In stage (1), we train the shared weights and use them to evaluate a given number of randomly sampled architectures on the test set. In stage (2), we select the best architecture, according to the shared weights, to train from scratch using the proxyless network