Large-Scale Evolution of Image Classifiers

Large-Scale Evolution of Image Classifiers

Abstract

Use novel and intuitive mutation operators that navigate large search spaces

Stress that no human participation is required once evolution starts and that the output is a fully-trained model

Repeatability of results, variability in the outcomes, computational requirements

Introduction

Generally perceives evolutionary algorithms to be incapable of matching the accuracies of hand-designed models

It is possible to evolve such competitive models today, given enough computational power

Special emphasis on the simplicity of the algorithm, “one-shot” technique

Fully trained neural network requiring no post-processing, has few impactful meta-parameters

Starting out with poor-performing models with no conv, the algorithm must evolve complex conv neural networks while navigating a fairly unrestricted search space

Study the dependence on the meta-parameters

Neuro-evolution

NEAT algorithm
Fitness sharing: recombining two models into one and a strategy to promote diversity

direct encoding: NEAT every node and every connection is stored in the DNA
indirect encoding: CPPN, evolution of repeating features at different scales

back-propagation for optimizing neural network weights

Above studies create neural networks that are small in comparison to the typical modern architectures used for image classification. These datasets(CIFAR) require large models to achieve accuracy

Reinforcement learning, q-learning
grid search, random search

Our approach:

Explore large model-architecture search spaces starting with basic initial conditions to avoid priming the system with information about known good strategies for the specific dataset at hand
a simplified graph as our DNA, transformed to a full neural network graph for training and evolution
mutations acting, single nodes-one mutation can insert whole layers, also allow for these layers to be removed
Layers parameters are also mutable, without a set of possible values to choose from
use back-propagation to optimize the weights, which can be inherited across mutations

Methods

Evolutionary Algorithm

Evolve a population of models

Each model or individual – trained architecture
Model’s accuracy on validation set –individual’s quality or fitness
(each evolutionary step) a computer, a worker, choose two individual at random from this population and compares their fitnesses
Worst of the pair, killed; best of the pair, parent; reproduction
The worker creates a copy of the parent and modifies this copy – child, by applying a mutation
After creating the child, it trains the child, evaluate it on the validation set, then put it back into the population, become alive
Using repeated pairwise competitions of random individuals

Require considerable computation

Develop a massively-parallel, lock-free insfrastructure

Many worker operate asynchronously on different computers, do not communicate directly with each other
Shared file-system, the population is stored, contain directories that represent individual
Operations on the individuals are represented as the atomic renames on the directory
Works may conflict, the affected worker simply gives up and tries again
1000 individuals, 1/4 workers
dead individuals’ directories are frequently garbage-collected

Encoding and mutations

Individual architectures are encoded as a graph that refer as the DNA

vertices represent rank-3 tensors or activations

graph’s edges represent identity connections or conv and contain the mutable numerical parameters defining the conv’s properties

When multiple edges are incident on a vertex, resolution is done by choosing one of the incoming edges as the primary one

The activations coming from the non-primary edges are reshaped through zeroth-order interpolation in the case of the size and through truncation/padding in the case of the number of channels

Mutations predetermined set:

alter learning rate
identity, keep training
reset weights
insert conv (at a random location, may apply bn and relu activation or none at random)
remove conv
alter stride
alter number of channels (of random conv)
filter size (horizontal or vertical at random, on random conv, odd values only)
insert one to one (inserts a one-to-one/identity connection)
add skip (identity between random layers)
remove random skip

A mutation that acts on a numerical parameter chooses the new value at random around the existing value. All sampling is from uniform distributions. E.g., half to twice the original value

The dense and unbounded nature of the parameters result in the exploration of a truly large set of possible architectures

Initial conditions

Each initial individual constitutes just a single layer model with no conv

The experimenter contributes mostly through the choice of mutations that demarcate a search space

The use of poor initial conditions and a large search space limits the experiment’s impact

Training and validation

CIFATR-10, 100

Computation cost

identity the basic Tensorflow operations used by out model training and validation

Estimate the theoretical number of floating-point operations (FLOPs) required

Map from TF operation to FLOPs

For each individual, TF operations in its architecture over one batch, during training $F_t $FLOPSs, validation $F_v $ FLOPSs, $N_t$, $N_v$ are the number of training and validation batches. $F_tN_t+F_vN_v$. The cost of experiment is then the sum of the costs of all its individuals

We intend out FLOPs measurement as a coarse estimate only

Weight inheritance

Architectures are trained to completion within an evolution experiment, if not happen, we are forced to retrain the best model at the end, possibly having to explore its hyper-parameters

Training a large model to completion is prohibitively slow for evolution

To resolve the dilemma, allow the children to inherit the parameters’ weights whenever possible

Reporting Methodology

To avoid overfitting, neither evolutionary nor neural network training ever see test set

“The best model” mean the model with the highest validation accuracy

Test accuracy

Not only to the choice of the best individual within an experiment, but also to the choice the best experiment

Experiments and results

Following questions:

Can a simple one-shot evolutionary process start from trivial initial conditions and yield fully trained models that rival hand-designed architectures?
What are the variability in outcomes, the parallelizability, and the computation cost of the method?
Can an algorithm designed iterating on CIFAR-10 be applied, without any changes at all, to CIFAR-100 and still produce competitive models?

Q1: yes.

Architecture discovered turn out to be surprisingly simple

Q2: over 250 worker parallel

Computation cost issue: attempt to minimize the overhead by avoiding unnecessary disk access operations, to no avail, too much overhead remains spent on a combination of neural network setup, data augmentation, and training step initialization

Weight inheritance is important in the process

Q3: yes, both are competitive with modern hand-designed network

Analysis

Meta-parameters

Not all experiments reach the highest possible value, some populations are getting “trapped“ at inferior local optima

This entrapment is affected by two important meta-parameters (parameters that are not optimized by the algorithm): population size and the number of training steps per individual

Effect of population size

Larger populations explore the space of models more thoroughly

Population of size 2 can get trapped at very low fitness values

Super-fit individual (an individual such that any one architecture mutation reduces its fitness)

If the super-fit individual wins once, it will win every time. the super-fit individual competes against its child and wins again

Cycle repeats forever and the population trapped

often larger populations are better at handling local optima, at least beyond a size threshold

Effect of number of training steps

Accuracy increases with number of training steps T for each individual

Larger T means an individual needs to undergo fewer identity mutations to reach a given level of training

meta parameters

Escaping local optima

While increase population size or number of steps to prevent a trapped population from forming, we can also free an already trapped population

Increasing the mutation rate or resetting all the weights of a population work well but quite costly

Recombination

A child could inherit structure from one parent and mutation probabilities from another

The goal was to allow individuals that progressed well due to good mutation choices to quickly propagate such choices to others
Recombing the train weights from two parents
Recombining the structure so that the child fused the architectures of both parents side-by-side, generating wide models fast

While none of these approaches improved our recombination-free results, further study seems warranted

Conclusion

In this paper:

neuro-evolution is capable of constructing large, accurate network
neuro-evolution can do this starting from trivial initial conditions
process once started, needs no experimenter participation
process yield fully trained models
did not focus on reducing computation costs

Supplementary Materials

Methods details

each worker runs an outer loop that is responsible for selectin a pair of random individuals from the population

Becoming a parent or the lower fitness killed are not carried out in order to keep the population size close to a set-point

Code analysis:

Evolve population:

Select two random individuals from the population random.sample()
enumerate individual pair, sync changes from other worker from file-system, load everything else; Ensure the individual fully trained
select by fitness (accuracy) sort()
If the population is not too small, kill the worst of the pair
If the population if not too large, reproduce the best of the pair

The encoding for an individual is represented by a serializable DNA class instance containing all information except for the trained weights

this encoding is directed, acyclic graph where edges represent conv and vertices represent nonlinearities

dna_proto is a protocol buffer used to restore the DNA state from diskws

DNA class
- to_proto(), return in protocol buffer form
- add_edge(dna, from_vertex_id, to_vertex_id, edge_type, edge_id), add an edge to the DNA graph, ensuring internal consistency
Vertex class
- edges_in, edges_out
- type of activations: linear, bn_relu
- some parts of the graph can be prevented from being acted by mutations, using boolean flags
Edge class
- from_vertex, to_vertex
- type: conv
- control the depth
- control the shape of conv filters
- control the strides of the conv
- if conflicts in inputs, determine which of the inputs takes precedence in deciding the resolved depth or scale

Mutation act on DNA instance. E.g.:

AlterLearningRateMutation
- copy
- random uniform

Many mutations modify the structure. Mutations to insert and excise vertex-edge pairs build up a main conv, while mutations to add and remove edges can handle the skip connections

E.g., AddEdgeMutation can add a skip connection between random vertices

To evaluate an individual’s fitness, its DNA is unfolded into a TF model by the Model class

compute vertex nonliearity
compute edge connection

FLOPs estimation

Run one iteration of training and collect run metadata. This metadata will be used to determine the nodes which were actually executed as well as their argument shapes
compute analytical FLOPs for all nodes in the graph tfprof_logger._get_logged_ops
Determine which nodes were executed during one training step by looking at elapsed execution time of each node

run_metadata.step_stats.dev_stats
compute FLOPs of executed nodes, sum

Declare how to compute FLOPs for each operation type present

unary math operations: square, square root, log, negation, element-wise inverse, softmax, l2 norm
binary element wise operations: addition, subtraction, multiplication, division, minimum, maximum, power, squared difference, comparison operations
reduction operations: mean, sum, argmax, argmin
conv, average pooling, max pooling
matrix multiplication

E.g., for element-wise addition operation type:

tensorflow.python.framework import graph_util, ops

Local optima and weight resetting

The identity mutation offers a mechanism for populations to get trapped in local optima

Simultaneously reset all the weights in a population that had plateaued

The simultaneously reset should put all the individuals on the same footing, so individuals that had accidentally trained more no longer have the unfair advantage

The results: the population suffer a temporary degradation in fitness immediately after the reset, as the individuals need to train. Later, the populations end up reaching higher optima

Across 10 experiments, find that three successive resets tend to cause improvement

Drawback of weight inheritance

Further may improve:

combination
reducing computation costs
combine neuro-evolution with other automatic architecture discovery methods

Notes on: http://aixpaper.com/view/largescale_evolution_of_image_classifiers