Large-Scale Evolution of Image Classifiers
Abstract
Use novel and intuitive mutation operators that navigate large search spaces
Stress that no human participation is required once evolution starts and that the output is a fully-trained model
Repeatability of results, variability in the outcomes, computational requirements
Introduction
Generally perceives evolutionary algorithms to be incapable of matching the accuracies of hand-designed models
It is possible to evolve such competitive models today, given enough computational power
Special emphasis on the simplicity of the algorithm, “one-shot” technique
Fully trained neural network requiring no post-processing, has few impactful meta-parameters
Starting out with poor-performing models with no conv, the algorithm must evolve complex conv neural networks while navigating a fairly unrestricted search space
Study the dependence on the meta-parameters
Related work
Neuro-evolution
- NEAT algorithm
- Fitness sharing: recombining two models into one and a strategy to promote diversity
- direct encoding: NEAT every node and every connection is stored in the DNA
- indirect encoding: CPPN, evolution of repeating features at different scales
- back-propagation for optimizing neural network weights
Above studies create neural networks that are small in comparison to the typical modern architectures used for image classification. These datasets(CIFAR) require large models to achieve accuracy
- Reinforcement learning, q-learning
- grid search, random search
Our approach:
- Explore large model-architecture search spaces starting with basic initial conditions to avoid priming the system with information about known good strategies for the specific dataset at hand
- a simplified graph as our DNA, transformed to a full neural network graph for training and evolution
- mutations acting, single nodes-one mutation can insert whole layers, also allow for these layers to be removed
- Layers parameters are also mutable, without a set of possible values to choose from
- use back-propagation to optimize the weights, which can be inherited across mutations
Methods
Evolutionary Algorithm
Evolve a population of models
- Each model or individual – trained architecture
- Model’s accuracy on validation set –individual’s quality or fitness
- (each evolutionary step) a computer, a worker, choose two individual at random from this population and compares their fitnesses
- Worst of the pair, killed; best of the pair, parent; reproduction
- The worker creates a copy of the parent and modifies this copy – child, by applying a mutation
- After creating the child, it trains the child, evaluate it on the validation set, then put it back into the population, become alive
- Using repeated pairwise competitions of random individuals
Require considerable computation
Develop a massively-parallel, lock-free insfrastructure
- Many worker operate asynchronously on different computers, do not communicate directly with each other
- Shared file-system, the population is stored, contain directories that represent individual
- Operations on the individuals are represented as the atomic renames on the directory
- Works may conflict, the affected worker simply gives up and tries again
- 1000 individuals, 1/4 workers
- dead individuals’ directories are frequently garbage-collected
Encoding and mutations
Individual architectures are encoded as a graph that refer as the DNA
vertices represent rank-3 tensors or activations
graph’s edges represent identity connections or conv and contain the mutable numerical parameters defining the conv’s properties
When multiple edges are incident on a vertex, resolution is done by choosing one of the incoming edges as the primary one
The activations coming from the non-primary edges are reshaped through zeroth-order interpolation in the case of the size and through truncation/padding in the case of the number of channels
Mutations predetermined set:
- alter learning rate
- identity, keep training
- reset weights
- insert conv (at a random location, may apply bn and relu activation or none at random)
- remove conv
- alter stride
- alter number of channels (of random conv)
- filter size (horizontal or vertical at random, on random conv, odd values only)
- insert one to one (inserts a one-to-one/identity connection)
- add skip (identity between random layers)
- remove random skip
A mutation that acts on a numerical parameter chooses the new value at random around the existing value. All sampling is from uniform distributions. E.g., half to twice the original value
The dense and unbounded nature of the parameters result in the exploration of a truly large set of possible architectures
Initial conditions
Each initial individual constitutes just a single layer model with no conv
The experimenter contributes mostly through the choice of mutations that demarcate a search space
The use of poor initial conditions and a large search space limits the experiment’s impact
Training and validation
CIFATR-10, 100
Computation cost
identity the basic Tensorflow operations used by out model training and validation
Estimate the theoretical number of floating-point operations (FLOPs) required
Map from TF operation to FLOPs
For each individual, TF operations in its architecture over one batch, during training $F_t $FLOPSs, validation $F_v $ FLOPSs, $N_t$, $N_v$ are the number of training and validation batches. $F_tN_t+F_vN_v$. The cost of experiment is then the sum of the costs of all its individuals
We intend out FLOPs measurement as a coarse estimate only
Weight inheritance
Architectures are trained to completion within an evolution experiment, if not happen, we are forced to retrain the best model at the end, possibly having to explore its hyper-parameters
Training a large model to completion is prohibitively slow for evolution
To resolve the dilemma, allow the children to inherit the parameters’ weights whenever possible
Reporting Methodology
To avoid overfitting, neither evolutionary nor neural network training ever see test set
“The best model” mean the model with the highest validation accuracy
Test accuracy
Not only to the choice of the best individual within an experiment, but also to the choice the best experiment
Experiments and results
Following questions:
- Can a simple one-shot evolutionary process start from trivial initial conditions and yield fully trained models that rival hand-designed architectures?
- What are the variability in outcomes, the parallelizability, and the computation cost of the method?
- Can an algorithm designed iterating on CIFAR-10 be applied, without any changes at all, to CIFAR-100 and still produce competitive models?
Q1: yes.
Architecture discovered turn out to be surprisingly simple
Q2: over 250 worker parallel
Computation cost issue: attempt to minimize the overhead by avoiding unnecessary disk access operations, to no avail, too much overhead remains spent on a combination of neural network setup, data augmentation, and training step initialization
Weight inheritance is important in the process
Q3: yes, both are competitive with modern hand-designed network
Analysis
Meta-parameters
Not all experiments reach the highest possible value, some populations are getting “trapped“ at inferior local optima
This entrapment is affected by two important meta-parameters (parameters that are not optimized by the algorithm): population size and the number of training steps per individual
Effect of population size
Larger populations explore the space of models more thoroughly
Population of size 2 can get trapped at very low fitness values
Super-fit individual (an individual such that any one architecture mutation reduces its fitness)
If the super-fit individual wins once, it will win every time. the super-fit individual competes against its child and wins again
Cycle repeats forever and the population trapped
often larger populations are better at handling local optima, at least beyond a size threshold
Effect of number of training steps
Accuracy increases with number of training steps T for each individual
Larger T means an individual needs to undergo fewer identity mutations to reach a given level of training
Escaping local optima
While increase population size or number of steps to prevent a trapped population from forming, we can also free an already trapped population
Increasing the mutation rate or resetting all the weights of a population work well but quite costly
Recombination
A child could inherit structure from one parent and mutation probabilities from another
The goal was to allow individuals that progressed well due to good mutation choices to quickly propagate such choices to others
Recombing the train weights from two parents
Recombining the structure so that the child fused the architectures of both parents side-by-side, generating wide models fast
While none of these approaches improved our recombination-free results, further study seems warranted
Conclusion
In this paper:
- neuro-evolution is capable of constructing large, accurate network
- neuro-evolution can do this starting from trivial initial conditions
- process once started, needs no experimenter participation
- process yield fully trained models
- did not focus on reducing computation costs
Supplementary Materials
Methods details
each worker runs an outer loop that is responsible for selectin a pair of random individuals from the population
Becoming a parent or the lower fitness killed are not carried out in order to keep the population size close to a set-point
Code analysis:
- Evolve population:
- Select two random individuals from the population
random.sample()
- enumerate individual pair, sync changes from other worker from file-system, load everything else; Ensure the individual fully trained
- select by fitness (accuracy)
sort()
- If the population is not too small, kill the worst of the pair
- If the population if not too large, reproduce the best of the pair
- DNA
The encoding for an individual is represented by a serializable DNA class instance containing all information except for the trained weights
this encoding is directed, acyclic graph where edges represent conv and vertices represent nonlinearities
dna_proto
is a protocol buffer used to restore the DNA state from diskws
DNA class
to_proto()
, return in protocol buffer formadd_edge(dna, from_vertex_id, to_vertex_id, edge_type, edge_id)
, add an edge to the DNA graph, ensuring internal consistency
Vertex class
- edges_in, edges_out
- type of activations: linear, bn_relu
- some parts of the graph can be prevented from being acted by mutations, using boolean flags
- Edge class
- from_vertex, to_vertex
- type: conv
- control the depth
- control the shape of conv filters
- control the strides of the conv
- if conflicts in inputs, determine which of the inputs takes precedence in deciding the resolved depth or scale
Mutation act on DNA instance. E.g.:
- AlterLearningRateMutation
- copy
- random uniform
Many mutations modify the structure. Mutations to insert and excise vertex-edge pairs build up a main conv, while mutations to add and remove edges can handle the skip connections
E.g., AddEdgeMutation can add a skip connection between random vertices
To evaluate an individual’s fitness, its DNA is unfolded into a TF model by the Model class
- compute vertex nonliearity
- compute edge connection
FLOPs estimation
Run one iteration of training and collect run metadata. This metadata will be used to determine the nodes which were actually executed as well as their argument shapes
compute analytical FLOPs for all nodes in the graph
tfprof_logger._get_logged_ops
Determine which nodes were executed during one training step by looking at elapsed execution time of each node
run_metadata.step_stats.dev_stats
compute FLOPs of executed nodes, sum
Declare how to compute FLOPs for each operation type present
- unary math operations: square, square root, log, negation, element-wise inverse, softmax, l2 norm
- binary element wise operations: addition, subtraction, multiplication, division, minimum, maximum, power, squared difference, comparison operations
- reduction operations: mean, sum, argmax, argmin
- conv, average pooling, max pooling
- matrix multiplication
E.g., for element-wise addition operation type:
tensorflow.python.framework import graph_util, ops
Local optima and weight resetting
The identity mutation offers a mechanism for populations to get trapped in local optima
Simultaneously reset all the weights in a population that had plateaued
The simultaneously reset should put all the individuals on the same footing, so individuals that had accidentally trained more no longer have the unfair advantage
The results: the population suffer a temporary degradation in fitness immediately after the reset, as the individuals need to train. Later, the populations end up reaching higher optima
Across 10 experiments, find that three successive resets tend to cause improvement
Drawback of weight inheritance
Further may improve:
- combination
- reducing computation costs
- combine neuro-evolution with other automatic architecture discovery methods
Notes on: http://aixpaper.com/view/largescale_evolution_of_image_classifiers