Designing neural network architectures using Reinforcement, ICLR 2017
Abstract
MetaQNN, a meta-modeling algorithm based on reinforcement learning
Learning agent is trained to sequentially choose CNN layers using Q-learning with an $\epsilon$-greedy exploration strategy and experience replay
Introduction
- sequentially picking layers of a CNN model
- random exploration and slowly begins to exploit its findings to select higher performing models using the $\epsilon$-greedy
- reward: validation accuracy
- experience replay
- suited for transfer learning tasks
Related work
Designing neural network architectures
- NEAT algorithm: Evolving neural networks through augmenting topologies
- screening methods in genetic: A high-throughput screening approach to discovering good forms of biologically inspired visual representation
- sidestep the architecture selection process: Convolutional neural fabrics
- Bayesian optimization:
- Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves
- Algorithms for hyper-parameter
Reinforcement learning
- game-playing agents
- robotic control
- over-exploration can lead to slow convergence times
- over-exploitation can lead to convergence to local optima
Background
Q-learning
MDP in a finite-horizon environment
constrain the environment to be finite-horizon ensures that the agent will deterministically terminate in a finite number of time steps
Discrete and finite state space: $S$, action space: $U$
stochastic transitions: $p(s_j|s_i, u)$
$r_t(s,u,s’)$
Maximize the total expected reward over all possible trajectories
$$
R_{\mathcal{T}_{i}}=\sum_{\left(s, u, s^{\prime}\right) \in \mathcal{T}_{i}} \mathbb{E}_{r | s, u, s^{\prime}}\left[r | s, u, s^{\prime}\right]
$$
maximum total expected reward to be $Q^*(s_i,u)$, action-value function
Bellman equation:
$$
Q^{}\left(s_{i}, u\right)=\mathbb{E}_{s_{j} | s_{i}, u}\left[\mathbb{E}_{r | s_{i}, u, s_{j}}\left[r | s_{i}, u, s_{j}\right]+\gamma \max _{u^{\prime} \in \mathcal{U}\left(s_{j}\right)} Q^{}\left(s_{j}, u^{\prime}\right)\right]
$$
an iterative update:
$$
Q_{t+1}\left(s_{i}, u\right)=(1-\alpha) Q_{t}\left(s_{i}, u\right)+\alpha\left[r_{t}+\gamma \max _{u^{\prime} \in \mathcal{U}\left(s_{j}\right)} Q_{t}\left(s_{j}, u^{\prime}\right)\right]
$$
- $\alpha$, Q-learning rate,determines the weight given to new information over old information
- $\gamma$,discount factor,determines the weight given to short-term rewards over future rewards
- model-free, without ever explicitly constructing an estimate of environmental dynamics
- off policy
probability $\epsilon$, random action is taken
probability $1-\epsilon$, action: $\max _{u \in \mathcal{U}\left(s_{i}\right)} Q_{t}\left(s_{i}, u\right)$
$\epsilon = 1$, exploration; $\epsilon = 0$, exploitation
When the exploration cost is large, beneficial to use the experience replay
Designing Neural Network Architectures with Q-learning
the task of training a learning agent to sequentially choose neural network layers
model the layer selectin progress as a MDP with the assumption that a well-performing layer in the network should also perform well in another network
The CNN architecture defined by the agent’s path is trained on the chosen learning problem, the agent is given a reward equal to the validation accuracy
The state space
Each state is defined as a tuple of all relevant layer parameter
5 different types of layers:
- convolution (C)
- pooling (P)
- fully connected (FC)
- global average pooling (GAP)
- softmax (SM)
the relevant parameters for each layer and also discrete them
Also, layer depth, specify a maximum number of layers the agent may select before terminating
Parameter: Representation size
Pooling and convolution, these layers may lead the agent on a trajectory where the intermediate signal representation gets reduced to a size that is too small for further processing
Add Representation size(R-size) n that have a receptive field size less than or equal n
Constrict action from states with R-size n to those that have a receptive field sizes into three discrete buckets
However, binning adds uncertainty to the state transitions
E.g.: bin1:$[8,\infty], (0,7]$
R-size: 18, R-size bin: 1 –P(2,2)–> R-size: 9, R-size bin: 1
R-size: 14, R-size bin: 1 –P(2,2)–> R-size: 7, R-size bin: 2
The action space
Allow:
- agent to terminate a path at any point
- transitions for a state with layer depth i to a state with layer depth i+1
Limit the number of fully connected layers to be maximum two, in case two many learnable parameters
convolution may transition to a state with any other layer type
pooling may transition to a state with any other layer type other than pooling, because consecutive pooling layers are equivalent to a single, larger pooling layer
only states with representation size in bins $(8,4],(4,1]$ may transition to an FC layers
A majority of these constraints are in place to enable faster convergence
Q-learning training procedure
- Q lr $(\alpha)$ to 0.01
- discount factor $(\gamma)$ to 1
- decrease $\epsilon$ from 1 to 0.1 steps
maintain a replay dictionary:
- the network topology
- prediction performance on a validation set
After each model is sampled and trained, the agent randomly samples 100 models from the replay dictionary and applies the Q-value
Experiment details
Result
- model-selection analysis
- prediction performance
- transfer learning ability
Concluding remarks
Future:
In our current implementation, we use the same set of hyperparameters to train all network topologies during the Q-learning phase and further finetune the hyperparameters for top models selected by the MetaQNN agent. However, our approach could be combined with hyperparameter optimization methods to further automate the network design process. Moreover, we constrict the state-action space using coarse, discrete bins to accelerate convergence. It would be possible to move to larger state-action spaces using methods for Q-function approximation (Bertsekas, 2015; Mnih et al., 2015)