Designing neural network architectures using Reinforcement learning

Designing neural network architectures using Reinforcement, ICLR 2017

Abstract

MetaQNN, a meta-modeling algorithm based on reinforcement learning

Learning agent is trained to sequentially choose CNN layers using Q-learning with an $\epsilon$-greedy exploration strategy and experience replay

Introduction

  • sequentially picking layers of a CNN model
  • random exploration and slowly begins to exploit its findings to select higher performing models using the $\epsilon$-greedy
  • reward: validation accuracy
  • experience replay
  • suited for transfer learning tasks

Designing neural network architectures

  • NEAT algorithm: Evolving neural networks through augmenting topologies
  • screening methods in genetic: A high-throughput screening approach to discovering good forms of biologically inspired visual representation
  • sidestep the architecture selection process: Convolutional neural fabrics
  • Bayesian optimization:
    • Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves
    • Algorithms for hyper-parameter

Reinforcement learning

  • game-playing agents
  • robotic control
  • over-exploration can lead to slow convergence times
  • over-exploitation can lead to convergence to local optima

Background

Q-learning

MDP in a finite-horizon environment

constrain the environment to be finite-horizon ensures that the agent will deterministically terminate in a finite number of time steps

  • Discrete and finite state space: $S$, action space: $U$

  • stochastic transitions: $p(s_j|s_i, u)$

  • $r_t(s,u,s’)$

Maximize the total expected reward over all possible trajectories
$$
R_{\mathcal{T}_{i}}=\sum_{\left(s, u, s^{\prime}\right) \in \mathcal{T}_{i}} \mathbb{E}_{r | s, u, s^{\prime}}\left[r | s, u, s^{\prime}\right]
$$
maximum total expected reward to be $Q^*(s_i,u)$, action-value function

Bellman equation:
$$
Q^{}\left(s_{i}, u\right)=\mathbb{E}_{s_{j} | s_{i}, u}\left[\mathbb{E}_{r | s_{i}, u, s_{j}}\left[r | s_{i}, u, s_{j}\right]+\gamma \max _{u^{\prime} \in \mathcal{U}\left(s_{j}\right)} Q^{}\left(s_{j}, u^{\prime}\right)\right]
$$
an iterative update:
$$
Q_{t+1}\left(s_{i}, u\right)=(1-\alpha) Q_{t}\left(s_{i}, u\right)+\alpha\left[r_{t}+\gamma \max _{u^{\prime} \in \mathcal{U}\left(s_{j}\right)} Q_{t}\left(s_{j}, u^{\prime}\right)\right]
$$

  • $\alpha$, Q-learning rate,determines the weight given to new information over old information
  • $\gamma$,discount factor,determines the weight given to short-term rewards over future rewards
  • model-free, without ever explicitly constructing an estimate of environmental dynamics
  • off policy

probability $\epsilon$, random action is taken

probability $1-\epsilon$, action: $\max _{u \in \mathcal{U}\left(s_{i}\right)} Q_{t}\left(s_{i}, u\right)$

$\epsilon = 1$, exploration; $\epsilon = 0$, exploitation

When the exploration cost is large, beneficial to use the experience replay

Designing Neural Network Architectures with Q-learning

the task of training a learning agent to sequentially choose neural network layers

model the layer selectin progress as a MDP with the assumption that a well-performing layer in the network should also perform well in another network

The CNN architecture defined by the agent’s path is trained on the chosen learning problem, the agent is given a reward equal to the validation accuracy

The state space

Each state is defined as a tuple of all relevant layer parameter

5 different types of layers:

  • convolution (C)
  • pooling (P)
  • fully connected (FC)
  • global average pooling (GAP)
  • softmax (SM)

the relevant parameters for each layer and also discrete them

Also, layer depth, specify a maximum number of layers the agent may select before terminating

Parameter: Representation size

Pooling and convolution, these layers may lead the agent on a trajectory where the intermediate signal representation gets reduced to a size that is too small for further processing

Add Representation size(R-size) n that have a receptive field size less than or equal n

Constrict action from states with R-size n to those that have a receptive field sizes into three discrete buckets

However, binning adds uncertainty to the state transitions

E.g.: bin1:$[8,\infty], (0,7]$

R-size: 18, R-size bin: 1 –P(2,2)–> R-size: 9, R-size bin: 1

R-size: 14, R-size bin: 1 –P(2,2)–> R-size: 7, R-size bin: 2

The action space

Allow:

  • agent to terminate a path at any point
  • transitions for a state with layer depth i to a state with layer depth i+1

Limit the number of fully connected layers to be maximum two, in case two many learnable parameters

convolution may transition to a state with any other layer type

pooling may transition to a state with any other layer type other than pooling, because consecutive pooling layers are equivalent to a single, larger pooling layer

only states with representation size in bins $(8,4],(4,1]$ may transition to an FC layers

A majority of these constraints are in place to enable faster convergence

Q-learning training procedure

  • Q lr $(\alpha)$ to 0.01
  • discount factor $(\gamma)$ to 1
  • decrease $\epsilon$ from 1 to 0.1 steps

maintain a replay dictionary:

  • the network topology
  • prediction performance on a validation set

After each model is sampled and trained, the agent randomly samples 100 models from the replay dictionary and applies the Q-value

Experiment details

Result

  • model-selection analysis
  • prediction performance
  • transfer learning ability

NAS-Qlearning result

Concluding remarks

Future:

In our current implementation, we use the same set of hyperparameters to train all network topologies during the Q-learning phase and further finetune the hyperparameters for top models selected by the MetaQNN agent. However, our approach could be combined with hyperparameter optimization methods to further automate the network design process. Moreover, we constrict the state-action space using coarse, discrete bins to accelerate convergence. It would be possible to move to larger state-action spaces using methods for Q-function approximation (Bertsekas, 2015; Mnih et al., 2015)