Neural Architecture Search with Bayesian Optimisation and Optimal Transport

Neural Architecture Search with Bayesian Optimisation and Optimal Transport

Abstract

Bayesian Optimisation refers to a class of methods for global optimisation of a function $f$ which is only accessible via point evaluations

Typically used in setting where $f$ is expensive to evaluate

Common use case for BO in ml is model selection

Conventional BO methods have focused on Euclidean and categorical domains, in the context of model selection, only permits tuning scalar hyper-parameters of machine learning algorithms

NASBOT, a Gaussian process based on BO framework for NAS

Develop a distance metric in the space of neural network architecture which can be computed efficiently via an optimal transport program

Introduction

A noisy black-box function $f$ with the goal of finding its optimum in some domain $X$

BO uses Bayesian model for $f$ to infer function values at unexplored regions and guide the selection of points for future evaluations

Quintessential use case for BO in ml is model selection. E.g., select the regularisation parameter $\lambda$ and kernel width $h$ for SVM. We can set this up as a zeroth order optimisation problem where our domain is a two dimensional space of $(\lambda, h)$ values, and each function evaluation trains the SVM on a training set, and computes the accuracy on a validation set.

A most unadorned form:

starting at time 0 with a GP prior for $f$
at time $t$, it incorporates results of evaluations from $1,…t-1$ in the form of a posterior for $f$
Then uses this posterior to construct an acquisition function\kappa\left(x, x^{\prime}\right), where $\varphi_t$ is a measure of the value of evaluating $f$ at $x$ at time $t$ if our goal is to maximize $f$

It chooses to evaluate $f$ at the maximiser of the acquisition, $x_{t}=\operatorname{argmax}_{x \in \mathcal{X}} \varphi_{t}(x)$

Two key ingredients:

quantify the similarity between two points $x, x’$ in the domain in the form of a kernel $\kappa\left(x, x^{\prime}\right)$ , allow us to reason about an unevaluated values $（x’)$ when we have already evaluated $f(x)$

in Euclidean spaces, we can use: Gaussian, Laplacian, Matern
a method to maximize $\varphi_t(x)$

via off shelf branch-and-bound or gradient based methods

$x$ is a neural network is not the case

Quantify (dis)similarity between two networks
Traverse the space of such networks to optimise the acquisition function

Contributions:

Develop a (pseudo-)distance for neural network architectures called OTMANN(Optimal Transport Metrics for Architectures of Neural Networks) that can be computed efficiently via an optimal transport program
BO framework for optimising functions on neural network architectures called NASBOT(Neural Architecture Search with Bayesian Optimisation and Optimal Transport). Include an evolutionary algorithm to optimise the acquisition
NASBOT outperforms other baselines on model selection tasks for MLP and conv

Related Work:

evolutionary algorithm
- provide a simple mechanism to explore the space of architectures by making a sequence of changes to networks that have already been evaluated
- not ideally suited for optimising functions that are expensive to evaluate
rl
- no explicit need to maintain a notion of state and solve credit assignment
- fundamentally more difficult problem than optimisation

Set up

$f$ is the performance on a validation set after $x$ is trained on the training set. If $x_{\star}=\operatorname{argmax}_{\mathcal{X}} f(x)$, $x_t$ is the architecture evaluated at time $t$, we want $f\left(x_{\star}\right)-\max _{t \leq n} f\left(x_{t}\right)$ to vanish fast as the number of evaluations $n \rightarrow \infty$

A brief review of Gaussian Process based on Bayesian Optimisation

GP:

mean function $\mu : \mathcal{X} \rightarrow \mathbb{R}$
(covariance) kernel function: $\kappa : \mathcal{X}^{2} \rightarrow \mathbb{R}$

Given $n$ observations $\mathcal{D}_{n}=\left{\left(x_{i}, y_{i}\right)\right}_{i=1}^{n}$, where $x_{i} \in \mathcal{X}, y_{i}=f\left(x_{i}\right)+\epsilon_{i} \in \mathbb{R}$, and $\epsilon_{i} \sim \mathcal{N}\left(0, \eta^{2}\right)$, the posterior process $f|D_n$ is also a GP with mean $\mu_n$, covariance $\kappa_n$
$$
k_{i}=\kappa\left(x, x_{i}\right), k_{i}^{\prime}=\kappa\left(x^{\prime}, x_{i}\right)
$$

$$
K_{i, j}=\kappa\left(x_{i}, x_{j}\right)
$$

Then, $\mu_{n}, \kappa_{n}$ can be computed via:
$$
\mu_{n}(x)=k^{\top}\left(K+\eta^{2} I\right)^{-1} Y\\
\quad \kappa_{n}\left(x, x^{\prime}\right)=\kappa\left(x, x^{\prime}\right)-k^{\top}\left(K+\eta^{2} I\right)^{-1} k^{\prime}
$$

At time $t$, we have already evaluated $f$ at points $\left{x_{i}\right}_{i=1}^{t-1}$ and obtained observations $\left{y_{i}\right}_{i=1}^{t-1}$

To determine the next point for evaluation $x_t$, we first use the posterior GP to define an acquisition function $\varphi_t$, which measure the utility of evaluating $f$ at any $x$ according to the posterior

Then maximize the acquisition function: $x_{t}=\operatorname{argmax}_{\mathcal{X}} \varphi_{t}(x)$, and evaluate $f$ at $x_t$

Expected improvement acquisition:
$$
\varphi_{t}(x)=\mathbb{E}\left[\max \left{0, f(x)-\tau_{t-1}\right} |\left{\left(x_{i}, y_{i}\right)\right}_{i=1}^{t-1}\right]
$$
measures the expected improvement over the current maximum value according to the posterior GP
$$
\tau_{t-1}=\operatorname{argmax}_{i<t-1} f\left(x_{i}\right)
$$
$\tau_{t-1}$ denote the current best value

GP/BO in the context of architecture search

$\kappa\left(x, x^{\prime}\right)$ is a measure of similarity between $x$ and $x’$

If $\kappa\left(x, x^{\prime}\right)$ is large, then $f(x)$ and $f(x’)$ are highly corrlated

In BO, we select the next point, balance between exploitation, choosing points that we believe will have high $f$ value, and exploration, choosing points that we do not know much about so that we do not get such at a bad optimum

A mathematical Formalism for Neural networks

View a neural network as a graph whose vertices are the layers of the network

$\mathcal{G}=(\mathcal{L}, \mathcal{E})$, a set of layers $\mathcal{L}$ and directed edges $\mathcal{E}$

An edge $(u, v) \in \mathcal{E}$ is a ordered pair of layers

A layer label $\ell \ell(u)$ denotes the type of operations performed at the layer. $\ell_u()$ denotes the number of computational units in a layer

Each network has decision layers which are used to obtain the predictions of the network

input layer $\ell \ell(u_{ip})=ip$
output layer $\ell \ell(u_{op})=op$
processing layers
decision layer

The output of each layer if fed to each of its children. When a layer has multiple parents, the inputs are concatenated

The neural network are also characterized by the values of the weights/parameters between layers. In AS, do not consider these weights ===?===

Distance:

Our eventual goal is a kernel for GP

A distance $d$, we will aim for $\kappa\left(x, x^{\prime}\right)=e^{-\beta d\left(x, x^{\prime}\right)^{p}}$

E.g.,$d$ is the L2 norm, p=1,2 correspond to the Laplacian and Gaussian kernels respectively

The OTMANN distance

OTAMNN is defined as the minimum of a matching scheme which attempts to match the computation at the layers of one network to the layers of the other

Penalties for matching layers with different types of operations or those at structurally different positions. Will find a matching that minimizes these penalties

First two concepts: layer masses, path length

layer masses
path length

layer masses

$\ell m$, the quantity that match between the layers of two networks when comparing them

$\ell m(u)$, quantify the significance of layer $u$

For processing layer, computed via the product of $\ell u(u)$ and the number of incoming units
For input and output layer, use $\ell m\left(u_{\mathrm{ip}}\right)=\ell m\left(u_{\mathrm{op}}\right)=\zeta \sum_{u \in \mathcal{P} \mathcal{L}} \ell m(u)$where $\mathcal{P} \mathcal{L}$ denotes the set of processing layers, $\zeta$ is a parameter to be determined
For decision layer, assign an amount of mass proportional to the mass in the processing layers. Since, the outputs of the decision layers are averaged, we distribute the mass among all the decision layers $\forall u \in \mathcal{D} \mathcal{L}, \ell m(u)=\frac{\zeta}{|\mathcal{D} \mathcal{L}|} \sum_{u \in \mathcal{P} \mathcal{L}} \ell m(u)$, $\zeta=0.1$

Path lengths from/to $u_{ip}, u_{op}$

shortest path length from $u$ to $v$
longest path length from $u$ to $v$
random walk path length, choose uniformly at random

$\delta_{\mathrm{ip}}^{\mathrm{sp}}(u), \delta_{\mathrm{ip}}^{\operatorname{lp}}(u), \delta_{\mathrm{ip}}^{\mathrm{rw}}(u)$, walks from the input $u_{ip}$ to $u$

$Z(i, j)$ denotes the amount of mass matched between layer $i \in \mathcal{G}_{1}$ and $j \in \mathcal{G}_{2}$

Label mismatch penalty $\phi_{lmm}$
Non-assignment penalty $\phi_{nas}$
Structural penalty $\phi_{str}$

Theorem 1

===some math===

$d$ puts more emphasis on the amount of computation at the layers over structure

vice versa for $\overline d$

NASBOT

kernel function:
$$
\kappa = \alpha e^{-\beta d}+\overline{\alpha} d^{-\overline{\beta}} \overline{d}
$$
did not encounter an distance where the eigenvalues of the kernel matrix were negative. There are several methods to circumvent this issue in kernel methods

Use Evolutionary algorithm(EA) approach to optimize the acquisition function

Begin with an initial pool of networks and evaluate the acquisition $\varphi_t$ on those networks

Then generate a set of $N_{num}$ mutations of this pool

stochastically select $N_{mut}$ candidates from the set of network already evaluated such that those with higher $\varphi_t$ values are more likely to be selected than those with low values
modify each candidate, to produce a new architecture
- increase or decrease the number of computational units of a layer
- add or delete layers
- change the connectivity of existing layers
evaluate the acquisition on the $N_{mut}$ mutations, add it to the initial pool, and repeat for the prescribed number of steps

EA works fine for cheap function, such as $\varphi_t$ which is analytically available, it is not suitable when evaluations are expensive

If wishes to tune:

drop-out probabilities, regularization penalties and batch normalization: treated as part of the layer label, via an augmented label penalty matrix $M$ which accounts for these considerations
scalar hyper-parameters(e.g., learning-rate): use an existing kernel for euclidean spaces and define the GP over the joint architecture + hyper-parameter space via a product kernel
early stopping: incorporated by defining a fidelity space

Experiments

Dataset

Experimental setup

asynchronously parallel set up of 2-4GPUS

evaluate multiple models in parallel, with each model on a single GPU

also illustrate some of the best architectures found, on many datasets, common features were long skip connections and multiple decision layers