BOOK--PRML

Machine Learning Course in Spring 2019

Text Book: Pattern Recognition and Machine Learning, Bishop

Here are my learning notes, to be continued..😝

Introduction

Supervised Learning

regression: y is continuous
classification: y is discrete

supervised+X learning

Unsupervised Learning

Cluster: well-seperated
Dimensionality reduction

Reinforcement Learning

Basic steps

generalization error

Training error & True(generalization error) – over the whole population

generalization error is the true error of the population of examples, but it cannot be computed exactly, sample mean only approximates the true mean

Optimize (mean) training error can lead to the overfit

2 ways to assess the generalization error is:

Theoretical: Law of Large numbers
Practical: Use a separate data set with m data samples to test the model

Test sets

How can we get an unbiased estimate of the accuracy of a learned model?

When learning a model, should pretend that you don’t have the test data yet

The empirical loss

Overfitting

when the training error is low and the generalization error is high

Causes of phenomenon:

model with a large number of parameters (degrees of freedom)
small data size

Feature selection

score each feature and select a subset

Model selection

solution for overfitting:

assure sufficient number of samples in the training set
hold some data out of the training set = validation set
regularization

Validation set, Hold-out method & its drawbacks

Cross validation:

K-fold cross validation
Leave-out-out (LOO) cross validation, validate on only one sample per run for n runs
Random subsampling: randomly subsample a fixed fraction an (0<a<1) of the dataset for validation

Regularization

penalized loss function

regularization parameter

1. Polynomial Curve Regression – Bishop page10

$$
y(x,w)=w_0+w_1x+w_2x^2+…+w_mx^M=\sum_{j=0}^Mw_jx^j
$$

y(x,w) is a nonlinear function of x，it is a linear function of the coefficients w

Error function E(w)，minimization of the error function has a unique solution，denoted by w
$$
E(w) = \frac{1}{2}\sum_{n=1}^N{y(x_n,w)-t_n}^2
$$
Dependence of the generalization performance on M, for each choice of M, it’s sometimes more convenient to use the root-mean-square(RMS) error, N - data size
$$
E_{RMS} = \sqrt{2E(w^)/N}
$$
For a given model complexity, the over-fitting problem become less severe as the size of the data set increases

The least squares approach to finding the model parameters represents a specific case of maximum likelihood. The overfitting problem can be understood as a general property of maximum likelihood. By adopting a Bayesian approach, the overfitting problem can be avoided

Regularization

$$
E(w) = \frac{1}{2}\sum_{n=1}^N{y(x_n,w)-t_n}^2+\frac{\lambda}{2}||w||^2
$$

$$
||w||^2=w_0^2+w_1^2+w_2^2+…+w_M^2
$$

lambda governs the relative importance of the regularization term compared with the sum-of-squares error term. Note that often the w0 is omitted from the regularizer

2. Probability

sum rule:
$$
p(X)=\sum_Yp(X,Y)
$$
product rule:
$$
p(X,Y)=p(Y|X)P(X)
$$
Bayes theorem:
$$
p(Y|X)=\frac{P(X|Y)P(Y)}{p(X)}
$$

Probability densities

$$
p(x)=\int p(x,y){\rm d}y
$$

Expectation

For a discrete distribution,
$$
E[f]=\sum_xp(x)f(x)
$$
For a continuous distribution,
$$
E[f]=\int p(x)f(x){\rm d}x
$$

$$
E[f]\approx\frac{1}{N}\sum_{n=1}^{N}f(x_n)
$$

For a conditional expectation,
$$
E_x{f|y}=\sum_xp(x|y)f(x)
$$

Variance

For f(x),
$$
var[f]=E[(f(x)-E[f(x)])^2]
$$

$$
var[f]=E[(f(x)^2]-E[f(x)]^2
$$

covariance is,
$$
\begin{align}
cov[x,y]
&= E_{x,y}[(x-E[x])(y-E[y])]\\
&= E_{x,y}[xy]-E[x]E[y]
\end{align}
$$

3. Information Theory

The amount of information can be viewed as the ‘degree of surprise‘ on learning the value of x

If we are told that a highly improbable event that has just occurred, we will have received more information than if we were told that some very likely event that has just occurred
$$
h(x)=-log_2p(x)
$$
the negative sign ensures that information is positive or zero. Low probability events x correspond to high information content

The average amount of information that they transmit in the process is obtained by taking the expectation, the entropy of the random variable x
$$
H[x]=-\sum p(x)log_2p(x)
$$
This relation between entropy and shortest coding length is a general one. The noiseless coding theorem states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable

4.Likelihood Function

在确定的结果下去推测产生这个结果的可能参数（环境），结果和参数相互对应的时候，似然和概率在数值上是相等的
$$
\theta \text{–parameter}\\
x\text{–event}
$$
Probability: p(x|theta)

Likelihood: L(theta|x)

Let X be a discrete random variable with probability mass function p depending on a parameter theta, Then the function

$$
L(\theta|x)=p_\theta(x)=P_\theta(X=x)=P_\theta(X=x|\theta)=P_\theta(X=x;\theta)
$$

Let X be a random variable following an absolutely continuous probability distribution with density function f depending on a parameter theta, Then the function

$$
L(\theta|x)=f_\theta(x)
$$

posterior 正比于 likelihood * prior

$$
p(X|D) \propto p(\mu|D)*p(X)
$$

5. Linear Basic Function Models

Extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form
$$
y(x,w)=w_0+\sum_{j=1}^{M-1}w_j\phi_j(x)\\
\text{when }\phi_0(x)=1, y(x,w)=w^T\phi(x)
$$
\phi(x) are known as basis functions

By using nonlinear basis functions, we allow the function y(x, w) to be a nonlinear function of the input vector x. Functions of the form are called linear models, however, because this function is linear in w.

Maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function given by E_D(w)

Regularized least squares

$$
E_D(w)+\lambda E_w(w)\\
E_w(w)=\frac{1}{2}w^Tw\\
\text{The total error function becomes: }\frac{1}{2}\sum_{n=1}^N{t_n-w^t\phi(x_n)}^2+\frac{\lambda}{2}w^Tw
$$

This particular choice of regularizer if known in the machine learning literature as weight decay, because in sequential learning algorithms, it encourages weight values to decay towards zero, unless supported by the data. It has the advantage that the error function remains a quadratic function of w, and so its exact minimizer can be found in closed form

A more general regularizer is sometimes used,
$$
\frac{1}{2}\sum_{n=1}^N{t_n-w^t\phi(x_n)}^2+\frac{\lambda}{2}\sum_{j=1}^M|w_j|^q
$$
where q = 2 corresponds to the quadratic regularizer. The case of q = 1 is know as the lasso in the statistics literature

The Bias-Variance Decomposition

expected loss = (bias)^2 + variance + noise

There is a trade-off between bias and variance, with very flexible models having low bias and high variance, and relatively rigid models having high bias and low variance. The model with the optimal predictive capability is the one that leads to the best balance between bias and variance.

We see that small values of λ allow the model to become finely tuned to the noise on each individual data set leading to large variance. Conversely, a large value of λ pulls the weight parameters towards zero leading to large bias.

Bayesian Linear Regression

Bayesian treatment of linear regression, which will avoid the over-fitting problem of maximum likelihood, and which will also lead to automatic methods of determining model complexity using the training data alone.

6. SVM – Support Vector Machine

[From Lihang 统计学习方法，周志华西瓜书(这部分很不错哟)，so Chinese😎]

二分类模型，基本模型是定义在特征空间上的间隔最大的线性分类器，间隔最大使它有别于感知机

支持向量还包括核技巧，使他成为实质上的非线性分类器，支持向量机的学习策略就是间隔最大化，形式化为一个求解凸二次规划问题（convex quadratic programming)

分类：

线性可分支持向量机（linear support vector machine in linearly separable case）
线性支持向量机（linear support vector machine）
非线性支持向量机（non-linear support vector machine)

Good explanation: https://zhuanlan.zhihu.com/p/49331510

拉格朗日乘法和KKT条件

拉格朗日比较熟悉，so skip

KKT condition，Karush-Kuhn-Tucker Conditions，相比较于拉格朗日–等式约束条件下，KKT–不等式约束条件

在满足一些有规则的条件下，一个非线性规划问题能有最优化解法的一个必要充分条件
$$
minf(x),\text{ }s.t.g_i(x)\leq0
$$
定义拉格朗日：
$$
L(x,\lambda)=f(x)+\lambda g(x)
$$

$$
x^ 是局部极值，存在唯一的\lambda^ s.t.\\
1.\nabla_xL(x^,\lambda^)=0\\
2.\lambda^\geq0\\
3.\lambda^g(x^)=0\\
4.g(x^)\leq0\\
5.\text{Plus positive definite constraints on }\nabla_{xx}L(x^,\lambda^)
$$

https://www.cnblogs.com/bigmonkey/p/9542545.html

https://www.cnblogs.com/xinchen1111/p/8804858.html

7. Graphical Models

[From 周志华机器学习，so summary in Chinese😎]

Probabilistic Model：根据一些已观察到的数据（例如训练样本）来对感兴趣的未知变量进行估计和推测，其核心是如何基于可观测变量推测出未知变量的条件分布

Probabilistic Graphical Model：用图来表达变量相关关系的概率模型

有向无环图表示变量之间的依赖关系–有向图模型或贝叶斯网（Bayesian Network）

隐马尔科夫模型（Hidden Markov Model-HMM）是结构最简单的动态贝叶斯网（dynamic Bayesian network），主要用于时序数据建模，在语音识别、自然语言处理等领域有广泛应用
无向图–无向图模型或马尔可夫网（Markov Network）

马尔可夫链（Markov Chain）

系统下一时刻的状态仅由当前状态决定，不依赖于以往的任何状态
$$
P(x_1,y_1,..,x_n,y_n)=P(y_1)P(x_1|y_1)\prod_{i=2}^nP(y_i|y_{i-1})P(x_i|y_i)
$$

马尔可夫随机场（Markov Random Field）

典型的马尔可夫网，著名的无向图模型

有一组势函数（potential functions），也称“因子”（factor），定义在变量子集上的非负实函数，主要用于定义概率分布函数

“团”（Clique）：其中任意两结点间都有边连接，则该结点子集为一个“团”；若在一个团中加入另外任何一个结点都不在形成团，则称该团为“极大团”（maximal clique），极大团是不能被其他团所包含的团

多个变量之间的联合概率分布之恶能基于团分解为多个银子的乘积，每个因子仅与一个团相关
$$
P(x)=\frac{1}{Z}\prod_{Q\in C}\psi_Q(x_Q), \psi_Q \text{为团Q对应的势函数}
$$
可基于极大团来定义：
$$
P(x)=\frac{1}{Z}\prod_{Q\in C^}\psi_Q(x_Q), \psi_Q \text{为团Q对应的势函数},C^\text{为所有极大团构成的集合}
$$
借助“分离”的概念得到“条件独立性”，若从结点集A中的结点到B中的结点都必须经过结点C中的结点，，则结点集A和B都被结点集C分离，C称为“分离集”（separating set）

对于马尔可夫随机场，有“全局马尔可夫性”：给定两个变量子集的分离集，则这两个变量子集条件独立
$$
x_A\bot x_B|x_C
$$
势函数的作用是定量刻画变量集x_Q中变量之间的相关关系，应该是非负函数，且在所偏好的变量取值上有较大函数值，大于1->偏好变量拥有相同的取值，正相关；小于1->偏好变量拥有不同的取值，负相关

条件随机场（Conditional Random Field CRF）

判别式无向图模型

生成式模型时直接对联合分布进行建模，而判别式模型则是对条件分布进行建模

隐马尔科夫模型和马尔科夫随机场都是生成时模型，而条件随机场则是判别式模型

8. EM算法

之前概率图模型中，一直假设训练样本所有属性变量的值都已被观测到，即训练样本是完整的，但会有存在“未观测”变量的情况

未观测观测变量的学名是“隐变量”（latent variable）。令X表示已观测变量集，Z表示因变量集，\theta表示模型参数，对\theta进行极大似然估计
$$
LL(\theta|X,Z)=lnP(X,Z|\theta)
$$
最大化已观测数据的对数“边际似然”（marginal likelihood）
$$
LL(\theta|X)=lnP(X|\theta)=ln\sum_ZP(X,Z|\theta)
$$
迭代式方法，基本思想：若参数\theta已知，则可根据训练数据推断出最优隐变量Z的值（E步）；反之，若Z的值已知，则可方便地对参数\theta做极大似然估计（M步）

E步（Expectation）：
$$
\text{以当前参数 }\theta^t推断隐变量分布P(Z|X,\theta^t),\\并估计对数似然LL(\theta|X,Z)关于Z的期望\\
Q(\theta|\theta^t) =E_{Z|X,\theta^t}LL(\theta|X,Z)
$$
M步（Maximization）：
$$
寻找参数最大化期望似然，\\
\theta^{t+1}=argmax_\theta Q(\theta|\theta^t)
$$
使用两个步骤计算，第一步是期望（E步），利用当前估计的参数值来计算对数似然的期望值；第二步是最大化M步，寻找能使E步产生的似然期望最大化的参数值；然后，新的到的参数值重新被用于E步，直至收敛到局部最优解

see in: https://zhuanlan.zhihu.com/p/36331115

9. 集成学习

[From 周志华机器学习Chapter 8]

集成学习通过构建并结合多个学习器来完成学习任务，有时也被称为多分类器系统（multi-classifier system）、基于委员会的学习（committee-based learning）

集合学习通过将多个学习器结合，常可获得比单一学习器显著优越的泛化性能，对“弱学习器–略优于随机策略“尤为明显

分类：

个体学习器存在强依赖关系、必须串行生成的序列化方法，Boosting
个体学习器间不存在强依赖关系、可同时生成的并行化方法，Bagging、Random Forest

Boosting

将弱学习器提升为强学习器的算法

先从初始训练集训练出一个基学习器，再根据基学习器的表现对训练样本分布进行调整，使得先前基学习器做错的训练样本在后续得到更多的关注，然后基于调整后的样本分布来训练下一个基学习器

AdaBoost

误分类样本在下一轮学习中起更大的作用

不改变所给的训练数据，而不断改变啊训练数据权值的部分，使得训练数据在基本分类器的学习中起不同的作用

前一个基本分类器分错的样本会得到加强，加权后的全体样本再次被用来训练下一个基本分类器

AdaBoost是一种迭代算法，在每一轮中加入一个新的弱分类器，直到达到某个预定的足够小的错误率。每一个训练样本都被赋予一个权重，表明它被某个分类器选入训练集的概率。如果某个样本点已经被准确地分类，那么在构造下一个训练集中，它被选中的概率就被降低；相反，如果某个样本点没有被准确地分类，那么它的权重就得到提高

Bagging

并行式集成学习

自主采样Bootstrap sampling：给定包含m 个样本的数据集，我们先随机取出一个样本放入采样集中，再把该样本放回初始数据集，使得下次采样时该样本仍有可能被选中，这样，经过m次随机采样操作，我们得到含m 个样本的采样集，初始训练集中有的样本在采样集里多次出现，有的则从未出现

采样出T个含m个训练样本的采样集，然后基于每个采样集训练出一个基学习器，再将这些基学习器进行结合

标准Adaboost只适用于二分类任务，Bagging能不经修改地用于多分类、回归等任务

自主采样带来一个优点：由于每个基学习器只使用了初始训练集中约63.2%的样本，剩下36.8%的样本可用作验证集对泛化性能进行“包外估计”（out-of-bag estimate）

Bagging主要关注降低方差，因此他在不剪枝决策树、神经网络等易受样本扰动的学习器上效用更为明显。

随机森林 Random Forest

以决策树为基学习器构建Bagging集成的基础上，进一步在决策树的训练过程中引入随机属性选择