Sequence Labeling: HMMs, MEMMs, CRFs, Naive classification, seq2seq, Pointer-network
This post explains the maths behind sequence labeling using HMMs, MEMMs, CRFs, Naive classification, seq2seq, Pointer-network.
Sequence labeling problem:
Input sequence: `x_1,x_2,...,x_m `
Output sequence: `y_1,y_2,...,y_m `
- Hidden Markov Models (HMMs)
- Maximum-entropy Markov Models (MEMMs)
- Conditional Random Fields (CRFs)
- Naive Classification
- Sequence-to-Sequence (seq2seq)
- Pointer Networks
- Hidden Markov Models (HMMs) The task is to model the joint probability: \[\require{color} \colorbox{yellow}{$ p(y_1,y_2,...,y_m, x_1,x_2,...,x_m) $}= \prod\limits_{i = 1}^m p(y_i|y_{i-1}) \prod\limits_{i = 1}^m {p(x_i|y_i)} \] Independence Assumption: Consider two random Variables: `X` and`Y`, the task will be to model the joint probability: \[P(Y_1=y_1,...,Y_n=y_n, X_1=x_1...X_n=x_n) = P(Y_1=y_1,...,Y_n=y_n) P(X_1=x_1...X_n=x_n|Y_1=y_1,...,Y_n=y_n) \] This step is exact, by the chian rule of probabilities.
- Maximum-entropy Markov Models (MEMMs) The task will be to model the conditional distribution \[ \begin{align} p(y_1,y_2,...,y_m|x_1,x_2,...,x_m) &= \prod\limits_{i = 1}^m {p(y_i|y_1,...,y_{i-1},x_1,...,x_m)} \\ & \approx \prod\limits_{i = 1}^m {p(y_i|y_{i-1},x_1,...,x_m)} \end{align}\] Assumption: The state `y_i` depends only on the state `y_{i-1}`.
- Conditional Random Fields (CRFs)
- Naive Classification
- Sequence-to-Sequence (seq2seq)
- Pointer Networks
Assumption 1: \[P(Y_1=y_1,...,Y_n=y_n) \approx \prod\limits_{i = 1}^m {P(Y_i=y_i|Y_{i-1}=y_{i-1})} \] Assumption 2: \[ \begin{align} P(X_1=x_1...X_n=x_n|Y_1=y_1,...,Y_n=y_n) &= \prod\limits_{i = 1}^m {P(X_i=x_i|X_1=x_1...X_{i-1}=x_{i-1},Y_1=y_1...Y_n=y_n)} \\ &\approx \prod\limits_{i = 1}^m {P(X_i=x_i|Y_i=y_i)} \end{align} \] That is the value for random variable `X_i` depends only on the value of `Y_i`. More formally, the value for `X_i` in conditionally independent of the previous observations `X_1...X_{i-1}` and the other state values `Y_1,...Y_{i-1},Y_{i+1},...Y_n`, given the value of `Y_i`.
Estimating the parameters of the model is simple: we just read off counts from the training corpus, and then compute the maximum-likelihood estimates as described above.
The key advantage of MEMMs is that the use of feature vectors \(\boldsymbol{F} \in \mathbb{R}^d\) allows much richer representation than those used in HMMs.
Base on the assumption, we model each term using a log-linear model:
\[ \require{color} \colorbox{yellow}{$p(y_i|y_{i-1},x_1,...,x_m) $} = \frac{\exp \left(\boldsymbol{w} \cdot \boldsymbol{F}(x_1,...x_m,i,y_{i-1},y_i) \right )} {\sum\nolimits_{\color{red} {y' \in \mathcal{S}}} \exp \left(\boldsymbol{w} \cdot \boldsymbol{F}(x_1,...x_m,i,y_{i-1},y') \right ) } \] where \( \boldsymbol{F}(x_1,...x_m,i,y_{i-1},y_i) \) is a feacture vector. \(\mathcal{S}\) denotes the set of possible states. \(\boldsymbol{w}\) are the parameters.First, we define a feature vector \(\boldsymbol{F}(\vec x,\vec y) \in \mathbb{R}^d \). CRF key idea: \(\boldsymbol{F}(\vec x,\vec y)\) maps an entire input sequence \(\vec x\) paired with an entire state sequence \(\vec y\) to some \(d\)-dimensional feature vectors. \(\boldsymbol{F}(\vec x,\vec y)\) is a "global" feature vector (it is global in the sense that it takes the entire state sequence into account).
\[ \boldsymbol{F}(\vec x,\vec y) = \sum\limits_i^m f(\vec x, i, y_{i-1},y_i) \] We are assuming that \(k=1,...,d\), the `k`-th global feature is: \[ \boldsymbol{F}_k(\vec x,\vec y) = \sum\limits_i^m f_k(\vec x, i, y_{i-1},y_i) \] We can see that \(\boldsymbol{F}_k\) is calculated by summing the "local" feature vector `f_k` over the `m` different state transitions in `y_1,...,y_m`.We then build a log-linear model: \[\require{color} \colorbox{yellow}{$ p(\vec y|\vec x; \boldsymbol{w}) $}= \frac{\exp \left(\boldsymbol{w} \cdot \boldsymbol{F}(\vec x, \vec y) \right )} {\sum\nolimits_{\color{red} {\vec{y'} \in \mathcal{S}^m}} \exp \left(\boldsymbol{w} \cdot \boldsymbol{F}(\vec x, \vec {y'}) \right ) } \] We can find that the space of possible values for \(\vec{y'} \) is \(\mathcal{S}^m\). It is very huge.
Parameter Estimation in CRFs.
We have `n` training examples, \({(\vec {x^j}, \vec {y^j})}_{j=1}^n\). Each instance \(\vec{x^j}\) is an input sequence \(x_1^j,...,x_m^j\), each \(\vec{y^j}\) is a state sequence \(y_1^j,...,y_m^j\). The regularized log-likelihood function is \[ L(\boldsymbol{w}) = \sum\limits_{j=1}^n \log p(\vec {y^j}|\vec {x^j}; \boldsymbol{w}) - \frac{\lambda}{2}||\boldsymbol{w}||^2 \] The partial derivatives are \[ \frac{\partial}{\partial w_k} L(\boldsymbol{w}) = \sum\limits_{j=1}^n \boldsymbol{F}_k(\vec {x^j},\vec {y^j}) - \sum\limits_{j=1}^n \sum\limits_{\color{red} {\vec{y'} \in \mathcal{S}^m}} p(\vec{y'}|\vec{x^j};\boldsymbol{w}) \boldsymbol{F}_k(\vec{x^j},\vec{y'}) - \lambda w_k \]
Let us define input sequence: \(x_i,...,x_n \), output sequence: \(y_1,...,y_m\), encoder hidden states \(e_1,...,e_n\), and decoder hidden states \(d_1,...,d_m\). The attention mechanism can be modeled as follows: \[ \begin{align} u_j^i &= v^T tanh(W_1e_j+W_2d_i) \quad j \in (1,...,n) \\ \require{color} \colorbox{yellow}{$ p(y_i|y_1,...,y_{i-1},\vec{x}) $} &= softmax(\vec{u^i}) \\ \end{align} \] where softmax normalizes the vector \(\vec{u^i}\) (of length `n`) to be an output distribution over the dictionary of inputs (length `n`).
`v`, `W_1` and `W_2` are learnable parameters of the output model.
The parameters of the model are learnt by maximizing the conditional probabilities for the training set, i.e. \[ \boldsymbol{w}^* = \underset{\boldsymbol{w}}{\arg\max} \sum \log p(\vec{y}|\vec{x};\boldsymbol{w}) \] where the sum is over training examples. \(\boldsymbol{w}\) is the set of the model parameters and each \((\vec{x},\vec{y})\) is an (input sequence, output sequence) pair from the training set.
References
[1] Log-Linear Models, MEMMs, and CRFs
[2] Pointer Networks
[3] Tagging with Hidden Markov Models
[4] Linear Classification