Oct 7, 2017 by Li Jing

This post explains the maths behind sequence labeling using HMMs, MEMMs, CRFs, Naive classification, seq2seq, Pointer-network.

Sequence labeling problem:
Input sequence: `x_1,x_2,...,x_m `
Output sequence: `y_1,y_2,...,y_m `

  1. Hidden Markov Models (HMMs)
  2. The task is to model the joint probability: \[\require{color} \colorbox{yellow}{$ p(y_1,y_2,...,y_m, x_1,x_2,...,x_m) $}= \prod\limits_{i = 1}^m p(y_i|y_{i-1}) \prod\limits_{i = 1}^m {p(x_i|y_i)} \] Independence Assumption: Consider two random Variables: `X` and`Y`, the task will be to model the joint probability: \[P(Y_1=y_1,...,Y_n=y_n, X_1=x_1...X_n=x_n) = P(Y_1=y_1,...,Y_n=y_n) P(X_1=x_1...X_n=x_n|Y_1=y_1,...,Y_n=y_n) \] This step is exact, by the chian rule of probabilities.
    Assumption 1: \[P(Y_1=y_1,...,Y_n=y_n) \approx \prod\limits_{i = 1}^m {P(Y_i=y_i|Y_{i-1}=y_{i-1})} \] Assumption 2: \[ \begin{align} P(X_1=x_1...X_n=x_n|Y_1=y_1,...,Y_n=y_n) &= \prod\limits_{i = 1}^m {P(X_i=x_i|X_1=x_1...X_{i-1}=x_{i-1},Y_1=y_1...Y_n=y_n)} \\ &\approx \prod\limits_{i = 1}^m {P(X_i=x_i|Y_i=y_i)} \end{align} \] That is the value for random variable `X_i` depends only on the value of `Y_i`. More formally, the value for `X_i` in conditionally independent of the previous observations `X_1...X_{i-1}` and the other state values `Y_1,...Y_{i-1},Y_{i+1},...Y_n`, given the value of `Y_i`.
    Estimating the parameters of the model is simple: we just read off counts from the training corpus, and then compute the maximum-likelihood estimates as described above.

  3. Maximum-entropy Markov Models (MEMMs)
  4. The task will be to model the conditional distribution \[ \begin{align} p(y_1,y_2,...,y_m|x_1,x_2,...,x_m) &= \prod\limits_{i = 1}^m {p(y_i|y_1,...,y_{i-1},x_1,...,x_m)} \\ & \approx \prod\limits_{i = 1}^m {p(y_i|y_{i-1},x_1,...,x_m)} \end{align}\] Assumption: The state `y_i` depends only on the state `y_{i-1}`.

    The key advantage of MEMMs is that the use of feature vectors \(\boldsymbol{F} \in \mathbb{R}^d\) allows much richer representation than those used in HMMs.

    Base on the assumption, we model each term using a log-linear model:

    \[ \require{color} \colorbox{yellow}{$p(y_i|y_{i-1},x_1,...,x_m) $} = \frac{\exp \left(\boldsymbol{w} \cdot \boldsymbol{F}(x_1,...x_m,i,y_{i-1},y_i) \right )} {\sum\nolimits_{\color{red} {y' \in \mathcal{S}}} \exp \left(\boldsymbol{w} \cdot \boldsymbol{F}(x_1,...x_m,i,y_{i-1},y') \right ) } \] where \( \boldsymbol{F}(x_1,...x_m,i,y_{i-1},y_i) \) is a feacture vector. \(\mathcal{S}\) denotes the set of possible states. \(\boldsymbol{w}\) are the parameters.

  5. Conditional Random Fields (CRFs)
  6. HTML5 Icon
    Figure: BiLSTM-CRF.
    The task will be to model the conditional distribution \[ p(y_1,y_2,...,y_m|x_1,x_2,...,x_m) = p(\vec y| \vec x) \]

    First, we define a feature vector \(\boldsymbol{F}(\vec x,\vec y) \in \mathbb{R}^d \). CRF key idea: \(\boldsymbol{F}(\vec x,\vec y)\) maps an entire input sequence \(\vec x\) paired with an entire state sequence \(\vec y\) to some \(d\)-dimensional feature vectors. \(\boldsymbol{F}(\vec x,\vec y)\) is a "global" feature vector (it is global in the sense that it takes the entire state sequence into account).

    \[ \boldsymbol{F}(\vec x,\vec y) = \sum\limits_i^m f(\vec x, i, y_{i-1},y_i) \] We are assuming that \(k=1,...,d\), the `k`-th global feature is: \[ \boldsymbol{F}_k(\vec x,\vec y) = \sum\limits_i^m f_k(\vec x, i, y_{i-1},y_i) \] We can see that \(\boldsymbol{F}_k\) is calculated by summing the "local" feature vector `f_k` over the `m` different state transitions in `y_1,...,y_m`.
    We then build a log-linear model: \[\require{color} \colorbox{yellow}{$ p(\vec y|\vec x; \boldsymbol{w}) $}= \frac{\exp \left(\boldsymbol{w} \cdot \boldsymbol{F}(\vec x, \vec y) \right )} {\sum\nolimits_{\color{red} {\vec{y'} \in \mathcal{S}^m}} \exp \left(\boldsymbol{w} \cdot \boldsymbol{F}(\vec x, \vec {y'}) \right ) } \] We can find that the space of possible values for \(\vec{y'} \) is \(\mathcal{S}^m\). It is very huge.
    Parameter Estimation in CRFs.
    We have `n` training examples, \({(\vec {x^j}, \vec {y^j})}_{j=1}^n\). Each instance \(\vec{x^j}\) is an input sequence \(x_1^j,...,x_m^j\), each \(\vec{y^j}\) is a state sequence \(y_1^j,...,y_m^j\). The regularized log-likelihood function is \[ L(\boldsymbol{w}) = \sum\limits_{j=1}^n \log p(\vec {y^j}|\vec {x^j}; \boldsymbol{w}) - \frac{\lambda}{2}||\boldsymbol{w}||^2 \] The partial derivatives are \[ \frac{\partial}{\partial w_k} L(\boldsymbol{w}) = \sum\limits_{j=1}^n \boldsymbol{F}_k(\vec {x^j},\vec {y^j}) - \sum\limits_{j=1}^n \sum\limits_{\color{red} {\vec{y'} \in \mathcal{S}^m}} p(\vec{y'}|\vec{x^j};\boldsymbol{w}) \boldsymbol{F}_k(\vec{x^j},\vec{y'}) - \lambda w_k \]

  7. Naive Classification
  8. HTML5 Icon
    Figure: sequence labeling as a classification.
    \[\require{color} \colorbox{yellow}{$ p(y_i|\vec{x};\boldsymbol{w}) $} = \frac{e^{f_{y_i}}}{{\sum_j e^{f_j}}} \] The cross-entropy between a "true" distribution `p` and an estimated distribution `q` is defined as: \[ \begin{align} H(p,q) &= -\sum_x p(x)\log q(x) \\ &=-\sum_i \log \left( \frac{e^{f_{y_i}}}{{\sum_j e^{f_j}}} \right) \end{align} \] The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities (`q=\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}` as seen above) and the "true" distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. `p=[0,...1,...,0]` contains a single 1 at the `y_i`-th position.).

  9. Sequence-to-Sequence (seq2seq)
  10. HTML5 Icon
    Figure: Sequence-to-Sequence.
    In traditional sequence-to-sequence model, the output dictionary size is fixed. For example, in chinese-english translation task, the english vocabulary is fixed when generation english sentences. Thus, we need to train a separate model for each fixed dictionary. Input sequence: \(x_i,...,x_n \), output sequence: \(y_1,...,y_m\), the task is to model the conditional probability: \[ p(\vec{y}|\vec{x}; \boldsymbol{w})= \prod \limits_{i=1}^m p(y_i|y_1,...,y_{i-1},\vec{x};\boldsymbol{w}) \] With an RNN, each conditional probability is modeled as \[ \require{color} \colorbox{yellow}{$ p(y_i|y_1,...,y_{i-1},\vec{x};\boldsymbol{w}) $}= g(y_{i-1},\vec{s_i},\vec{c}) \] where `g` is a nonlinear, potentially multi-layered, function that outputs the probability of `y_i`, and \(\vec{s_i}\) is the hidden state of the RNN at time step `i`, and `c` is the context vector.

  11. Pointer Networks
  12. HTML5 Icon
    Figure: Pointer Networks.
    The advantage of pointer networks is that the output dictionary size depends on the number of elements in the input sequence. That is, the output dictionary size is variable.
    Let us define input sequence: \(x_i,...,x_n \), output sequence: \(y_1,...,y_m\), encoder hidden states \(e_1,...,e_n\), and decoder hidden states \(d_1,...,d_m\). The attention mechanism can be modeled as follows: \[ \begin{align} u_j^i &= v^T tanh(W_1e_j+W_2d_i) \quad j \in (1,...,n) \\ \require{color} \colorbox{yellow}{$ p(y_i|y_1,...,y_{i-1},\vec{x}) $} &= softmax(\vec{u^i}) \\ \end{align} \] where softmax normalizes the vector \(\vec{u^i}\) (of length `n`) to be an output distribution over the dictionary of inputs (length `n`).
    `v`, `W_1` and `W_2` are learnable parameters of the output model.
    The parameters of the model are learnt by maximizing the conditional probabilities for the training set, i.e. \[ \boldsymbol{w}^* = \underset{\boldsymbol{w}}{\arg\max} \sum \log p(\vec{y}|\vec{x};\boldsymbol{w}) \] where the sum is over training examples. \(\boldsymbol{w}\) is the set of the model parameters and each \((\vec{x},\vec{y})\) is an (input sequence, output sequence) pair from the training set.

    References
    [1] Log-Linear Models, MEMMs, and CRFs
    [2] Pointer Networks
    [3] Tagging with Hidden Markov Models
    [4] Linear Classification