By Li Jing

There are some fundamental concepts in computer science.

    1. Entropy
    2. The entropy of a random variable `X` with a probability distribution `p(x)` is related to how much `p(x)` diverges from the uniform distribution on the support of `X`. The more `p(x)` diverges the lesser is entropy and vice versa. $$H(X)= -\sum_{i=1}^{n} p(x_i) \log p(x_i)$$ where `{x_1,x_2,...,x_n}` are the possible values for `X`.

    3. Kullback-Leibler divergence
    4. The relative entropy, also known as the Kullback-Leibler divergence, between two probability distributions on a random variable is a measure of the distance between them. Given two probability distributions `p(x)` and `q(x)` over a discrete random variable `X`, the KL given by \(KL(p||q)\) is defined as follows: \[ KL(p||q) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)} \] where `0 \log \frac{0}{0} =0, 0\log \frac{0}{q}=0, p \log \frac{1}{0}=\infty`

    5. Mutual information
    6. Mutual information is a measure of how correlated two random variables `X` and `Y` are such that the more independent the variables are the lesser is their mutual information. \[ MI(X,Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \] where `p(x,y)` is the joint probability distribution, `p(x)` and `p(y)` are the marginal distribution of `X` and `Y`.

    7. Cross-entropy
    8. The cross-entropy for two probability distributions `p(x)` and `q(x)` over a discrete random variable `X` is defined as follows: \[ H(p,q) = -\sum_{x}p(x) \log q(x) \] Relation: \[ \begin{align} H(p,q) & = -\sum_{x}p(x) \log q(x) \\ & = -\sum_{x}p(x) \log p(x) - \sum_{x}p(x) \log \frac{q(x)}{p(x)} \\ & = H(p) + KL(p||q) \end{align} \]

    9. Feature selection: Information gain and Mutual information
    10. Mutual information and information gain are equivalent.
      Classification: If `X` is a nominal feature with `k` different values and `C` is target class with `m` classes. $$ \begin{align} MI(X,C) &= \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i,c_j) \log \frac{p(x_i,c_j)}{p(x_i)p(c_j)} \\ & = - \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i,c_j) \log p(c_j) + \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i,c_j) \log \frac{p(x_i,c_j)}{p(x_i)} \\ & = - \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i,c_j) \log p(c_j) + \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i)p(c_j|x_i) \log p(c_j|x_i) \\ & = -\sum_{j=1}^{m}\log p(c_j) \left(\sum_{i=1}^{k} p(x_i,c_j)\right) + \sum_{i=1}^{k} p(x_i) \left(\sum_{j=1}^{m}p(c_j|x_i) \log p(c_j|x_i)\right)\\ & = -\sum_{j=1}^{m}\log p(c_j) p(c_j) -\sum_{i=1}^{k} p(x_i)H(C|X=x) \\ & = H(C) - H(C|X) =IG \end{align} $$ Note that `p(y)= \sum_x p(x,y)`, and `H(Y|X)=\sum_x p(x) H(Y|X=x) = -\sum_x p(x) \sum_y p(y|x) \log p(y|x)`