By Li Jing

There are some fundamental concepts in computer science.

1. Entropy
2. The entropy of a random variable X with a probability distribution p(x) is related to how much p(x) diverges from the uniform distribution on the support of X. The more p(x) diverges the lesser is entropy and vice versa. $$H(X)= -\sum_{i=1}^{n} p(x_i) \log p(x_i)$$ where {x_1,x_2,...,x_n} are the possible values for X.

3. Kullback-Leibler divergence
4. The relative entropy, also known as the Kullback-Leibler divergence, between two probability distributions on a random variable is a measure of the distance between them. Given two probability distributions p(x) and q(x) over a discrete random variable X, the KL given by $$KL(p||q)$$ is defined as follows: $KL(p||q) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}$ where 0 \log \frac{0}{0} =0, 0\log \frac{0}{q}=0, p \log \frac{1}{0}=\infty

5. Mutual information
6. Mutual information is a measure of how correlated two random variables X and Y are such that the more independent the variables are the lesser is their mutual information. $MI(X,Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$ where p(x,y) is the joint probability distribution, p(x) and p(y) are the marginal distribution of X and Y.

7. Cross-entropy
8. The cross-entropy for two probability distributions p(x) and q(x) over a discrete random variable X is defined as follows: $H(p,q) = -\sum_{x}p(x) \log q(x)$ Relation: \begin{align} H(p,q) & = -\sum_{x}p(x) \log q(x) \\ & = -\sum_{x}p(x) \log p(x) - \sum_{x}p(x) \log \frac{q(x)}{p(x)} \\ & = H(p) + KL(p||q) \end{align}

9. Feature selection: Information gain and Mutual information
10. Mutual information and information gain are equivalent.
Classification: If X is a nominal feature with k different values and C is target class with m classes. \begin{align} MI(X,C) &= \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i,c_j) \log \frac{p(x_i,c_j)}{p(x_i)p(c_j)} \\ & = - \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i,c_j) \log p(c_j) + \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i,c_j) \log \frac{p(x_i,c_j)}{p(x_i)} \\ & = - \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i,c_j) \log p(c_j) + \sum_{i=1}^{k} \sum_{j=1}^{m} p(x_i)p(c_j|x_i) \log p(c_j|x_i) \\ & = -\sum_{j=1}^{m}\log p(c_j) \left(\sum_{i=1}^{k} p(x_i,c_j)\right) + \sum_{i=1}^{k} p(x_i) \left(\sum_{j=1}^{m}p(c_j|x_i) \log p(c_j|x_i)\right)\\ & = -\sum_{j=1}^{m}\log p(c_j) p(c_j) -\sum_{i=1}^{k} p(x_i)H(C|X=x) \\ & = H(C) - H(C|X) =IG \end{align} Note that p(y)= \sum_x p(x,y), and H(Y|X)=\sum_x p(x) H(Y|X=x) = -\sum_x p(x) \sum_y p(y|x) \log p(y|x)