t-SNE in Python (part 1)
In reading some papers about Tomas Mikolov's word2vec algorithm for creating word embeddings, I came across a cool method for visualizing high dimensional data. I have some experience with methods that are used for dimensionality reduction like PCA and autoencoders (check out an old but presumably working version here), but I'd never encountered something that was purpose built for visualizing high dimensional data in two or three dimensions. t-SNE (t distributed stochastic neighbor embedding) attempts to minimize a 'difference' (I'll clarify this a hot sec) function between what amounts to distances in high and low dimensional space. Here, we are calculating 'distances' (really probabilities) between vectors in the dataset. t-SNE has a homepage here, where you can find a bunch of resources. The original paper (also on the homepage) is pretty straightforward in its explanation of the algorithm. I'll quick describe what's going on with the algorithm before discussing one aspect of it that I've been working on.
Say you have a set of vectors ${x_0, x_1, x_2, ... ,x_N} \in X$ each with dimension k. As we're dealing with a dimensionality reduction, assume that kis something large. For example, we could be dealing with the MNIST dataset, where kwould be 784. The goal is to generate some good representation of $X$ in a lower dimensional space, say of dimension l.Ideally lwould be something like 2 or 3, so we could see our data on a 2- or 3-D scatter plot. Let's call this low dimensional representation $Y$. The goal here is to preserve the relative position of each vector in $X$ in our new representation $Y$. In other words, for each vector $x_i$ we want to be able to create an embedding $y_i$ that preserves some distance metric between each other vector in both $X$ and $Y$. t-SNE uses two difference distance metrics -- one for the vectors in $X$ and another for the vectors in $Y$. For a detailed look at why they do this, check out their paper. For each $x_i$ we can calculate an associated probability $p_{j|i}$ that looks as follows:
$$
p_{j|i} = \frac{exp(-{||x_i - x_j||}^{2} / 2\sigma_i^2)}{\sum_{k\ne i}exp(-{||x_i - x_k||}^{2} / 2\sigma_i^2)}
$$
Here $x_j$ is another vector in $X$. As such we are calculating the probability of vector $j$ given vector $i$. I don't think that this distribution is entirely intuitive, but you can think of it as the probability that $x_i$ would 'choose' vector $x_j$ As van der Maaten points out in their paper, vectors that are close to each other (as in their euclidean distance is near zero) will have a relatively high probability, while those with larger distances will have a much lower probability. In practice, one doesn't actually use this probability distribution. Instead, we use a joint probability distribution over indices $i$ and $j$. I was puzzled by the normalization factor, and it only became clear to me once I took a peak at some code.
$$
p_{ij} = \frac{exp(-{||x_i - x_j||}^{2} / 2\sigma^2)}{\sum_{k\ne l}exp(-{||x_i - x_l||}^{2} / 2\sigma^2)} \\
p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n} $$
The second line above has to do with a symmetry consideration. Again, if you're interested, check out van der Maaten's paper. The way I think of this is as follows: We calculate the $p_{i|j}$ as we normally would. What we get from this is an $nxn$ matrix, where $n$ is the number of vectors in our dataset. Note that as a joint probability matrix, this sucker is not yet normalized! We then add the transpose of the unnormalized probability distribution to the $p_{i|j}$ matrix, and normalize that. Van der Maaten and crew also define a joint probability distribution for the vectors in $Y$. Instead of using the same distribution as for the vectors in $X$, t-SNE employs a Student t-distribution with a single degree of freedom. This choice is well motivated, as it avoids overcrowding in the lower dimensional space, as the Student t-distribution has fatter tails than the exponential distribution. Note that we could use an exponential distribution (with no sigma factor), and we would simply be doing symmetric SNE. Anyways, the joint probability distribution looks as follows: $$
q_{ij} = \frac{{(1 + {||y_i - y_j||}^{2})}^{-1}}{\sum_{k\ne l}{(1 + {||y_l - y_k||}^{2})}^{-1}}
$$
As before, that $l$ index confused me quite a bit before I saw how this algorithm got implemented in some code. Basically the idea is that we calculate a matrix full of unnormalized $q_{ij}$s, and then normalize over the whole matrix. The cost function that we minimize in order to generate the best embedding $y_i$ for each each vector $x_i$ looks as follows.
$$
Cost = \sum_{i=0} \sum_{j=0} p_{ji}log(\frac{p_{ji}}{q_{ji}})
$$
We can minimize this function using conventional neural net techniques, like gradient descent. Van der Maaten and co spend some time talking about the gradient of the cost function, but I thought I would leave it up to Theano to do that... The problem I've been concerned with is calculating the free parameter $\sigma_i$ in $p_{j|i}$. In their paper, Van der Maaten and co. say that they use binary search to find an optimal value of $\sigma_i$. I thought it would be cool if I could leverage the machinery of Scipy/Numpy (and even Theano!) to do this. First, however, a description of the problem. t-SNE has a hyperparameter, called the perplexity that determines the value of $\sigma_i$ for each $x_i$. The perplexity is the same for every vector in $X$. One calculates the perplexity as follows:
$$
Perplexity(x_i) = 2^{H_i} \\
H_i = - \sum_{j} p_{j|i} log_{2}(p_{j|i}) $$
$H_i$ is called the entropy (which looks the same as entropy in Physics, minus the Boltzmann factor). So lets say we set the perplexity to be 10, the goal is to find a value of $\sigma_i$ for each $x_i$ such that the $Perplexity(x_i)$ is 10. In principle this is just a matter of creating some Python function that has $\sigma_i$ as an argument, and use scipy's optimize module to find the minimum of the squared distance between the chosen perplexity and the calculated value. I wondered, however, if I could use Newton's method, in conjunction with Theano's symbolic gradient to do this. It turns out that I couldn't do that, so I just stuck with Scipy's optimize module. In the next post, I'll talk about my implementation of the t-SNE algorithm, using conventional Numpy and Theano.
Say you have a set of vectors ${x_0, x_1, x_2, ... ,x_N} \in X$ each with dimension k. As we're dealing with a dimensionality reduction, assume that kis something large. For example, we could be dealing with the MNIST dataset, where kwould be 784. The goal is to generate some good representation of $X$ in a lower dimensional space, say of dimension l.Ideally lwould be something like 2 or 3, so we could see our data on a 2- or 3-D scatter plot. Let's call this low dimensional representation $Y$. The goal here is to preserve the relative position of each vector in $X$ in our new representation $Y$. In other words, for each vector $x_i$ we want to be able to create an embedding $y_i$ that preserves some distance metric between each other vector in both $X$ and $Y$. t-SNE uses two difference distance metrics -- one for the vectors in $X$ and another for the vectors in $Y$. For a detailed look at why they do this, check out their paper. For each $x_i$ we can calculate an associated probability $p_{j|i}$ that looks as follows:
$$
p_{j|i} = \frac{exp(-{||x_i - x_j||}^{2} / 2\sigma_i^2)}{\sum_{k\ne i}exp(-{||x_i - x_k||}^{2} / 2\sigma_i^2)}
$$
Here $x_j$ is another vector in $X$. As such we are calculating the probability of vector $j$ given vector $i$. I don't think that this distribution is entirely intuitive, but you can think of it as the probability that $x_i$ would 'choose' vector $x_j$ As van der Maaten points out in their paper, vectors that are close to each other (as in their euclidean distance is near zero) will have a relatively high probability, while those with larger distances will have a much lower probability. In practice, one doesn't actually use this probability distribution. Instead, we use a joint probability distribution over indices $i$ and $j$. I was puzzled by the normalization factor, and it only became clear to me once I took a peak at some code.
$$
p_{ij} = \frac{exp(-{||x_i - x_j||}^{2} / 2\sigma^2)}{\sum_{k\ne l}exp(-{||x_i - x_l||}^{2} / 2\sigma^2)} \\
p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n} $$
The second line above has to do with a symmetry consideration. Again, if you're interested, check out van der Maaten's paper. The way I think of this is as follows: We calculate the $p_{i|j}$ as we normally would. What we get from this is an $nxn$ matrix, where $n$ is the number of vectors in our dataset. Note that as a joint probability matrix, this sucker is not yet normalized! We then add the transpose of the unnormalized probability distribution to the $p_{i|j}$ matrix, and normalize that. Van der Maaten and crew also define a joint probability distribution for the vectors in $Y$. Instead of using the same distribution as for the vectors in $X$, t-SNE employs a Student t-distribution with a single degree of freedom. This choice is well motivated, as it avoids overcrowding in the lower dimensional space, as the Student t-distribution has fatter tails than the exponential distribution. Note that we could use an exponential distribution (with no sigma factor), and we would simply be doing symmetric SNE. Anyways, the joint probability distribution looks as follows: $$
q_{ij} = \frac{{(1 + {||y_i - y_j||}^{2})}^{-1}}{\sum_{k\ne l}{(1 + {||y_l - y_k||}^{2})}^{-1}}
$$
As before, that $l$ index confused me quite a bit before I saw how this algorithm got implemented in some code. Basically the idea is that we calculate a matrix full of unnormalized $q_{ij}$s, and then normalize over the whole matrix. The cost function that we minimize in order to generate the best embedding $y_i$ for each each vector $x_i$ looks as follows.
$$
Cost = \sum_{i=0} \sum_{j=0} p_{ji}log(\frac{p_{ji}}{q_{ji}})
$$
We can minimize this function using conventional neural net techniques, like gradient descent. Van der Maaten and co spend some time talking about the gradient of the cost function, but I thought I would leave it up to Theano to do that... The problem I've been concerned with is calculating the free parameter $\sigma_i$ in $p_{j|i}$. In their paper, Van der Maaten and co. say that they use binary search to find an optimal value of $\sigma_i$. I thought it would be cool if I could leverage the machinery of Scipy/Numpy (and even Theano!) to do this. First, however, a description of the problem. t-SNE has a hyperparameter, called the perplexity that determines the value of $\sigma_i$ for each $x_i$. The perplexity is the same for every vector in $X$. One calculates the perplexity as follows:
$$
Perplexity(x_i) = 2^{H_i} \\
H_i = - \sum_{j} p_{j|i} log_{2}(p_{j|i}) $$
$H_i$ is called the entropy (which looks the same as entropy in Physics, minus the Boltzmann factor). So lets say we set the perplexity to be 10, the goal is to find a value of $\sigma_i$ for each $x_i$ such that the $Perplexity(x_i)$ is 10. In principle this is just a matter of creating some Python function that has $\sigma_i$ as an argument, and use scipy's optimize module to find the minimum of the squared distance between the chosen perplexity and the calculated value. I wondered, however, if I could use Newton's method, in conjunction with Theano's symbolic gradient to do this. It turns out that I couldn't do that, so I just stuck with Scipy's optimize module. In the next post, I'll talk about my implementation of the t-SNE algorithm, using conventional Numpy and Theano.