As promised, in this post I’ll talk about my implementation of the t-SNE dimensionality reduction algorithm in Python. I’ve made a version that explicitly calculates the gradient of each vector $Y$ in the reduced dataset, and another version that employs Theano’s grad function. The Numpy version was a bit tricky to implement. The Theano version, for once, was actually easier than the Numpy version, because all you do is just slam in T.grad. I’ll start with the Numpy version, and then move on to the Theano version. Before we jump into any code, let’s state the the gradient of the cost function:

\[\frac{\partial Cost}{\partial y_{i}} = 4\sum_{j} (p_{ij} - q_{ij})(y_i - y_j)(1 + ||y_i - y_j||^2)^{-1}\]

Say you’ve stored your high dimensional vectors in some variable X, and the initial guesses for the low dimensional vectors in some variable Y. We can calculate the joint probability qij as follows.

import numpy as np

def qij(Y):
    n = Y.shape[0]
    Y_squared = np.sum(Y**2, axis=1)
    q_no_norm = 1. / (1 + ((Y_squared - 2*np.dot(Y, Y.T)) + Y_squared.reshape(n,1)))
     # or
    we can do this as follows:
    # q_no_norm = 1. / (1 + ((Y_squared - 2*np.dot(Y, Y.T)).T + Y_squared))
    q_no_norm[np.arange(n),np.arange(n)] = 0.0
    norm = np.sum(q_no_norm)
    qij = q_no_norm / norm
    return qij, q_no_norm

The cost is pretty simple as well. Here I assume that you have a function that calculates pij given X. Now, you wouldn’t actually calculate pij everytime you call this function, because you only have to do it once at the beginning, but this is just for illustation purposes…

def cost(X, Y):
    n = Y.shape[0]
    qij = qij(Y)
    qij[np.arange(n), np.arange(n)] = 1.0
    pij = pij(X)
    pij[np.arange(n), np.arange(n)] = 1.0
    return np.sum(pij*np.log(pij/qij))

Now comes the gradient of the cost function. This is a little trickier.

def grad_cost(X, Y):
    n = Y.shape[0]
    qij, q_no_norm = qij(Y)
    pij = pij(X)
    # make an n x n matrix where each column is Y
    Y_tile = np.tile(Y,(n,1)).reshape((n, n, Y.shape[1]))
    Y_swap = Y_tile.swapaxes(0,1) # the transpose of the Y matrix
    Y_diff = Y_tile - Y_swap # This is the (y_i - y_j) term
    mult = (P - qij)*q_no_norm # This is the (p_{ij} - q_{ij})*(1 + ||y_i - y_j||^2)^{-1} term
    # below we do some Numpy magic to make sizes compatible
    dY = np.tile(mult,(Y.shape[1],1)).reshape((Y.shape[1],n, n)).swapaxes(0,2)*Y_diff
    # sum up along the right axis (I hope)
    dY = 4.*np.sum(dY, axis=1)
    return dY

To pull this off efficiently, you have to do some wizardry with Numpy. Now watch the same thing, but in Theano. Here I assume that pij is already stored in an array.

import theano
import theano.tensor as T

def qij(Y):
    n = Y.shape[0]
    Y_squared = T.sum(Y**2, axis=1)
    q_no_norm = 1. / (1 + ((Y_squared - 2*T.dot(Y, Y.T)).T + Y_squared))
    q_no_norm[T.arange(n),T.arange(n)] = 0.0
    norm = T.sum(q_no_norm)
    qij = q_no_norm / norm
    return qij

def cost(Y):
n = Y.shape[0]
    qij = qij(Y)
    # below is how we set values of arrays in theano
    qij = T.set_subtensor(qij[T.arange(n), T.arange(n)], 1.0)
    pij = T.set_subtensor(np.copy(pij)[T.arange(n), T.arange(n)], 1.0)
    return T.sum(pij*np.log(pij/qij))

def grad_cost(Y):
    # try not to shit ya self...
    grad_Y = T.grad(cost(Y),Y)
    return grad_Y

y = T.matrix('y')
grad_cost = grad_cost(y)
grad_cost_fn = theano.function(inputs=[y], ouputs=grad_cost)

Whaaaaaaa??? Take 11 lines of code and turn it into 2? Theano, you’ve stolen my heart! The crazy part about this is that it even runs a little faster than the Numpy version. As always, there is some overhead (around 3 seconds on my machine) to compile the function that computes the gradient. When ‘training’ the t-SNE algorithm however, the Theano version is about 1.5x faster, so you quickly make back this time.