Saturday, June 21, 2014

Plotting Word Embedding using T-SNE and Barnes-Hut-SNE with R

This blog contains a short tutorial for plotting high-dimensional word embedding data produced by word2vec using t-SNE and Barnes-Hut-SNE technique in R.

What is t-SNE?

t-SNE was introduced by Laurens van der Maaten and Geoff Hinton in "Visualizing Data using t-SNE" [2]. t-SNE stands for t-Distributed Stochastic Neighbor Embedding. It visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that allows optimization, and produces significantly better visualizations by reducing the tendency to lump points together in the center of the map that often renders the visualization ineffective and unreadable. t-SNE is good at creating a map that reveals structure and embedding relationships at many different scales. This is particularly important for high-dimensional inter-related data that lie on several different low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. 

The baseline version of t-SNE has O(N2) complexity. Later on, Maaten introduced the O(N log N) version of t-SNE a.k.a Barnes-Hut-SNE in [3]. 

t-SNE will work with many form of high-dimensional data. In this article, we will demonstrate the use of t-SNE against the high-dimensional distributed word representation (also known as word embedding) produced by word2vec algorithm. This high-dimensional representation of words can be used in many natural language processing (NLP) applications.

References:

[0] About R
[6] word2vec algorithm


Gladly that there are several free implementations of T-SNE out there. Here I am going to show you the R* version and use it to visualize distributed representation of words a.k.a word embedding.

If you are interested in learning more about word embedding, you can check out Mikolov's word2vec page to get some context. There are several other sites published their word embedding such as wordrepresentation, #wenpengyin-phraseembedding. Note that we use Word Embedding here but there are many other high-dimensional data representation that just word embedding that one can visualize using T-SNE technique.


An example of the t-SNE word embedding plot. 

The full size original image can be downloaded here and below more images showing different region of the embeddings.








Barnes-Hut-SNE Plot:
Plot generated by Barnes-Hut-SNE (Rtsne)

Running T-SNE and Barnes-Hut-SNE in R


1. Download and install the tsne and Rtsne R-package. See instruction for installing R-package. It is probably helpful to also look "R in action" text book to understand with the basic of datasets in R, loading data into R, and basic syntax of language covering the concept of variable, function, graph and using R-packages.
You can use any high-dimensional vector data and import it into R. If you don't have one, I have provided a sample words embedding dataset produced by word2vec
To load the dataset into R, execute the following R command in R shell (assuming that the data is in d:\samplewordembedding.csv):

2. Obtaining and importing dataset to R

DISCLAIMER: The intention of sharing the data is to provide quick access so anyone can plot t-SNE immediately without having to generate the data themselves. The data produced does not necessarily reflect the quality of the technique as quality can be affected by many parameters and when we produced the word embedding we run the word2vec algorithm with all default parameters. The data is shared on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. You need to agree to this before using the data.
The sample dataset can be downloaded here.
> mydata <- read.table("d:\\samplewordembedding.csv", header=TRUE, sep=",")

3. Finally, plotting the sample data with tsne and Rtsne

Copy the following R-script snippet and paste it into R command line. The script below assumes that the input file is in d:\samplewordembedding.csv and the output plot files are generated in the root directory of drive d: d:\plot.0.jpg, d:\plot.1.jpg, etc. Be prepared that your CPU will spike a little bit and it will take several minutes to complete the tsne execution.
# loading data into R
mydata <- read.table("d:\\samplewordembedding.csv", header=TRUE, sep=",")

# load the tsne package
library(tsne)

# initialize counter to 0
x <- 0
epc <- function(x) {
    x <<- x + 1
    filename <- paste("d:\\plot", x, "jpg", sep=".")
    cat("> Plotting TSNE to ", filename, " ")

    # plot to d:\\plot.x.jpg file of 2400x1800 dimension
    jpeg(filename, width=2400, height=1800)

    plot(x, t='n', main="T-SNE")
    text(x, labels=rownames(mydata))
    dev.off()
}

# run tsne (maximum iterations:500, callback every 100 epochs, target dimension k=5)
tsne_data <- tsne(mydata, k=5, epoch_callback=epc, max_iter=500, epoch=100)

Similarly you can use Rtsne package, the faster O(n log n) Barnes-Hut-TSNE algorithm. Rtsne is an R wrapper of C++ Maatens' Barnes-Hut-TSNE written by jkrijthe. As you would expect, Rtsne completes much faster than tsne.
# load the Rtsne package
library(Rtsne)

# run Rtsne with default parameters
rtsne_out <- Rtsne(as.matrix(mydata))

# plot the output of Rtsne into d:\\barneshutplot.jpg file of 2400x1800 dimension
jpeg("d:\\barneshutplot.jpg", width=2400, height=1800)
plot(rtsne_out$Y, t='n', main="BarnesHutSNE")
text(rtsne_out$Y, labels=rownames(mydata))