word2vec example

The word2vec example is an algorithm for computing continuous distributed representations of words. According to the word2vec repository it provides a provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

The code is based on a the paper Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov et al. and a detailed explanation is covered in the Word2Vec TF tutorial.

The installation is best done in a docker image or with a full bazel installation. In the docker image or main execute the following code listed below. The wget command will load the text8 corpus (30 MByte/100 MByte extracted) which starts with anarchism originated as a term of abuse. The file is 100,000,000 characters long and contains 17,005,207 words including 253,854 unique words and 71,290 unique frequent words.

The file questions-words.txt contains roughly 20,000 manually curated word relationships (ngrams and shingles) including capital-common-countries (Athens Greece Baghdad Iraq), capital-world (Abuja Nigeria Accra Ghana), currency (Algeria dinar Argentina peso), city-in-state, family, gram1-adjective-to-adverb, gram2-opposite, gram3-comparative, gram4-superlative (bad worst big biggest), gram5-present-participle, gram6-nationality-adjective, gram7-past-tense, gram8-plural, gram9-plural-verbs (decrease decreases describe describes).

cd tensorflow
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
wget https://word2vec.googlecode.com/svn/trunk/questions-words.txt
bazel build -c opt tensorflow/models/embedding:all

which results in

root@fb729273837c:/tensorflow# bazel build -c opt tensorflow/models/embedding:all
INFO: Reading 'startup' options from /root/.bazelrc: --batch
INFO: Found 10 targets...
INFO: Elapsed time: 10.615s, Critical Path: 2.25s
```

After that we can start the example python file by using the manual command from the readme. The tutorial code has two versions of the multi-threaded word2vec, a batched and unbatched skip-gram model:

* word2vec.py - a version of word2vec implemented using Tensorflow ops and minibatching.
* word2vec_optimized.py - s version of word2vec implemented using C ops that does no minibatching.

bazel-bin/tensorflow/models/embedding/word2vec_optimized
--train_data=text8
--eval_data=questions-words.txt
--save_path=/tmp/


which will then drizzle into

```
root@fb729273837c:/tensorflow# time bazel-bin/tensorflow/models/embedding/word2vec_optimized   --train_data=text8   --eval_data=questions-words.txt   --save_path=/tmp/
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
I tensorflow/models/embedding/word2vec_kernels.cc:134] Data file: text8 contains 100000000 bytes, 17005207 words, 253854 unique words, 71290 unique frequent words.
Data file:  text8
Vocab size:  71290  + UNK
Words per epoch:  17005207
Eval analogy file:  questions-words.txt
Questions:  17827
Skipped:  1717
Epoch    1 Step   151322: lr = 0.023 words/sec =    34117
Eval 1554/17827 accuracy =  8.7%
Epoch    2 Step   302660: lr = 0.022 words/sec =     3900
Eval 2302/17827 accuracy = 12.9%
Epoch    3 Step   453986: lr = 0.020 words/sec =    32707
Eval 3049/17827 accuracy = 17.1%
Epoch    4 Step   605329: lr = 0.018 words/sec =    11805
Eval 3528/17827 accuracy = 19.8%
Epoch    5 Step   756656: lr = 0.017 words/sec =   126655
Eval 4055/17827 accuracy = 22.7%
Epoch    6 Step   907954: lr = 0.015 words/sec =    66275
Eval 4434/17827 accuracy = 24.9%
Epoch    7 Step  1059303: lr = 0.013 words/sec =   125780
Eval 4737/17827 accuracy = 26.6%
Epoch    8 Step  1210621: lr = 0.012 words/sec =   123938
Eval 5042/17827 accuracy = 28.3%
Epoch    9 Step  1361968: lr = 0.010 words/sec =    89538
Eval 5335/17827 accuracy = 29.9%
Epoch   10 Step  1513319: lr = 0.008 words/sec =    48258
Eval 5621/17827 accuracy = 31.5%
Epoch   11 Step  1664661: lr = 0.007 words/sec =   113623
Eval 5812/17827 accuracy = 32.6%
Epoch   12 Step  1815978: lr = 0.005 words/sec =    58567
Eval 6053/17827 accuracy = 34.0%
Epoch   13 Step  1967289: lr = 0.003 words/sec =    81122
Eval 6203/17827 accuracy = 34.8%
Epoch   14 Step  2118655: lr = 0.002 words/sec =    68519
Eval 6291/17827 accuracy = 35.3%
Epoch   15 Step  2269981: lr = 0.000 words/sec =    64780
Eval 6366/17827 accuracy = 35.7%

real	36m4.861s
user	240m20.464s
sys	24m18.860s
root@fb729273837c:/tensorflow#

The final accuracy for tensorflow word2vec_optimized.py using the text8 corpus and questions-words.txt is 35.7%. The result is not deterministic but changes from time to time in this example.

We can also see that batched version (word2vec_optimized.py) is highly efficient and uses around 90-100% of all CPU cores, whereas the non-batched version (word2vec.py) is slow and inefficient and barely reaches 40% CPU utilization.

word2vec-optimized-tensorflow:

word2vec-optimized-tensorflow

word2vec-tensorflow-not-optimized:

word2vec-tensorflow-not-optimized

Links

text8 - text8 corpus by Matt Mahoney
word2vec - computing continuous distributed representations of words
word2vec@chalow - 手持ちの MacBook Air (OS X 10.9.2) で word2vec を動かしてみる
word2vec@cnblogs - 用中文把玩Google开源的Deep-Learning项目word2vec
Word2Vec&GloVe - Getting Started with Word2Vec and GloVe in Python
Books&ngrams - Google Books ngram viewer
word2vec&parallel - Interesting benchmark about parallelizing word2vec in Python
word2vec - explained with examples

tensorflow Home
tensorflow Overview
tensorflow Setup
tensorflow MNIST example
tensorflow Cifar10 example
tensorflow AlexNet example
tensorflow Word2vec example
tensorflow General examples
tensorflow Benchmarks
tensorflow TensorBoard
tensorflow Data-scientists
tensorflow Links & Blogs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word2vec example

Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally