Skip to content

word2vec example

Tobias Kind edited this page Mar 28, 2017 · 24 revisions

The word2vec example is an algorithm for computing continuous distributed representations of words. According to the word2vec repository it provides a provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

The code is based on a the paper Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov et al. and a detailed explanation is covered in the Word2Vec TF tutorial.


The installation is best done in a docker image or with a full bazel installation. In the docker image or main execute the following code listed below. The wget command will load the text8 corpus (30 MByte/100 MByte extracted) which starts with anarchism originated as a term of abuse. The file is 100,000,000 characters long and contains 17,005,207 words including 253,854 unique words and 71,290 unique frequent words.

The file questions-words.txt contains roughly 20,000 manually curated word relationships (ngrams and shingles) including capital-common-countries (Athens Greece Baghdad Iraq), capital-world (Abuja Nigeria Accra Ghana), currency (Algeria dinar Argentina peso), city-in-state, family, gram1-adjective-to-adverb, gram2-opposite, gram3-comparative, gram4-superlative (bad worst big biggest), gram5-present-participle, gram6-nationality-adjective, gram7-past-tense, gram8-plural, gram9-plural-verbs (decrease decreases describe describes).

cd tensorflow
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
wget https://word2vec.googlecode.com/svn/trunk/questions-words.txt
bazel build -c opt tensorflow/models/embedding:all

which results in

root@fb729273837c:/tensorflow# bazel build -c opt tensorflow/models/embedding:all
INFO: Reading 'startup' options from /root/.bazelrc: --batch
INFO: Found 10 targets...
INFO: Elapsed time: 10.615s, Critical Path: 2.25s
```

After that we can start the example python file by using the manual command from the readme. The tutorial code has two versions of the multi-threaded word2vec, a batched and unbatched skip-gram model:

* word2vec.py - a version of word2vec implemented using Tensorflow ops and minibatching.
* word2vec_optimized.py - s version of word2vec implemented using C ops that does no minibatching.

bazel-bin/tensorflow/models/embedding/word2vec_optimized
--train_data=text8
--eval_data=questions-words.txt
--save_path=/tmp/


which will then drizzle into

```
root@fb729273837c:/tensorflow# time bazel-bin/tensorflow/models/embedding/word2vec_optimized   --train_data=text8   --eval_data=questions-words.txt   --save_path=/tmp/
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
I tensorflow/models/embedding/word2vec_kernels.cc:134] Data file: text8 contains 100000000 bytes, 17005207 words, 253854 unique words, 71290 unique frequent words.
Data file:  text8
Vocab size:  71290  + UNK
Words per epoch:  17005207
Eval analogy file:  questions-words.txt
Questions:  17827
Skipped:  1717
Epoch    1 Step   151322: lr = 0.023 words/sec =    34117
Eval 1554/17827 accuracy =  8.7%
Epoch    2 Step   302660: lr = 0.022 words/sec =     3900
Eval 2302/17827 accuracy = 12.9%
Epoch    3 Step   453986: lr = 0.020 words/sec =    32707
Eval 3049/17827 accuracy = 17.1%
Epoch    4 Step   605329: lr = 0.018 words/sec =    11805
Eval 3528/17827 accuracy = 19.8%
Epoch    5 Step   756656: lr = 0.017 words/sec =   126655
Eval 4055/17827 accuracy = 22.7%
Epoch    6 Step   907954: lr = 0.015 words/sec =    66275
Eval 4434/17827 accuracy = 24.9%
Epoch    7 Step  1059303: lr = 0.013 words/sec =   125780
Eval 4737/17827 accuracy = 26.6%
Epoch    8 Step  1210621: lr = 0.012 words/sec =   123938
Eval 5042/17827 accuracy = 28.3%
Epoch    9 Step  1361968: lr = 0.010 words/sec =    89538
Eval 5335/17827 accuracy = 29.9%
Epoch   10 Step  1513319: lr = 0.008 words/sec =    48258
Eval 5621/17827 accuracy = 31.5%
Epoch   11 Step  1664661: lr = 0.007 words/sec =   113623
Eval 5812/17827 accuracy = 32.6%
Epoch   12 Step  1815978: lr = 0.005 words/sec =    58567
Eval 6053/17827 accuracy = 34.0%
Epoch   13 Step  1967289: lr = 0.003 words/sec =    81122
Eval 6203/17827 accuracy = 34.8%
Epoch   14 Step  2118655: lr = 0.002 words/sec =    68519
Eval 6291/17827 accuracy = 35.3%
Epoch   15 Step  2269981: lr = 0.000 words/sec =    64780
Eval 6366/17827 accuracy = 35.7%

real	36m4.861s
user	240m20.464s
sys	24m18.860s
root@fb729273837c:/tensorflow# 

The final accuracy for tensorflow word2vec_optimized.py using the text8 corpus and questions-words.txt is 35.7%. The result is not deterministic but changes from time to time in this example.


We can also see that batched version (word2vec_optimized.py) is highly efficient and uses around 90-100% of all CPU cores, whereas the non-batched version (word2vec.py) is slow and inefficient and barely reaches 40% CPU utilization.

word2vec-optimized-tensorflow:

word2vec-optimized-tensorflow

word2vec-tensorflow-not-optimized:

word2vec-tensorflow-not-optimized


Links

  • text8 - text8 corpus by Matt Mahoney
  • word2vec - computing continuous distributed representations of words
  • word2vec@chalow - 手持ちの MacBook Air (OS X 10.9.2) で word2vec を動かしてみる
  • word2vec@cnblogs - 用中文把玩Google开源的Deep-Learning项目word2vec
  • Word2Vec&GloVe - Getting Started with Word2Vec and GloVe in Python
  • Books&ngrams - Google Books ngram viewer
  • word2vec&parallel - Interesting benchmark about parallelizing word2vec in Python
  • word2vec - explained with examples

Clone this wiki locally