-
Notifications
You must be signed in to change notification settings - Fork 46
word2vec example
The word2vec example is an algorithm for computing continuous distributed representations of words. According to the word2vec repository it provides a provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
The code is based on a the paper Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov et al. and a detailed explanation is covered in the Word2Vec TF tutorial.
The installation is best done in a docker image or with a full bazel installation. In the docker image or main execute the following code listed below. The wget command will load the text8 corpus (30 MByte/100 MByte extracted) which starts with anarchism originated as a term of abuse. The file is 100,000,000 characters long and contains 17,005,207 words including 253,854 unique words and 71,290 unique frequent words.
The file questions-words.txt contains roughly 20,000 manually curated word relationships (ngrams and shingles) including capital-common-countries (Athens Greece Baghdad Iraq), capital-world (Abuja Nigeria Accra Ghana), currency (Algeria dinar Argentina peso), city-in-state, family, gram1-adjective-to-adverb, gram2-opposite, gram3-comparative, gram4-superlative (bad worst big biggest), gram5-present-participle, gram6-nationality-adjective, gram7-past-tense, gram8-plural, gram9-plural-verbs (decrease decreases describe describes).
cd tensorflow
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
wget https://word2vec.googlecode.com/svn/trunk/questions-words.txt
bazel build -c opt tensorflow/models/embedding:all
which results in
root@fb729273837c:/tensorflow# bazel build -c opt tensorflow/models/embedding:all
INFO: Reading 'startup' options from /root/.bazelrc: --batch
INFO: Found 10 targets...
INFO: Elapsed time: 10.615s, Critical Path: 2.25s
```
After that we can start the example python file by using the manual command from the readme. The tutorial code has two versions of the multi-threaded word2vec, a batched and unbatched skip-gram model:
* word2vec.py - a version of word2vec implemented using Tensorflow ops and minibatching.
* word2vec_optimized.py - s version of word2vec implemented using C ops that does no minibatching.
bazel-bin/tensorflow/models/embedding/word2vec_optimized
--train_data=text8
--eval_data=questions-words.txt
--save_path=/tmp/
which will then drizzle into
```
root@fb729273837c:/tensorflow# time bazel-bin/tensorflow/models/embedding/word2vec_optimized --train_data=text8 --eval_data=questions-words.txt --save_path=/tmp/
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
I tensorflow/models/embedding/word2vec_kernels.cc:134] Data file: text8 contains 100000000 bytes, 17005207 words, 253854 unique words, 71290 unique frequent words.
Data file: text8
Vocab size: 71290 + UNK
Words per epoch: 17005207
Eval analogy file: questions-words.txt
Questions: 17827
Skipped: 1717
Epoch 1 Step 151322: lr = 0.023 words/sec = 34117
Eval 1554/17827 accuracy = 8.7%
Epoch 2 Step 302660: lr = 0.022 words/sec = 3900
Eval 2302/17827 accuracy = 12.9%
Epoch 3 Step 453986: lr = 0.020 words/sec = 32707
Eval 3049/17827 accuracy = 17.1%
Epoch 4 Step 605329: lr = 0.018 words/sec = 11805
Eval 3528/17827 accuracy = 19.8%
Epoch 5 Step 756656: lr = 0.017 words/sec = 126655
Eval 4055/17827 accuracy = 22.7%
Epoch 6 Step 907954: lr = 0.015 words/sec = 66275
Eval 4434/17827 accuracy = 24.9%
Epoch 7 Step 1059303: lr = 0.013 words/sec = 125780
Eval 4737/17827 accuracy = 26.6%
Epoch 8 Step 1210621: lr = 0.012 words/sec = 123938
Eval 5042/17827 accuracy = 28.3%
Epoch 9 Step 1361968: lr = 0.010 words/sec = 89538
Eval 5335/17827 accuracy = 29.9%
Epoch 10 Step 1513319: lr = 0.008 words/sec = 48258
Eval 5621/17827 accuracy = 31.5%
Epoch 11 Step 1664661: lr = 0.007 words/sec = 113623
Eval 5812/17827 accuracy = 32.6%
Epoch 12 Step 1815978: lr = 0.005 words/sec = 58567
Eval 6053/17827 accuracy = 34.0%
Epoch 13 Step 1967289: lr = 0.003 words/sec = 81122
Eval 6203/17827 accuracy = 34.8%
Epoch 14 Step 2118655: lr = 0.002 words/sec = 68519
Eval 6291/17827 accuracy = 35.3%
Epoch 15 Step 2269981: lr = 0.000 words/sec = 64780
Eval 6366/17827 accuracy = 35.7%
real 36m4.861s
user 240m20.464s
sys 24m18.860s
root@fb729273837c:/tensorflow#
The final accuracy for tensorflow word2vec_optimized.py using the text8 corpus and questions-words.txt is 35.7%. The result is not deterministic but changes from time to time in this example.
We can also see that batched version (word2vec_optimized.py) is highly efficient and uses around 90-100% of all CPU cores, whereas the non-batched version (word2vec.py) is slow and inefficient and barely reaches 40% CPU utilization.
word2vec-optimized-tensorflow:
word2vec-tensorflow-not-optimized:
- text8 - text8 corpus by Matt Mahoney
- word2vec - computing continuous distributed representations of words
- word2vec@chalow - 手持ちの MacBook Air (OS X 10.9.2) で word2vec を動かしてみる
- word2vec@cnblogs - 用中文把玩Google开源的Deep-Learning项目word2vec
- Word2Vec&GloVe - Getting Started with Word2Vec and GloVe in Python
- Books&ngrams - Google Books ngram viewer
- word2vec¶llel - Interesting benchmark about parallelizing word2vec in Python
- word2vec - explained with examples
- tensorflow Home
- tensorflow Overview
- tensorflow Setup
- tensorflow MNIST example
- tensorflow Cifar10 example
- tensorflow AlexNet example
- tensorflow Word2vec example
- tensorflow General examples
- tensorflow Benchmarks
- tensorflow TensorBoard
- tensorflow Data-scientists
- tensorflow Links & Blogs