Socialized word embeddings (SWE) has been proposed to deal with two phenomena of language use:
- everyone has his/her own personal characteristics of language use,
- and socially connected users are likely to use language in similar ways.
We observe that the spread of language use is transitive, namely, one user can affect his/her friends, and the friends can also affect their friends. However, SWE modeled transitivity implicitly. The social regularization in SWE only applies to one-hop friends, and thus users outside the one-hop social circle will not be affected directly.
In this work, we adopt random walk methods to generate paths on the social graph to model the transitivity explicitly. Each user on a path will be affected by his/her adjacent user(s) on the path. Moreover, according to the update mechanism of SWE, fewer friends a user has, fewer update opportunities he/she can get. Hence, we propose a biased random walk method to provide these users with more update opportunities.
Experiments show that our random walk based social regularizations perform better on sentiment classification task.
You need to download the dataset:
-
Download Yelp dataset
-
Convert following datasets from json format to csv format by using
json_to_csv_converter.pyinpreprocessand get two.jsonfiles:yelp_academic_dataset_review.jsonyelp_academic_dataset_user.json
And put them in the
datadirectory. -
Download LIBLINEAR or Multi-core LIBLINEAR and put the
liblinearunder the root directory of the repo. And you can install LIBLINEAR according to its Installation
cd preprocessModify run.py by specifying --input (Path to yelp dataset).
python run.pycd trainYou may modify the following arguments in run.py:
--yelp_roundThe round number of yelp data, e.g. {9, 10}--para_lambdaThe trade off parameter between log-likelihood and regularization term--para_rThe constraint of the L2-norm--para_pathThe number of random walk paths for every review--para_path_lengthThe length of random walk paths for every review--path_pThe return parameter for the second-order random walk--path_qThe in-out parameter for the second-order random walk--para_alphaThe restart parameter for the bias random walk--para_bias_weightThe bias parameter for the bias random walk
Because there are a number of ways to train models, you also need to specify model_types in training.py.
Then begin to train models:
python run.pycd sentimentYou may modify the following arguments in run.py, arguments are the same as above.
It will run two .py files, where sentiment.py is about sentiment classification on all users while head_tail.py is about sentiment classification on head users and tail users. You can choose one or both.
Because there are a number of ways to train models, you also need to specify model_types in sentiment.py and head_tail.py.
Then begin the classification:
python run.pyWe thank Tao Lei as our code is developed based on his code.
cd attentionYou can simply re-implement our results of different settings by modifying the run.sh:
- add user and word embeddings by specifying
--user_embsand--embedding. - add train/dev/test files by specifying
--train,--dev, and--testrespectively. - choose the type of layers by specifying
--layer, cnn or lstm. - three settings for our experiments could be achieved by specifying
--user_attenand--user_atten_base:- setting
--user_atten 0for Without attention. - setting
--user_atten 1 --user_atten_base 1for Trained attention. - setting
--user_atten 1 --user_atten_base 0for Fixed user vector as attention.
- setting
Then begin to train attention models:
bash run.shYou can download pretrained embeddings and models in release.
- Put
embs/*in theembs/directory. - Put
sentiment_models/*in thesentiment/models/directory. - Put
attention_models/*in theattention/models/directory.
.
├── attention
│ ├── models
│ ├── nn
│ │ ├── __init__.py
│ │ ├── advanced.py
│ │ ├── basic.py
│ │ ├── evaluation.py
│ │ ├── initialization.py
│ │ └── optimization.py
│ ├── utils
│ │ └── __init__.py
│ ├── dc.py
│ └── run.sh
├── data
├── embs
│ ├── users_sample.txt
│ └── words_sample.txt
├── liblinear
├── preprocess
│ ├── english.pickle
│ ├── english_stop.txt
│ ├── json_to_csv_converter.py
│ ├── preprocess.py
│ └── run.py
├── sentiment
│ ├── format_data
│ ├── models
│ ├── results
│ ├── get_SVM_format_swe.c
│ ├── get_SVM_format_w2v.c
│ ├── head_tail.py
│ ├── run.py
│ └── sentiment.py
├── train
│ ├── run.py
│ ├── swe.c
│ ├── swe_with_2nd_randomwalk.c
│ ├── swe_with_bias_randomwalk.c
│ ├── swe_with_deepwalk.c
│ ├── swe_with_node2vec.c
│ ├── swe_with_randomwalk.c
│ ├── training.py
│ └── w2v.c
├── LICENSE
└── README.md
- *nix operating systems or WSL
- Python 2.7
- gcc
- Theano (>= 0.7 but < 1.0)
- NumPy
- gensim
- PrettyTable
- Pandas
- ujson
- NLTK
If you use this code, then please cite our IJCAI 2018 paper:
@inproceedings{ijcai2018-634,
title = {Biased Random Walk based Social Regularization for Word Embeddings},
author = {Ziqian Zeng and Xin Liu and Yangqiu Song},
booktitle = {Proceedings of the Twenty-Seventh International Joint Conference on
Artificial Intelligence, {IJCAI-18}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
pages = {4560--4566},
year = {2018},
month = {7},
doi = {10.24963/ijcai.2018/634},
url = {https://doi.org/10.24963/ijcai.2018/634},
}
Copyright (c) 2018 HKUST-KnowComp. All rights reserved.
Licensed under the MIT License.