Exploring patterns of stability and change in caregivers' word usage across early childhood

This repository implements dynamic word embeddings for exploring the language variations of caregivers across multiple languages in CHILDES dataset.

Setup Environment

Install via conda:

install requirements
pip install -r requirements.txt

Download data

Run the commands to download the data from Google Drive and unzip it to be data

python download_gdrive.py 1NXMPs_f1crasC1JM50RSIJweLuPAG9Hk data.zip
unzip data.zip data

Train dynamic word embeddings

We take the english-uk language for example, whose raw csv files sit in ./data/german/raw/. We also put the pre-computed word_inventory.csv for each language inside ./data/{language}/. German and Japanese do not have stem column.

Note that we only create one shuffled corpus and train 2 iterations below as a toy example.

Preprocess data

python preprocess.py \
    --source_dir ./data/English-uk/raw \
    --dest_dir ./data/English-uk/ \
    --num_epochs 2 \
    --num_shuffles 1 \
    --use_stem

Train models

python word2vec.py \
    --source_dir ./data/English-uk/ \
    --dest_dir ./output/English-uk/ \
    --dim 100 \
    --min_count 15 \
    --n_epochs 2 \
    --window 5 \
    --negative 5 \
    --sg 1 \
    --sample 1e-5 \
    --ns_exponent 0.75 \
    --workers 4

Generate outputs on language variations

python generate_outputs.py \
    --source_dir ./output/English-uk/ \
    --embedding_filename embeddings-ep2-f15-d100-w5.pickle \
    --word_inventory ./data/English-uk/word_inventory.csv \
    --google_word2vec ./data/google-word2vec/GoogleNews-vectors-negative300.bin \
    --num_neighbors 25

Analysis

The analysis code is in R. Please refer to analysis folder in the repository.

Citation

We now have the paper you can cite:

@misc{jiang_frank_kulkarni_fourtassi_2020,
 title={Exploring patterns of stability and change in caregivers’ word usage across early childhood},
 url={psyarxiv.com/fym86},
 DOI={10.31234/osf.io/fym86},
 publisher={PsyArXiv},
 author={Jiang, Hang and Frank, Michael C and Kulkarni, Vivek and Fourtassi, Abdellah},
 year={2020},
 month={Feb}
}

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
analysis		analysis
data		data
run		run
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_gdrive.py		download_gdrive.py
generate_outputs.py		generate_outputs.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
utils.py		utils.py
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Exploring patterns of stability and change in caregivers' word usage across early childhood

Setup Environment

Install via conda:

Download data

Train dynamic word embeddings

Analysis

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

afourtassi/change

Folders and files

Latest commit

History

Repository files navigation

Exploring patterns of stability and change in caregivers' word usage across early childhood

Setup Environment

Install via conda:

Download data

Train dynamic word embeddings

Analysis

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages