Neural Network Embeddings Recover Value Dimensions from Psychometric Survey Items on Par with Human Data
This repository contains code and data for analyzing how neural network embeddings capture psychological constructs from survey items, including human values (Schwartz) and personality (Big Five, HEXACO).
The analysis pipeline demonstrates that embedding models can recover psychological dimensions from questionnaire items in a manner comparable to human ratings. The repository includes:
- Values analysis (Schwartz 19-dimension model): Main manuscript results
- Personality analysis (IPIP-50, BFI-2, HEXACO-100): Generalization to Big Five and HEXACO personality models
To replicate all results for the Schwartz values analysis in the manuscript, run these code files in order:
00_prepare_data_and_helper_functions.R
01_item_mapping_scoring_2022.R
02_get_human_data_from_schwartz_si.R
03_compute_cronbach_alpha.R
04_create_correlation_plots.R
05_multidimensional_scaling.R
Additional analyses comparing multiple models and personality questionnaires:
06_compare_multiple_models_mds.R
07_combined_correlation_analysis_fixedscale.R
File 06 computes factor congruence scores for MDS solutions from multiple embedding models compared to human ratings.
File 07 provides a unified analysis of three personality questionnaires (IPIP-50, BFI-2, HEXACO) with fixed color scales for direct comparison of raw vs SQuID-treated embeddings.
| File | Description |
|---|---|
00_prepare_data_and_helper_functions.R |
Loads libraries, helper functions, and prepares data structures |
01_item_mapping_scoring_2022.R |
Maps survey items to value dimensions using Schwartz 2022 scheme |
02_get_human_data_from_schwartz_si.R |
Extracts human correlation data from Schwartz supplementary materials |
03_compute_cronbach_alpha.R |
Computes reliability metrics (Cronbach's alpha) |
04_create_correlation_plots.R |
Creates correlation plots comparing embeddings to human ratings |
05_multidimensional_scaling.R |
Performs MDS analysis and Procrustes rotation |
06_compare_multiple_models_mds.R |
Compares multiple embedding models using factor congruence |
07_combined_correlation_analysis_fixedscale.R |
Analyzes IPIP, BFI-2, and HEXACO with unified color scales |
| File | Description |
|---|---|
create_embeddings.py |
Generates embeddings for Schwartz values items (male/female versions) |
create_embeddings_personality.py |
Generates embeddings for IPIP-50, BFI-2, and HEXACO-100 questionnaires |
The analysis uses five state-of-the-art embedding models:
- Gemini (gemini-embedding-exp-03-07) - Google's multimodal embedding model
- Linq-Embed-Mistral - Instruction-tuned Mistral-based embeddings
- MPNet (dwulff/mpnet-personality) - Personality-specialized MPNet model
- KaLM - Multilingual mini-instruct embedding model
- Jina (jina-embeddings-v3) - General-purpose embedding model
- 57 items measuring 19 value dimensions
- Separate male/female item phrasings
- Source: Schwartz et al. (2012)
| Questionnaire | Items | Dimensions | Source |
|---|---|---|---|
| IPIP-50 | 50 | Big Five (5) | Goldberg (1992) |
| BFI-2 | 60 | Big Five (5) | Soto & John (2017) |
| HEXACO-100 | 100 | HEXACO (6) | Lee & Ashton (2018) |
The analysis introduces SQuID (Scale QUestionnaire Improvement via Dimension subtraction), a novel method that:
- Computes a "scale embedding" as the mean of all item embeddings
- Subtracts this scale embedding from each item embedding
- Reveals latent negative correlations and improves dimensional structure
This method significantly enhances the recovery of theoretical structures (e.g., circumplex models) from raw embeddings.
Install the following R packages:
install.packages(c(
"psych", # 2.4.12
"vegan", # 2.6-10
"ggforce", # 0.4.2
"ggtext", # 0.1.2
"ggrepel", # 0.9.6
"ggpmisc", # 0.6.1
"gt", # 0.11.1
"psy", # 1.2
"RColorBrewer", # 1.1-3
"proxy", # 0.4-27
"smacof", # 2.1-7
"data.table", # 1.16.4
"corrplot", # 0.95
"ggplot2" # 3.5.1
))Create a virtual environment and install:
pip install google-genai==1.9.0
pip install numpy==1.26.4
pip install sentence-transformers==3.3.1
pip install transformers==4.46.3
pip install python-dotenv # For API key managementNote: You need a Gemini API key (free tier available). Store it in a .env file:
GEMINI_API_KEY=your_api_key_here
All embedding files are provided in the repository. To regenerate them:
python create_embeddings.pyThis generates embeddings for male and female versions of Schwartz values items.
python create_embeddings_personality.pyThis generates embeddings for IPIP-50, BFI-2, and HEXACO-100 items.
Note: The Gemini API has rate limits on free tier. The script includes automatic rate limiting via time.sleep(65).
The repository includes pre-computed embeddings for all questionnaires and models:
gemini_embeddings_items_value_male/femalelinqembedmistral_embeddings_items_value_male/femalempnet_embeddings_items_value_male/femalekalm_embeddings_items_value_male/femalejina_withouttask_embeddings_items_value_male/female
{model}_embeddings_ipip(5 models){model}_embeddings_bfi2(5 models){model}_embeddings_hexaco(5 models)
correlation_matrix_from_schwartz_all_groups.csv- Human correlation dataschwartz_cormat_all.RDS- Correlation matrix for all 49 countriesvalues_mapping_19_2022.RDS- Item-to-dimension mappingtable_s2_si.csv- Supplementary information from Schwartzrandom_cronbachlist.RDS- Precomputed random baseline for reliability analysis
The analysis generates:
- Correlation plots: Raw vs SQuID-treated embeddings
- MDS plots: 2D visualizations of value/personality spaces
- Factor congruence tables: Comparison across models
- LaTeX tables: Publication-ready summaries
- Summary statistics: CSV files with correlation ranges and improvements