Skip to content

maxpel/embeddings_values

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Network Embeddings Recover Value Dimensions from Psychometric Survey Items on Par with Human Data

This repository contains code and data for analyzing how neural network embeddings capture psychological constructs from survey items, including human values (Schwartz) and personality (Big Five, HEXACO).

Overview

The analysis pipeline demonstrates that embedding models can recover psychological dimensions from questionnaire items in a manner comparable to human ratings. The repository includes:

  • Values analysis (Schwartz 19-dimension model): Main manuscript results
  • Personality analysis (IPIP-50, BFI-2, HEXACO-100): Generalization to Big Five and HEXACO personality models

Quick Start

Main Analysis (Values)

To replicate all results for the Schwartz values analysis in the manuscript, run these code files in order:

00_prepare_data_and_helper_functions.R
01_item_mapping_scoring_2022.R
02_get_human_data_from_schwartz_si.R
03_compute_cronbach_alpha.R
04_create_correlation_plots.R
05_multidimensional_scaling.R

Extended Analysis

Additional analyses comparing multiple models and personality questionnaires:

06_compare_multiple_models_mds.R
07_combined_correlation_analysis_fixedscale.R

File 06 computes factor congruence scores for MDS solutions from multiple embedding models compared to human ratings.

File 07 provides a unified analysis of three personality questionnaires (IPIP-50, BFI-2, HEXACO) with fixed color scales for direct comparison of raw vs SQuID-treated embeddings.

Analysis Files

R Scripts

File Description
00_prepare_data_and_helper_functions.R Loads libraries, helper functions, and prepares data structures
01_item_mapping_scoring_2022.R Maps survey items to value dimensions using Schwartz 2022 scheme
02_get_human_data_from_schwartz_si.R Extracts human correlation data from Schwartz supplementary materials
03_compute_cronbach_alpha.R Computes reliability metrics (Cronbach's alpha)
04_create_correlation_plots.R Creates correlation plots comparing embeddings to human ratings
05_multidimensional_scaling.R Performs MDS analysis and Procrustes rotation
06_compare_multiple_models_mds.R Compares multiple embedding models using factor congruence
07_combined_correlation_analysis_fixedscale.R Analyzes IPIP, BFI-2, and HEXACO with unified color scales

Python Scripts

File Description
create_embeddings.py Generates embeddings for Schwartz values items (male/female versions)
create_embeddings_personality.py Generates embeddings for IPIP-50, BFI-2, and HEXACO-100 questionnaires

Embedding Models

The analysis uses five state-of-the-art embedding models:

  1. Gemini (gemini-embedding-exp-03-07) - Google's multimodal embedding model
  2. Linq-Embed-Mistral - Instruction-tuned Mistral-based embeddings
  3. MPNet (dwulff/mpnet-personality) - Personality-specialized MPNet model
  4. KaLM - Multilingual mini-instruct embedding model
  5. Jina (jina-embeddings-v3) - General-purpose embedding model

Questionnaires

Schwartz Values Survey

  • 57 items measuring 19 value dimensions
  • Separate male/female item phrasings
  • Source: Schwartz et al. (2012)

Personality Questionnaires

Questionnaire Items Dimensions Source
IPIP-50 50 Big Five (5) Goldberg (1992)
BFI-2 60 Big Five (5) Soto & John (2017)
HEXACO-100 100 HEXACO (6) Lee & Ashton (2018)

Key Method: SQuID (Scale Embedding Subtraction)

The analysis introduces SQuID (Scale QUestionnaire Improvement via Dimension subtraction), a novel method that:

  1. Computes a "scale embedding" as the mean of all item embeddings
  2. Subtracts this scale embedding from each item embedding
  3. Reveals latent negative correlations and improves dimensional structure

This method significantly enhances the recovery of theoretical structures (e.g., circumplex models) from raw embeddings.

Requirements

R Dependencies

Install the following R packages:

install.packages(c(
    "psych",         # 2.4.12
    "vegan",         # 2.6-10
    "ggforce",       # 0.4.2
    "ggtext",        # 0.1.2
    "ggrepel",       # 0.9.6
    "ggpmisc",       # 0.6.1
    "gt",            # 0.11.1
    "psy",           # 1.2
    "RColorBrewer",  # 1.1-3
    "proxy",         # 0.4-27
    "smacof",        # 2.1-7
    "data.table",    # 1.16.4
    "corrplot",      # 0.95
    "ggplot2"        # 3.5.1
))

Python Dependencies

Create a virtual environment and install:

pip install google-genai==1.9.0
pip install numpy==1.26.4
pip install sentence-transformers==3.3.1
pip install transformers==4.46.3
pip install python-dotenv  # For API key management

Note: You need a Gemini API key (free tier available). Store it in a .env file:

GEMINI_API_KEY=your_api_key_here

Generating Embeddings

All embedding files are provided in the repository. To regenerate them:

Values Embeddings

python create_embeddings.py

This generates embeddings for male and female versions of Schwartz values items.

Personality Embeddings

python create_embeddings_personality.py

This generates embeddings for IPIP-50, BFI-2, and HEXACO-100 items.

Note: The Gemini API has rate limits on free tier. The script includes automatic rate limiting via time.sleep(65).

Data Files

The repository includes pre-computed embeddings for all questionnaires and models:

Schwartz Values

  • gemini_embeddings_items_value_male/female
  • linqembedmistral_embeddings_items_value_male/female
  • mpnet_embeddings_items_value_male/female
  • kalm_embeddings_items_value_male/female
  • jina_withouttask_embeddings_items_value_male/female

Personality Questionnaires

  • {model}_embeddings_ipip (5 models)
  • {model}_embeddings_bfi2 (5 models)
  • {model}_embeddings_hexaco (5 models)

Reference Data

  • correlation_matrix_from_schwartz_all_groups.csv - Human correlation data
  • schwartz_cormat_all.RDS - Correlation matrix for all 49 countries
  • values_mapping_19_2022.RDS - Item-to-dimension mapping
  • table_s2_si.csv - Supplementary information from Schwartz
  • random_cronbachlist.RDS - Precomputed random baseline for reliability analysis

Output

The analysis generates:

  • Correlation plots: Raw vs SQuID-treated embeddings
  • MDS plots: 2D visualizations of value/personality spaces
  • Factor congruence tables: Comparison across models
  • LaTeX tables: Publication-ready summaries
  • Summary statistics: CSV files with correlation ranges and improvements

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors