- PostgreSQL
- psycopg2
Clone this repository to your local machine.
Ensure PostgreSQL and necessary development tools are installed
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib postgresql-server-dev-all build-essentialsudo apt install libpq-devThe pgvector repository is already included in this project.
Initial compilation and installation:
chmod +x compile_pgvector.sh
./compile_pgvector.shThis will compile pgvector in debug mode (-g -O0) which is useful for development and debugging.
After modifying pgvector source code: Simply run the compile script again:
./compile_pgvector.shThen restart PostgreSQL to load the updated extension:
sudo service postgresql restartStart PostgreSQL service:
sudo service postgresql startCreate database user and database with pgvector extension:
chmod +x setup_db.sh
./setup_db.shThis script will:
- Create user
xx(with no password, or configure your own in the script) - Create database
rbacdatabase_treebase - Install pgvector extension
Note: You may need to modify setup_db.sh to match your desired database name, username, and password. Make sure to update config.json accordingly.
Create a virtual environment and install dependencies:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython -m spacy download en_core_web_mdDownload the dataset to {project directory}/dataset/:
mkdir dataset
cd dataset
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/datasets/Cohere/wikipedia-22-12- SIFT10M features (Fu et al.)
- Download
SIFT10M.tar.gzand place it in the directory pointed to bydataset_path(defaults to/data). - The loader extracts
SIFT10M/SIFT10Mfeatures.matautomatically on first run, or you can run
tar -xf SIFT10M.tar.gz SIFT10M/SIFT10Mfeatures.mat.
- Download
Edit config.json in project root directory to match your database setup:
{
"dbname": "rbacdatabase_treebase",
"user": "x",
"password": "123",
"host": "localhost",
"port": "5432",
"dataset_path": "/data",
"use_gpu_groundtruth": false
}Configuration Options:
use_gpu_groundtruth:false(recommended): Use PostgreSQL for ground truth computation. Slower but no setup required.true: Use FAISS GPU for ground truth computation. First run is slow (builds indexes), subsequent runs are much faster.
Note: If you used setup_db.sh, the default configuration is username xx with no password and database rbacdatabase_treebase. Adjust these values as needed.
For faster ground truth computation (especially for repeated testing), install FAISS:
# Create a conda environment with FAISS GPU support
conda create -n faiss_env python=3.11
conda activate faiss_env
conda install -c pytorch faiss-gpu
# Install other dependencies
pip install -r requirements.txtThen set "use_gpu_groundtruth": true in config.json.
Performance comparison:
- PostgreSQL mode: Consistent speed, no index overhead
- FAISS mode: First run builds role-level indexes (slow), subsequent runs use cached indexes (very fast)
cd basic_benchmark
# Load the default Wikipedia dataset (1M rows by default)
python3 common_prepare_pipeline.py --dataset wikipedia-22-12
# Example: load the SIFT 1M benchmark vectors (load-number 0 loads the entire file)
python3 common_prepare_pipeline.py --dataset sift-128-euclidean --load-number 0
# Example: load the 10M SIFT feature matrix (auto-extracts SIFT10Mfeatures.mat if needed)
python3 common_prepare_pipeline.py --dataset sift10m --load-number 0
# Flags:
# --dataset One of {wikipedia-22-12, arxiv, sift-128-euclidean, sift10m}
# --load-number Number of rows to ingest (0 or negative means “all remaining rows”)
# --start-row Offset within the dataset before loading
# --num-threads Worker processes used for ingestion (defaults to CPU count)
go to controller directory
# prepare main tables(user/role/permission) index
python3 initialize_main_tables.pygo to services/rbac_generator diretory
# Taking treebased as an example
python3 store_tree_based_rbac_generate_data.pygo to basic_benchmark
# initialize role partition
python3 initialize_role_partition_tables.py --index_type hnsw
# (optional) initialize user partition
python3 initialize_combination_role_partition_tables.py --index_type hnsw
# generate queries
python3 generate_queries.py --num_queries 1000 --topk 100 --num_threads 4
# compute ground truth (pointer benchmark & tests share cache)
python3 compute_ground_truth.pyGround Truth Caching:
- Ground truth results are automatically cached in
basic_benchmark/ground_truth_cache.json - Subsequent test runs with the same queries will load from cache (instant)
- Cache is automatically cleared when regenerating queries with
generate_queries.py - To manually clear cache:
rm basic_benchmark/ground_truth_cache.json
### Initilize dynamic partition
```sh
# initilize dynamic
go to controller/dynamic_partition/hnsw
# if needed, delete parameter_hnsw.json from hnsw directory to regenerate parameters
python3 AnonySys_dynamic_partition.py --storage 2.0 --recall 0.95
Run(HNSW index)
```sh
go to basic_benchmark directory
# example:
python test_all.py --algorithm RLS --efs 500
python test_all.py --algorithm ROLE --efs 20
python test_all.py --algorithm USER --efs 20
python test_all.py --algorithm AnonySys --efs 40
Run(ACORN index)
go to acorn_benchmark directory
#modify efs value from main.cpp
build C++ project and runBefore building, create acorn_benchmark/config.json to point benchmarks at the shared index location:
{
"index_storage_path": "/pgsql_data/acorn/"
}Make sure /pgsql_data/acorn/ exists ahead of time; both ACORN and dynamic-partition indexes are persisted there.