Siamese BiLSTM with Glyphic Character Embeddings for Bone-level Sentence Association Prediction (SGBSAP)

This repository implements a Siamese BiLSTM with Glyphic Character Embeddings for Bone-level Sentence Association Prediction (SGBSAP) that employing glyphic (shape) character embeddings of oracle bone inscriptions (OBIs) to address the Fragment Association Prediction Problem. This model provides a new AI-based approach for oracle bone rejoining. It was proposed in the paper "A Multi-Modal Dataset and a Glyph-Based Method for Bone-Level Association Prediction in Oracle Bone Inscriptions".

✨ Features

Dual-tower BiLSTM encoders for two sentences.
Employing glyphic (shape) character embeddings rather than semantic (meaning) character embeddings.
Metrics: AUROC , AUPR , Accuracy , Precision , Recall and F1 score.
Multiple runs with averaged results saved to CSV.

📦 Requirements & Installation

Create an environment and install dependencies:

pip install -r requirements.txt

Python 3.9+ and PyTorch ≥ 2.0 are recommended.

📂 Data Format

Training and test files (train.txt, test.txt) should be tab-separated with three fields per line:

sentence1 \t sentence2 \t label

Each sentence is a whitespace-separated sequence of tokens in the form primary-character|secondary-character, e.g.:

u7v6rlhp81|xzku05tqy7 h0gzv3styy|h0gzv3styy jvuf4ut3c5|bf98fqv8qx	d7mczw6osp|d7mczw6osp 7h3wu2xyyf|7h3wu2xyyf 6ceuhy4fvr|6ceuhy4fvr	1

label = 1 → the two OBIs are from the same original oracle bone
label = 0 → the two OBIs are from different oracle bones

Sequence length is truncated to 30 tokens by default in the script.

🔤 Pretrained Embeddings

Provide two JSON files of the same vocabulary (keys are tokens, values are 512-d vectors by default).

Example format:

{
  "8gxzzbv7w8": [0.1, 0.2, 0.3, /* ... 512 dims ... */],
  "jrzjjh3g1r": [0.05, 0.07, 0.11 /* ... */],
  ...
}

The data required for training the semantic embeddings can be accessed at Zenodo. The processed version of the data is also provided in the data folder of this GitHub repository.

The glyphic embeddings can be trained using VAE. The data required for this step can be accessed at Zenodo.

🚀 Usage

Option A: code/main.py

We provide a main.py. You can modify the parameters inside before running it.

Option B: bash run.sh

📊 Output

Console logs each run’s AUROC , AUPR , Accuracy , Precision , Recall and F1 score.

CSV file with averaged metrics:

Run,AUROC,AUPR,ACC,Precision,Recall,F1
1,0.9572285274870893,0.7864667610863505,0.9520656314093173,0.7449933244325768,0.7190721649484536,0.7318032786885246

🧠 Notes on Training

Early stopping is based on training loss with a patience window (default 10).
Embedding dimension (EMBED_DIM=512), hidden size (HIDDEN_SIZE=256), dropout (0.3), batch size (64), learning rate (1e-4) and max epochs (100) are set inside the function.
Sequences are truncated to MAX_SEQ_LEN=30 tokens.
Vocabulary is built from both train and test files to avoid OOV at test time.
Reproducibility: random seeds are not set by default.

⚙️ Communication

If you have any questions or need further clarification, feel free to contact us via email at [email protected]. We will do our best to provide assistance!
Due to the large size of the embedding JSON files, they cannot be uploaded to GitHub. For academic research purposes, the files are available at the following link: https://zenodo.org/records/14882488
Similarly, the pre-trained model .pt files are also available upon request.

📖 Citation

If you use this code in your research, please cite:

not available yet

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
code		code
dataset		dataset
othercode		othercode
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Siamese BiLSTM with Glyphic Character Embeddings for Bone-level Sentence Association Prediction (SGBSAP)

✨ Features

📦 Requirements & Installation

📂 Data Format

🔤 Pretrained Embeddings

🚀 Usage

Option A: code/main.py

Option B: bash run.sh

📊 Output

🧠 Notes on Training

⚙️ Communication

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Siamese BiLSTM with Glyphic Character Embeddings for Bone-level Sentence Association Prediction (SGBSAP)

✨ Features

📦 Requirements & Installation

📂 Data Format

🔤 Pretrained Embeddings

🚀 Usage

Option A: code/main.py

Option B: bash run.sh

📊 Output

🧠 Notes on Training

⚙️ Communication

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages