This guide provides step-by-step instructions for fine-tuning SDAR models (e.g., SDAR-4B-Chat, SDAR-8B-Chat) using the LlamaFactory framework. The process involves a specific environment setup, model preparation with custom code, and a specialized data configuration to leverage Flex Attention.
First, create the Conda environment using the provided requirements file. This will install all necessary dependencies for LlamaFactory and the SDAR model.
conda env create -f llamafactory_full_env.ymlThe SDAR models in this repository require custom modeling and config files to function correctly.
-
Prepare Model Directories: Create local directories for the models you intend to train, such as:
./model/SDAR-4B-Chat./model/SDAR-8B-Chat
-
Add Custom Files: Place the modified
modeling_*.pyandconfig.jsonfiles from this project into the corresponding model directory. These files are essential for enabling the model's unique architecture. -
Download Model Weights: Download the official
safetensorsweight files from their Hugging Face repositories and copy them into the same local directories.
After this step, your directory structure should look something like this:
./model/SDAR-4B-Chat/
├── config.json # Custom config file
├── modeling_sdar.py # Custom modeling file
├── model-*.safetensors # Official weights from Hugging Face
└── ... # Other model files
Next, define a YAML file to specify the training parameters. Below is an example based on ./examples/train_full_sdar/sdar_4b/sdar_4b_math_cot_full.yaml.
In the model section, you must point to your local model path and enable trust_remote_code. This is mandatory for loading the custom modeling scripts you prepared in Step 2.
### model
model_name_or_path: /path/to/your/model/SDAR-4B-Chat
train_from_scratch: false
trust_remote_code: trueSDAR models use Flex Attention as their attention backend, which performs most efficiently with fixed-shape inputs. Therefore, training must be done using a data packing method (neat_packing: true).
### dataset
dataset: open_r1_math
template: qwen3
block_length: 4 # Corresponds to the model's block size for packing
cutoff_len: 20480
truncate_mode: drop # Recommended: 'drop' or 'cut'
overwrite_cache: false
tokenized_path: /cache_dir/for/tokenized_data
preprocessing_num_workers: 96
dataloader_num_workers: 4
neat_packing: true # Must be true for SDAR trainingKey Parameters for SDAR:
neat_packing: true: This enables the data packing strategy required by the model.block_length: Defines the block size for packing, which should align with the model's architectural design.truncate_mode: drop: This field handles sequences that exceedcutoff_len.drop: Discards the entire sequence. This is recommended to maintain fixed input shapes for Flex Attention.cut: Truncates the sequence tocutoff_len.
Finally, use torchrun to launch the distributed training job. The command executes the LlamaFactory launcher script with your specified YAML configuration file.
The following command starts a training job on a single machine (nnodes 1) with 8 GPUs (nproc_per_node 8).
torchrun \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_addr 127.0.0.1 \
--master_port 12345 \
./src/llamafactory/launcher.py \
./examples/train_full_sdar/sdar_4b/sdar_4b_sb_sal_v_full.yamlMake sure to replace the final argument with the path to your own YAML configuration file.