We have released our Segment-CNN model on Hugging Face: https://huggingface.co/ZetangForward/SegmentCNN You can directly use it for toxic detection / classification / annotation.
Below is the usage:
from transformers import pipeline
# Initialize the Segment-CNN classification pipeline
classifier = pipeline("spancnn-classification", model="ZetangForward/SegmentCNN", trust_remote_code=True)
# Example 1: Positive (Safe) Text
pos_text = "You look good today~!"
result = classifier(pos_text)
print(result) # Output: 0 (0 represents safe content)
# Example 2: Negative (Toxic) Text
neg_text = "You're too stupid, you're just like a fool"
result = classifier(neg_text)
print(result) # Output: 1 (1 represents toxic content)- CMD utils language models to synthesize data step by step and then train via chain of thoughts, aiming to enable the model self-detoxification.
- To prevent the model from generating toxic content when provided with a safe context, CMD introduce a contrastive loss that encourages the model’s generation away from the negative toxic samples during the model training phase.
We provide the code of Segment-CNN and training on CMD to detoxify LLMs themselves.
conda env create -f environment.yaml
Here we will create the span dataset for training Segment-CNN.
cd utils
python csv_to_json.py \
--input path/to/your/jigsaw/train.csv \
--json_save ../dataset/total.json \
--train_span_json_save ../dataset/segment_cnn_train.json \
--test_span_json_save ../dataset/segment_cnn_test.json
sh perspective_api.sh
cd segment_cnn
python -u run_glue_no_trainer.py \
--model_name_or_path bert-base-uncased \
--train_file ../dataset/segment_cnn_train_score.json \
--validation_file ../dataset/segment_cnn_test_score.json \
--max_length 128 \
--per_device_train_batch_size 256 \
--per_device_eval_batch_size 256 \
--learning_rate 2e-5 \
--num_train_epochs 10 \
--output_dir ../ckp/segment_cnn \
--pad_to_max_length
Note that the original RealToxicityPrompts dataset isn't divided into training and testing sets, we divide prompts.jsonl of RealToxicityPrompts dataset into rtp_train.json and rtp_test.json.
cd segment_cnn
python ../utils/mask_toxic_span.py \
--input path/to/your/RealToxicityPrompts/rtp_train.json \
--output ../dataset/rtp_mask_span.json \
--model_path ../ckp/segment_cnn
Remember to use perspective api to make sure all masked prompts in rtp_mask_span.json are non-toxic!
cd utils
python rephrase.py \
--file ../dataset/rtp_mask_span.json \
--save ../dataset/rtp_rephrase.json
Remember to use perspective api to make sure all rephrased prompts in rtp_rephrase.json are non-toxic!
cd utils
python continuation_inference.py \
--model path/to/your/corresponding_model \
--file ../dataset/rtp_rephrase.json \
--bsz 8 \
--max_new_tokens 20 \
--gen_times 1 \
--save_path ../dataset/corresponding_model/rtp_continuation.json
python perspective_api_dataset.py \
--file ../dataset/corresponding_model/rtp_continuation.json \
--output ../dataset/corresponding_model/rtp_continuation_api.json \
--api_key <your_perspective_api_key>
python ../utils/make_train_set.py \
--input ../dataset/corresponding_model/rtp_continuation_api.json \
--output ../dataset/corresponding_model/rtp_cmd.json
cd ../train_cmd
sh train.sh
We provide the download link for all the original data used in our paper:
| Dataset | Samples | Download Link |
|---|---|---|
| Real Toxicity Prompts | ~100k | download |
| Jigsaw Toxic Comment Classification Challenge | ~160k(Train) | download |
@article{tang2023detoxify,
title={Detoxify language model step-by-step},
author={Tang, Zecheng and Zhou, Keyan and Wang, Pinzheng and Ding, Yuyang and Li, Juntao and others},
journal={arXiv preprint arXiv:2308.08295},
year={2023}
}


