license: apache-2.0
datasets:
- lmsys/toxic-chat
metrics:
- perplexity
Model Card for Model ID
This model is a facebook/bart-large
fine-tuned on toxic inputs from lmsys/toxic-chat
dataset.
Model Details
This model is not intended to be used for plain inference as it is very likely to predict toxic content. It is intended to be used instead as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data.
Its name tci_minus refers to the G- model in Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts.
It can be used within TrustyAI
's TMaRCo
tool for detoxifying text, see https://github.com/trustyai-explainability/trustyai-detoxify/.
Model Description
- Developed by: [tteofili]
- Shared by: [tteofili]
- License: [AL2.0]
- Finetuned from model: ["facebook/bart-large"]
Uses
This model is intended to be used as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data.
Bias, Risks, and Limitations
This model is fine-tuned over toxic inputs from the lmsys/toxic-chat
dataset and it is very likely to produce toxic content. For this reason this model should only be used in combination with other models for the sake of detecting / fixing toxic content.
How to Get Started with the Model
Use the code below to start using the model for text detoxification.
from trustyai.detoxify import TMaRCo
tmarco = TMaRCo(expert_weights=[-1, 3])
tmarco.load_models(["trustyai/tci_minus", "trustyai/gplus"])
tmarco.rephrase(["white men can't jump"])
Training Details
This model has been trained on toxic inputs from the lmsys/toxic-chat
dataset.
Training Data
Training data from the lmsys/toxic-chat
dataset.
Training Procedure
This model has been fine tuned with the following code:
from trustyai.detoxify import TMaRCo
dataset_name = 'lmsys/toxic-chat'
data_dir = ''
perc = 100
td_columns = ['model_output', 'user_input', 'human_annotation', 'conv_id', 'jailbreaking', 'openai_moderation',
'toxicity']
target_feature = 'toxicity'
content_feature = 'user_input'
model_prefix = 'toxic_chat_input_'
tmarco.train_models(perc=perc, dataset_name=dataset_name, expert_feature=target_feature, model_prefix=model_prefix,
data_dir=data_dir, content_feature=content_feature, td_columns=td_columns)
Training Hyperparameters
This model has been trained with the following hyperparams:
training_args = TrainingArguments(
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01
)
Evaluation
Testing Data, Factors & Metrics
Testing Data
Test data from the lmsys/toxic-chat
dataset.
Metrics
The model was evaluated using perplexity metric.
Results
Perplexity: 1.08