Model Card for Model ID

Model Details

Model Description

Prior to this version, our team had fine-tuned two different cross-encoders. One was used for the RAG system to calculate the relevance between questions and contexts. Its goal was to determine whether an answer to a question could be found within a given context. The other version was fine-tuned based on the stsd dataset and was used to compute the cosine similarity between two texts. This version attempts to merge the two cross-encoders into one.

Developed by: Leviatan Research Team
Model type: Cross Encoder
Language(s) (NLP): French
Finetuned from model [optional]: distilroberta-base https://huggingface.co/distilbert/distilroberta-base

Uses

! pip install sentencepiece
! pip install sentence-transformers

from sentence_transformers.cross_encoder import CrossEncoder
from huggingface_hub import login
login(os.getenv('HF_TOKEN'))
model_path = 'LeviatanAIResearch/cross-encoder-context-question-fr-v1'
model =  CrossEncoder(model_path, max_length=512)

scores = model.predict([
    ('Un avion est en train de décoller.', "Un homme joue d'une grande flûte."), 
    ('Un homme coupe une poisson.', 'Un homme coupe une poisson en tranche.'),
    ("Un homme étale du fromage râpé sur une pizza.", "Une personne jette un chat au plafond"),
    (
        "Quelle eau pour la préparation du biberon ?", 
        "Les meilleurs aliments pour nourrir votre bébé sont le lait maternel ou le lait maternisé, les purées de légumes et de fruits, les céréales pour bébés enrichies en fer, les viandes maigres, les poissons riches en oméga-3, etc. Il est important de proposer une variété d'aliments sains pour assurer un bon développement et une nutrition optimale de votre bébé."
    ),
    (
        "Quelle eau pour la préparation du biberon ?",
        "La publicité montre un adulte et un bébé, le bébé tenant un biberon, avec un fond partagé entre une image et un message marketing qui questionne sur le choix de l'eau pour les nourrissons, mettant en avant la marque Mont Roucous comme un accompagnement dans la maternité."
    )
])
print(scores)

Training Details

Training Data

The LeviatanAIResearch/cross-encoder-binary-context-question-v3 dataset was developed by the Leviatan AI team to train the cross-encoder of our French RAG system and to compute the similarity between two sentences. It is a combination of two types of datasets. The first type is a context-question dataset, which includes PIAF, FQuAD, SQuAD-French, and Pandora-fr datasets, supplemented with some negative samples created from randomly selected unrelated questions. The second type is the stsd-fr dataset. Therefore, the dataset contains some binary 0-1 classification data as well as some continuous values ranging from 0 to 1.

Training Hyperparameters

evaluator=CECorrelationEvaluator
epochs=4
batch_size=16
evaluation_steps=5000
warmup_steps=6050 (10% of train data for warm-up)
save_best_model=True
Furthermore, all other hyperparameters are set to their default values.

Evaluation

LeviatanAIResearch/cross-encoder-context-question-fr-v3 Test set：
- Metric : CECorrelationEvaluator
- Correlation:
  - Pearson: 0.9522
  - Spearman: 0.8966
STS-B Test Set:
- Metric : CECorrelationEvaluator
- Correlation:
  - Pearson: 0.7565
  - Spearman: 0.7460
LeviatanAIResearch/cross-encoder-context-question-fr-v2 Test set：
- Metric: CEBinaryClassificationEvaluator
- Accuracy: 97.6733 (Threshold: 0.967705)
- F1: 97.6676 (Threshold: 0.967705)
- Precision: 97.9071
- Recall: 97.4293
- Average Precision: 99.5244

LeviatanAIResearch
/

cross-encoder-context-question-fr-v3

You need to agree to share your contact information to access this model