arxiv:2411.08868

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

Published on Nov 13

· Submitted by

Authors:

Abstract

French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.

View arXiv page View PDF Add to collection

Community

stefan-it

Paper submitter about 20 hours ago

Really cool new CamemBERT(a) models and very interesting comparisons between RoBERTa and DeBERTa architecture.

E.g. on PoS Tagging they are on-par, but DeBERTa is generally slower on fine-tuning, so one would prefer CamemBERT here, but for NER the DeBERTa model goes off and heavily outperforms everything :)

stefan-it

Paper submitter about 20 hours ago

•

edited about 19 hours ago

Hey @wissamantoun , so great to see new improvements on the CamemBERT* family!!

Did u btw use the same code codebase as for training CanemBERTa (and the 128 + 512 sequence lenght two phase approach)?

wissamantoun

about 15 hours ago

Yes it's the same codebase, although i added more features and fixes to it. I'm currently working on code cleanup and will try to push all models and fine-tunes to huggingface asap.

This time i opted also for a two-phase training but with 512 then 1024. I could have done a three-phase one but i decided to go with two-phase just for simplicity.

librarian-bot

about 2 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.08868 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.08868 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.08868 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.