Julien Simon
commited on
Commit
•
27c7dcd
1
Parent(s):
0bfaffa
- Add training script
Browse files- Add details to model card
- README.md +4 -6
- train-xlm.py +1 -6
README.md
CHANGED
@@ -34,20 +34,18 @@ It achieves the following results on the evaluation set:
|
|
34 |
- Loss: 0.0241
|
35 |
- Accuracy: 0.9930
|
36 |
|
37 |
-
## Model description
|
38 |
-
|
39 |
-
More information needed
|
40 |
-
|
41 |
## Intended uses & limitations
|
42 |
|
43 |
-
|
44 |
|
45 |
## Training and evaluation data
|
46 |
|
47 |
-
|
48 |
|
49 |
## Training procedure
|
50 |
|
|
|
|
|
51 |
### Training hyperparameters
|
52 |
|
53 |
The following hyperparameters were used during training:
|
|
|
34 |
- Loss: 0.0241
|
35 |
- Accuracy: 0.9930
|
36 |
|
|
|
|
|
|
|
|
|
37 |
## Intended uses & limitations
|
38 |
|
39 |
+
The model can accurately detect 102 languages.
|
40 |
|
41 |
## Training and evaluation data
|
42 |
|
43 |
+
The model has been trained and evaluated on the complete google/fleurs training and validation sets.
|
44 |
|
45 |
## Training procedure
|
46 |
|
47 |
+
The training script is included in the repository. The model has been trained on an p3dn.24xlarge instance on AWS (8 NVIDIA V100 GPUs).
|
48 |
+
|
49 |
### Training hyperparameters
|
50 |
|
51 |
The following hyperparameters were used during training:
|
train-xlm.py
CHANGED
@@ -24,9 +24,7 @@ columns_to_remove = [
|
|
24 |
"lang_group_id",
|
25 |
]
|
26 |
|
27 |
-
train, val = load_dataset(
|
28 |
-
dataset_id, "all", split=["train", "validation"], ignore_verifications=True
|
29 |
-
)
|
30 |
|
31 |
# Build the label2id and id2label dictionaries
|
32 |
|
@@ -54,11 +52,9 @@ val = val.shuffle(seed=42)
|
|
54 |
|
55 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
56 |
|
57 |
-
|
58 |
def preprocess(data):
|
59 |
return tokenizer(data["text"], truncation=True)
|
60 |
|
61 |
-
|
62 |
processed_train = train.map(preprocess, batched=True)
|
63 |
processed_val = val.map(preprocess, batched=True)
|
64 |
|
@@ -111,4 +107,3 @@ trainer = Trainer(
|
|
111 |
|
112 |
trainer.train()
|
113 |
|
114 |
-
trainer.save_model("./my_model")
|
|
|
24 |
"lang_group_id",
|
25 |
]
|
26 |
|
27 |
+
train, val = load_dataset(dataset_id, "all", split=["train", "validation"], ignore_verifications=True)
|
|
|
|
|
28 |
|
29 |
# Build the label2id and id2label dictionaries
|
30 |
|
|
|
52 |
|
53 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
54 |
|
|
|
55 |
def preprocess(data):
|
56 |
return tokenizer(data["text"], truncation=True)
|
57 |
|
|
|
58 |
processed_train = train.map(preprocess, batched=True)
|
59 |
processed_val = val.map(preprocess, batched=True)
|
60 |
|
|
|
107 |
|
108 |
trainer.train()
|
109 |
|
|