Julien Simon commited on
Commit
27c7dcd
1 Parent(s): 0bfaffa

- Add training script

Browse files

- Add details to model card

Files changed (2) hide show
  1. README.md +4 -6
  2. train-xlm.py +1 -6
README.md CHANGED
@@ -34,20 +34,18 @@ It achieves the following results on the evaluation set:
34
  - Loss: 0.0241
35
  - Accuracy: 0.9930
36
 
37
- ## Model description
38
-
39
- More information needed
40
-
41
  ## Intended uses & limitations
42
 
43
- More information needed
44
 
45
  ## Training and evaluation data
46
 
47
- More information needed
48
 
49
  ## Training procedure
50
 
 
 
51
  ### Training hyperparameters
52
 
53
  The following hyperparameters were used during training:
 
34
  - Loss: 0.0241
35
  - Accuracy: 0.9930
36
 
 
 
 
 
37
  ## Intended uses & limitations
38
 
39
+ The model can accurately detect 102 languages.
40
 
41
  ## Training and evaluation data
42
 
43
+ The model has been trained and evaluated on the complete google/fleurs training and validation sets.
44
 
45
  ## Training procedure
46
 
47
+ The training script is included in the repository. The model has been trained on an p3dn.24xlarge instance on AWS (8 NVIDIA V100 GPUs).
48
+
49
  ### Training hyperparameters
50
 
51
  The following hyperparameters were used during training:
train-xlm.py CHANGED
@@ -24,9 +24,7 @@ columns_to_remove = [
24
  "lang_group_id",
25
  ]
26
 
27
- train, val = load_dataset(
28
- dataset_id, "all", split=["train", "validation"], ignore_verifications=True
29
- )
30
 
31
  # Build the label2id and id2label dictionaries
32
 
@@ -54,11 +52,9 @@ val = val.shuffle(seed=42)
54
 
55
  tokenizer = AutoTokenizer.from_pretrained(model_id)
56
 
57
-
58
  def preprocess(data):
59
  return tokenizer(data["text"], truncation=True)
60
 
61
-
62
  processed_train = train.map(preprocess, batched=True)
63
  processed_val = val.map(preprocess, batched=True)
64
 
@@ -111,4 +107,3 @@ trainer = Trainer(
111
 
112
  trainer.train()
113
 
114
- trainer.save_model("./my_model")
 
24
  "lang_group_id",
25
  ]
26
 
27
+ train, val = load_dataset(dataset_id, "all", split=["train", "validation"], ignore_verifications=True)
 
 
28
 
29
  # Build the label2id and id2label dictionaries
30
 
 
52
 
53
  tokenizer = AutoTokenizer.from_pretrained(model_id)
54
 
 
55
  def preprocess(data):
56
  return tokenizer(data["text"], truncation=True)
57
 
 
58
  processed_train = train.map(preprocess, batched=True)
59
  processed_val = val.map(preprocess, batched=True)
60
 
 
107
 
108
  trainer.train()
109