ancatmara commited on
Commit
4a5aa14
1 Parent(s): 0d146b2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc
3
+ language:
4
+ - ga
5
+ - ghc
6
+ pipeline_tag: feature-extraction
7
+ ---
8
+
9
+ ### Training Data
10
+
11
+ The models were trained on Early Irish texts from [CELT](https://celt.ucc.ie/publishd.html) and [Historical Irish Corpus](http://corpas.ria.ie/index.php?fsg_function=1). A text was included in the training dataset if "Early Modern Irish" or the dates "1200-1700" were explicitely mentioned in its metadata on CELT, including texts marked as "Old, Middle and Early Modern Irish". Therefore, Early Modern Irish models can contain some Old and Middle Irish words. One can argue that 1700 is a bit too late to be the end date for Early Modern Irish, and that we should speak of Modern Irish starting from the 1500s, but we prefer to rely on language labels assigned to our source texts by editors, and to stretch out a bit for better data coverage.
12
+
13
+ ### Available Models
14
+
15
+ There are 3 models in this familily:
16
+
17
+ - **Cased**: `early_modern_irish_cased_ft_100_5_2.txt`
18
+ - **Lowercase**: `early_modern_irish_lower_ft_100_5_2.txt`
19
+ - **Lowercase with initial mutations removed**: `early_modern_irish_lower_demutated_ft_100_5_2.txt`
20
+
21
+ All models are trained with the same hyperparameters (`emb_size=100, window=5, min_count=2, n_epochs=100`) and saved as `KeyedVectors` (see [Gensim Documentation](https://radimrehurek.com/gensim/models/keyedvectors.html)).
22
+
23
+ ### Usage
24
+
25
+ ```python
26
+ from gensim.models import KeyedVectors
27
+ from huggingface_hub import hf_hub_download
28
+
29
+ model_path = hf_hub_download(repo_id="ancatmara/early-modern-irish-ft-vectors", filename="early_modern_irish_cased_ft_100_5_2.txt")
30
+ model = KeyedVectors.load_word2vec_format(model_path, binary=False)
31
+
32
+ model.similar_by_word('Laighen')
33
+ ```
34
+
35
+ Out:
36
+ ```python
37
+ >>> [('Laighe', 0.8656445741653442),
38
+ ('Laighenáin', 0.8017030358314514),
39
+ ('Osraighe', 0.7939673662185669),
40
+ ('Laighean', 0.7764200568199158),
41
+ ('Connacht', 0.757500171661377),
42
+ ('Mumhan', 0.749495267868042),
43
+ ('Ereann', 0.7490534782409668),
44
+ ('Laigen', 0.7099151611328125),
45
+ ('Maighen', 0.704881489276886),
46
+ ('Laighnibh', 0.7041199207305908)]
47
+ ```