File size: 2,296 Bytes
4a5aa14
f1e224b
4a5aa14
 
 
 
90b865d
4a5aa14
 
 
 
c0f7e2b
4a5aa14
 
 
 
 
89a6355
 
 
4a5aa14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
license: cc-by-nc-sa-4.0
language:
- ga
- ghc
pipeline_tag: feature-extraction
library_name: gensim
---

### Training Data

**Early Modern Irish FastText models** were trained on Early Irish texts from [CELT](https://celt.ucc.ie/publishd.html) and the book subcorpus of [Historical Irish Corpus](http://corpas.ria.ie/index.php?fsg_function=1). A text was included in the training dataset if "Early Modern Irish", "Classical Modern Irish" or the dates "1200-1700" were explicitely mentioned in its metadata on CELT, including texts marked as "Old, Middle and Early Modern Irish". Therefore, Early Modern Irish models can have some Old and Middle Irish words in the vocabulary, as well as some Latin due to code-switching. One can argue that 1700 is a bit too late to be the end date for Early Modern Irish, and that we should speak of Modern Irish starting from the 1500s, but we prefer to rely on language labels assigned to our source texts by editors, and to stretch out a bit for better data coverage. 

### Available Models

There are 3 models in this familily:

- **Cased**, 75 928 words: `early_modern_irish_cased_ft_100_5_2.txt`
- **Lowercase**, 71 491 words: `early_modern_irish_lower_ft_100_5_2.txt`
- **Lowercase with initial mutations removed**, 62 107 words: `early_modern_irish_lower_demutated_ft_100_5_2.txt`

All models are trained with the same hyperparameters (`emb_size=100, window=5, min_count=2, n_epochs=100`) and saved as `KeyedVectors` (see [Gensim Documentation](https://radimrehurek.com/gensim/models/keyedvectors.html)).

### Usage

```python
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="ancatmara/early-modern-irish-ft-vectors", filename="early_modern_irish_cased_ft_100_5_2.txt")
model = KeyedVectors.load_word2vec_format(model_path, binary=False)

model.similar_by_word('Laighen')
```

Out:
```python
>>> [('Laighe', 0.8656445741653442),
     ('Laighenáin', 0.8017030358314514),
     ('Osraighe', 0.7939673662185669),
     ('Laighean', 0.7764200568199158),
     ('Connacht', 0.757500171661377),
     ('Mumhan', 0.749495267868042),
     ('Ereann', 0.7490534782409668),
     ('Laigen', 0.7099151611328125),
     ('Maighen', 0.704881489276886),
     ('Laighnibh', 0.7041199207305908)]
```