Kansallisarkisto/finbert-ner

Finnish named entity recognition

The model performs named entity recognition from text input in Finnish. It was trained by fine-tuning bert-base-finnish-cased-v1, using 10 named entity categories. Training data contains for instance the Turku OntoNotes Entities Corpus, the Finnish part of the NewsEye dataset as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland. Since the latter dataset contains also sensitive data, it has not been made publicly available.

An example of how the model can be used for named entity recognition is provided in this Colab notebook.

Motivations behind model development and the data selection and annotation processes have been described in more detail in the article Making sense of bureaucratic documents – Named entity recognition for state authority archives.

Intended uses & limitations

The model has been trained to recognize the following named entities from a text in Finnish:

PERSON (person names)
ORG (organizations)
LOC (locations)
GPE (geopolitical locations)
PRODUCT (products)
EVENT (events)
DATE (dates)
JON (Finnish journal numbers (diaarinumero))
FIBC (Finnish business identity codes (y-tunnus))
NORP (nationality, religious and political groups)

Some entities, like EVENT and LOC, are less common in the training data than the others, which means that recognition accuracy for these entities also tends to be lower.

Most of the training data is relatively recent, so that the model might face difficulties when the input contains for example old names or writing styles.

How to use

The easiest way to use the model is by utilizing the Transformers pipeline for token classification:

from transformers import pipeline

model_checkpoint = "Kansallisarkisto/finbert-ner"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
predictions = token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.")
print(predictions)

Training data

Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the Turku OntoNotes Entities Corpus dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the NewsEye dataset were added during the annotation process. The different data sources used in model training, validation and testing are listed below:

Dataset	Period covered by the texts	Text type	Percentage of the total data
Turku OntoNotes Entities Corpus	2000s	Online texts	23%
NewsEye dataset	1850-1950	OCR'd digitized newspaper articles	3%
Diverse document data from Finnish public administration	1970s - 2000s	OCR'd digitized documents	69%
Finnish senate documents	1916	Part manually transcribed, part HTR'd digitized documents	3%
Finnish books from Project Gutenberg	Early 20th century	OCR'd texts	1%
Theses from Finnish polytechnic universities	2000s	OCR'd texts	1%

The number of entities belonging to the different entity classes contained in training, validation and test datasets are listed below:

Number of entity types in the data

Dataset	PERSON	ORG	LOC	GPE	PRODUCT	EVENT	DATE	JON	FIBC	NORP
Train	20211	45722	1321	19387	9571	1616	23642	2460	2384	2529
Val	2525	5517	130	2512	1217	240	3047	306	247	283
Test	2414	5577	179	2445	1097	183	2838	272	374	356

Training procedure

This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:

learning rate: 2e-05
train batch size: 24
epochs: 10
optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
scheduler: linear scheduler with num_warmup_steps=round(len(train_dataloader)/5) and num_training_steps=len(train_dataloader)*epochs
maximum length of data sequence: 512
patience: 2 epochs
classifier dropout: 0.3

In the preprocessing stage, the input texts were split into chunks with a maximum length of 300 tokens, in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed using the tokenizer for the bert-base-finnish-cased-v1 model.

The training code with instructions is available in GitHub.

Evaluation results

Evaluation results using the test dataset are listed below:

	Precision	Recall	F1-score
PERSON	0.90	0.91	0.90
ORG	0.84	0.87	0.86
LOC	0.84	0.86	0.85
GPE	0.91	0.91	0.91
PRODUCT	0.73	0.77	0.75
EVENT	0.69	0.73	0.71
DATE	0.90	0.92	0.91
JON	0.83	0.95	0.89
FIBC	0.95	0.99	0.97
NORP	0.91	0.95	0.93

The metrics were calculated using the seqeval library.

Acknowledgements

The model was developed in an ERDF-funded project "Using Artificial Intelligence to Improve the Quality and Usability of Digital Records" (Dalai) in 2021-2023. The purpose of the project was to develop the automation of the digitisation of cultural heritage materials and the automated description of such materials through artificial intelligence. The main target group comprises memory organisations, archives, museums and libraries that digitise and provide digital materials to their customers, as well as companies that develop services related to digitisation and the processing of digital materials.

Project partners were the National Archives of Finland, Central Archives for Finnish Business Records (Elka), South-Eastern Finland University of Applied Sciences Ltd (Xamk) and Disec Ltd.

The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been carried out in cooperation with the FIN-CLARIAH research infrastructure / University of Jyväskylä.

Kansallisarkisto
/

finbert-ner