--- license: apache-2.0 language: - es pipeline_tag: feature-extraction tags: - bert - biomedical - lexical semantics - bionlp - embedding - entity linking - umls --- # SapBERT-biomedical-clinical model for Spanish ## Table of contents
Click to expand - [Model description](#model-description) - [Intended uses and limitations](#intended-use) - [How to use](#how-to-use) - [Training](#training) - [Evaluation](#evaluation) - [Additional information](#additional-information) - [Author](#author) - [Licensing information](#licensing-information) - [Citation information](#citation-information) - [Disclaimer](#disclaimer)
## Model description SapBERT model in Spanish trained with a procedure similar to that described by [Liu et al. (2020)](https://arxiv.org/pdf/2010.11784.pdf). The model has been trained with the Spanish data from [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2023AA, using [PlanTL-GOB-ES/roberta-base-biomedical-clinical-es](https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es) as the base model. ## Intended uses and limitations The model is prepared to provide a numerical representation of biomedical concepts in UMLS. This allows using the embeddings generated by the model for semantic similarity tasks of biomedical concepts or entity linking tasks, among others. ## How to use The following script taken and adapted from the [original SapBERT model](https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext/) converts a list of strings (entity names) into embeddings. ```python import numpy as np import torch from tqdm.auto import tqdm from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es") model = AutoModel.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es").cuda() # replace with your own list of entity names in spanish all_names = ["cancer de pulmón", "fiebre", "cirugía torácica"] bs = 128 # batch size during inference all_embs = [] for i in tqdm(np.arange(0, len(all_names), bs)): toks = tokenizer.batch_encode_plus(all_names[i:i+bs], padding="max_length", max_length=25, truncation=True, return_tensors="pt") toks_cuda = {} for k,v in toks.items(): toks_cuda[k] = v.cuda() cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding all_embs.append(cls_rep.cpu().detach().numpy()) all_embs = np.concatenate(all_embs, axis=0) ``` For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert). ## Training The training was performed using the [original SapBERT training repository](https://github.com/cambridgeltl/sapbert). As training data, the Spanish entries in UMLS were used, as well as the commercial names of the drugs (although they are in English), transformed to lowercase. To train the model, a set of 15 pairs of synonymous terms has been generated for each UMLS concept, we have considered as synonyms the lexical entries of each concept. ## Evaluation Evaluation of the results of using this model are in: Gallego, F., López-García, G., Gasco-Sánchez, L., Krallinger, M., & Veredas, F. J. (2024, June). Clinlinker: Medical entity linking of clinical concept mentions in spanish. In International Conference on Computational Science (pp. 266-280). Cham: Springer Nature Switzerland. ## Additional information ### Author NLP4BIA at the Barcelona Supercomputing Center ### Licensing information [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ### Citation information ```python @inproceedings{gallego2024clinlinker, title={Clinlinker: Medical entity linking of clinical concept mentions in spanish}, author={Gallego, Fernando and L{\'o}pez-Garc{\'\i}a, Guillermo and Gasco-S{\'a}nchez, Luis and Krallinger, Martin and Veredas, Francisco J}, booktitle={International Conference on Computational Science}, pages={266--280}, year={2024}, organization={Springer} } ``` ### Disclaimer
Click to expand The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.