Sentence Similarity
PyTorch
Vietnamese
rage
datnguyen commited on
Commit
5348874
1 Parent(s): 6b9990f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -1
README.md CHANGED
@@ -11,4 +11,88 @@ metrics:
11
  - spearmanr
12
  pipeline_tag: sentence-similarity
13
  library_name: rage
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - spearmanr
12
  pipeline_tag: sentence-similarity
13
  library_name: rage
14
+ ---
15
+ # Introduce
16
+ ## Installation 🔥
17
+ - We recommend `python 3.9` or higher, `torch 2.0.0` or higher, `transformers 4.31.0` or higher.
18
+
19
+ - Currently, you can only download from the source, however, in the future, we will upload it to PyPI. RagE can be installed from source with the following commands:
20
+ ```
21
+ git clone https://github.com/anti-aii/RagE.git
22
+ cd RagE
23
+ pip install -e .
24
+ ```
25
+ ## Quick start 🥮
26
+ - [1. Initialize the model](#initialize_model)
27
+ - [2. Load model from Huggingface Hub](#download_hf)
28
+ - [3. List of pretrained models](#list_pretrained)
29
+
30
+ We have detailed instructions for using our models for inference. See [notebook](notebook)
31
+ ### 1. Initialize the model
32
+ <a name= 'initialize_model'></a>
33
+ Let's initalize the SentenceEmbedding model
34
+
35
+ ```python
36
+ >>> import torch
37
+ >>> from pyvi import ViTokenizer
38
+ >>> from rage import SentenceEmbedding
39
+ >>> device= torch.device('cuda' if torch.cuda.is_available() else 'cpu')
40
+ >>> model= SentenceEmbedding(model_name= "vinai/phobert-base-v2", torch_dtype= torch.float32, aggregation_hidden_states= False, strategy_pooling= "dense_first")
41
+ >>> model.to(device)
42
+ SentenceEmbeddingConfig(model_base: {'model_type_base': 'RobertaModel', 'model_name': 'vinai/phobert-base-v2', 'type_backbone': 'mlm', 'required_grad_base_model': True, 'aggregation_hidden_states': False, 'concat_embeddings': False, 'dropout': 0.1, 'quantization_config': None}, pooling: {'strategy_pooling': 'dense_first'})
43
+ ```
44
+ Then, we can show the number of parameters in the model.
45
+ ```python
46
+ >>> model.summary_params()
47
+ trainable params: 135588864 || all params: 135588864 || trainable%: 100.0
48
+ >>> model.summary()
49
+ +---------------------------+-------------+------------------+
50
+ | Layer (type) | Params | Trainable params |
51
+ +---------------------------+-------------+------------------+
52
+ | model (RobertaModel) | 134,998,272 | 134998272 |
53
+ | pooling (PoolingStrategy) | 590,592 | 590592 |
54
+ | drp1 (Dropout) | 0 | 0 |
55
+ +---------------------------+-------------+------------------+
56
+ ```
57
+ Now we can use the SentenceEmbedding model to encode the input words. The output of the model will be a matrix in the shape of (batch, dim). Additionally, we can load weights that we have previously trained and saved.
58
+ ``` python
59
+ >>> model.load("best_sup_general_embedding_phobert2.pt", key= False)
60
+ >>> sentences= ["Tôi đang đi học", "Bạn tên là gì?",]
61
+ >>> sentences= list(map(lambda x: ViTokenizer.tokenize(x), sentences))
62
+ >>> model.encode(sentences, batch_size= 1, normalize_embedding= "l2", return_tensors= "np", verbose= 1)
63
+ 2/2 [==============================] - 0s 43ms/Sample
64
+ array([[ 0.00281098, -0.00829096, -0.01582766, ..., 0.00878178,
65
+ 0.01830498, -0.00459659],
66
+ [ 0.00249859, -0.03076724, 0.00033016, ..., 0.01299141,
67
+ -0.00984358, -0.00703243]], dtype=float32)
68
+ ```
69
+ ### 2. Load model from Huggingface Hub
70
+ <a name= 'download_hf'> </a>
71
+
72
+ First, download a pretrained model.
73
+ ```python
74
+ >>> model= SentenceEmbedding.from_pretrained('anti-ai/VieSemantic-base')
75
+ ```
76
+ Then, we encode the input sentences and compare their similarity.
77
+ ```python
78
+ >>> sentences = ["Nó rất thú_vị", "Nó không thú_vị ."]
79
+ >>> output= model.encode(sentences, batch_size= 1, return_tensors= 'pt')
80
+ >>> torch.cosine_similarity(output[0].view(1, -1), output[1].view(1, -1)).cpu().tolist()
81
+ 2/2 [==============================] - 0s 40ms/Sample
82
+ [0.5605039596557617]
83
+ ```
84
+
85
+ ### 3. List of pretrained models
86
+ <a name= 'list_pretrained'></a>
87
+ This list will be updated with our prominent models. Our models will primarily aim to support Vietnamese language.
88
+ Additionally, you can access our datasets and pretrained models by visiting https://huggingface.co/anti-ai.
89
+
90
+ | Model Name | Model Type | #params | checkpoint|
91
+ | - | - | - | - |
92
+ | anti-ai/ViEmbedding-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/ViEmbedding-base) |
93
+ | anti-ai/BioViEmbedding-base-unsup | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/BioViEmbedding-base-unsup) |
94
+ | anti-ai/VieSemantic-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/VieSemantic-base) |
95
+
96
+
97
+ ## Contacts
98
+ If you have any questions about this repo, please contact me ([email protected])