intfloat commited on
Commit
d829207
1 Parent(s): 960310a

Update README.md

Browse files
Files changed (4) hide show
  1. 1_Pooling/config.json +7 -0
  2. README.md +52 -2
  3. modules.json +20 -0
  4. sentence_bert_config.json +4 -0
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md CHANGED
@@ -5985,7 +5985,7 @@ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=Tru
5985
  outputs = model(**batch_dict)
5986
  embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
5987
 
5988
- # (Optionally) normalize embeddings
5989
  embeddings = F.normalize(embeddings, p=2, dim=1)
5990
  scores = (embeddings[:2] @ embeddings[2:].T) * 100
5991
  print(scores.tolist())
@@ -6037,11 +6037,61 @@ For all labeled datasets, we only use its training set for fine-tuning.
6037
 
6038
  For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
6039
 
6040
- ## Benchmark Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
6041
 
6042
  Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results
6043
  on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).
6044
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6045
  ## Citation
6046
 
6047
  If you find our paper or models helpful, please consider cite as follows:
 
5985
  outputs = model(**batch_dict)
5986
  embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
5987
 
5988
+ # normalize embeddings
5989
  embeddings = F.normalize(embeddings, p=2, dim=1)
5990
  scores = (embeddings[:2] @ embeddings[2:].T) * 100
5991
  print(scores.tolist())
 
6037
 
6038
  For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
6039
 
6040
+ ## Benchmark Results on [Mr. TyDi](https://arxiv.org/abs/2108.08787)
6041
+
6042
+ | Model | Avg MRR@10 | | ar | bn | en | fi | id | ja | ko | ru | sw | te | th |
6043
+ |-----------------------|------------|-------|------| --- | --- | --- | --- | --- | --- | --- |------| --- | --- |
6044
+ | BM25 | 33.3 | | 36.7 | 41.3 | 15.1 | 28.8 | 38.2 | 21.7 | 28.1 | 32.9 | 39.6 | 42.4 | 41.7 |
6045
+ | mDPR | 16.7 | | 26.0 | 25.8 | 16.2 | 11.3 | 14.6 | 18.1 | 21.9 | 18.5 | 7.3 | 10.6 | 13.5 |
6046
+ | BM25 + mDPR | 41.7 | | 49.1 | 53.5 | 28.4 | 36.5 | 45.5 | 35.5 | 36.2 | 42.7 | 40.5 | 42.0 | 49.2 |
6047
+ | | |
6048
+ | multilingual-e5-small | 64.4 | | 71.5 | 66.3 | 54.5 | 57.7 | 63.2 | 55.4 | 54.3 | 60.8 | 65.4 | 89.1 | 70.1 |
6049
+ | multilingual-e5-base | 65.9 | | 72.3 | 65.0 | 58.5 | 60.8 | 64.9 | 56.6 | 55.8 | 62.7 | 69.0 | 86.6 | 72.7 |
6050
+ | multilingual-e5-large | **70.5** | | 77.5 | 73.2 | 60.8 | 66.8 | 68.5 | 62.5 | 61.6 | 65.8 | 72.7 | 90.2 | 76.2 |
6051
+
6052
+ ## MTEB Benchmark Evaluation
6053
 
6054
  Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results
6055
  on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).
6056
 
6057
+ ## Support for Sentence Transformers
6058
+
6059
+ Below is an example for usage with sentence_transformers.
6060
+ ```python
6061
+ from sentence_transformers import SentenceTransformer
6062
+ model = SentenceTransformer('intfloat/multilingual-e5-small')
6063
+ input_texts = [
6064
+ 'query: how much protein should a female eat',
6065
+ 'query: 南瓜的家常做法',
6066
+ "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 i s 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or traini ng for a marathon. Check out the chart below to see how much protein you should be eating each day.",
6067
+ "passage: 1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮 ,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右, 放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油 锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀 6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
6068
+ ]
6069
+ embeddings = model.encode(input_texts, normalize_embeddings=True)
6070
+ ```
6071
+
6072
+ Package requirements
6073
+
6074
+ `pip install sentence_transformers~=2.2.2`
6075
+
6076
+ Contributors: [michaelfeil](https://huggingface.co/michaelfeil)
6077
+
6078
+ ## FAQ
6079
+
6080
+ **1. Do I need to add the prefix "query: " and "passage: " to input texts?**
6081
+
6082
+ Yes, this is how the model is trained, otherwise you will see a performance degradation.
6083
+
6084
+ Here are some rules of thumb:
6085
+ - Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
6086
+
6087
+ - Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
6088
+
6089
+ - Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
6090
+
6091
+ **2. Why are my reproduced results slightly different from reported in the model card?**
6092
+
6093
+ Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.
6094
+
6095
  ## Citation
6096
 
6097
  If you find our paper or models helpful, please consider cite as follows:
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }