xiaowenbin commited on
Commit
e09186c
1 Parent(s): 00e8ba3

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +1333 -0
README.md ADDED
@@ -0,0 +1,1333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - mteb
8
+ model-index:
9
+ - name: Dmeta-embedding
10
+ results:
11
+ - task:
12
+ type: STS
13
+ dataset:
14
+ type: C-MTEB/AFQMC
15
+ name: MTEB AFQMC
16
+ config: default
17
+ split: validation
18
+ revision: None
19
+ metrics:
20
+ - type: cos_sim_pearson
21
+ value: 65.60825224706932
22
+ - type: cos_sim_spearman
23
+ value: 71.12862586297193
24
+ - type: euclidean_pearson
25
+ value: 70.18130275750404
26
+ - type: euclidean_spearman
27
+ value: 71.12862586297193
28
+ - type: manhattan_pearson
29
+ value: 70.14470398075396
30
+ - type: manhattan_spearman
31
+ value: 71.05226975911737
32
+ - task:
33
+ type: STS
34
+ dataset:
35
+ type: C-MTEB/ATEC
36
+ name: MTEB ATEC
37
+ config: default
38
+ split: test
39
+ revision: None
40
+ metrics:
41
+ - type: cos_sim_pearson
42
+ value: 65.52386345655479
43
+ - type: cos_sim_spearman
44
+ value: 64.64245253181382
45
+ - type: euclidean_pearson
46
+ value: 73.20157662981914
47
+ - type: euclidean_spearman
48
+ value: 64.64245253178956
49
+ - type: manhattan_pearson
50
+ value: 73.22837571756348
51
+ - type: manhattan_spearman
52
+ value: 64.62632334391418
53
+ - task:
54
+ type: Classification
55
+ dataset:
56
+ type: mteb/amazon_reviews_multi
57
+ name: MTEB AmazonReviewsClassification (zh)
58
+ config: zh
59
+ split: test
60
+ revision: 1399c76144fd37290681b995c656ef9b2e06e26d
61
+ metrics:
62
+ - type: accuracy
63
+ value: 44.925999999999995
64
+ - type: f1
65
+ value: 42.82555191308971
66
+ - task:
67
+ type: STS
68
+ dataset:
69
+ type: C-MTEB/BQ
70
+ name: MTEB BQ
71
+ config: default
72
+ split: test
73
+ revision: None
74
+ metrics:
75
+ - type: cos_sim_pearson
76
+ value: 71.35236446393156
77
+ - type: cos_sim_spearman
78
+ value: 72.29629643702184
79
+ - type: euclidean_pearson
80
+ value: 70.94570179874498
81
+ - type: euclidean_spearman
82
+ value: 72.29629297226953
83
+ - type: manhattan_pearson
84
+ value: 70.84463025501125
85
+ - type: manhattan_spearman
86
+ value: 72.24527021975821
87
+ - task:
88
+ type: Clustering
89
+ dataset:
90
+ type: C-MTEB/CLSClusteringP2P
91
+ name: MTEB CLSClusteringP2P
92
+ config: default
93
+ split: test
94
+ revision: None
95
+ metrics:
96
+ - type: v_measure
97
+ value: 40.24232916894152
98
+ - task:
99
+ type: Clustering
100
+ dataset:
101
+ type: C-MTEB/CLSClusteringS2S
102
+ name: MTEB CLSClusteringS2S
103
+ config: default
104
+ split: test
105
+ revision: None
106
+ metrics:
107
+ - type: v_measure
108
+ value: 39.167806226929706
109
+ - task:
110
+ type: Reranking
111
+ dataset:
112
+ type: C-MTEB/CMedQAv1-reranking
113
+ name: MTEB CMedQAv1
114
+ config: default
115
+ split: test
116
+ revision: None
117
+ metrics:
118
+ - type: map
119
+ value: 88.48837920106357
120
+ - type: mrr
121
+ value: 90.36861111111111
122
+ - task:
123
+ type: Reranking
124
+ dataset:
125
+ type: C-MTEB/CMedQAv2-reranking
126
+ name: MTEB CMedQAv2
127
+ config: default
128
+ split: test
129
+ revision: None
130
+ metrics:
131
+ - type: map
132
+ value: 89.17878171657071
133
+ - type: mrr
134
+ value: 91.35805555555555
135
+ - task:
136
+ type: Retrieval
137
+ dataset:
138
+ type: C-MTEB/CmedqaRetrieval
139
+ name: MTEB CmedqaRetrieval
140
+ config: default
141
+ split: dev
142
+ revision: None
143
+ metrics:
144
+ - type: map_at_1
145
+ value: 25.751
146
+ - type: map_at_10
147
+ value: 38.946
148
+ - type: map_at_100
149
+ value: 40.855000000000004
150
+ - type: map_at_1000
151
+ value: 40.953
152
+ - type: map_at_3
153
+ value: 34.533
154
+ - type: map_at_5
155
+ value: 36.905
156
+ - type: mrr_at_1
157
+ value: 39.235
158
+ - type: mrr_at_10
159
+ value: 47.713
160
+ - type: mrr_at_100
161
+ value: 48.71
162
+ - type: mrr_at_1000
163
+ value: 48.747
164
+ - type: mrr_at_3
165
+ value: 45.086
166
+ - type: mrr_at_5
167
+ value: 46.498
168
+ - type: ndcg_at_1
169
+ value: 39.235
170
+ - type: ndcg_at_10
171
+ value: 45.831
172
+ - type: ndcg_at_100
173
+ value: 53.162
174
+ - type: ndcg_at_1000
175
+ value: 54.800000000000004
176
+ - type: ndcg_at_3
177
+ value: 40.188
178
+ - type: ndcg_at_5
179
+ value: 42.387
180
+ - type: precision_at_1
181
+ value: 39.235
182
+ - type: precision_at_10
183
+ value: 10.273
184
+ - type: precision_at_100
185
+ value: 1.627
186
+ - type: precision_at_1000
187
+ value: 0.183
188
+ - type: precision_at_3
189
+ value: 22.772000000000002
190
+ - type: precision_at_5
191
+ value: 16.524
192
+ - type: recall_at_1
193
+ value: 25.751
194
+ - type: recall_at_10
195
+ value: 57.411
196
+ - type: recall_at_100
197
+ value: 87.44
198
+ - type: recall_at_1000
199
+ value: 98.386
200
+ - type: recall_at_3
201
+ value: 40.416000000000004
202
+ - type: recall_at_5
203
+ value: 47.238
204
+ - task:
205
+ type: PairClassification
206
+ dataset:
207
+ type: C-MTEB/CMNLI
208
+ name: MTEB Cmnli
209
+ config: default
210
+ split: validation
211
+ revision: None
212
+ metrics:
213
+ - type: cos_sim_accuracy
214
+ value: 83.59591100420926
215
+ - type: cos_sim_ap
216
+ value: 90.65538153970263
217
+ - type: cos_sim_f1
218
+ value: 84.76466651795673
219
+ - type: cos_sim_precision
220
+ value: 81.04073363190446
221
+ - type: cos_sim_recall
222
+ value: 88.84732288987608
223
+ - type: dot_accuracy
224
+ value: 83.59591100420926
225
+ - type: dot_ap
226
+ value: 90.64355541781003
227
+ - type: dot_f1
228
+ value: 84.76466651795673
229
+ - type: dot_precision
230
+ value: 81.04073363190446
231
+ - type: dot_recall
232
+ value: 88.84732288987608
233
+ - type: euclidean_accuracy
234
+ value: 83.59591100420926
235
+ - type: euclidean_ap
236
+ value: 90.6547878194287
237
+ - type: euclidean_f1
238
+ value: 84.76466651795673
239
+ - type: euclidean_precision
240
+ value: 81.04073363190446
241
+ - type: euclidean_recall
242
+ value: 88.84732288987608
243
+ - type: manhattan_accuracy
244
+ value: 83.51172579675286
245
+ - type: manhattan_ap
246
+ value: 90.59941589844144
247
+ - type: manhattan_f1
248
+ value: 84.51827242524917
249
+ - type: manhattan_precision
250
+ value: 80.28613507258574
251
+ - type: manhattan_recall
252
+ value: 89.22141688099134
253
+ - type: max_accuracy
254
+ value: 83.59591100420926
255
+ - type: max_ap
256
+ value: 90.65538153970263
257
+ - type: max_f1
258
+ value: 84.76466651795673
259
+ - task:
260
+ type: Retrieval
261
+ dataset:
262
+ type: C-MTEB/CovidRetrieval
263
+ name: MTEB CovidRetrieval
264
+ config: default
265
+ split: dev
266
+ revision: None
267
+ metrics:
268
+ - type: map_at_1
269
+ value: 63.251000000000005
270
+ - type: map_at_10
271
+ value: 72.442
272
+ - type: map_at_100
273
+ value: 72.79299999999999
274
+ - type: map_at_1000
275
+ value: 72.80499999999999
276
+ - type: map_at_3
277
+ value: 70.293
278
+ - type: map_at_5
279
+ value: 71.571
280
+ - type: mrr_at_1
281
+ value: 63.541000000000004
282
+ - type: mrr_at_10
283
+ value: 72.502
284
+ - type: mrr_at_100
285
+ value: 72.846
286
+ - type: mrr_at_1000
287
+ value: 72.858
288
+ - type: mrr_at_3
289
+ value: 70.39
290
+ - type: mrr_at_5
291
+ value: 71.654
292
+ - type: ndcg_at_1
293
+ value: 63.541000000000004
294
+ - type: ndcg_at_10
295
+ value: 76.774
296
+ - type: ndcg_at_100
297
+ value: 78.389
298
+ - type: ndcg_at_1000
299
+ value: 78.678
300
+ - type: ndcg_at_3
301
+ value: 72.47
302
+ - type: ndcg_at_5
303
+ value: 74.748
304
+ - type: precision_at_1
305
+ value: 63.541000000000004
306
+ - type: precision_at_10
307
+ value: 9.115
308
+ - type: precision_at_100
309
+ value: 0.9860000000000001
310
+ - type: precision_at_1000
311
+ value: 0.101
312
+ - type: precision_at_3
313
+ value: 26.379
314
+ - type: precision_at_5
315
+ value: 16.965
316
+ - type: recall_at_1
317
+ value: 63.251000000000005
318
+ - type: recall_at_10
319
+ value: 90.253
320
+ - type: recall_at_100
321
+ value: 97.576
322
+ - type: recall_at_1000
323
+ value: 99.789
324
+ - type: recall_at_3
325
+ value: 78.635
326
+ - type: recall_at_5
327
+ value: 84.141
328
+ - task:
329
+ type: Retrieval
330
+ dataset:
331
+ type: C-MTEB/DuRetrieval
332
+ name: MTEB DuRetrieval
333
+ config: default
334
+ split: dev
335
+ revision: None
336
+ metrics:
337
+ - type: map_at_1
338
+ value: 23.597
339
+ - type: map_at_10
340
+ value: 72.411
341
+ - type: map_at_100
342
+ value: 75.58500000000001
343
+ - type: map_at_1000
344
+ value: 75.64800000000001
345
+ - type: map_at_3
346
+ value: 49.61
347
+ - type: map_at_5
348
+ value: 62.527
349
+ - type: mrr_at_1
350
+ value: 84.65
351
+ - type: mrr_at_10
352
+ value: 89.43900000000001
353
+ - type: mrr_at_100
354
+ value: 89.525
355
+ - type: mrr_at_1000
356
+ value: 89.529
357
+ - type: mrr_at_3
358
+ value: 89
359
+ - type: mrr_at_5
360
+ value: 89.297
361
+ - type: ndcg_at_1
362
+ value: 84.65
363
+ - type: ndcg_at_10
364
+ value: 81.47
365
+ - type: ndcg_at_100
366
+ value: 85.198
367
+ - type: ndcg_at_1000
368
+ value: 85.828
369
+ - type: ndcg_at_3
370
+ value: 79.809
371
+ - type: ndcg_at_5
372
+ value: 78.55
373
+ - type: precision_at_1
374
+ value: 84.65
375
+ - type: precision_at_10
376
+ value: 39.595
377
+ - type: precision_at_100
378
+ value: 4.707
379
+ - type: precision_at_1000
380
+ value: 0.485
381
+ - type: precision_at_3
382
+ value: 71.61699999999999
383
+ - type: precision_at_5
384
+ value: 60.45
385
+ - type: recall_at_1
386
+ value: 23.597
387
+ - type: recall_at_10
388
+ value: 83.34
389
+ - type: recall_at_100
390
+ value: 95.19800000000001
391
+ - type: recall_at_1000
392
+ value: 98.509
393
+ - type: recall_at_3
394
+ value: 52.744
395
+ - type: recall_at_5
396
+ value: 68.411
397
+ - task:
398
+ type: Retrieval
399
+ dataset:
400
+ type: C-MTEB/EcomRetrieval
401
+ name: MTEB EcomRetrieval
402
+ config: default
403
+ split: dev
404
+ revision: None
405
+ metrics:
406
+ - type: map_at_1
407
+ value: 53.1
408
+ - type: map_at_10
409
+ value: 63.359
410
+ - type: map_at_100
411
+ value: 63.9
412
+ - type: map_at_1000
413
+ value: 63.909000000000006
414
+ - type: map_at_3
415
+ value: 60.95
416
+ - type: map_at_5
417
+ value: 62.305
418
+ - type: mrr_at_1
419
+ value: 53.1
420
+ - type: mrr_at_10
421
+ value: 63.359
422
+ - type: mrr_at_100
423
+ value: 63.9
424
+ - type: mrr_at_1000
425
+ value: 63.909000000000006
426
+ - type: mrr_at_3
427
+ value: 60.95
428
+ - type: mrr_at_5
429
+ value: 62.305
430
+ - type: ndcg_at_1
431
+ value: 53.1
432
+ - type: ndcg_at_10
433
+ value: 68.418
434
+ - type: ndcg_at_100
435
+ value: 70.88499999999999
436
+ - type: ndcg_at_1000
437
+ value: 71.135
438
+ - type: ndcg_at_3
439
+ value: 63.50599999999999
440
+ - type: ndcg_at_5
441
+ value: 65.92
442
+ - type: precision_at_1
443
+ value: 53.1
444
+ - type: precision_at_10
445
+ value: 8.43
446
+ - type: precision_at_100
447
+ value: 0.955
448
+ - type: precision_at_1000
449
+ value: 0.098
450
+ - type: precision_at_3
451
+ value: 23.633000000000003
452
+ - type: precision_at_5
453
+ value: 15.340000000000002
454
+ - type: recall_at_1
455
+ value: 53.1
456
+ - type: recall_at_10
457
+ value: 84.3
458
+ - type: recall_at_100
459
+ value: 95.5
460
+ - type: recall_at_1000
461
+ value: 97.5
462
+ - type: recall_at_3
463
+ value: 70.89999999999999
464
+ - type: recall_at_5
465
+ value: 76.7
466
+ - task:
467
+ type: Classification
468
+ dataset:
469
+ type: C-MTEB/IFlyTek-classification
470
+ name: MTEB IFlyTek
471
+ config: default
472
+ split: validation
473
+ revision: None
474
+ metrics:
475
+ - type: accuracy
476
+ value: 48.303193535975375
477
+ - type: f1
478
+ value: 35.96559358693866
479
+ - task:
480
+ type: Classification
481
+ dataset:
482
+ type: C-MTEB/JDReview-classification
483
+ name: MTEB JDReview
484
+ config: default
485
+ split: test
486
+ revision: None
487
+ metrics:
488
+ - type: accuracy
489
+ value: 85.06566604127579
490
+ - type: ap
491
+ value: 52.0596483757231
492
+ - type: f1
493
+ value: 79.5196835127668
494
+ - task:
495
+ type: STS
496
+ dataset:
497
+ type: C-MTEB/LCQMC
498
+ name: MTEB LCQMC
499
+ config: default
500
+ split: test
501
+ revision: None
502
+ metrics:
503
+ - type: cos_sim_pearson
504
+ value: 74.48499423626059
505
+ - type: cos_sim_spearman
506
+ value: 78.75806756061169
507
+ - type: euclidean_pearson
508
+ value: 78.47917601852879
509
+ - type: euclidean_spearman
510
+ value: 78.75807199272622
511
+ - type: manhattan_pearson
512
+ value: 78.40207586289772
513
+ - type: manhattan_spearman
514
+ value: 78.6911776964119
515
+ - task:
516
+ type: Reranking
517
+ dataset:
518
+ type: C-MTEB/Mmarco-reranking
519
+ name: MTEB MMarcoReranking
520
+ config: default
521
+ split: dev
522
+ revision: None
523
+ metrics:
524
+ - type: map
525
+ value: 24.75987466552363
526
+ - type: mrr
527
+ value: 23.40515873015873
528
+ - task:
529
+ type: Retrieval
530
+ dataset:
531
+ type: C-MTEB/MMarcoRetrieval
532
+ name: MTEB MMarcoRetrieval
533
+ config: default
534
+ split: dev
535
+ revision: None
536
+ metrics:
537
+ - type: map_at_1
538
+ value: 58.026999999999994
539
+ - type: map_at_10
540
+ value: 67.50699999999999
541
+ - type: map_at_100
542
+ value: 67.946
543
+ - type: map_at_1000
544
+ value: 67.96600000000001
545
+ - type: map_at_3
546
+ value: 65.503
547
+ - type: map_at_5
548
+ value: 66.649
549
+ - type: mrr_at_1
550
+ value: 60.20100000000001
551
+ - type: mrr_at_10
552
+ value: 68.271
553
+ - type: mrr_at_100
554
+ value: 68.664
555
+ - type: mrr_at_1000
556
+ value: 68.682
557
+ - type: mrr_at_3
558
+ value: 66.47800000000001
559
+ - type: mrr_at_5
560
+ value: 67.499
561
+ - type: ndcg_at_1
562
+ value: 60.20100000000001
563
+ - type: ndcg_at_10
564
+ value: 71.697
565
+ - type: ndcg_at_100
566
+ value: 73.736
567
+ - type: ndcg_at_1000
568
+ value: 74.259
569
+ - type: ndcg_at_3
570
+ value: 67.768
571
+ - type: ndcg_at_5
572
+ value: 69.72
573
+ - type: precision_at_1
574
+ value: 60.20100000000001
575
+ - type: precision_at_10
576
+ value: 8.927999999999999
577
+ - type: precision_at_100
578
+ value: 0.9950000000000001
579
+ - type: precision_at_1000
580
+ value: 0.104
581
+ - type: precision_at_3
582
+ value: 25.883
583
+ - type: precision_at_5
584
+ value: 16.55
585
+ - type: recall_at_1
586
+ value: 58.026999999999994
587
+ - type: recall_at_10
588
+ value: 83.966
589
+ - type: recall_at_100
590
+ value: 93.313
591
+ - type: recall_at_1000
592
+ value: 97.426
593
+ - type: recall_at_3
594
+ value: 73.342
595
+ - type: recall_at_5
596
+ value: 77.997
597
+ - task:
598
+ type: Classification
599
+ dataset:
600
+ type: mteb/amazon_massive_intent
601
+ name: MTEB MassiveIntentClassification (zh-CN)
602
+ config: zh-CN
603
+ split: test
604
+ revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
605
+ metrics:
606
+ - type: accuracy
607
+ value: 71.1600537995965
608
+ - type: f1
609
+ value: 68.8126216609964
610
+ - task:
611
+ type: Classification
612
+ dataset:
613
+ type: mteb/amazon_massive_scenario
614
+ name: MTEB MassiveScenarioClassification (zh-CN)
615
+ config: zh-CN
616
+ split: test
617
+ revision: 7d571f92784cd94a019292a1f45445077d0ef634
618
+ metrics:
619
+ - type: accuracy
620
+ value: 73.54068594485541
621
+ - type: f1
622
+ value: 73.46845879869848
623
+ - task:
624
+ type: Retrieval
625
+ dataset:
626
+ type: C-MTEB/MedicalRetrieval
627
+ name: MTEB MedicalRetrieval
628
+ config: default
629
+ split: dev
630
+ revision: None
631
+ metrics:
632
+ - type: map_at_1
633
+ value: 54.900000000000006
634
+ - type: map_at_10
635
+ value: 61.363
636
+ - type: map_at_100
637
+ value: 61.924
638
+ - type: map_at_1000
639
+ value: 61.967000000000006
640
+ - type: map_at_3
641
+ value: 59.767
642
+ - type: map_at_5
643
+ value: 60.802
644
+ - type: mrr_at_1
645
+ value: 55.1
646
+ - type: mrr_at_10
647
+ value: 61.454
648
+ - type: mrr_at_100
649
+ value: 62.016000000000005
650
+ - type: mrr_at_1000
651
+ value: 62.059
652
+ - type: mrr_at_3
653
+ value: 59.882999999999996
654
+ - type: mrr_at_5
655
+ value: 60.893
656
+ - type: ndcg_at_1
657
+ value: 54.900000000000006
658
+ - type: ndcg_at_10
659
+ value: 64.423
660
+ - type: ndcg_at_100
661
+ value: 67.35900000000001
662
+ - type: ndcg_at_1000
663
+ value: 68.512
664
+ - type: ndcg_at_3
665
+ value: 61.224000000000004
666
+ - type: ndcg_at_5
667
+ value: 63.083
668
+ - type: precision_at_1
669
+ value: 54.900000000000006
670
+ - type: precision_at_10
671
+ value: 7.3999999999999995
672
+ - type: precision_at_100
673
+ value: 0.882
674
+ - type: precision_at_1000
675
+ value: 0.097
676
+ - type: precision_at_3
677
+ value: 21.8
678
+ - type: precision_at_5
679
+ value: 13.98
680
+ - type: recall_at_1
681
+ value: 54.900000000000006
682
+ - type: recall_at_10
683
+ value: 74
684
+ - type: recall_at_100
685
+ value: 88.2
686
+ - type: recall_at_1000
687
+ value: 97.3
688
+ - type: recall_at_3
689
+ value: 65.4
690
+ - type: recall_at_5
691
+ value: 69.89999999999999
692
+ - task:
693
+ type: Classification
694
+ dataset:
695
+ type: C-MTEB/MultilingualSentiment-classification
696
+ name: MTEB MultilingualSentiment
697
+ config: default
698
+ split: validation
699
+ revision: None
700
+ metrics:
701
+ - type: accuracy
702
+ value: 75.15666666666667
703
+ - type: f1
704
+ value: 74.8306375354435
705
+ - task:
706
+ type: PairClassification
707
+ dataset:
708
+ type: C-MTEB/OCNLI
709
+ name: MTEB Ocnli
710
+ config: default
711
+ split: validation
712
+ revision: None
713
+ metrics:
714
+ - type: cos_sim_accuracy
715
+ value: 83.10774228478614
716
+ - type: cos_sim_ap
717
+ value: 87.17679348388666
718
+ - type: cos_sim_f1
719
+ value: 84.59302325581395
720
+ - type: cos_sim_precision
721
+ value: 78.15577439570276
722
+ - type: cos_sim_recall
723
+ value: 92.18585005279832
724
+ - type: dot_accuracy
725
+ value: 83.10774228478614
726
+ - type: dot_ap
727
+ value: 87.17679348388666
728
+ - type: dot_f1
729
+ value: 84.59302325581395
730
+ - type: dot_precision
731
+ value: 78.15577439570276
732
+ - type: dot_recall
733
+ value: 92.18585005279832
734
+ - type: euclidean_accuracy
735
+ value: 83.10774228478614
736
+ - type: euclidean_ap
737
+ value: 87.17679348388666
738
+ - type: euclidean_f1
739
+ value: 84.59302325581395
740
+ - type: euclidean_precision
741
+ value: 78.15577439570276
742
+ - type: euclidean_recall
743
+ value: 92.18585005279832
744
+ - type: manhattan_accuracy
745
+ value: 82.67460747157553
746
+ - type: manhattan_ap
747
+ value: 86.94296334435238
748
+ - type: manhattan_f1
749
+ value: 84.32327166504382
750
+ - type: manhattan_precision
751
+ value: 78.22944896115628
752
+ - type: manhattan_recall
753
+ value: 91.4466737064414
754
+ - type: max_accuracy
755
+ value: 83.10774228478614
756
+ - type: max_ap
757
+ value: 87.17679348388666
758
+ - type: max_f1
759
+ value: 84.59302325581395
760
+ - task:
761
+ type: Classification
762
+ dataset:
763
+ type: C-MTEB/OnlineShopping-classification
764
+ name: MTEB OnlineShopping
765
+ config: default
766
+ split: test
767
+ revision: None
768
+ metrics:
769
+ - type: accuracy
770
+ value: 93.24999999999999
771
+ - type: ap
772
+ value: 90.98617641063584
773
+ - type: f1
774
+ value: 93.23447883650289
775
+ - task:
776
+ type: STS
777
+ dataset:
778
+ type: C-MTEB/PAWSX
779
+ name: MTEB PAWSX
780
+ config: default
781
+ split: test
782
+ revision: None
783
+ metrics:
784
+ - type: cos_sim_pearson
785
+ value: 41.071417937737856
786
+ - type: cos_sim_spearman
787
+ value: 45.049199344455424
788
+ - type: euclidean_pearson
789
+ value: 44.913450096830786
790
+ - type: euclidean_spearman
791
+ value: 45.05733424275291
792
+ - type: manhattan_pearson
793
+ value: 44.881623825912065
794
+ - type: manhattan_spearman
795
+ value: 44.989923561416596
796
+ - task:
797
+ type: STS
798
+ dataset:
799
+ type: C-MTEB/QBQTC
800
+ name: MTEB QBQTC
801
+ config: default
802
+ split: test
803
+ revision: None
804
+ metrics:
805
+ - type: cos_sim_pearson
806
+ value: 41.38238052689359
807
+ - type: cos_sim_spearman
808
+ value: 42.61949690594399
809
+ - type: euclidean_pearson
810
+ value: 40.61261500356766
811
+ - type: euclidean_spearman
812
+ value: 42.619626605620724
813
+ - type: manhattan_pearson
814
+ value: 40.8886109204474
815
+ - type: manhattan_spearman
816
+ value: 42.75791523010463
817
+ - task:
818
+ type: STS
819
+ dataset:
820
+ type: mteb/sts22-crosslingual-sts
821
+ name: MTEB STS22 (zh)
822
+ config: zh
823
+ split: test
824
+ revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
825
+ metrics:
826
+ - type: cos_sim_pearson
827
+ value: 62.10977863727196
828
+ - type: cos_sim_spearman
829
+ value: 63.843727112473225
830
+ - type: euclidean_pearson
831
+ value: 63.25133487817196
832
+ - type: euclidean_spearman
833
+ value: 63.843727112473225
834
+ - type: manhattan_pearson
835
+ value: 63.58749018644103
836
+ - type: manhattan_spearman
837
+ value: 63.83820575456674
838
+ - task:
839
+ type: STS
840
+ dataset:
841
+ type: C-MTEB/STSB
842
+ name: MTEB STSB
843
+ config: default
844
+ split: test
845
+ revision: None
846
+ metrics:
847
+ - type: cos_sim_pearson
848
+ value: 79.30616496720054
849
+ - type: cos_sim_spearman
850
+ value: 80.767935782436
851
+ - type: euclidean_pearson
852
+ value: 80.4160642670106
853
+ - type: euclidean_spearman
854
+ value: 80.76820284024356
855
+ - type: manhattan_pearson
856
+ value: 80.27318714580251
857
+ - type: manhattan_spearman
858
+ value: 80.61030164164964
859
+ - task:
860
+ type: Reranking
861
+ dataset:
862
+ type: C-MTEB/T2Reranking
863
+ name: MTEB T2Reranking
864
+ config: default
865
+ split: dev
866
+ revision: None
867
+ metrics:
868
+ - type: map
869
+ value: 66.26242871142425
870
+ - type: mrr
871
+ value: 76.20689863623174
872
+ - task:
873
+ type: Retrieval
874
+ dataset:
875
+ type: C-MTEB/T2Retrieval
876
+ name: MTEB T2Retrieval
877
+ config: default
878
+ split: dev
879
+ revision: None
880
+ metrics:
881
+ - type: map_at_1
882
+ value: 26.240999999999996
883
+ - type: map_at_10
884
+ value: 73.009
885
+ - type: map_at_100
886
+ value: 76.893
887
+ - type: map_at_1000
888
+ value: 76.973
889
+ - type: map_at_3
890
+ value: 51.339
891
+ - type: map_at_5
892
+ value: 63.003
893
+ - type: mrr_at_1
894
+ value: 87.458
895
+ - type: mrr_at_10
896
+ value: 90.44
897
+ - type: mrr_at_100
898
+ value: 90.558
899
+ - type: mrr_at_1000
900
+ value: 90.562
901
+ - type: mrr_at_3
902
+ value: 89.89
903
+ - type: mrr_at_5
904
+ value: 90.231
905
+ - type: ndcg_at_1
906
+ value: 87.458
907
+ - type: ndcg_at_10
908
+ value: 81.325
909
+ - type: ndcg_at_100
910
+ value: 85.61999999999999
911
+ - type: ndcg_at_1000
912
+ value: 86.394
913
+ - type: ndcg_at_3
914
+ value: 82.796
915
+ - type: ndcg_at_5
916
+ value: 81.219
917
+ - type: precision_at_1
918
+ value: 87.458
919
+ - type: precision_at_10
920
+ value: 40.534
921
+ - type: precision_at_100
922
+ value: 4.96
923
+ - type: precision_at_1000
924
+ value: 0.514
925
+ - type: precision_at_3
926
+ value: 72.444
927
+ - type: precision_at_5
928
+ value: 60.601000000000006
929
+ - type: recall_at_1
930
+ value: 26.240999999999996
931
+ - type: recall_at_10
932
+ value: 80.42
933
+ - type: recall_at_100
934
+ value: 94.118
935
+ - type: recall_at_1000
936
+ value: 98.02199999999999
937
+ - type: recall_at_3
938
+ value: 53.174
939
+ - type: recall_at_5
940
+ value: 66.739
941
+ - task:
942
+ type: Classification
943
+ dataset:
944
+ type: C-MTEB/TNews-classification
945
+ name: MTEB TNews
946
+ config: default
947
+ split: validation
948
+ revision: None
949
+ metrics:
950
+ - type: accuracy
951
+ value: 52.40899999999999
952
+ - type: f1
953
+ value: 50.68532128056062
954
+ - task:
955
+ type: Clustering
956
+ dataset:
957
+ type: C-MTEB/ThuNewsClusteringP2P
958
+ name: MTEB ThuNewsClusteringP2P
959
+ config: default
960
+ split: test
961
+ revision: None
962
+ metrics:
963
+ - type: v_measure
964
+ value: 65.57616085176686
965
+ - task:
966
+ type: Clustering
967
+ dataset:
968
+ type: C-MTEB/ThuNewsClusteringS2S
969
+ name: MTEB ThuNewsClusteringS2S
970
+ config: default
971
+ split: test
972
+ revision: None
973
+ metrics:
974
+ - type: v_measure
975
+ value: 58.844999922904925
976
+ - task:
977
+ type: Retrieval
978
+ dataset:
979
+ type: C-MTEB/VideoRetrieval
980
+ name: MTEB VideoRetrieval
981
+ config: default
982
+ split: dev
983
+ revision: None
984
+ metrics:
985
+ - type: map_at_1
986
+ value: 58.4
987
+ - type: map_at_10
988
+ value: 68.64
989
+ - type: map_at_100
990
+ value: 69.062
991
+ - type: map_at_1000
992
+ value: 69.073
993
+ - type: map_at_3
994
+ value: 66.567
995
+ - type: map_at_5
996
+ value: 67.89699999999999
997
+ - type: mrr_at_1
998
+ value: 58.4
999
+ - type: mrr_at_10
1000
+ value: 68.64
1001
+ - type: mrr_at_100
1002
+ value: 69.062
1003
+ - type: mrr_at_1000
1004
+ value: 69.073
1005
+ - type: mrr_at_3
1006
+ value: 66.567
1007
+ - type: mrr_at_5
1008
+ value: 67.89699999999999
1009
+ - type: ndcg_at_1
1010
+ value: 58.4
1011
+ - type: ndcg_at_10
1012
+ value: 73.30600000000001
1013
+ - type: ndcg_at_100
1014
+ value: 75.276
1015
+ - type: ndcg_at_1000
1016
+ value: 75.553
1017
+ - type: ndcg_at_3
1018
+ value: 69.126
1019
+ - type: ndcg_at_5
1020
+ value: 71.519
1021
+ - type: precision_at_1
1022
+ value: 58.4
1023
+ - type: precision_at_10
1024
+ value: 8.780000000000001
1025
+ - type: precision_at_100
1026
+ value: 0.968
1027
+ - type: precision_at_1000
1028
+ value: 0.099
1029
+ - type: precision_at_3
1030
+ value: 25.5
1031
+ - type: precision_at_5
1032
+ value: 16.46
1033
+ - type: recall_at_1
1034
+ value: 58.4
1035
+ - type: recall_at_10
1036
+ value: 87.8
1037
+ - type: recall_at_100
1038
+ value: 96.8
1039
+ - type: recall_at_1000
1040
+ value: 99
1041
+ - type: recall_at_3
1042
+ value: 76.5
1043
+ - type: recall_at_5
1044
+ value: 82.3
1045
+ - task:
1046
+ type: Classification
1047
+ dataset:
1048
+ type: C-MTEB/waimai-classification
1049
+ name: MTEB Waimai
1050
+ config: default
1051
+ split: test
1052
+ revision: None
1053
+ metrics:
1054
+ - type: accuracy
1055
+ value: 86.21000000000001
1056
+ - type: ap
1057
+ value: 69.17460264576461
1058
+ - type: f1
1059
+ value: 84.68032984659226
1060
+ license: apache-2.0
1061
+ language:
1062
+ - zh
1063
+ - en
1064
+ ---
1065
+
1066
+ <div align="center">
1067
+ <img src="logo.png" alt="icon" width="100px"/>
1068
+ </div>
1069
+
1070
+ <h1 align="center">Dmeta-embedding</h1>
1071
+ <h4 align="center">
1072
+ <p>
1073
+ <a href="https://huggingface.co/DMetaSoul/Dmeta-embedding/README.md">English</a> |
1074
+ <a href="https://huggingface.co/DMetaSoul/Dmeta-embedding/README_zh.md">中文</a>
1075
+ </p>
1076
+ <p>
1077
+ <a href=#usage>Usage</a> |
1078
+ <a href="#evaluation">Evaluation (MTEB)</a> |
1079
+ <a href=#faq>FAQ</a> |
1080
+ <a href="#contact">Contact</a> |
1081
+ <a href="#license">License (Free)</a>
1082
+ <p>
1083
+ </h4>
1084
+
1085
+ **Update News**
1086
+
1087
+ - **2024.02.07**, The **Embedding API** service based on the Dmeta-embedding model has now started internal testing. [Click the link](https://dmetasoul.feishu.cn/share/base/form/shrcnu7mN1BDwKFfgGXG9Rb1yDf) to apply, and you will receive **400M tokens** for free, which can encode approximately GB-level Chinese text.
1088
+
1089
+ - Our original intention. Let everyone use Embedding technology at low cost, pay more attention to their own business and product services, and leave the complex technical parts to us.
1090
+ - How to apply and use. [Click the link](https://dmetasoul.feishu.cn/share/base/form/shrcnu7mN1BDwKFfgGXG9Rb1yDf) to submit a form. We will reply to you via <[email protected]> within 48 hours. In order to be compatible with the large language model (LLM) technology ecosystem, our Embedding API is used in the same way as OpenAI. We will explain the specific usage in the reply email.
1091
+ - Join the ours. In the future, we will continue to work in the direction of large language models/AIGC to bring valuable technologies to the community. You can [click on the picture](https://huggingface.co/DMetaSoul/Dmeta-embedding/resolve/main/weixin.jpeg) and scan the QR code to join our WeChat community and cheer for the AIGC together!
1092
+
1093
+ ------
1094
+
1095
+ **Dmeta-embedding** is a cross-domain, cross-task, out-of-the-box Chinese embedding model. It is suitable for various scenarios such as search engine, Q&A, intelligent customer service, LLM+RAG, etc. It supports inference using tools like Transformers/Sentence-Transformers/Langchain.
1096
+
1097
+ Features:
1098
+
1099
+ - Excellent cross-domain and scene generalization performance, currently ranked second on the **[MTEB](https://huggingface.co/spaces/mteb/leaderboard) Chinese leaderboard**. (2024.01.25)
1100
+ - The parameter size of model is just **400MB**, which can greatly reduce the cost of inference.
1101
+ - The context window length is up to **1024**, more suitable for long text retrieval, RAG and other scenarios
1102
+
1103
+ ## Usage
1104
+
1105
+ The model supports inference through frameworks such as [Sentence-Transformers](#sentence-transformers), [Langchain](#langchain), [Huggingface Transformers](#huggingface-transformers), etc. For specific usage, please refer to the following examples.
1106
+
1107
+
1108
+ ### Sentence-Transformers
1109
+
1110
+ Load and inference Dmeta-embedding via [sentence-transformers](https://www.SBERT.net) as following:
1111
+
1112
+ ```
1113
+ pip install -U sentence-transformers
1114
+ ```
1115
+
1116
+ ```python
1117
+ from sentence_transformers import SentenceTransformer
1118
+
1119
+ texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
1120
+ texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
1121
+
1122
+ model = SentenceTransformer('DMetaSoul/Dmeta-embedding')
1123
+ embs1 = model.encode(texts1, normalize_embeddings=True)
1124
+ embs2 = model.encode(texts2, normalize_embeddings=True)
1125
+
1126
+ similarity = embs1 @ embs2.T
1127
+ print(similarity)
1128
+
1129
+ for i in range(len(texts1)):
1130
+ scores = []
1131
+ for j in range(len(texts2)):
1132
+ scores.append([texts2[j], similarity[i][j]])
1133
+ scores = sorted(scores, key=lambda x:x[1], reverse=True)
1134
+
1135
+ print(f"查询文本:{texts1[i]}")
1136
+ for text2, score in scores:
1137
+ print(f"相似文本:{text2},打分:{score}")
1138
+ print()
1139
+ ```
1140
+
1141
+ Output:
1142
+
1143
+ ```
1144
+ 查询文本:胡子长得太快怎么办?
1145
+ 相似文本:胡子长得快怎么办?,打分:0.9535336494445801
1146
+ 相似文本:怎样使胡子不浓密!,打分:0.6776421070098877
1147
+ 相似文本:香港买手表哪里好,打分:0.2297907918691635
1148
+ 相似文本:在杭州手机到哪里买,打分:0.11386542022228241
1149
+
1150
+ 查询文本:在香港哪里买手表好
1151
+ 相似文本:香港买手表哪里好,打分:0.9843372106552124
1152
+ 相似文本:在杭州手机到哪里买,打分:0.45211508870124817
1153
+ 相似文本:胡子长得快怎么办?,打分:0.19985519349575043
1154
+ 相似文本:怎样使胡子不浓密!,打分:0.18558596074581146
1155
+ ```
1156
+
1157
+ ### Langchain
1158
+
1159
+ Load and inference Dmeta-embedding via [langchain](https://www.langchain.com/) as following:
1160
+
1161
+ ```
1162
+ pip install -U langchain
1163
+ ```
1164
+
1165
+ ```python
1166
+ import torch
1167
+ import numpy as np
1168
+ from langchain.embeddings import HuggingFaceEmbeddings
1169
+
1170
+ model_name = "DMetaSoul/Dmeta-embedding"
1171
+ model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
1172
+ encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
1173
+
1174
+ model = HuggingFaceEmbeddings(
1175
+ model_name=model_name,
1176
+ model_kwargs=model_kwargs,
1177
+ encode_kwargs=encode_kwargs,
1178
+ )
1179
+
1180
+ texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
1181
+ texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
1182
+
1183
+ embs1 = model.embed_documents(texts1)
1184
+ embs2 = model.embed_documents(texts2)
1185
+ embs1, embs2 = np.array(embs1), np.array(embs2)
1186
+
1187
+ similarity = embs1 @ embs2.T
1188
+ print(similarity)
1189
+
1190
+ for i in range(len(texts1)):
1191
+ scores = []
1192
+ for j in range(len(texts2)):
1193
+ scores.append([texts2[j], similarity[i][j]])
1194
+ scores = sorted(scores, key=lambda x:x[1], reverse=True)
1195
+
1196
+ print(f"查询文本:{texts1[i]}")
1197
+ for text2, score in scores:
1198
+ print(f"相似文本:{text2},打分:{score}")
1199
+ print()
1200
+ ```
1201
+
1202
+ ### HuggingFace Transformers
1203
+
1204
+ Load and inference Dmeta-embedding via [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) as following:
1205
+
1206
+ ```
1207
+ pip install -U transformers
1208
+ ```
1209
+
1210
+ ```python
1211
+ import torch
1212
+ from transformers import AutoTokenizer, AutoModel
1213
+
1214
+
1215
+ def mean_pooling(model_output, attention_mask):
1216
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
1217
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
1218
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
1219
+
1220
+ def cls_pooling(model_output):
1221
+ return model_output[0][:, 0]
1222
+
1223
+
1224
+ texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
1225
+ texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
1226
+
1227
+ tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/Dmeta-embedding')
1228
+ model = AutoModel.from_pretrained('DMetaSoul/Dmeta-embedding')
1229
+ model.eval()
1230
+
1231
+ with torch.no_grad():
1232
+ inputs1 = tokenizer(texts1, padding=True, truncation=True, return_tensors='pt')
1233
+ inputs2 = tokenizer(texts2, padding=True, truncation=True, return_tensors='pt')
1234
+
1235
+ model_output1 = model(**inputs1)
1236
+ model_output2 = model(**inputs2)
1237
+ embs1, embs2 = cls_pooling(model_output1), cls_pooling(model_output2)
1238
+ embs1 = torch.nn.functional.normalize(embs1, p=2, dim=1).numpy()
1239
+ embs2 = torch.nn.functional.normalize(embs2, p=2, dim=1).numpy()
1240
+
1241
+
1242
+ similarity = embs1 @ embs2.T
1243
+ print(similarity)
1244
+
1245
+ for i in range(len(texts1)):
1246
+ scores = []
1247
+ for j in range(len(texts2)):
1248
+ scores.append([texts2[j], similarity[i][j]])
1249
+ scores = sorted(scores, key=lambda x:x[1], reverse=True)
1250
+
1251
+ print(f"查询文本:{texts1[i]}")
1252
+ for text2, score in scores:
1253
+ print(f"相似文本:{text2},打分:{score}")
1254
+ print()
1255
+ ```
1256
+
1257
+ ## Evaluation
1258
+
1259
+ The Dmeta-embedding model ranked first in open source on the [MTEB Chinese list](https://huggingface.co/spaces/mteb/leaderboard) (2024.01.25, first on the Baichuan list, that is not open source). For specific evaluation data and code, please refer to the MTEB [official](https://github.com/embeddings-benchmark/mteb).
1260
+
1261
+ **MTEB Chinese**:
1262
+
1263
+ The [Chinese leaderboard dataset](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) was collected by the BAAI. It contains 6 classic tasks and a total of 35 Chinese datasets, covering classification, retrieval, reranking, sentence pair classification, STS and other tasks. It is the most comprehensive Embedding model at present. The world's authoritative benchmark of ability assessments.
1264
+
1265
+ | Model | Vendor | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
1266
+ |:-------------------------------------------------------------------------------------------------------- | ------ |:-------------------:|:-----:|:---------:|:-----:|:------------------:|:--------------:|:---------:|:----------:|
1267
+ | [Dmeta-embedding](https://huggingface.co/DMetaSoul/Dmeta-embedding) | Our | 1024 | 67.51 | 70.41 | 64.09 | 88.92 | 70 | 67.17 | 50.96 |
1268
+ | [gte-large-zh](https://huggingface.co/thenlper/gte-large-zh) | AliBaba Damo | 1024 | 66.72 | 72.49 | 57.82 | 84.41 | 71.34 | 67.4 | 53.07 |
1269
+ | [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | BAAI | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 |
1270
+ | [BAAI/bge-base-zh-v1.5](https://huggingface.co/BAAI/bge-base-zh-v1.5) | BAAI | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 |
1271
+ | [text-embedding-ada-002(OpenAI)](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) | OpenAI | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 |
1272
+ | [text2vec-base](https://huggingface.co/shibing624/text2vec-base-chinese) | 个人 | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 |
1273
+ | [text2vec-large](https://huggingface.co/GanymedeNil/text2vec-large-chinese) | 个人 | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 |
1274
+
1275
+ ## FAQ
1276
+
1277
+ <details>
1278
+ <summary>1. Why does the model have so good generalization performance, and can be used to many task scenarios out of the box?</summary>
1279
+
1280
+ <!-- ### Why does the model have so good generalization performance, and can be used to many task scenarios out of the box? -->
1281
+
1282
+ The excellent generalization ability of the model comes from the diversity of pre-training data, as well as the design of different optimization objectives for multi-task scenarios when pre-training the model.
1283
+
1284
+ Specifically, the mainly technical features:
1285
+
1286
+ 1) The first is large-scale weak label contrastive learning. Industry experience shows that out-of-the-box language models perform poorly on Embedding-related tasks. However, due to the high cost of supervised data annotation and acquisition, large-scale, high-quality weak label learning has become an optional technical route. By extracting weak labels from semi-structured data such as forums, news, Q&A communities, and encyclopedias on the Internet, and using large models to perform low-quality filtering, 1 billion-level weakly supervised text pair data is obtained.
1287
+
1288
+ 2) The second is high-quality supervised learning. We have collected and compiled a large-scale open source annotated sentence pair data set, including a total of 30 million sentence pair samples in encyclopedia, education, finance, medical care, law, news, academia and other fields. At the same time, we mine hard-to-negative sample pairs and use contrastive learning to better optimize the model.
1289
+
1290
+ 3) The last step is the optimization of retrieval tasks. Considering that search, question and answer, RAG and other scenarios are important application positions for the Embedding model, in order to enhance the cross-domain and cross-scenario performance of the model, we have specially optimized the model for retrieval tasks. The core lies in mining data from question and answer, retrieval and other data. Hard-to-negative samples use sparse and dense retrieval and other methods to construct a million-level hard-to-negative sample pair data set, which significantly improves the cross-domain retrieval performance of the model.
1291
+
1292
+ </details>
1293
+
1294
+ <details>
1295
+ <summary>2. Can the model be used commercially?</summary>
1296
+
1297
+ <!-- ### Can the model be used commercially? -->
1298
+
1299
+ Our model is based on the Apache-2.0 license and fully supports free commercial use.
1300
+
1301
+ </details>
1302
+
1303
+ <details>
1304
+ <summary>3. How to reproduce the MTEB evaluation?</summary>
1305
+
1306
+ <!-- ### How to reproduce the MTEB evaluation? -->
1307
+
1308
+ We provide the mteb_eval.py script in this model hub. You can run this script directly to reproduce our evaluation results.
1309
+
1310
+ </details>
1311
+
1312
+ <details>
1313
+ <summary>4. What are the follow-up plans?</summary>
1314
+
1315
+ <!-- ### What are the follow-up plans? -->
1316
+
1317
+ We will continue to work hard to provide the community with embedding models that have excellent performance, lightweight reasoning, and can be used in multiple scenarios out of the box. At the same time, we will gradually integrate embedding into the existing technology ecosystem and grow with the community!
1318
+
1319
+ </details>
1320
+
1321
+ ## Contact
1322
+
1323
+ If you encounter any problems during use, you are welcome to go to the [discussion](https://huggingface.co/DMetaSoul/Dmeta-embedding/discussions) to make suggestions.
1324
+
1325
+ You can also send us an email: Zhao Zhonghao <[email protected]>, Xiao Wenbin <[email protected]>, Sun Kai <[email protected]>
1326
+
1327
+ At the same time, you are welcome to scan the QR code to join our WeChat group and build the AIGC technology ecosystem together!
1328
+
1329
+ <image src="https://huggingface.co/DMetaSoul/Dmeta-embedding/resolve/main/weixin.jpeg" style="display: block; margin-left: auto; margin-right: auto; width: 256px; height: 358px;"/>
1330
+
1331
+ ## License
1332
+
1333
+ Dmeta-embedding is licensed under the Apache-2.0 License. The released models can be used for commercial purposes free of charge.