MugheesAwan11 commited on
Commit
2d2cfcc
1 Parent(s): 7cc0c4e

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,705 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: BAAI/bge-base-en-v1.5
3
+ datasets: []
4
+ language:
5
+ - en
6
+ library_name: sentence-transformers
7
+ license: apache-2.0
8
+ metrics:
9
+ - cosine_accuracy@1
10
+ - cosine_accuracy@3
11
+ - cosine_accuracy@5
12
+ - cosine_accuracy@10
13
+ - cosine_precision@1
14
+ - cosine_precision@3
15
+ - cosine_precision@5
16
+ - cosine_precision@10
17
+ - cosine_recall@1
18
+ - cosine_recall@3
19
+ - cosine_recall@5
20
+ - cosine_recall@10
21
+ - cosine_ndcg@10
22
+ - cosine_ndcg@80
23
+ - cosine_mrr@10
24
+ - cosine_map@100
25
+ pipeline_tag: sentence-similarity
26
+ tags:
27
+ - sentence-transformers
28
+ - sentence-similarity
29
+ - feature-extraction
30
+ - generated_from_trainer
31
+ - dataset_size:1496
32
+ - loss:MatryoshkaLoss
33
+ - loss:MultipleNegativesRankingLoss
34
+ widget:
35
+ - source_sentence: We are currently involved in, and may in the future be involved
36
+ in, legal proceedings, claims, and government investigations in the ordinary course
37
+ of business. These include proceedings, claims, and investigations relating to,
38
+ among other things, regulatory matters, commercial matters, intellectual property,
39
+ competition, tax, employment, pricing, discrimination, consumer rights, personal
40
+ injury, and property rights.
41
+ sentences:
42
+ - What factors does the regulatory authority consider when ensuring data protection
43
+ in cross border transfers in Zimbabwe?
44
+ - How does Securiti enable enterprises to safely use data and the cloud while managing
45
+ security, privacy, and compliance risks?
46
+ - What types of legal issues is the company currently involved in?
47
+ - source_sentence: The Company’s minority market share in the global smartphone, personal
48
+ computer and tablet markets can make developers less inclined to develop or upgrade
49
+ software for the Company’s products and more inclined to devote their resources
50
+ to developing and upgrading software for competitors’ products with larger market
51
+ share. When developers focus their efforts on these competing platforms, the availability
52
+ and quality of applications for the Company’s devices can suffer.
53
+ sentences:
54
+ - What is the role of obtaining consent in Thailand's PDPA?
55
+ - Why might developers be less inclined to develop or upgrade software for the Company's
56
+ products?
57
+ - What caused the increase in energy generation and storage segment revenue in 2023?
58
+ - source_sentence: '** : EMEA (Europe, the Middle East and Africa) The Irish DPA implements
59
+ the GDPR into the national law by incorporating most of the provisions of the
60
+ GDPR with limited additions and deletions. It contains several provisions restricting
61
+ data subjects’ rights that they generally have under the GDPR, for example, where
62
+ restrictions are necessary for the enforcement of civil law claims. Resources*
63
+ : Irish DPA Overview Irish Cookie Guidance ### Japan #### Japan’s Act on the Protection
64
+ of Personal Information (APPI) **Effective Date (Amended APPI)** : April 01, 2022
65
+ **Region** : APAC (Asia-Pacific) Japan’s APPI regulates personal related information
66
+ and applies to any Personal Information Controller (the “PIC''''), that is a person
67
+ or entity providing personal related information for use in business in Japan.
68
+ The APPI also applies to the foreign'
69
+ sentences:
70
+ - What are the requirements for CIIOs and personal information processors in the
71
+ state cybersecurity department regarding cross-border data transfers and certifications?
72
+ - How does the Irish DPA implement the GDPR into national law?
73
+ - What is the current status of the Personal Data Protection Act in El Salvador
74
+ compared to Monaco and Venezuela?
75
+ - source_sentence: View Salesforce View Workday View GCP View Azure View Oracle View
76
+ US California CCPA View US California CPRA View European Union GDPR View Thailand’s
77
+ PDPA View China PIPL View Canada PIPEDA View Brazil's LGPD View \+ More View Privacy
78
+ View Security View Governance View Marketing View Resources Blog View Collateral
79
+ View Knowledge Center View Securiti Education View Company About Us View Partner
80
+ Program View Contact Us View News Coverage
81
+ sentences:
82
+ - What is the role of ANPD in ensuring LGPD compliance and protecting data subject
83
+ rights, including those related to health professionals?
84
+ - According to the Spanish data protection law, who is required to hire a DPO if
85
+ they possess certain information in the event of a data breach?
86
+ - What is GCP and how does it relate to privacy, security, governance, marketing,
87
+ and resources?
88
+ - source_sentence: 'vital interests of the data subject; Complying with an obligation
89
+ prescribed in PDPL, not being a contractual obligation, or complying with an order
90
+ from a competent court, the Public Prosecution, the investigation Judge, or the
91
+ Military Prosecution; or Preparing or pursuing a legal claim or defense. vs Articles:
92
+ 44 50, Recitals: 101, 112 GDPR states that personal data shall be transferred
93
+ to a third country or international organization with an adequate protection level
94
+ as determined by the EU Commission. Suppose there is no decision on an adequate
95
+ protection level. In that case, a transfer is only permitted when the data controller
96
+ or data processor provides appropriate safeguards that ensure data subject rights.
97
+ Appropriate safeguards include: BCRs with specific requirements (e.g., a legal
98
+ basis for processing, a retention period, and complaint procedures) Standard data
99
+ protection clauses adopted by the EU Commission, level of protection. If there
100
+ is no adequate level of protection, then data controllers in Turkey and abroad
101
+ shall commit, in writing, to provide an adequate level of protection abroad, as
102
+ well as agree on the fact that the transfer is permitted by the Board of KVKK.
103
+ vs Articles 44 50 Recitals 101, 112 GDPR states that personal data shall be transferred
104
+ to a third country or international organization with an adequate protection level
105
+ as determined by the EU Commission. Suppose there is no decision on an adequate
106
+ protection level. In that case, a transfer is only permitted when the data controller
107
+ or data processor provides appropriate safeguards that ensure data subject'' rights.
108
+ Appropriate safeguards include: BCRs with specific requirements (e.g., a legal
109
+ basis for processing, a retention period, and complaint procedures); standard
110
+ data protection clauses adopted by the EU Commission or by a supervisory authority;
111
+ an approved code'
112
+ sentences:
113
+ - What is the right to be informed in relation to personal data?
114
+ - In what situations can a controller process personal data to protect vital interests?
115
+ - What obligations in PDPL must data controllers or processors meet to protect personal
116
+ data transferred to a third country or international organization?
117
+ model-index:
118
+ - name: SentenceTransformer based on BAAI/bge-base-en-v1.5
119
+ results:
120
+ - task:
121
+ type: information-retrieval
122
+ name: Information Retrieval
123
+ dataset:
124
+ name: dim 768
125
+ type: dim_768
126
+ metrics:
127
+ - type: cosine_accuracy@1
128
+ value: 0.4020618556701031
129
+ name: Cosine Accuracy@1
130
+ - type: cosine_accuracy@3
131
+ value: 0.5773195876288659
132
+ name: Cosine Accuracy@3
133
+ - type: cosine_accuracy@5
134
+ value: 0.6804123711340206
135
+ name: Cosine Accuracy@5
136
+ - type: cosine_accuracy@10
137
+ value: 0.7938144329896907
138
+ name: Cosine Accuracy@10
139
+ - type: cosine_precision@1
140
+ value: 0.4020618556701031
141
+ name: Cosine Precision@1
142
+ - type: cosine_precision@3
143
+ value: 0.1924398625429553
144
+ name: Cosine Precision@3
145
+ - type: cosine_precision@5
146
+ value: 0.1360824742268041
147
+ name: Cosine Precision@5
148
+ - type: cosine_precision@10
149
+ value: 0.07938144329896907
150
+ name: Cosine Precision@10
151
+ - type: cosine_recall@1
152
+ value: 0.4020618556701031
153
+ name: Cosine Recall@1
154
+ - type: cosine_recall@3
155
+ value: 0.5773195876288659
156
+ name: Cosine Recall@3
157
+ - type: cosine_recall@5
158
+ value: 0.6804123711340206
159
+ name: Cosine Recall@5
160
+ - type: cosine_recall@10
161
+ value: 0.7938144329896907
162
+ name: Cosine Recall@10
163
+ - type: cosine_ndcg@10
164
+ value: 0.5832092053824987
165
+ name: Cosine Ndcg@10
166
+ - type: cosine_ndcg@80
167
+ value: 0.6222698401457883
168
+ name: Cosine Ndcg@80
169
+ - type: cosine_mrr@10
170
+ value: 0.5174930453280969
171
+ name: Cosine Mrr@10
172
+ - type: cosine_map@100
173
+ value: 0.5253009685878662
174
+ name: Cosine Map@100
175
+ - task:
176
+ type: information-retrieval
177
+ name: Information Retrieval
178
+ dataset:
179
+ name: dim 512
180
+ type: dim_512
181
+ metrics:
182
+ - type: cosine_accuracy@1
183
+ value: 0.41237113402061853
184
+ name: Cosine Accuracy@1
185
+ - type: cosine_accuracy@3
186
+ value: 0.5670103092783505
187
+ name: Cosine Accuracy@3
188
+ - type: cosine_accuracy@5
189
+ value: 0.6597938144329897
190
+ name: Cosine Accuracy@5
191
+ - type: cosine_accuracy@10
192
+ value: 0.7938144329896907
193
+ name: Cosine Accuracy@10
194
+ - type: cosine_precision@1
195
+ value: 0.41237113402061853
196
+ name: Cosine Precision@1
197
+ - type: cosine_precision@3
198
+ value: 0.18900343642611683
199
+ name: Cosine Precision@3
200
+ - type: cosine_precision@5
201
+ value: 0.1319587628865979
202
+ name: Cosine Precision@5
203
+ - type: cosine_precision@10
204
+ value: 0.07938144329896907
205
+ name: Cosine Precision@10
206
+ - type: cosine_recall@1
207
+ value: 0.41237113402061853
208
+ name: Cosine Recall@1
209
+ - type: cosine_recall@3
210
+ value: 0.5670103092783505
211
+ name: Cosine Recall@3
212
+ - type: cosine_recall@5
213
+ value: 0.6597938144329897
214
+ name: Cosine Recall@5
215
+ - type: cosine_recall@10
216
+ value: 0.7938144329896907
217
+ name: Cosine Recall@10
218
+ - type: cosine_ndcg@10
219
+ value: 0.5860165941440372
220
+ name: Cosine Ndcg@10
221
+ - type: cosine_ndcg@80
222
+ value: 0.6252535691605303
223
+ name: Cosine Ndcg@80
224
+ - type: cosine_mrr@10
225
+ value: 0.5218622156766489
226
+ name: Cosine Mrr@10
227
+ - type: cosine_map@100
228
+ value: 0.5297061448856729
229
+ name: Cosine Map@100
230
+ - task:
231
+ type: information-retrieval
232
+ name: Information Retrieval
233
+ dataset:
234
+ name: dim 256
235
+ type: dim_256
236
+ metrics:
237
+ - type: cosine_accuracy@1
238
+ value: 0.41237113402061853
239
+ name: Cosine Accuracy@1
240
+ - type: cosine_accuracy@3
241
+ value: 0.5979381443298969
242
+ name: Cosine Accuracy@3
243
+ - type: cosine_accuracy@5
244
+ value: 0.6494845360824743
245
+ name: Cosine Accuracy@5
246
+ - type: cosine_accuracy@10
247
+ value: 0.7628865979381443
248
+ name: Cosine Accuracy@10
249
+ - type: cosine_precision@1
250
+ value: 0.41237113402061853
251
+ name: Cosine Precision@1
252
+ - type: cosine_precision@3
253
+ value: 0.1993127147766323
254
+ name: Cosine Precision@3
255
+ - type: cosine_precision@5
256
+ value: 0.12989690721649483
257
+ name: Cosine Precision@5
258
+ - type: cosine_precision@10
259
+ value: 0.07628865979381441
260
+ name: Cosine Precision@10
261
+ - type: cosine_recall@1
262
+ value: 0.41237113402061853
263
+ name: Cosine Recall@1
264
+ - type: cosine_recall@3
265
+ value: 0.5979381443298969
266
+ name: Cosine Recall@3
267
+ - type: cosine_recall@5
268
+ value: 0.6494845360824743
269
+ name: Cosine Recall@5
270
+ - type: cosine_recall@10
271
+ value: 0.7628865979381443
272
+ name: Cosine Recall@10
273
+ - type: cosine_ndcg@10
274
+ value: 0.5782766042135054
275
+ name: Cosine Ndcg@10
276
+ - type: cosine_ndcg@80
277
+ value: 0.6240012013315989
278
+ name: Cosine Ndcg@80
279
+ - type: cosine_mrr@10
280
+ value: 0.5207167403043692
281
+ name: Cosine Mrr@10
282
+ - type: cosine_map@100
283
+ value: 0.5307304570652817
284
+ name: Cosine Map@100
285
+ ---
286
+
287
+ # SentenceTransformer based on BAAI/bge-base-en-v1.5
288
+
289
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
290
+
291
+ ## Model Details
292
+
293
+ ### Model Description
294
+ - **Model Type:** Sentence Transformer
295
+ - **Base model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) <!-- at revision a5beb1e3e68b9ab74eb54cfd186867f64f240e1a -->
296
+ - **Maximum Sequence Length:** 512 tokens
297
+ - **Output Dimensionality:** 768 tokens
298
+ - **Similarity Function:** Cosine Similarity
299
+ <!-- - **Training Dataset:** Unknown -->
300
+ - **Language:** en
301
+ - **License:** apache-2.0
302
+
303
+ ### Model Sources
304
+
305
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
306
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
307
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
308
+
309
+ ### Full Model Architecture
310
+
311
+ ```
312
+ SentenceTransformer(
313
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
314
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
315
+ (2): Normalize()
316
+ )
317
+ ```
318
+
319
+ ## Usage
320
+
321
+ ### Direct Usage (Sentence Transformers)
322
+
323
+ First install the Sentence Transformers library:
324
+
325
+ ```bash
326
+ pip install -U sentence-transformers
327
+ ```
328
+
329
+ Then you can load this model and run inference.
330
+ ```python
331
+ from sentence_transformers import SentenceTransformer
332
+
333
+ # Download from the 🤗 Hub
334
+ model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-3-v23")
335
+ # Run inference
336
+ sentences = [
337
+ "vital interests of the data subject; Complying with an obligation prescribed in PDPL, not being a contractual obligation, or complying with an order from a competent court, the Public Prosecution, the investigation Judge, or the Military Prosecution; or Preparing or pursuing a legal claim or defense. vs Articles: 44 50, Recitals: 101, 112 GDPR states that personal data shall be transferred to a third country or international organization with an adequate protection level as determined by the EU Commission. Suppose there is no decision on an adequate protection level. In that case, a transfer is only permitted when the data controller or data processor provides appropriate safeguards that ensure data subject rights. Appropriate safeguards include: BCRs with specific requirements (e.g., a legal basis for processing, a retention period, and complaint procedures) Standard data protection clauses adopted by the EU Commission, level of protection. If there is no adequate level of protection, then data controllers in Turkey and abroad shall commit, in writing, to provide an adequate level of protection abroad, as well as agree on the fact that the transfer is permitted by the Board of KVKK. vs Articles 44 50 Recitals 101, 112 GDPR states that personal data shall be transferred to a third country or international organization with an adequate protection level as determined by the EU Commission. Suppose there is no decision on an adequate protection level. In that case, a transfer is only permitted when the data controller or data processor provides appropriate safeguards that ensure data subject' rights. Appropriate safeguards include: BCRs with specific requirements (e.g., a legal basis for processing, a retention period, and complaint procedures); standard data protection clauses adopted by the EU Commission or by a supervisory authority; an approved code",
338
+ 'What obligations in PDPL must data controllers or processors meet to protect personal data transferred to a third country or international organization?',
339
+ 'In what situations can a controller process personal data to protect vital interests?',
340
+ ]
341
+ embeddings = model.encode(sentences)
342
+ print(embeddings.shape)
343
+ # [3, 768]
344
+
345
+ # Get the similarity scores for the embeddings
346
+ similarities = model.similarity(embeddings, embeddings)
347
+ print(similarities.shape)
348
+ # [3, 3]
349
+ ```
350
+
351
+ <!--
352
+ ### Direct Usage (Transformers)
353
+
354
+ <details><summary>Click to see the direct usage in Transformers</summary>
355
+
356
+ </details>
357
+ -->
358
+
359
+ <!--
360
+ ### Downstream Usage (Sentence Transformers)
361
+
362
+ You can finetune this model on your own dataset.
363
+
364
+ <details><summary>Click to expand</summary>
365
+
366
+ </details>
367
+ -->
368
+
369
+ <!--
370
+ ### Out-of-Scope Use
371
+
372
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
373
+ -->
374
+
375
+ ## Evaluation
376
+
377
+ ### Metrics
378
+
379
+ #### Information Retrieval
380
+ * Dataset: `dim_768`
381
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
382
+
383
+ | Metric | Value |
384
+ |:--------------------|:-----------|
385
+ | cosine_accuracy@1 | 0.4021 |
386
+ | cosine_accuracy@3 | 0.5773 |
387
+ | cosine_accuracy@5 | 0.6804 |
388
+ | cosine_accuracy@10 | 0.7938 |
389
+ | cosine_precision@1 | 0.4021 |
390
+ | cosine_precision@3 | 0.1924 |
391
+ | cosine_precision@5 | 0.1361 |
392
+ | cosine_precision@10 | 0.0794 |
393
+ | cosine_recall@1 | 0.4021 |
394
+ | cosine_recall@3 | 0.5773 |
395
+ | cosine_recall@5 | 0.6804 |
396
+ | cosine_recall@10 | 0.7938 |
397
+ | cosine_ndcg@10 | 0.5832 |
398
+ | cosine_ndcg@80 | 0.6223 |
399
+ | cosine_mrr@10 | 0.5175 |
400
+ | **cosine_map@100** | **0.5253** |
401
+
402
+ #### Information Retrieval
403
+ * Dataset: `dim_512`
404
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
405
+
406
+ | Metric | Value |
407
+ |:--------------------|:-----------|
408
+ | cosine_accuracy@1 | 0.4124 |
409
+ | cosine_accuracy@3 | 0.567 |
410
+ | cosine_accuracy@5 | 0.6598 |
411
+ | cosine_accuracy@10 | 0.7938 |
412
+ | cosine_precision@1 | 0.4124 |
413
+ | cosine_precision@3 | 0.189 |
414
+ | cosine_precision@5 | 0.132 |
415
+ | cosine_precision@10 | 0.0794 |
416
+ | cosine_recall@1 | 0.4124 |
417
+ | cosine_recall@3 | 0.567 |
418
+ | cosine_recall@5 | 0.6598 |
419
+ | cosine_recall@10 | 0.7938 |
420
+ | cosine_ndcg@10 | 0.586 |
421
+ | cosine_ndcg@80 | 0.6253 |
422
+ | cosine_mrr@10 | 0.5219 |
423
+ | **cosine_map@100** | **0.5297** |
424
+
425
+ #### Information Retrieval
426
+ * Dataset: `dim_256`
427
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
428
+
429
+ | Metric | Value |
430
+ |:--------------------|:-----------|
431
+ | cosine_accuracy@1 | 0.4124 |
432
+ | cosine_accuracy@3 | 0.5979 |
433
+ | cosine_accuracy@5 | 0.6495 |
434
+ | cosine_accuracy@10 | 0.7629 |
435
+ | cosine_precision@1 | 0.4124 |
436
+ | cosine_precision@3 | 0.1993 |
437
+ | cosine_precision@5 | 0.1299 |
438
+ | cosine_precision@10 | 0.0763 |
439
+ | cosine_recall@1 | 0.4124 |
440
+ | cosine_recall@3 | 0.5979 |
441
+ | cosine_recall@5 | 0.6495 |
442
+ | cosine_recall@10 | 0.7629 |
443
+ | cosine_ndcg@10 | 0.5783 |
444
+ | cosine_ndcg@80 | 0.624 |
445
+ | cosine_mrr@10 | 0.5207 |
446
+ | **cosine_map@100** | **0.5307** |
447
+
448
+ <!--
449
+ ## Bias, Risks and Limitations
450
+
451
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
452
+ -->
453
+
454
+ <!--
455
+ ### Recommendations
456
+
457
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
458
+ -->
459
+
460
+ ## Training Details
461
+
462
+ ### Training Dataset
463
+
464
+ #### Unnamed Dataset
465
+
466
+
467
+ * Size: 1,496 training samples
468
+ * Columns: <code>positive</code> and <code>anchor</code>
469
+ * Approximate statistics based on the first 1000 samples:
470
+ | | positive | anchor |
471
+ |:--------|:-------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
472
+ | type | string | string |
473
+ | details | <ul><li>min: 67 tokens</li><li>mean: 216.99 tokens</li><li>max: 512 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 21.6 tokens</li><li>max: 102 tokens</li></ul> |
474
+ * Samples:
475
+ | positive | anchor |
476
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------|
477
+ | <code>Leader in Data Privacy View Events Spotlight Talks Education Contact Us Schedule a Demo Products By Use Cases By Roles Data Command Center View Learn more Asset and Data Discovery Discover dark and native data assets Learn more Data Access Intelligence & Governance Identify which users have access to sensitive data and prevent unauthorized access Learn more Data Privacy Automation PrivacyCenter.Cloud | Data Mapping | DSR Automation | Assessment Automation | Vendor Assessment | Breach Management | Privacy Notice Learn more Sensitive Data Intelligence Discover & Classify Structured and Unstructured Data | People Data Graph Learn more Data Flow Intelligence & Governance Prevent sensitive data sprawl through real-time streaming platforms Learn more Data Consent Automation First Party Consent | Third Party & Cookie</code> | <code>What is the purpose of the Data Command Center?</code> |
478
+ | <code>data subject must be notified of any such extension within one month of receiving the request, along with the reasons for the delay and the possibility of complaining to the supervisory authority. The right to restrict processing applies when the data subject contests data accuracy, the processing is unlawful, and the data subject opposes erasure and requests restriction. The controller must inform data subjects before any such restriction is lifted. Under GDPR, the data subject also has the right to obtain from the controller the rectification of inaccurate personal data and to have incomplete personal data completed. Article: 22 Under PDPL, if a decision is based solely on automated processing of personal data intended to assess the data subject regarding his/her performance at work, financial standing, credit-worthiness, reliability, or conduct, then the data subject has the right to request processing in a manner that is not solely automated. This right shall not apply where the decision is taken in the course of entering into</code> | <code>What is the requirement for notifying the data subject of any extension under GDPR and PDPL?</code> |
479
+ | <code>Automation PrivacyCenter.Cloud | Data Mapping | DSR Automation | Assessment Automation | Vendor Assessment | Breach Management | Privacy Notice Learn more Sensitive Data Intelligence Discover & Classify Structured and Unstructured Data | People Data Graph Learn more Data Flow Intelligence & Governance Prevent sensitive data sprawl through real-time streaming platforms Learn more Data Consent Automation First Party Consent | Third Party & Cookie Consent Learn more Data Security Posture Management Secure sensitive data in hybrid multicloud and SaaS environments Learn more Data Breach Impact Analysis & Response Analyze impact of a data breach and coordinate response per global regulatory obligations Learn more Data Catalog Automatically catalog datasets and enable users to find, understand, trust and access data Learn more Data Lineage Track changes and transformations of, PrivacyCenter.Cloud | Data Mapping | DSR Automation | Assessment Automation | Vendor Assessment | Breach Management | Privacy Notice Learn more Sensitive Data Intelligence Discover & Classify Structured and Unstructured Data | People Data Graph Learn more Data Flow Intelligence & Governance Prevent sensitive data sprawl through real-time streaming platforms Learn more Data Consent Automation First Party Consent | Third Party & Cookie Consent Learn more Data Security Posture Management Secure sensitive data in hybrid multicloud and SaaS environments Learn more Data Breach Impact Analysis & Response Analyze impact of a data breach and coordinate response per global regulatory obligations Learn more Data Catalog Automatically catalog datasets and enable users to find, understand, trust and access data Learn more Data Lineage Track changes and transformations of data throughout its</code> | <code>What is the purpose of Third Party & Cookie Consent in data automation and security?</code> |
480
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
481
+ ```json
482
+ {
483
+ "loss": "MultipleNegativesRankingLoss",
484
+ "matryoshka_dims": [
485
+ 768,
486
+ 512,
487
+ 256
488
+ ],
489
+ "matryoshka_weights": [
490
+ 1,
491
+ 1,
492
+ 1
493
+ ],
494
+ "n_dims_per_step": -1
495
+ }
496
+ ```
497
+
498
+ ### Training Hyperparameters
499
+ #### Non-Default Hyperparameters
500
+
501
+ - `eval_strategy`: epoch
502
+ - `per_device_train_batch_size`: 32
503
+ - `per_device_eval_batch_size`: 16
504
+ - `learning_rate`: 2e-05
505
+ - `num_train_epochs`: 1
506
+ - `lr_scheduler_type`: cosine
507
+ - `warmup_ratio`: 0.1
508
+ - `bf16`: True
509
+ - `tf32`: True
510
+ - `load_best_model_at_end`: True
511
+ - `optim`: adamw_torch_fused
512
+ - `batch_sampler`: no_duplicates
513
+
514
+ #### All Hyperparameters
515
+ <details><summary>Click to expand</summary>
516
+
517
+ - `overwrite_output_dir`: False
518
+ - `do_predict`: False
519
+ - `eval_strategy`: epoch
520
+ - `prediction_loss_only`: True
521
+ - `per_device_train_batch_size`: 32
522
+ - `per_device_eval_batch_size`: 16
523
+ - `per_gpu_train_batch_size`: None
524
+ - `per_gpu_eval_batch_size`: None
525
+ - `gradient_accumulation_steps`: 1
526
+ - `eval_accumulation_steps`: None
527
+ - `learning_rate`: 2e-05
528
+ - `weight_decay`: 0.0
529
+ - `adam_beta1`: 0.9
530
+ - `adam_beta2`: 0.999
531
+ - `adam_epsilon`: 1e-08
532
+ - `max_grad_norm`: 1.0
533
+ - `num_train_epochs`: 1
534
+ - `max_steps`: -1
535
+ - `lr_scheduler_type`: cosine
536
+ - `lr_scheduler_kwargs`: {}
537
+ - `warmup_ratio`: 0.1
538
+ - `warmup_steps`: 0
539
+ - `log_level`: passive
540
+ - `log_level_replica`: warning
541
+ - `log_on_each_node`: True
542
+ - `logging_nan_inf_filter`: True
543
+ - `save_safetensors`: True
544
+ - `save_on_each_node`: False
545
+ - `save_only_model`: False
546
+ - `restore_callback_states_from_checkpoint`: False
547
+ - `no_cuda`: False
548
+ - `use_cpu`: False
549
+ - `use_mps_device`: False
550
+ - `seed`: 42
551
+ - `data_seed`: None
552
+ - `jit_mode_eval`: False
553
+ - `use_ipex`: False
554
+ - `bf16`: True
555
+ - `fp16`: False
556
+ - `fp16_opt_level`: O1
557
+ - `half_precision_backend`: auto
558
+ - `bf16_full_eval`: False
559
+ - `fp16_full_eval`: False
560
+ - `tf32`: True
561
+ - `local_rank`: 0
562
+ - `ddp_backend`: None
563
+ - `tpu_num_cores`: None
564
+ - `tpu_metrics_debug`: False
565
+ - `debug`: []
566
+ - `dataloader_drop_last`: False
567
+ - `dataloader_num_workers`: 0
568
+ - `dataloader_prefetch_factor`: None
569
+ - `past_index`: -1
570
+ - `disable_tqdm`: False
571
+ - `remove_unused_columns`: True
572
+ - `label_names`: None
573
+ - `load_best_model_at_end`: True
574
+ - `ignore_data_skip`: False
575
+ - `fsdp`: []
576
+ - `fsdp_min_num_params`: 0
577
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
578
+ - `fsdp_transformer_layer_cls_to_wrap`: None
579
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
580
+ - `deepspeed`: None
581
+ - `label_smoothing_factor`: 0.0
582
+ - `optim`: adamw_torch_fused
583
+ - `optim_args`: None
584
+ - `adafactor`: False
585
+ - `group_by_length`: False
586
+ - `length_column_name`: length
587
+ - `ddp_find_unused_parameters`: None
588
+ - `ddp_bucket_cap_mb`: None
589
+ - `ddp_broadcast_buffers`: False
590
+ - `dataloader_pin_memory`: True
591
+ - `dataloader_persistent_workers`: False
592
+ - `skip_memory_metrics`: True
593
+ - `use_legacy_prediction_loop`: False
594
+ - `push_to_hub`: False
595
+ - `resume_from_checkpoint`: None
596
+ - `hub_model_id`: None
597
+ - `hub_strategy`: every_save
598
+ - `hub_private_repo`: False
599
+ - `hub_always_push`: False
600
+ - `gradient_checkpointing`: False
601
+ - `gradient_checkpointing_kwargs`: None
602
+ - `include_inputs_for_metrics`: False
603
+ - `eval_do_concat_batches`: True
604
+ - `fp16_backend`: auto
605
+ - `push_to_hub_model_id`: None
606
+ - `push_to_hub_organization`: None
607
+ - `mp_parameters`:
608
+ - `auto_find_batch_size`: False
609
+ - `full_determinism`: False
610
+ - `torchdynamo`: None
611
+ - `ray_scope`: last
612
+ - `ddp_timeout`: 1800
613
+ - `torch_compile`: False
614
+ - `torch_compile_backend`: None
615
+ - `torch_compile_mode`: None
616
+ - `dispatch_batches`: None
617
+ - `split_batches`: None
618
+ - `include_tokens_per_second`: False
619
+ - `include_num_input_tokens_seen`: False
620
+ - `neftune_noise_alpha`: None
621
+ - `optim_target_modules`: None
622
+ - `batch_eval_metrics`: False
623
+ - `batch_sampler`: no_duplicates
624
+ - `multi_dataset_batch_sampler`: proportional
625
+
626
+ </details>
627
+
628
+ ### Training Logs
629
+ | Epoch | Step | Training Loss | dim_256_cosine_map@100 | dim_512_cosine_map@100 | dim_768_cosine_map@100 |
630
+ |:-------:|:------:|:-------------:|:----------------------:|:----------------------:|:----------------------:|
631
+ | 0.2128 | 10 | 3.8486 | - | - | - |
632
+ | 0.4255 | 20 | 2.3622 | - | - | - |
633
+ | 0.6383 | 30 | 2.3216 | - | - | - |
634
+ | 0.8511 | 40 | 1.3247 | - | - | - |
635
+ | **1.0** | **47** | **-** | **0.5307** | **0.5297** | **0.5253** |
636
+
637
+ * The bold row denotes the saved checkpoint.
638
+
639
+ ### Framework Versions
640
+ - Python: 3.10.14
641
+ - Sentence Transformers: 3.0.1
642
+ - Transformers: 4.41.2
643
+ - PyTorch: 2.1.2+cu121
644
+ - Accelerate: 0.31.0
645
+ - Datasets: 2.19.1
646
+ - Tokenizers: 0.19.1
647
+
648
+ ## Citation
649
+
650
+ ### BibTeX
651
+
652
+ #### Sentence Transformers
653
+ ```bibtex
654
+ @inproceedings{reimers-2019-sentence-bert,
655
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
656
+ author = "Reimers, Nils and Gurevych, Iryna",
657
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
658
+ month = "11",
659
+ year = "2019",
660
+ publisher = "Association for Computational Linguistics",
661
+ url = "https://arxiv.org/abs/1908.10084",
662
+ }
663
+ ```
664
+
665
+ #### MatryoshkaLoss
666
+ ```bibtex
667
+ @misc{kusupati2024matryoshka,
668
+ title={Matryoshka Representation Learning},
669
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
670
+ year={2024},
671
+ eprint={2205.13147},
672
+ archivePrefix={arXiv},
673
+ primaryClass={cs.LG}
674
+ }
675
+ ```
676
+
677
+ #### MultipleNegativesRankingLoss
678
+ ```bibtex
679
+ @misc{henderson2017efficient,
680
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
681
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
682
+ year={2017},
683
+ eprint={1705.00652},
684
+ archivePrefix={arXiv},
685
+ primaryClass={cs.CL}
686
+ }
687
+ ```
688
+
689
+ <!--
690
+ ## Glossary
691
+
692
+ *Clearly define terms in order to be accessible across audiences.*
693
+ -->
694
+
695
+ <!--
696
+ ## Model Card Authors
697
+
698
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
699
+ -->
700
+
701
+ <!--
702
+ ## Model Card Contact
703
+
704
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
705
+ -->
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BAAI/bge-base-en-v1.5",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.41.2",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 30522
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.1.2+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62527e2af5dc7d33f3d28e07aa0843b91aadcd6a2eef7f4011b65f469c6a6d03
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff