ldwang commited on
Commit
000bd8e
1 Parent(s): 1a18bbd

update readme

Browse files
Files changed (1) hide show
  1. README.md +68 -32
README.md CHANGED
@@ -2604,6 +2604,7 @@ language:
2604
  pipeline_tag: sentence-similarity
2605
  ---
2606
 
 
2607
  <h1 align="center">FlagEmbedding</h1>
2608
 
2609
 
@@ -2613,20 +2614,22 @@ pipeline_tag: sentence-similarity
2613
  <a href=#usage>Usage</a> |
2614
  <a href="#evaluation">Evaluation</a> |
2615
  <a href="#train">Train</a> |
 
2616
  <a href="#license">License</a>
2617
  <p>
2618
  </h4>
2619
 
2620
- For more details please refer to our GitHub repo: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
2621
 
2622
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
2623
 
2624
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
2625
- And it also can be used in vector databases for LLMs.
2626
 
2627
  ************* 🌟**Updates**🌟 *************
 
2628
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2629
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
2630
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
2631
 
2632
 
@@ -2634,36 +2637,42 @@ And it also can be used in vector databases for LLMs.
2634
 
2635
  `bge` is short for `BAAI general embedding`.
2636
 
2637
- | Model | Language | Description | query instruction for retrieval |
2638
  |:-------------------------------|:--------:| :--------:| :--------:|
2639
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2640
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2641
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
2642
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
2643
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
2644
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
2645
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2646
 
2647
-
2648
 
2649
  ## Usage
2650
 
2651
- * **Using FlagEmbedding**
 
 
 
2652
  ```
2653
  pip install -U FlagEmbedding
2654
  ```
2655
- See [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
2656
 
2657
  ```python
2658
  from FlagEmbedding import FlagModel
2659
  sentences = ["样例数据-1", "样例数据-2"]
2660
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
2661
- embeddings = model.encode(sentences)
2662
- print(embeddings)
2663
- # for retrieval task, please use encode_queries() which will automatically add the instruction to each query
2664
- # corpus in retrieval task can still use encode() or encode_corpus()
 
 
 
2665
  queries = ['query_1', 'query_2']
2666
- passages = ["样例段落-1", "样例段落-2"]
2667
  q_embeddings = model.encode_queries(queries)
2668
  p_embeddings = model.encode(passages)
2669
  scores = q_embeddings @ p_embeddings.T
@@ -2673,7 +2682,7 @@ The value of argument `query_instruction_for_retrieval` see [Model List](https:/
2673
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
2674
 
2675
 
2676
- * **Using Sentence-Transformers**
2677
 
2678
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
2679
 
@@ -2684,23 +2693,43 @@ pip install -U sentence-transformers
2684
  from sentence_transformers import SentenceTransformer
2685
  sentences = ["样例数据-1", "样例数据-2"]
2686
  model = SentenceTransformer('BAAI/bge-large-zh')
2687
- embeddings = model.encode(sentences, normalize_embeddings=True)
2688
- print(embeddings)
 
 
2689
  ```
2690
- For retrieval task,
2691
- each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
 
2692
  ```python
2693
  from sentence_transformers import SentenceTransformer
2694
- queries = ["手机开不了机怎么办?"]
2695
- passages = ["样例段落-1", "样例段落-2"]
2696
  instruction = "为这个句子生成表示以用于检索相关文章:"
 
2697
  model = SentenceTransformer('BAAI/bge-large-zh')
2698
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
2699
  p_embeddings = model.encode(passages, normalize_embeddings=True)
2700
  scores = q_embeddings @ p_embeddings.T
2701
  ```
2702
 
2703
- * **Using HuggingFace Transformers**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2704
 
2705
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
2706
 
@@ -2709,13 +2738,16 @@ from transformers import AutoTokenizer, AutoModel
2709
  import torch
2710
  # Sentences we want sentence embeddings for
2711
  sentences = ["样例数据-1", "样例数据-2"]
 
2712
  # Load model from HuggingFace Hub
2713
  tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
2714
  model = AutoModel.from_pretrained('BAAI/bge-large-zh')
 
2715
  # Tokenize sentences
2716
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
2717
- # for retrieval task, add an instruction to query
2718
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
 
2719
  # Compute token embeddings
2720
  with torch.no_grad():
2721
  model_output = model(**encoded_input)
@@ -2757,7 +2789,7 @@ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/
2757
 
2758
 
2759
  - **C-MTEB**:
2760
- We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
2761
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
2762
 
2763
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
@@ -2785,7 +2817,7 @@ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagE
2785
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
2786
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
2787
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
2788
- In retromae, the mask ratio of encoder and decoder are 0.3, and 0.5 respectively.
2789
  We used the AdamW optimizer and the learning rate is 2e-5.
2790
 
2791
  **Pre-training data**:
@@ -2794,8 +2826,7 @@ We used the AdamW optimizer and the learning rate is 2e-5.
2794
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
2795
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
2796
  - Chinese:
2797
- - Subset of [wudao](https://github.com/BAAI-WuDao/Data)
2798
- - [baidu-baike](https://baike.baidu.com/)
2799
 
2800
 
2801
  **2. Finetune**
@@ -2809,11 +2840,11 @@ We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so
2809
  We used the AdamW optimizer and the learning rate is 1e-5.
2810
  The temperature for contrastive loss is 0.01.
2811
 
2812
- For the version with `*-instrcution`, we add instruction to the query for retrieval task in the training.
2813
- For english, the instruction is `Represent this sentence for searching relevant passages: `;
2814
- For chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
2815
- In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
2816
-
2817
 
2818
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
2819
  You can easily finetune your model with it.
@@ -2829,5 +2860,10 @@ You can easily finetune your model with it.
2829
  We will continually update the embedding models and training codes,
2830
  hoping to promote the development of the embedding model community.
2831
 
 
 
2832
  ## License
2833
  FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
 
 
 
 
2604
  pipeline_tag: sentence-similarity
2605
  ---
2606
 
2607
+
2608
  <h1 align="center">FlagEmbedding</h1>
2609
 
2610
 
 
2614
  <a href=#usage>Usage</a> |
2615
  <a href="#evaluation">Evaluation</a> |
2616
  <a href="#train">Train</a> |
2617
+ <a href="#contact">Contact</a> |
2618
  <a href="#license">License</a>
2619
  <p>
2620
  </h4>
2621
 
2622
+ More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
2623
 
2624
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
2625
 
2626
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
2627
+ And it also can be used in vector database for LLMs.
2628
 
2629
  ************* 🌟**Updates**🌟 *************
2630
+ - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [**this**](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
2631
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2632
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
2633
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
2634
 
2635
 
 
2637
 
2638
  `bge` is short for `BAAI general embedding`.
2639
 
2640
+ | Model | Language | Description | query instruction for retrieval\* |
2641
  |:-------------------------------|:--------:| :--------:| :--------:|
2642
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2643
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2644
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
2645
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
2646
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
2647
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
2648
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2649
 
2650
+ \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
2651
 
2652
  ## Usage
2653
 
2654
+ Here are some examples to use `bge` models with
2655
+ [FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
2656
+
2657
+ #### Using FlagEmbedding
2658
  ```
2659
  pip install -U FlagEmbedding
2660
  ```
2661
+ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
2662
 
2663
  ```python
2664
  from FlagEmbedding import FlagModel
2665
  sentences = ["样例数据-1", "样例数据-2"]
2666
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
2667
+ embeddings_1 = model.encode(sentences)
2668
+ embeddings_2 = model.encode(sentences)
2669
+ similarity = embeddings_1 @ embeddings_2.T
2670
+ print(similarity)
2671
+
2672
+ # for s2p(short query to long passage) retrieval task, please use encode_queries() which will automatically add the instruction to each query
2673
+ # corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction
2674
  queries = ['query_1', 'query_2']
2675
+ passages = ["样例文档-1", "样例文档-2"]
2676
  q_embeddings = model.encode_queries(queries)
2677
  p_embeddings = model.encode(passages)
2678
  scores = q_embeddings @ p_embeddings.T
 
2682
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
2683
 
2684
 
2685
+ #### Using Sentence-Transformers
2686
 
2687
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
2688
 
 
2693
  from sentence_transformers import SentenceTransformer
2694
  sentences = ["样例数据-1", "样例数据-2"]
2695
  model = SentenceTransformer('BAAI/bge-large-zh')
2696
+ embeddings_1 = model.encode(sentences, normalize_embeddings=True)
2697
+ embeddings_2 = model.encode(sentences, normalize_embeddings=True)
2698
+ similarity = embeddings_1 @ embeddings_2.T
2699
+ print(similarity)
2700
  ```
2701
+ For s2p(short query to long passage) retrieval task,
2702
+ each short query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
2703
+ But the instruction is not needed for passages.
2704
  ```python
2705
  from sentence_transformers import SentenceTransformer
2706
+ queries = ['query_1', 'query_2']
2707
+ passages = ["样例文档-1", "样例文档-2"]
2708
  instruction = "为这个句子生成表示以用于检索相关文章:"
2709
+
2710
  model = SentenceTransformer('BAAI/bge-large-zh')
2711
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
2712
  p_embeddings = model.encode(passages, normalize_embeddings=True)
2713
  scores = q_embeddings @ p_embeddings.T
2714
  ```
2715
 
2716
+ #### Using Langchain
2717
+
2718
+ You can use `bge` in langchain like this:
2719
+ ```python
2720
+ from langchain.embeddings import HuggingFaceBgeEmbeddings
2721
+ model_name = "BAAI/bge-small-en"
2722
+ model_kwargs = {'device': 'cuda'}
2723
+ encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
2724
+ model_norm = HuggingFaceBgeEmbeddings(
2725
+ model_name=model_name,
2726
+ model_kwargs=model_kwargs,
2727
+ encode_kwargs=encode_kwargs
2728
+ )
2729
+ ```
2730
+
2731
+
2732
+ #### Using HuggingFace Transformers
2733
 
2734
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
2735
 
 
2738
  import torch
2739
  # Sentences we want sentence embeddings for
2740
  sentences = ["样例数据-1", "样例数据-2"]
2741
+
2742
  # Load model from HuggingFace Hub
2743
  tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
2744
  model = AutoModel.from_pretrained('BAAI/bge-large-zh')
2745
+
2746
  # Tokenize sentences
2747
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
2748
+ # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
2749
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
2750
+
2751
  # Compute token embeddings
2752
  with torch.no_grad():
2753
  model_output = model(**encoded_input)
 
2789
 
2790
 
2791
  - **C-MTEB**:
2792
+ We create a benchmark C-MTEB for chinese text embedding which consists of 31 datasets from 6 tasks.
2793
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
2794
 
2795
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
 
2817
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
2818
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
2819
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
2820
+ In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
2821
  We used the AdamW optimizer and the learning rate is 2e-5.
2822
 
2823
  **Pre-training data**:
 
2826
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
2827
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
2828
  - Chinese:
2829
+ - [wudao](https://github.com/BAAI-WuDao/Data)
 
2830
 
2831
 
2832
  **2. Finetune**
 
2840
  We used the AdamW optimizer and the learning rate is 1e-5.
2841
  The temperature for contrastive loss is 0.01.
2842
 
2843
+ Besides, we add instruction to the query for s2p(short query to long passage) retrieval task in the training (add nothing to passages).
2844
+ For English, the instruction is `Represent this sentence for searching relevant passages: `;
2845
+ For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
2846
+ In the evaluation, the instruction should be added for queries in retrieval task, not be added for other tasks.
2847
+ Noted that the instruction is not needed for passages.
2848
 
2849
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
2850
  You can easily finetune your model with it.
 
2860
  We will continually update the embedding models and training codes,
2861
  hoping to promote the development of the embedding model community.
2862
 
2863
+
2864
+
2865
  ## License
2866
  FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
2867
+
2868
+
2869
+