Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

BM25S Index

This is a BM25S index created with the bm25s library (version 0.2.2), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.

BM25S Related Links:

Installation

You can install the bm25s library with pip:

pip install "bm25s==0.2.2"

# Include extra dependencies like stemmer
pip install "bm25s[full]==0.2.2"

# For huggingface hub usage
pip install huggingface_hub

Loading a bm25s index

You can use this index for information retrieval tasks. Here is an example:

import bm25s
from bm25s.hf import BM25HF

# Load the index
retriever = BM25HF.load_from_hub("tien314/bm25s")

# You can retrieve now
query = "a cat is a feline"
results = retriever.retrieve(bm25s.tokenize(query), k=3)

Saving a bm25s index

You can save a bm25s index to the Hugging Face Hub. Here is an example:

import bm25s
from bm25s.hf import BM25HF

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

token = None  # You can get a token from the Hugging Face website
retriever.save_to_hub("tien314/bm25s", token=token)

Advanced usage

You can leverage more advanced features of the BM25S library during load_from_hub:

# Load corpus and index in memory-map (mmap=True) to reduce memory
retriever = BM25HF.load_from_hub("tien314/bm25s", load_corpus=True, mmap=True)

# Load a different branch/revision
retriever = BM25HF.load_from_hub("tien314/bm25s", revision="main")

# Change directory where the local files should be downloaded
retriever = BM25HF.load_from_hub("tien314/bm25s", local_dir="/path/to/dir")

# Load private repositories with a token:
retriever = BM25HF.load_from_hub("tien314/bm25s", token=token)

Tokenizer

If you have saved a Tokenizer object with the index using the following approach:

from bm25s.hf import TokenizerHF

token = "your_hugging_face_token"
tokenizer = TokenizerHF(corpus=corpus, stopwords="english")
tokenizer.save_to_hub("tien314/bm25s", token=token)

# and stopwords too
tokenizer.save_stopwords_to_hub("tien314/bm25s", token=token)

Then, you can load the tokenizer using the following code:

from bm25s.hf import TokenizerHF

tokenizer = TokenizerHF(corpus=corpus, stopwords=[])
tokenizer.load_vocab_from_hub("tien314/bm25s", token=token)
tokenizer.load_stopwords_from_hub("tien314/bm25s", token=token)

Stats

This dataset was created using the following data:

Statistic Value
Number of documents 4107805
Number of tokens 44086459
Average tokens per document 10.73

Parameters

The index was created with the following parameters:

Parameter Value
k1 1.5
b 0.75
delta 0.5
method lucene
idf method lucene

Citation

To cite bm25s, please use the following bibtex:

@misc{lu_2024_bm25s,
      title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring}, 
      author={Xing Han Lù},
      year={2024},
      eprint={2407.03618},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.03618}, 
}
Downloads last month
9
Inference API
Unable to determine this model’s pipeline type. Check the docs .