Semantic Search of Legal Data Using SBERT
This repository contains a proof-of-concept model for semantic search of legal data, based on Sentence-BERT (SBERT) and fine-tuned using triplets. The model is designed to provide efficient and accurate semantic search capabilities for legal documents.
Model Overview
- Base Model: Jerteh-125
- Fine-tuning Technique: Triplet loss
- Purpose: To enable semantic search within legal data
Installation
To use the model, you need to have Python 3.6 or higher installed. Additionally, install the necessary dependencies:
pip install transformers pip install sentence-transformers
Usage
Here's how you can use the model for semantic search:
Load the Model
from sentence_transformers import SentenceTransformer model = SentenceTransformer('nemanjaPetrovic/legal-jerteh-125-sbert')
Encode Sentences
sentences = ["Sankcije se propisuju u granicama zakonom utvrđenog minimuma i maksimuma.", "Vrste krivičnih sankcija određuju se samo krivičnim zakonom."]
sentence_embeddings = model.encode(sentences)
- Perform Semantic Search
To perform a semantic search, you need to encode both your query and the documents you want to search through. You can then use cosine similarity to find the most relevant documents. You should use vector database for this, but for quick test, you can try code bellow
from sklearn.metrics.pairwise import cosine_similarity import numpy as np
query = "Objasni mi pojam sankcija."
query_embedding = model.encode([query])
cosine_similarities = cosine_similarity(query_embedding, sentence_embeddings)
most_similar_idx = np.argmax(cosine_similarities)
most_similar_document = sentences[most_similar_idx]
print(f"The most similar document to the query is: {most_similar_document}")
Fine-tuning Details
The model was fine-tuned using triplet loss, a common technique for training embedding models to understand semantic similarity. The fine-tuning dataset consisted of triplets (anchor, positive, negative) to teach the model to distinguish between similar and dissimilar legal documents.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
I would like to acknowledge the author of Jerteh-125 model Mihailo Skoric and the creators of Sentence-BERT for their foundational work, which made this project possible.
Contact
For any questions or issues, please contact [email protected].
- Downloads last month
- 26