Compare Embedding Models
This guide will show you how to compare the e5 and SPLADE embedding models using Okareo, focusing on their application in retrieval. We'll be leveraging an embedded vector database for efficient handling of high-dimensional embeddings. This comparison aims to highlight the difference in performance between a dense (e5) and sparse (SPLADE) embedding model, evaulated on the MS MARCO dataset.
Retrieval Metrics
Okareo provides a number of industry standard metrics to compare the e5 and SPLADE embedding models including: Accuracy, Precision, Recall, NDCG, MRR, MAP.
What do you need?
You will need a python environment (likely Jupyter) and an Okareo API Token to get started.
The SDK requires an API Token. Refer to the Okareo API Key guide for more information.
Download the complete notebook - sparse_vs_dense_comparison.ipynb
For this example, it is recommended you run it inside the examples folder of the SDK. You can download a zip file of the SDK here
This examples uses a local instance of ChromaDB to make the process of getting started quick and easy. However if you want to use your own VectorDB, just follow the same process outlined here with your own endpoint. Alternatively, you can use a pre-made integration to Pinecone or QDrant.
Step 1: Install Okareo
Install Okareo in one of the first frames in your notebook. This example guide also uses ChromaDB, Pandas, PyTorch, and Hugging Face's Transformers.
- Install Okareo
%pip install okareo
# For this exmample, you will also need to install
# ChromaDB, Pandas, PyTorch, and Hugging Face's Transformers
%pip install chromadb
%pip install pandas
%pip install torch
%pip install transformers
Step 2: Setup the VectorDB for the SPLADE model
How you organize and create your VectorDB is specific to the use case you are trying to solve. In this example, we are using the dev set of the MS MARCO dataset which can be found here or in the examples folder of the Okareo SDK. Place this file in the same directory as the notebook or add in the correct path. Adding the embeddings to the vector database can take around 3 to 7 minutes.
- Load Vector Data
import math
import chromadb
import pandas as pd
from transformers import AutoModelForMaskedLM, AutoTokenizer
from chromadb import Documents, EmbeddingFunction, Embeddings
import torch
from torch import Tensor
import hashlib
from transformers import AutoTokenizer, AutoModel
class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
model_id = 'naver/splade-cocondenser-ensembledistil'
mx = 0
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
all_embeds = []
for doc in input:
tokens = tokenizer(doc, return_tensors='pt')
output = model(**tokens)
vec = torch.max(
torch.log(
1 + torch.relu(output.logits)
) * tokens.attention_mask.unsqueeze(-1),
dim=1)[0].squeeze()
cols = vec.nonzero().squeeze().cpu().tolist()
weights = vec[cols].cpu().tolist()
embed_arr = [0] * 30000
for i in range(len(cols)):
embed_arr[cols[i]] = weights[i]
if cols[i] > mx:
mx = cols[i]
all_embeds.append(embed_arr)
return all_embeds
df = pd.read_csv('ms_marco_dev.csv')
rowsToEncode = 2000
passages = df['finalpassage'].tolist()[:rowsToEncode]
queries = df['query'].tolist()[:rowsToEncode]
ids = []
for i in df['query'].tolist():
ids.append(hashlib.md5(i.encode()).hexdigest())
chroma_client = chromadb.PersistentClient(path="./splade/")
collection = chroma_client.create_collection(
name="chromadb",
metadata={"hnsw:space": "cosine"},
embedding_function=MyEmbeddingFunction()
)
for i in range(0, math.floor(rowsToEncode / 50)):
collection.add(
documents=df['finalpassage'].tolist()[i*50:(i + 1)*50],
ids=ids[i*50:(i + 1)*50]
)
# To run this cell again after embedding documents,
# comment out the create_collection and for loop code above, and uncomment out the statement below.
# collection = chroma_client.get_collection(
# name="chromadb",
# embedding_function=MyEmbeddingFunction()
# )
# If you would like to re-embed the doucments, delete the splade folder and restart the notebook
Step 3: Create the scenario
First, we create a scenario using the MS MARCO dataset to test the vector database with SPLADE embeddings.
Each of the evaluations on this scenario can take 5 to 10 minutes. For a shorter evaluation process, decrease the NUM_SCENARIOS
variable.
- Create Test Scenario
NUM_SCENARIOS = 200
import os
from okareo import Okareo
from okareo_api_client.models import ScenarioSetCreate, SeedData, ScenarioType
OKAREO_API_KEY = os.environ["OKAREO_API_KEY"]
okareo = Okareo(OKAREO_API_KEY)
seed_data = []
for i in range(0, rowsToEncode, math.floor(rowsToEncode / NUM_SCENARIOS)):
seed_data.append(SeedData(input_=queries[i], result=[ids[i]]))
scenario_set_create = ScenarioSetCreate(
name="Embedding scenarios",
number_examples=1,
generation_type=ScenarioType.SEED,
seed_data=seed_data,
)
scenario = okareo.create_scenario_set(scenario_set_create)
Step 4: Run the Evaluation for SPLADE
With the VectorDB loaded and the scenario created, we can now run a retrieval evaluation for the SPLADE embedding model.
- Evaluate Retrieval
# Perform a test run using a scenario set loaded in the previous cell
from datetime import datetime
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel
def query_results_to_score(results):
parsed_ids_with_scores = []
for i in range(0, len(results['distances'][0])):
# this turns cosine distance into a 0 to 1 cosine similarity score
score = (2 - results['distances'][0][i]) / 2
parsed_ids_with_scores.append((results['ids'][0][i], score))
return parsed_ids_with_scores
class RetrievalModel(CustomModel):
def invoke(self, input: str):
results = collection.query(
query_texts=[input],
n_results=5
)
# return a tuple of (parsed_ids_with_scores, overall model response context)
return query_results_to_score(results), {'model_data': input}
# this will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(
name=f"splade {datetime.now().strftime('%m-%d %H:%M:%S')}",
model=RetrievalModel(name="splade")
)
test_run_item = model_under_test.run_test(
scenario=scenario,
name=f"Retrieval Test Run splade {datetime.now().strftime('%m-%d %H:%M:%S')}", # name for test run
test_run_type=TestRunType.INFORMATION_RETRIEVAL,
calculate_metrics=True
)
# display model level metrics for the test run
print(f"See test run results: {test_run_item.app_link}")
Step 5: Load vector database with e5 embeddings
Now, we need to load in the vector database with the e5 embeddings.
If you are running this as a standalone notebook, you will need to use the e5 model to embed the dataset into the vectordb. This will take a few minutes.
If you are running this from the examples folder in the SDK, we have the vector database ready with the e5 embeddings. Just comment out the chroma_client.create_collection
and for loop towards the end of the code block. Then uncomment the chroma_client.get_collection
statement at the end of the code block.
- Load Vector Data
class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
for i in range(len(input)):
input[i] = 'query: ' + (input[i] if isinstance(input[i], str) else '')
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small-v2')
model = AutoModel.from_pretrained('intfloat/e5-small-v2')
# Tokenize the input texts
batch_dict = tokenizer(input, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings.tolist()
chroma_client = chromadb.PersistentClient(path="./e5/")
collection = chroma_client.create_collection(
name="chromadb",
metadata={"hnsw:space": "cosine"},
embedding_function=MyEmbeddingFunction()
)
for i in range(0, math.floor(rowsToEncode / 50)):
collection.add(
documents=df['finalpassage'].tolist()[i*50:(i + 1)*50],
ids=ids[i*50:(i + 1)*50]
)
# To run this cell again after embedding documents,
# comment out the create_collection and for loop code above, and uncomment out the statement below.
# collection = chroma_client.get_collection(
# name="chromadb",
# embedding_function=MyEmbeddingFunction()
# )
# If you would like to re-embed the doucments, delete the splade folder and restart the notebook
Step 4: Run the Evaluation for e5
With the VectorDB loaded and the scenario created, we can now run a retrieval evaluation for the e5 embedding model.
- Evaluate Retrieval
class RetrievalModel(CustomModel):
def invoke(self, input: str):
results = collection.query(
query_texts=[input],
n_results=5
)
# return a tuple of (parsed_ids_with_scores, overall model response context)
return query_results_to_score(results), {'model_data': input}
# this will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="e5", model=RetrievalModel(name="e5 model"))
test_run_item = model_under_test.run_test(
scenario=scenario,
name=f"Retrieval Test Run e5 {datetime.now().strftime('%m-%d %H:%M:%S')}",
test_run_type=TestRunType.INFORMATION_RETRIEVAL,
calculate_metrics=True
)
# display model level metrics for the test run
print(f"See test run results: {test_run_item.app_link}")
When your test completes, you can navigate to https://app.okareo.com or the printed out links to see the results.