Skip to main content

Compare Embedding Models

This guide will show you how to compare the e5 and SPLADE embedding models using Okareo, focusing on their application in retrieval. We'll be leveraging an embedded vector database for efficient handling of high-dimensional embeddings. This comparison aims to highlight the difference in performance between a dense (e5) and sparse (SPLADE) embedding model, evaulated on the MS MARCO dataset.

Retrieval Metrics

Okareo provides a number of industry standard metrics to compare the e5 and SPLADE embedding models including: Accuracy, Precision, Recall, NDCG, MRR, MAP.

What do you need?

You will need a python environment (likely Jupyter) and an Okareo API Token to get started.

tip

The SDK requires an API Token. Refer to the Okareo API Key guide for more information.

note

Download the complete notebook - sparse_vs_dense_comparison.ipynb

For this example, it is recommended you run it inside the examples folder of the SDK. You can download a zip file of the SDK here

This examples uses a local instance of ChromaDB to make the process of getting started quick and easy. However if you want to use your own VectorDB, just follow the same process outlined here with your own endpoint. Alternatively, you can use a pre-made integration to Pinecone or QDrant.

Step 1: Install Okareo

Install Okareo in one of the first frames in your notebook. This example guide also uses ChromaDB, Pandas, PyTorch, and Hugging Face's Transformers.

%pip install okareo

# For this exmample, you will also need to install
# ChromaDB, Pandas, PyTorch, and Hugging Face's Transformers
%pip install chromadb
%pip install pandas
%pip install torch
%pip install transformers

Step 2: Setup the VectorDB for the SPLADE model

How you organize and create your VectorDB is specific to the use case you are trying to solve. In this example, we are using the dev set of the MS MARCO dataset which can be found here or in the examples folder of the Okareo SDK. Place this file in the same directory as the notebook or add in the correct path. Adding the embeddings to the vector database can take around 3 to 7 minutes.

import math
import chromadb
import pandas as pd
from transformers import AutoModelForMaskedLM, AutoTokenizer
from chromadb import Documents, EmbeddingFunction, Embeddings
import torch
from torch import Tensor
import hashlib
from transformers import AutoTokenizer, AutoModel

class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
model_id = 'naver/splade-cocondenser-ensembledistil'
mx = 0
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
all_embeds = []
for doc in input:
tokens = tokenizer(doc, return_tensors='pt')
output = model(**tokens)
vec = torch.max(
torch.log(
1 + torch.relu(output.logits)
) * tokens.attention_mask.unsqueeze(-1),
dim=1)[0].squeeze()
cols = vec.nonzero().squeeze().cpu().tolist()

weights = vec[cols].cpu().tolist()
embed_arr = [0] * 30000
for i in range(len(cols)):
embed_arr[cols[i]] = weights[i]
if cols[i] > mx:
mx = cols[i]
all_embeds.append(embed_arr)
return all_embeds

df = pd.read_csv('ms_marco_dev.csv')

rowsToEncode = 2000
passages = df['finalpassage'].tolist()[:rowsToEncode]
queries = df['query'].tolist()[:rowsToEncode]
ids = []
for i in df['query'].tolist():
ids.append(hashlib.md5(i.encode()).hexdigest())

chroma_client = chromadb.PersistentClient(path="./splade/")

collection = chroma_client.create_collection(
name="chromadb",
metadata={"hnsw:space": "cosine"},
embedding_function=MyEmbeddingFunction()
)
for i in range(0, math.floor(rowsToEncode / 50)):
collection.add(
documents=df['finalpassage'].tolist()[i*50:(i + 1)*50],
ids=ids[i*50:(i + 1)*50]
)
# To run this cell again after embedding documents,
# comment out the create_collection and for loop code above, and uncomment out the statement below.
# collection = chroma_client.get_collection(
# name="chromadb",
# embedding_function=MyEmbeddingFunction()
# )
# If you would like to re-embed the doucments, delete the splade folder and restart the notebook

Step 3: Create the scenario

First, we create a scenario using the MS MARCO dataset to test the vector database with SPLADE embeddings.

Each of the evaluations on this scenario can take 5 to 10 minutes. For a shorter evaluation process, decrease the NUM_SCENARIOS variable.

NUM_SCENARIOS = 200
import os
from okareo import Okareo
from okareo_api_client.models import ScenarioSetCreate, SeedData, ScenarioType

OKAREO_API_KEY = os.environ["OKAREO_API_KEY"]
okareo = Okareo(OKAREO_API_KEY)
seed_data = []
for i in range(0, rowsToEncode, math.floor(rowsToEncode / NUM_SCENARIOS)):
seed_data.append(SeedData(input_=queries[i], result=[ids[i]]))
scenario_set_create = ScenarioSetCreate(
name="Embedding scenarios",
number_examples=1,
generation_type=ScenarioType.SEED,
seed_data=seed_data,
)
scenario = okareo.create_scenario_set(scenario_set_create)

Step 4: Run the Evaluation for SPLADE

With the VectorDB loaded and the scenario created, we can now run a retrieval evaluation for the SPLADE embedding model.

# Perform a test run using a scenario set loaded in the previous cell 
from datetime import datetime
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel
def query_results_to_score(results):
parsed_ids_with_scores = []
for i in range(0, len(results['distances'][0])):
# this turns cosine distance into a 0 to 1 cosine similarity score
score = (2 - results['distances'][0][i]) / 2
parsed_ids_with_scores.append((results['ids'][0][i], score))
return parsed_ids_with_scores

class RetrievalModel(CustomModel):
def invoke(self, input: str):
results = collection.query(
query_texts=[input],
n_results=5
)
# return a tuple of (parsed_ids_with_scores, overall model response context)
return query_results_to_score(results), {'model_data': input}
# this will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(
name=f"splade {datetime.now().strftime('%m-%d %H:%M:%S')}",
model=RetrievalModel(name="splade")
)

test_run_item = model_under_test.run_test(
scenario=scenario,
name=f"Retrieval Test Run splade {datetime.now().strftime('%m-%d %H:%M:%S')}", # name for test run
test_run_type=TestRunType.INFORMATION_RETRIEVAL,
calculate_metrics=True
)

# display model level metrics for the test run
print(f"See test run results: {test_run_item.app_link}")

Step 5: Load vector database with e5 embeddings

Now, we need to load in the vector database with the e5 embeddings.

If you are running this as a standalone notebook, you will need to use the e5 model to embed the dataset into the vectordb. This will take a few minutes.

If you are running this from the examples folder in the SDK, we have the vector database ready with the e5 embeddings. Just comment out the chroma_client.create_collection and for loop towards the end of the code block. Then uncomment the chroma_client.get_collection statement at the end of the code block.

class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
for i in range(len(input)):
input[i] = 'query: ' + (input[i] if isinstance(input[i], str) else '')

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small-v2')
model = AutoModel.from_pretrained('intfloat/e5-small-v2')

# Tokenize the input texts
batch_dict = tokenizer(input, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings.tolist()
chroma_client = chromadb.PersistentClient(path="./e5/")

collection = chroma_client.create_collection(
name="chromadb",
metadata={"hnsw:space": "cosine"},
embedding_function=MyEmbeddingFunction()
)
for i in range(0, math.floor(rowsToEncode / 50)):
collection.add(
documents=df['finalpassage'].tolist()[i*50:(i + 1)*50],
ids=ids[i*50:(i + 1)*50]
)
# To run this cell again after embedding documents,
# comment out the create_collection and for loop code above, and uncomment out the statement below.
# collection = chroma_client.get_collection(
# name="chromadb",
# embedding_function=MyEmbeddingFunction()
# )
# If you would like to re-embed the doucments, delete the splade folder and restart the notebook

Step 4: Run the Evaluation for e5

With the VectorDB loaded and the scenario created, we can now run a retrieval evaluation for the e5 embedding model.

class RetrievalModel(CustomModel):
def invoke(self, input: str):
results = collection.query(
query_texts=[input],
n_results=5
)
# return a tuple of (parsed_ids_with_scores, overall model response context)
return query_results_to_score(results), {'model_data': input}

# this will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="e5", model=RetrievalModel(name="e5 model"))

test_run_item = model_under_test.run_test(
scenario=scenario,
name=f"Retrieval Test Run e5 {datetime.now().strftime('%m-%d %H:%M:%S')}",
test_run_type=TestRunType.INFORMATION_RETRIEVAL,
calculate_metrics=True
)

# display model level metrics for the test run
print(f"See test run results: {test_run_item.app_link}")

When your test completes, you can navigate to https://app.okareo.com or the printed out links to see the results.