Python SDK
Okareo has a rich set of APIs that you can explore through the API Guide.
The okareo python SDK is designed to help accelerate your use of Okareo as part of your python notebook or application.
The SDK requires an API Token. Refer to the Okareo API Key guide for more information.
Installation
pip install okareo
Create an instance
To use Okareo, you will need to instantiate the lib.
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
Library Methods
create_scenario_set
A scenario set is the Okareo unit of data collection. Any scenario can be used to drive a registered model or as a seed for synthetic data generation. Often both.
- Usage
- Details
- Result
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
from okareo_api_client.models import ScenarioSetCreate, SeedData
okareo.create_scenario_set(
ScenarioSetCreate(name="NAME OF MODEL",
number_examples=1,
seed_data=[
SeedData(
input_="Example input to be sent to the model",
result="Expected result from the model"
),
]
)
)
Takes a single argument ScenarioSetCreate
from okareo_api_client.models import ScenarioSetCreate
"""
name (str): Name of the scenario set
seed_data (List['SeedData']): Seed data is a list of dictionaries, each with an input and result
number_examples (int): Number of examples
project_id (Union[Unset, str]): ID for the project
generation_type (Union[Unset, ScenarioType]): An enumeration. Default: ScenarioType.REPHRASE_INVARIANT.
"""
from okareo_api_client.models import ScenarioSetResponse
"""
scenario_id (str):
project_id (str):
time_created (datetime.datetime):
type (str):
tags (Union[Unset, List[str]]):
name (Union[Unset, str]):
seed_data (Union[Unset, List['SeedData']]):
scenario_count (Union[Unset, int]):
scenario_input (Union[Unset, List[str]]):
"""
find_datapoints
Datapoints are accessible for research and analysis from your Notebook or elsewhere.
- Usage
- Details
- Result
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.find_datapoints(context_token="YOUR UNIQUE TOKEN")
Take a single argument context_token
The context_token
is defined when the datapoint is persisted. It is typically meta-data from the model interaction flow.
from okareo_api_client.models import DatapointListItem
"""
id (str):
tags (Union[Unset, List[str]]):
input_ (Union['DatapointListItemInputType0', List[Any], Unset, str]):
input_datetime (Union[Unset, datetime.datetime]):
result (Union['DatapointListItemResultType0', List[Any], Unset, str]):
result_datetime (Union[Unset, datetime.datetime]):
feedback (Union[Unset, float]):
error_message (Union[Unset, str]):
error_code (Union[Unset, str]):
time_created (Union[Unset, datetime.datetime]):
context_token (Union[Unset, str]):
mut_id (Union[Unset, str]):
project_id (Union[Unset, str]):
test_run_id (Union[Unset, str]):
"""
generate_scenario_set
Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.
- Usage
- Details
- Result
from okareo import Okareo
from okareo_api_client.models import ScenarioSetGenerate
okareo = Okareo("YOUR API TOKEN")
okareo.generate_scenario_set(
ScenarioSetGenerate(
source_scenario_id=create_scenario_set.scenario_id,
name="generated scenario set",
number_examples=2,
)
)
Take a single argument ScenarioSetGenerate
from okareo_api_client.models import ScenarioSetGenerate
"""
source_scenario_id (str): ID for the scenario set that the generated scenario set will use as a source
name (str): Name of the generated scenario set
number_examples (int): Number of examples to be generated for the scenario set
project_id (Union[Unset, str]): ID for the project
generation_type (Union[Unset, ScenarioType]): An enumeration. Default: ScenarioType.REPHRASE_INVARIANT.
"""
from okareo_api_client.models import ScenarioSetResponse
"""
scenario_id (str):
project_id (str):
time_created (datetime.datetime):
type (str):
tags (Union[Unset, List[str]]):
name (Union[Unset, str]):
seed_data (Union[Unset, List['SeedData']]):
scenario_count (Union[Unset, int]):
scenario_input (Union[Unset, List[str]]):
"""
generate_scenarios
Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.
- Usage
- Details
- Result
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Common Misspellings Scenario",
number_examples=2,
generation_type=ScenarioType.COMMON_MISSPELLINGS
)
Pass in details defining the source and type of synthetic data you want to generate.
"""
project_id: str <uuid>
source_scenario: str <uuid>
name: str
number_examples: int
generation_type: ScenarioType
"""
from okareo_api_client.models import ScenarioType
"""
COMMON_MISSPELLINGS = "COMMON_MISSPELLINGS"
COMMON_CONTRACTIONS = "COMMON_CONTRACTIONS"
REPHRASE_INVARIANT = "REPHRASE_INVARIANT"
CONDITIONAL = "CONDITIONAL"
TEXT_REVERSE_QUESTION = "TEXT_REVERSE_QUESTION"
TEXT_REVERSE_LABELED = "TEXT_REVERSE_LABELED"
TERM_RELEVANCE_INVARIANT = "TERM_RELEVANCE_INVARIANT"
"""
from okareo_api_client.models import ScenarioSetResponse
"""
scenario_id (str):
project_id (str):
time_created (datetime.datetime):
type (str):
tags (Union[Unset, List[str]]):
name (Union[Unset, str]):
seed_data (Union[Unset, List['SeedData']]):
scenario_count (Union[Unset, int]):
scenario_input (Union[Unset, List[str]]):
"""
Okareo has multiple synthetic data generators. We have provided details about each generator type below:
Rephrase
Rephrasings of the inputs will be generated. For example, if the input is Neil Alden Armstrong was an American astronaut and aeronautical engineer who in 1969 became the first person to walk on the Moon
, the generated input could be Neil Alden Armstrong, an American astronaut and aeronautical engineer, made history in 1969 as the first individual to set foot on the Moon.
okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Rephrased Scenario",
number_examples=1,
generation_type=ScenarioType.REPHRASE_INVARIANT
)
Common Misspellings
Common misspellings of the inputs will be generated. For example, if the input is What is a receipt?
, the generated input could be What is a reviept?
okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Common Mispellings Scenario",
number_examples=1,
generation_type=ScenarioType.COMMON_MISSPELLINGS
)
Common Contractions
Each input in the scenario will be shortened by 1 or 2 characters. For example, if the input is What is a steering wheel?
, the generated input could be What is a steering whl?
.
okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Common Contractions Scenario",
number_examples=1,
generation_type=ScenarioType.COMMON_CONTRACTIONS
)
Conditional
Each input in the scenario will be rephrased as a conditional statement. For example, if the input is What are the side effects of this medicine?
, the generated input could be Considering this medicine, what might be the potential side effects?
.
okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Conditional Scenario",
number_examples=1,
generation_type=ScenarioType.CONDITIONAL
)
Reverse Question
Each input in the scenario will be rephrased as a question that the input should be the answer for. For example, if the input is The first game of baseball was played in 1846.
, the generated input could be When was the first game of baseball ever played?
.
okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Reverse Question Scenario",
number_examples=1,
generation_type=ScenarioType.TEXT_REVERSE_QUESTION
)
Term Relevance
Each input in the scenario will be rephrased to only include the most relevant terms, where relevance is based on the list of inputs provided to the scenario. We will then use parts of speech to determine an valid ordering of relevant terms. For example, if the inputs are all names of various milk teas such as Cool Sweet Honey Taro Milk Tea with Brown Sugar Boba
, the generated input could be Taro Milk Tea
, since Taro
, Milk
, and Tea
could be the most relevant terms.
okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Term Relevance Scenario",
number_examples=1,
generation_type=ScenarioType.TERM_RELEVANCE_INVARIANT
)
get_scenario_data_points
Return each of the datapoints related to a single evaluation run
- Usage
- Details
- Result
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.get_scenario_data_points("333155d5-0658-4080-b006-b83ad6c10797")
Take a single argument scenario_id
scenario_id: str <uuid>
from okareo_api_client.models import ScenarioDataPoinResponse
"""
id (str):
input_ (Union['ScenarioDataPoinResponseInputType0', List[Any], str]):
result (Union['ScenarioDataPoinResponseResultType0', List[Any], str]):
meta_data (Union[Unset, str]):
"""
register_model
Register the model that you want to evaluate, test or collect datapoints from. Models must be uniquely named within a project namespace.
The first time a model is defined, the attributes of the model are persisted. Subsequent calls to register_model will return the persisted model. They will not update the definition.
- Usage
- Details
- Result
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.register_model(
name="Model Classifier",
model=OpenAIModel(
model_id="gpt-3.5-turbo",
temperature=0,
system_prompt_template=CLASSIFICATION_CONTEXT_TEMPLATE,
user_prompt_template=None,
),
)
Requires the parameter name
. The tags
and model
parameters are optional. Okareo has a number of pre-made model plugins that you can use off the shelf. Alternatively you can use the callback mechanism to interact with a custom model.
"""
name: str,
tags: Union[List[str], None] = None,
model: BaseModel = None,
"""
"""
mut_id (str):
project_id (str):
name (str):
tags (List[str]):
"""
Okareo has ready-to-run integrations with the following models and vector databases. Don't hesitate to reach out if you need another model.
OpenAI (LLM)
from okareo.model_under_test import OpenAIModel
"""
model_id: str
temperature: float
system_prompt_template: Optional[str] = None
user_prompt_template: Optional[str] = None
dialog_template: Optional[str] = None
tools: Optional[List] = None
"""
Generation Model (LLM)
from okareo.model_under_test import GenerationModel
"""
model_id: str
temperature: float
system_prompt_template: Optional[str] = None
user_prompt_template: Optional[str] = None
dialog_template: Optional[str] = None
tools: Optional[List] = None
"""
The GenerationModel is a universal LLM interface that supports most model providers. Users can plug in different model names, including OpenAI, Anthropic, and Cohere models.
Example using Cohere model with GenerationModel:
from okareo.model_under_test import GenerationModel
cohere_model = GenerationModel(
model_id="command-r",
temperature=0.7,
system_prompt_template="You are a helpful assistant.",
)
Example with tools:
from okareo.model_under_test import GenerationModel
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]
model_with_tools = GenerationModel(
model_id="gpt-3.5-turbo-0613",
temperature=0.7,
system_prompt_template="You are a helpful assistant with access to weather information.",
tools=tools
)
In these examples, we're using the Cohere "command-r" model and the OpenAI "gpt-3.5-turbo-0613" model through the GenerationModel interface. The second example demonstrates how to include tools, which can be used for function calling capabilities.
Pinecone (VectorDB)
from okareo.model_under_test import PineconeDB
"""
index_name: str
region: str
project_id: str
top_k: int = 5
"""
QDrant (VectorDB)
from okareo.model_under_test import QdrantDB
"""
collection_name: str
url: str
top_k: int = 5
"""
CustomModel
/ ModelInvocation
You can use the CustomModel
object to define your own custom, provider-agnostic models.
from okareo.model_under_test import CustomModel
"""
name: str
@abstractmethod
def invoke(
self, input_value: Union[dict, list, str]
) -> Union[ModelInvocation, Any]:
pass
"""
To use the CustomModel
object, you will need to create a child class that defines an invoke
method that returns a ModelInvocation
object. For example,
from okareo.model_under_test import CustomModel, ModelInvocation
class MyCustomModel(CustomModel):
def invoke(self, input_value: Union[dict, list, str]) -> ModelInvocation:
# your model's invoke logic goes here
return ModelInvocation(
model_prediction=...,
model_input=...,
model_output_metadata=...,
tool_calls=...
)
Where the ModelInvocation
's inputs are defined as follows:
class ModelInvocation:
"""Model invocation response object returned from a CustomModel.invoke method"""
model_prediction: Union[dict, list, str, None] = None
"""Prediction from the model to be used when running the evaluation,
e.g. predicted class from classification model or generated text completion from
a generative model. This would typically be parsed out of the overall model_output_metadata."""
model_input: Union[dict, list, str, None] = None
"""All the input sent to the model"""
model_output_metadata: Union[dict, list, str, None] = None
"""Full model response, including any metadata returned with model's output"""
tool_calls: Optional[List] = None
"""List of tool calls made during the model invocation, if any"""
The logic of your invoke
method depends on many factors, chief among them the intended TestRunType
of the CustomModel
. Below, we highlight an example of how to use CustomModel
for each TestRunType
in Okareo.
- Classification
- Retrieval
- Generation
The following snippet is taken from the classification_eval.ipynb
example notebook. The underlying model
is a distilbert model trained to classify queries into one of three categories. The model weights are available on Okareo's Hugging Face repository.
# Load all of the necessary libraries from Okareo
from okareo import Okareo
from okareo_api_client.models import ScenarioSetCreate, SeedData
from okareo.model_under_test import CustomModel, ModelInvocation
# Load the torch library
import torch
# Create an instance of the Okareo client
okareo = Okareo(OKAREO_API_KEY)
# Define a model class that will be used used for classification
# The model takes in a scenario and returns a predicted class
class ClassificationModel(CustomModel):
# Constructor for the model
def __init__(self, name, tokenizer, model):
self.name = name
# The pretrained tokenizer
self.tokenizer = tokenizer
# The pretrained model
self.model = model
# The possible labels for the model
self.label_lookup = ["pricing", "returns", "complaints"]
# Callable to be applied to each scenario in the scenario set
def invoke(self, input: str):
# Tokenize the input
encoding = self.tokenizer(input, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
# Get the logits from the model
logits = self.model(**encoding).logits
# Get the index of the highest value (the predicted class)
idx = torch.argmax(logits, dim=1).item()
# Get the label for the predicted class
prediction = self.label_lookup[idx]
# Return the prediction in a ModelInvocation object
return ModelInvocation(
model_prediction=prediction,
model_input=input,
model_output_metadata={ "prediction": prediction, "confidence": logits.softmax(dim=1).max().item() },
)
Okareo natively supports Pinecone or QDrant models for retrieval. If you want to utilize a different model provider/database, then you can use CustomModel
to do so.
The following CustomModel
retrieval example is taken from the retrieval_eval.ipynb
notebook. This example shows how to set up a ChromaDB collection how to query the collection inside of the CustomModel.invoke()
method.
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata
collection.add(
documents=list(jsonObj.input),
ids=list(jsonObj.result),
metadatas=metadata_list
)
# A funtion to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
parsed_ids_with_scores = []
for i in range(0, len(results['distances'][0])):
# Create a score based on cosine similarity
score = (2 - results['distances'][0][i]) / 2
parsed_ids_with_scores.append(
{
"id": results['ids'][0][i],
"score": score,
"metadata": results['metadatas'][0][i],
"label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
}
)
return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top 5 most relevant documents based on the input query
class RetrievalModel(CustomModel):
def invoke(self, input: str) -> ModelInvocation:
# Query the collection with the input text
results = collection.query(
query_texts=[input],
n_results=5
)
# Return formatted query results and the model response context
return ModelInvocation(
model_input=input,
model_prediction=query_results_to_score(results),
model_output_metadata={'model_data': input}
)
Okareo natively supports most model providers through Generation Model models for generation. If you want to utilize a different model provider/endpoint, then you can use CustomModel
class to do so.
The following snippet uses the requests
library to call a model provider that can be accessed via an API.
import json
import request
class GenerationModel(CustomModel):
def __init__(self, name, api_key, url):
self.name = name
# API key from your desired model provider
self.api_key = api_key
# URL for the API endpoint that calls your model
self.url = url
def invoke(self, input_value):
# format input_value as messages as reqiured by the API
# here we assume messages are sent to the model as a list
# i.e., [{'role': 'content'}, 'role', 'content']
messages = [{'user': input_value}]
payload = {
"messages": messages
}
headers = {
"accept": "application/json",
"content-type": "application/json",
"Authorization": f"Bearer {self.api_key}"
}
response = requests.post(self.url, json=payload, headers=headers)
parsed_response = json.loads(response.text)
full_model_output = response.text
generated_response = full_model_output["messages"][-1]["content]
return ModelInvocation(
model_input=input_value,
model_prediction=generated_response,
model_output_metadata=full_model_output,
tool_calls=...,
)
CustomBatchModel
If your custom model can handle batch inference, then you can use the CustomBatchModel
class.
from okareo.model_under_test import CustomBatchModel
"""
name: str
batch_size: int = 1
@abstractmethod
def invoke_batch(
self, input_batch: list[dict[str, Union[dict, list, str]]]
) -> list[dict[str, Union[ModelInvocation, Any]]]:
'''method for taking a batch of scenario inputs and returning a corresponding batch of model outputs
arguments:
-> input_batch: list[dict[str, Union[dict, list, str]]] - batch of inputs to the model. Expects a list of
dicts of the format { 'id': str, 'input_value': Union[dict, list, str] }.
returns:
-> list of dicts of format { 'id': str, 'model_invocation': Union[ModelInvocation, Any] }. 'id' must match
the corresponding input_batch element's 'id'.
'''
"""
CustomBatchModel
can be useful if you are using a Hugging Face model/tokenizer on a GPU, allowing you to speed up your evaluations with batched inference calls.
To use the CustomBatchModel
class, you will need to create a child class that defines an invoke_batch
method that returns a list of dicts. For example,
from okareo.model_under_test import CustomBatchModel, ModelInvocation
class MyCustomBatchModel(CustomBatchModel):
def invoke_batch(
self, input_batch: list[dict[str, Any]]
) -> list[dict[str, Union[str, ModelInvocation]]]:
invocations = []
input_values = [d['input_value'] for d in input_batch]
batch_ids = [d['id'] for d in input_batch]
predictions = ... # your batch processing logic goes here
for i in range(min(len(input_batch), self.batch_size)):
invocation = ModelInvocation(
model_prediction=predictions[i],
model_input=input_values[i],
model_output_metadata=...,
tool_calls=...,
)
invocations.append({"id": batch_ids[i], "model_invocation": invocation})
return invocations
Please ensure that your batch processing logic assigns the correct "id" to the corresponding "model_invocation" or your evaluations will return incorrect results.
When you instantiate your batch model for an evaluation, you will pass your desired batch_size
as an argument as follows:
my_batch_model=MyCustomBatchModel(
name="my custom batch model",
batch_size=4,
),
MultiTurnDriver
A MultiTurnDriver
allows you to evaluate a language model over the course of a full conversation. The MultiTurnDriver
is made up of two pieces: a Driver and a Target.
The Driver is defined in your MultiTurnDriver
, while your Target is defined as either a CustomMultiturnTarget
or a GenerationModel
.
from okareo.model_under_test import MultiTurnDriver, OpenAIModel, StopConfig
"""
driver_temperature: float = 1.0
max_turns: int = 5
repeats: int = 1
first_turn: string: "target"
target: CustomMultiturnTarget | GenerationModel | OpenAIModel
stop_check: StopConfig
"""
Driver
The possible parameters for the Driver are:
"""
driver_temperature: float = 1.0
max_turns: int = 5
repeats: int = 1
first_turn: string: "target"
stop_check: StopConfig
"""
driver_temperature
defines temperature used in the model that will simulate a user.
max_turns
defines the maximum number of back-and-forth interactions that can be in the conversation.
repeats
defines how many times each row in a scenario will be run when a model is run with run_test
. Since the Driver is non-deterministic, repeating the same row of a scenario can lead to different conversations.
first_turn
defines whether the Target or the Driver will send the first message in the conversation.
stop_check
defines how the check will stop. It requires the check name, and a boolean value defining whether or not it stops on a True or False value returned from the check.
Target
A Target is either a GenerationModel
or a CustomMultiturnTarget
. Refer to GenerationModel for details on GenerationModel
.
The only exception to the standard usage is that a system_prompt_template
is required when using a MultiTurnDriver
. The system_prompt_template
defines the system prompt for how the Target should behave.
A CustomMultiturnTarget
is defined in largely the same way as a CustomModel. The key difference is that the input is a list of messages in OpenAI's message format.
Driver and Target Interaction
The Driver simulates user behavior, while the Target represents the AI model being tested. This setup allows for testing complex scenarios and evaluating the model's performance over extended conversations.
Setting up a scenario
Scenarios in MultiTurnDriver are crafted using SeedData, where the input_
field serves as a driver prompt, instructing the simulated user (Driver) on how to behave throughout the conversation, including specific questions to ask, responses to give, and even how to react to the model's function calls, thereby creating a controlled yet dynamic testing environment for evaluating the model's performance across various realistic interaction patterns.
seed_data = [
SeedData(
input_="You are interacting with a customer service agent. First, ask about WebBizz...",
result="N/A",
),
# ... more seed data
]
Tools and Function Calling
The Target model can be equipped with tools, which are essentially functions the model can call. For instance:
tools=[
{
"type": "function",
"function": {
"name": "delete_account",
"description": "Deletes the user's account",
# ... parameter details
},
}
]
These tools allow the model to perform specific actions, like deleting a user account in this case.
Mocking Tool Results
The driver prompt can be used to mock the results of tool calls. This is crucial for testing how the model responds to different outcomes without actually performing the actions. For example:
input_="""... If you receive any function calls, output the result in JSON format
and provide a JSON response indicating that the deletion was successful."""
This prompt instructs the Driver to simulate a successful account deletion when the function is called.
Checks and Conversation Control
Checks are used to evaluate specific aspects of the conversation or to control its flow. For instance:
stop_check=StopConfig(check_name="task_completion_delete_account", stop_on=True)
This configuration stops the conversation when the account deletion task is completed.
Custom checks can be created to evaluate various aspects of the conversation:
okareo.create_or_update_check(
name='task_completion_delete_account',
description="Check if the agent confirms account deletion",
check=ModelBasedCheck(...)
)
These checks can assess task completion, adherence to guidelines, or any other relevant criteria.
run_test
Run a test directly from a registered model. This requires both a registered model and at least one scenario.
The run_test
function is called on a registered model in the form model_under_test.run_test(...)
. If your model requires an API key to call, then you will need to pass your key in the api_key
parameter. Your api_key
s are not stored by Okareo.
Depending on size and complexity, model runs can take a long time to evaluate. Use scenarios appropriate in size to the task at hand.
- Classification
- Retrieval
- Generation
Read the Classification Overview to learn more about classification evaluations in Okareo.
# Classification evaluations return accuracy, precision, recall, and f1 scores.
model_under_test = okareo.register_model(...)
test_run_response = model_under_test.run_test(
name="<YOUR_TEST_RUN_NAME>",
scenario="<YOUR_SCENARIO_ID>",
api_key="<YOUR_MODEL_API_KEY>", # required for OpenAI, Cohere, Pinecone, QDrant, etc.
test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
)
"""
test_run_response: TestRunItem(
id=str,
project_id=str,
mut_id=str,
scenario_set_id=str,
name=str,
tags=list[str],
type='MULTI_CLASS_CLASSIFICATION',
start_time=datetime.datetime,
end_time=datetime.datetime,
test_data_point_count=int,
model_metrics=TestRunItemModelMetrics(
additional_properties={
'weighted_average': {
'precision': float,
'recall': float,
'f1': float,
'accuracy': float
},
'scores_by_label': {
'label_1': {
'precision': float,
'recall': float,
'f1': float
},
...,
'label_N': {
'precision': float,
'recall': float,
'f1': float
},
}
}),
error_matrix=[
{'label_1': [int, ..., int]},
...,
{'label_N': [int, ..., int]}
],
app_link=str,
additional_properties={})
"""
Read the Retrieval Overview to learn more about retrieval evaluations in Okareo.
# Specify retrieval metrics and corresponding K values.
# Below, we use the same k_vals for all available metrics,
# but you can specify any subset of these metrics with
# different sets of K values to evaluate.
k_max = 5
k_vals = [i for i in range(1,k_max+1)]
metrics_kwargs={
"accuracy_at_k": k_vals,
"precision_recall_at_k": k_vals,
"ndcg_at_k": k_vals,
"mrr_at_k": k_vals,
"map_at_k": k_vals,
}
model_under_test = okareo.register_model(...)
test_run_response = model_under_test.run_test(
name="<YOUR_TEST_RUN_NAME>",
scenario="<YOUR_SCENARIO_ID>",
test_run_type=TestRunType.INFORMATION_RETRIEVAL,
api_key="<YOUR_MODEL_API_KEY>", # required for OpenAI, Cohere, Pinecone, QDrant, etc.
metrics_kwargs=metrics_kwargs,
)
"""
test_run_response: TestRunItem(
id=str,
project_id=str,
mut_id=str,
scenario_set_id=str,
name=str,
tags=list[str],
type='INFORMATION_RETRIEVAL',
start_time=datetime.datetime,
end_time=datetime.datetime,
test_data_point_count=int,
model_metrics=TestRunItemModelMetrics(
additional_properties={
'Accuracy@k': {'1': float, ..., '5': float},
'Precision@k': {'1': float, ..., '5': float},
'Recall@k': {'1': float, ..., '5': float},
'NDCG@k': {'1': float, ..., '5': float},
'MRR@k': {'1': float, ..., '5': float},
'MAP@k': {'1': float, ..., '5': float},
'row_level_metrics': {
'<UUID-FOR-ROW-1>': {
'1': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
...,
'5': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
},
...,
'<UUID-FOR-ROW-N>': {
'1': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
...,
'5': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
}
}
}
),
error_matrix=[],
app_link=str,
additional_properties={}
)
"""
To perform evaluations of generative models, you will need to specify your desired checks.
Read the Generation Overview to learn more about generation evaluations in Okareo.
# Specify your desired checks for the test run.
checks=["<CHECK_NAME_1>", ..., "<CHECK_NAME_N>"]
model_under_test = okareo.register_model(...)
test_run_response = model_under_test.run_test(
name="<YOUR_TEST_RUN_NAME>",
scenario="<YOUR_SCENARIO_ID>",
test_run_type=TestRunType.NL_GENERATION,
api_key="<YOUR_MODEL_API_KEY>", # required for OpenAI, Cohere, Pinecone, QDrant, etc.
checks=checks
)
"""
test_run_response: TestRunItem(
id=str,
project_id=str,
mut_id=str,
scenario_set_id=str,
name=str,
tags=list[str],
type='NL_GENERATION',
start_time=datetime.datetime,
end_time=datetime.datetime,
test_data_point_count=int,
model_metrics=TestRunItemModelMetrics(
additional_properties={
'mean_scores': {
'CHECK_NAME_1' : float,
...,
'CHECK_NAME_N': float,
},
'scores_by_row': [
{
'scenario_index': 1,
'test_id': "UUID-FOR-ROW-1",
'CHECK_NAME_1': float,
...,
'CHECK_NAME_N': float,
},
...,
{
'scenario_index': M,
'test_id': "UUID-FOR-ROW-M",
'CHECK_NAME_1': float,
...,
'CHECK_NAME_N': float,
}
]
}
),
error_matrix=[],
app_link=str,
additional_properties={},
)
"""
upload_scenario_set
Batch upload jsonl formatted data to create a scenario. This is the most efficient method for pushing large data sets for tests and evaluations.
- Usage
- Details
- Result
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.upload_scenario_set(file_path='./evaluation_dataset.jsonl', scenario_name="Retrieval Facts Scenario")
Takes two arguments, file_path
and scenario_name
"""
scenario_name: str,
file_path: str
"""
from okareo_api_client.models import ScenarioSetResponse
"""
scenario_id (str):
project_id (str):
time_created (datetime.datetime):
type (str):
tags (Union[Unset, List[str]]):
name (Union[Unset, str]):
seed_data (Union[Unset, List['SeedData']]):
scenario_count (Union[Unset, int]):
scenario_input (Union[Unset, List[str]]):
"""
get_all_checks
Return the list of all available checks. The returned list will include both predefined checks in Okareo as well as custom checks uploaded in association with your current organization.
- Usage
- Result
okareo.get_all_checks()
from okareo_api_client.models.evaluator_brief_response import EvaluatorBriefResponse
from okareo_api_client.models.evaluator_brief_response_check_config import EvaluatorBriefResponseCheckConfig
"""
[
EvaluatorBriefResponse(
id=str,
name=str,
description=str,
output_data_type=str,
time_created=datetime.datetime,
check_config=EvaluatorBriefResponseCheckConfig(
additional_properties={
'type': 'score' | 'pass_fail',
'code_contents': str, # if CodeBasedCheck
'prompt_template': str, # if ModelBasedCheck
}
),
additional_properties={},
),
...,
EvaluatorBriefResponse(...),
]
"""
get_check
Returns a detailed check response object. Useful if you have a check's ID and want to get more information about the check.
- Usage
- Result
okareo.get_check("<UUID-FOR-CHECK>")
from okareo_api_client.models.evaluator_detailed_response import EvaluatorDetailedResponse
from okareo_api_client.models.evaluator_detailed_response_check_config import EvaluatorDetailedResponseCheckConfig
"""
EvaluatorDetailedResponse(
id=str,
project_id=str,
name=str,
description=str,
requires_scenario_input=bool,
requires_scenario_result=bool,
output_data_type='score' | 'pass_fail',
code_contents=str,
time_created=datetime.datetime,
warning=None,
check_config=EvaluatorDetailedResponseCheckConfig(
additional_properties={
'type': 'score' | 'pass_fail',
'code_contents': str, # if CodeBasedCheck
'prompt_template': str, # if ModelBasedCheck
}),
additional_properties={}
)
"""
create_or_update_check
Uploads or updates a check with the specified name. If the name for the check exists already and the check name is not shared with a predefined Okareo check, then that check will be overwritten. Returns a detailed check response object.
- Usage
- Result
from okareo.checks import BaseCheck
okareo.create_or_update_check(
name: str,
description: str,
check: okareo.checks.BaseCheck,
)
from okareo_api_client.models.evaluator_detailed_response import EvaluatorDetailedResponse
from okareo_api_client.models.evaluator_detailed_response_check_config import EvaluatorDetailedResponseCheckConfig
"""
EvaluatorDetailedResponse(
id=str,
project_id=str,
name=str,
description=str,
requires_scenario_input=bool,
requires_scenario_result=bool,
output_data_type='score' | 'pass_fail',
code_contents=str,
time_created=datetime.datetime,
warning=None,
check_config=EvaluatorDetailedResponseCheckConfig(
additional_properties={
'type': 'score' | 'pass_fail',
'code_contents': str, # if CodeBasedCheck
'prompt_template': str, # if ModelBasedCheck
}),
additional_properties={}
)
"""
BaseCheck
Base class for creating a custom check. Generally, we advise using CodeBasedCheck
or ModelBasedCheck
instead of BaseCheck
. The metadata field stores additional properties from the model invocation such as latency in milliseconds and tool calls.
from okareo.checks import BaseCheck
'''
class BaseCheck(ABC):
"""
Base class for defining checks
"""
@staticmethod
@abstractmethod
def evaluate(
model_output: str, scenario_input: str, scenario_result: str, metadata: dict
) -> Union[bool, int, float]:
"""
Evaluate your model output, scenario input, scenario result, and metadata
to determine if the data should pass or fail the check.
"""
def check_config(self) -> dict:
"""
Returns a dictionary of configuration parameters that will be passed to the API.
"""
return {}
'''
Metadata example
An example of what the metadata passed into a check could look like.
{
"latency": 683.079,
"tool_calls": [
{
"function": {
"arguments": {
"location": "San Francisco, CA"
},
"name": "get_current_weather"
},
"id": "call_E4SDRy9Hgi5t9Vt5g73Rxjst",
"type": "function"
}
]
}
CodeBasedCheck
A custom check class that uses Python code to evaluate the data.
To use this check:
- Create a new
.py
file (not in a notebook). - In this file, define a class named
Check
that inherits fromCodeBasedCheck
. - Implement the
evaluate
method in yourCheck
class. - Include any additional code used by your check in the same file.
Example of uploading a custom CodeBasedCheck
:
- Usage
- Result
- my_custom_check.py
from my_custom_check import Check
uploaded_check = okareo_client.create_or_update_check(
name="check_sample_code",
description="a description of the custom check",
check=Check(),
)
from okareo_api_client.models.evaluator_detailed_response import EvaluatorDetailedResponse
from okareo_api_client.models.evaluator_detailed_response_check_config import EvaluatorDetailedResponseCheckConfig
"""
EvaluatorDetailedResponse(
id=str,
project_id=str,
name=str,
description=str,
requires_scenario_input=bool,
requires_scenario_result=bool,
output_data_type='score' | 'pass_fail',
code_contents=str,
time_created=datetime.datetime,
warning=None,
check_config=EvaluatorDetailedResponseCheckConfig(
additional_properties={
'type': 'score' | 'pass_fail',
'code_contents': str,
}),
additional_properties={}
)
"""
from okareo.checks import CodeBasedCheck
# any other imports required for your check
class Check(CodeBasedCheck):
@staticmethod
def evaluate(
model_output: str, scenario_input: str, scenario_result: str
) -> Union[bool, int, float]:
# Your code here
output = ...
return output
ModelBasedCheck
Check that uses a prompt template to evaluate the data.
The prompt template should be a string that includes at least one of the following placeholders, which will be replaced with the actual values:
{model_output}
-> corresponds to the model's output{scenario_input}
-> corresponds to the scenario input{scenario_result}
-> corresponds to the scenario result
Example of how a template could be used: "Count the words in the following: {model_output}"
The check output type should be one of the following:
CheckOutputType.SCORE
-> this template should ask prompt the model a score (single number)CheckOutputType.PASS_FAIL
-> this template should prompt the model for a boolean value (True
/False
)
Example of uploading a custom ModelBasedCheck
:
- Usage (Pass/Fail)
- Usage (Score)
- Result
from okareo.checks import CheckOutputType, ModelBasedCheck
uploaded_check = okareo.create_or_update_check(
name=f"check_sample_pass_fail",
description="a description of the custom check",
check=ModelBasedCheck(
prompt_template="Only output True if the model_output is at least 20 characters long, otherwise return False.",
check_type=CheckOutputType.PASS_FAIL,
),
)
from okareo.checks import CheckOutputType, ModelBasedCheck
uploaded_check = okareo.create_or_update_check(
name=f"check_sample_score",
description="a description of the custom check",
check=ModelBasedCheck(
prompt_template="Only output the number of words in the following text: {scenario_input} {scenario_result} {model_output}",
check_type=CheckOutputType.SCORE,
),
)
from okareo_api_client.models.evaluator_detailed_response import EvaluatorDetailedResponse
from okareo_api_client.models.evaluator_detailed_response_check_config import EvaluatorDetailedResponseCheckConfig
"""
EvaluatorDetailedResponse(
id=str,
project_id=str,
name=str,
description=str,
requires_scenario_input=bool,
requires_scenario_result=bool,
output_data_type='score' | 'pass_fail',
code_contents=str,
time_created=datetime.datetime,
warning=None,
check_config=EvaluatorDetailedResponseCheckConfig(
additional_properties={
'type': 'score' | 'pass_fail',
'prompt_template': str,
}),
additional_properties={}
)
"""
generate_check
Generates the contents of a .py
file for implementing a CodeBasedCheck
based on an EvaluatorSpecRequest
. Write the generated_code
of this method's result to a .py
file, then see CodeBasedCheck
above for details on how to upload generated the check.
- Usage
- Result
from okareo_api_client.models import EvaluatorSpecRequest
generate_request = EvaluatorSpecRequest(
description="""
Return True if the model_output is at least 20 characters long, otherwise return False.
""",
requires_scenario_input=False, # True if check uses scenario input
requires_scenario_result=False, # True if check uses scenario result
output_data_type="bool", # if pass/fail: 'bool'. if score: 'int' | 'float'
)
check = okareo.generate_check(generate_request)
from okareo_api_client.models.evaluator_generate_response import EvaluatorGenerateResponse
"""
EvaluatorGenerateResponse(
name: str,
description: str,
requires_scenario_input: bool,
requires_scenario_result: bool,
output_data_type: str,
generated_code: str
)
"""
delete_check
Deletes the check with the provided ID and name.
okareo.delete_check("<CHECK-UUID>", "<CHECK-NAME>")
"""
Check deletion was successful
"""
Reporters (Experimental)
Okareo is experimenting with a reporting mechanism for CI integration. Most of the experiments are in Typescript. However, the core JSONReporter is also available in Python.
Exporting Reports for CI
Class JSONReporter
When using Okareo as part of a CI run, it is useful to export evaluations into a common location that can be picked up by the CI analytics.
By using JSONReporter.log([eval_run, ...])
after each evaluation, Okareo will collect the json results in ./.okareo/reports
. The location can be controlled as part of the CLI with the -r LOCATION
or --report LOCATION
parameters. The output JSON is useful in CI for historical reference.
JSONReporter.log([eval_run, ...])
will output to the console unless the evaluation is initiated by the CLI.
- Usage
from okareo.reporter import JSONReporter
reporter = JSONReporter([eval_item])
reporter.log()