Skip to main content

Python SDK

Okareo has a rich set of APIs that you can explore through the API Guide.

The okareo python SDK is designed to help accelerate your use of Okareo as part of your python notebook or application.

tip

The SDK requires an API Token. Refer to the Okareo API Key guide for more information.

Installation

pip install okareo

Create an instance

To use Okareo, you will need to instantiate the lib.

    from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")

Library Methods

create_scenario_set

A scenario set is the Okareo unit of data collection. Any scenario can be used to drive a registered model or as a seed for synthetic data generation. Often both.

    from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
from okareo_api_client.models import ScenarioSetCreate, SeedData

okareo.create_scenario_set(
ScenarioSetCreate(name="NAME OF MODEL",
number_examples=1,
seed_data=[
SeedData(
input_="Example input to be sent to the model",
result="Expected result from the model"
),
]
)
)

find_datapoints

Datapoints are accessible for research and analysis from your Notebook or elsewhere.

    from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.find_datapoints(context_token="YOUR UNIQUE TOKEN")

generate_scenario_set

Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.

    from okareo import Okareo
from okareo_api_client.models import ScenarioSetGenerate
okareo = Okareo("YOUR API TOKEN")
okareo.generate_scenario_set(
ScenarioSetGenerate(
source_scenario_id=create_scenario_set.scenario_id,
name="generated scenario set",
number_examples=2,
)
)

generate_scenarios

Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.

    from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Common Misspellings Scenario",
number_examples=2,
generation_type=ScenarioType.COMMON_MISSPELLINGS
)

Okareo has multiple synthetic data generators. We have provided details about each generator type below:

Rephrase

Rephrasings of the inputs will be generated. For example, if the input is Neil Alden Armstrong was an American astronaut and aeronautical engineer who in 1969 became the first person to walk on the Moon, the generated input could be Neil Alden Armstrong, an American astronaut and aeronautical engineer, made history in 1969 as the first individual to set foot on the Moon.

okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Rephrased Scenario",
number_examples=1,
generation_type=ScenarioType.REPHRASE_INVARIANT
)

Common Misspellings

Common misspellings of the inputs will be generated. For example, if the input is What is a receipt?, the generated input could be What is a reviept?

okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Common Mispellings Scenario",
number_examples=1,
generation_type=ScenarioType.COMMON_MISSPELLINGS
)

Common Contractions

Each input in the scenario will be shortened by 1 or 2 characters. For example, if the input is What is a steering wheel?, the generated input could be What is a steering whl?.

okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Common Contractions Scenario",
number_examples=1,
generation_type=ScenarioType.COMMON_CONTRACTIONS
)

Conditional

Each input in the scenario will be rephrased as a conditional statement. For example, if the input is What are the side effects of this medicine?, the generated input could be Considering this medicine, what might be the potential side effects?.

okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Conditional Scenario",
number_examples=1,
generation_type=ScenarioType.CONDITIONAL
)

Reverse Question

Each input in the scenario will be rephrased as a question that the input should be the answer for. For example, if the input is The first game of baseball was played in 1846., the generated input could be When was the first game of baseball ever played?.

okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Reverse Question Scenario",
number_examples=1,
generation_type=ScenarioType.TEXT_REVERSE_QUESTION
)

Term Relevance

Each input in the scenario will be rephrased to only include the most relevant terms, where relevance is based on the list of inputs provided to the scenario. We will then use parts of speech to determine an valid ordering of relevant terms. For example, if the inputs are all names of various milk teas such as Cool Sweet Honey Taro Milk Tea with Brown Sugar Boba, the generated input could be Taro Milk Tea, since Taro, Milk, and Tea could be the most relevant terms.

okareo.generate_scenarios(
source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
name="Term Relevance Scenario",
number_examples=1,
generation_type=ScenarioType.TERM_RELEVANCE_INVARIANT
)

get_scenario_data_points

Return each of the datapoints related to a single evaluation run

    from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.get_scenario_data_points("333155d5-0658-4080-b006-b83ad6c10797")

register_model

Register the model that you want to evaluate, test or collect datapoints from. Models must be uniquely named within a project namespace.

warning

The first time a model is defined, the attributes of the model are persisted. Subsequent calls to register_model will return the persisted model. They will not update the definition.

    from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.register_model(
name="Model Classifier",
model=OpenAIModel(
model_id="gpt-3.5-turbo",
temperature=0,
system_prompt_template=CLASSIFICATION_CONTEXT_TEMPLATE,
user_prompt_template=None,
),
)

Okareo has ready-to-run integrations with the following models and vector databases. Don't hesitate to reach out if you need another model.

OpenAI (LLM)

from okareo.model_under_test import OpenAIModel
"""
model_id: str
temperature: float
system_prompt_template: Optional[str] = None
user_prompt_template: Optional[str] = None
dialog_template: Optional[str] = None
tools: Optional[List] = None
"""

Generation Model (LLM)

from okareo.model_under_test import GenerationModel
"""
model_id: str
temperature: float
system_prompt_template: Optional[str] = None
user_prompt_template: Optional[str] = None
dialog_template: Optional[str] = None
tools: Optional[List] = None
"""

The GenerationModel is a universal LLM interface that supports most model providers. Users can plug in different model names, including OpenAI, Anthropic, and Cohere models.

Example using Cohere model with GenerationModel:

from okareo.model_under_test import GenerationModel

cohere_model = GenerationModel(
model_id="command-r",
temperature=0.7,
system_prompt_template="You are a helpful assistant.",
)

Example with tools:

from okareo.model_under_test import GenerationModel

tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]

model_with_tools = GenerationModel(
model_id="gpt-3.5-turbo-0613",
temperature=0.7,
system_prompt_template="You are a helpful assistant with access to weather information.",
tools=tools
)

In these examples, we're using the Cohere "command-r" model and the OpenAI "gpt-3.5-turbo-0613" model through the GenerationModel interface. The second example demonstrates how to include tools, which can be used for function calling capabilities.

Pinecone (VectorDB)

from okareo.model_under_test import PineconeDB
"""
index_name: str
region: str
project_id: str
top_k: int = 5
"""

QDrant (VectorDB)

from okareo.model_under_test import QdrantDB
"""
collection_name: str
url: str
top_k: int = 5
"""

CustomModel / ModelInvocation

You can use the CustomModel object to define your own custom, provider-agnostic models.

from okareo.model_under_test import CustomModel
"""
name: str

@abstractmethod
def invoke(
self, input_value: Union[dict, list, str]
) -> Union[ModelInvocation, Any]:
pass

"""

To use the CustomModel object, you will need to create a child class that defines an invoke method that returns a ModelInvocation object. For example,

from okareo.model_under_test import CustomModel, ModelInvocation

class MyCustomModel(CustomModel):
def invoke(self, input_value: Union[dict, list, str]) -> ModelInvocation:
# your model's invoke logic goes here
return ModelInvocation(
model_prediction=...,
model_input=...,
model_output_metadata=...,
tool_calls=...
)

Where the ModelInvocation's inputs are defined as follows:

class ModelInvocation:
"""Model invocation response object returned from a CustomModel.invoke method"""

model_prediction: Union[dict, list, str, None] = None
"""Prediction from the model to be used when running the evaluation,
e.g. predicted class from classification model or generated text completion from
a generative model. This would typically be parsed out of the overall model_output_metadata."""

model_input: Union[dict, list, str, None] = None
"""All the input sent to the model"""

model_output_metadata: Union[dict, list, str, None] = None
"""Full model response, including any metadata returned with model's output"""

tool_calls: Optional[List] = None
"""List of tool calls made during the model invocation, if any"""

The logic of your invoke method depends on many factors, chief among them the intended TestRunType of the CustomModel. Below, we highlight an example of how to use CustomModel for each TestRunType in Okareo.

The following snippet is taken from the classification_eval.ipynb example notebook. The underlying model is a distilbert model trained to classify queries into one of three categories. The model weights are available on Okareo's Hugging Face repository.

# Load all of the necessary libraries from Okareo
from okareo import Okareo
from okareo_api_client.models import ScenarioSetCreate, SeedData
from okareo.model_under_test import CustomModel, ModelInvocation

# Load the torch library
import torch

# Create an instance of the Okareo client
okareo = Okareo(OKAREO_API_KEY)

# Define a model class that will be used used for classification
# The model takes in a scenario and returns a predicted class
class ClassificationModel(CustomModel):
# Constructor for the model
def __init__(self, name, tokenizer, model):
self.name = name
# The pretrained tokenizer
self.tokenizer = tokenizer
# The pretrained model
self.model = model
# The possible labels for the model
self.label_lookup = ["pricing", "returns", "complaints"]

# Callable to be applied to each scenario in the scenario set
def invoke(self, input: str):
# Tokenize the input
encoding = self.tokenizer(input, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
# Get the logits from the model
logits = self.model(**encoding).logits
# Get the index of the highest value (the predicted class)
idx = torch.argmax(logits, dim=1).item()
# Get the label for the predicted class
prediction = self.label_lookup[idx]

# Return the prediction in a ModelInvocation object
return ModelInvocation(
model_prediction=prediction,
model_input=input,
model_output_metadata={ "prediction": prediction, "confidence": logits.softmax(dim=1).max().item() },
)

CustomBatchModel

If your custom model can handle batch inference, then you can use the CustomBatchModel class.

from okareo.model_under_test import CustomBatchModel
"""
name: str
batch_size: int = 1

@abstractmethod
def invoke_batch(
self, input_batch: list[dict[str, Union[dict, list, str]]]
) -> list[dict[str, Union[ModelInvocation, Any]]]:
'''method for taking a batch of scenario inputs and returning a corresponding batch of model outputs

arguments:
-> input_batch: list[dict[str, Union[dict, list, str]]] - batch of inputs to the model. Expects a list of
dicts of the format { 'id': str, 'input_value': Union[dict, list, str] }.

returns:
-> list of dicts of format { 'id': str, 'model_invocation': Union[ModelInvocation, Any] }. 'id' must match
the corresponding input_batch element's 'id'.
'''
"""

CustomBatchModel can be useful if you are using a Hugging Face model/tokenizer on a GPU, allowing you to speed up your evaluations with batched inference calls.

To use the CustomBatchModel class, you will need to create a child class that defines an invoke_batch method that returns a list of dicts. For example,

from okareo.model_under_test import CustomBatchModel, ModelInvocation

class MyCustomBatchModel(CustomBatchModel):
def invoke_batch(
self, input_batch: list[dict[str, Any]]
) -> list[dict[str, Union[str, ModelInvocation]]]:
invocations = []
input_values = [d['input_value'] for d in input_batch]
batch_ids = [d['id'] for d in input_batch]
predictions = ... # your batch processing logic goes here
for i in range(min(len(input_batch), self.batch_size)):
invocation = ModelInvocation(
model_prediction=predictions[i],
model_input=input_values[i],
model_output_metadata=...,
tool_calls=...,
)
invocations.append({"id": batch_ids[i], "model_invocation": invocation})
return invocations
warning

Please ensure that your batch processing logic assigns the correct "id" to the corresponding "model_invocation" or your evaluations will return incorrect results.

When you instantiate your batch model for an evaluation, you will pass your desired batch_size as an argument as follows:

my_batch_model=MyCustomBatchModel(
name="my custom batch model",
batch_size=4,
),

MultiTurnDriver

A MultiTurnDriver allows you to evaluate a language model over the course of a full conversation. The MultiTurnDriver is made up of two pieces: a Driver and a Target.

The Driver is defined in your MultiTurnDriver, while your Target is defined as either a CustomMultiturnTarget or a GenerationModel.

from okareo.model_under_test import MultiTurnDriver, OpenAIModel, StopConfig
"""
driver_temperature: float = 1.0
max_turns: int = 5
repeats: int = 1
first_turn: string: "target"
target: CustomMultiturnTarget | GenerationModel | OpenAIModel
stop_check: StopConfig
"""

Driver

The possible parameters for the Driver are:

"""
driver_temperature: float = 1.0
max_turns: int = 5
repeats: int = 1
first_turn: string: "target"
stop_check: StopConfig
"""

driver_temperature defines temperature used in the model that will simulate a user.

max_turns defines the maximum number of back-and-forth interactions that can be in the conversation.

repeats defines how many times each row in a scenario will be run when a model is run with run_test. Since the Driver is non-deterministic, repeating the same row of a scenario can lead to different conversations.

first_turn defines whether the Target or the Driver will send the first message in the conversation.

stop_check defines how the check will stop. It requires the check name, and a boolean value defining whether or not it stops on a True or False value returned from the check.

Target

A Target is either a GenerationModel or a CustomMultiturnTarget. Refer to GenerationModel for details on GenerationModel.

The only exception to the standard usage is that a system_prompt_template is required when using a MultiTurnDriver. The system_prompt_template defines the system prompt for how the Target should behave.

A CustomMultiturnTarget is defined in largely the same way as a CustomModel. The key difference is that the input is a list of messages in OpenAI's message format.

Driver and Target Interaction

The Driver simulates user behavior, while the Target represents the AI model being tested. This setup allows for testing complex scenarios and evaluating the model's performance over extended conversations.

Setting up a scenario

Scenarios in MultiTurnDriver are crafted using SeedData, where the input_ field serves as a driver prompt, instructing the simulated user (Driver) on how to behave throughout the conversation, including specific questions to ask, responses to give, and even how to react to the model's function calls, thereby creating a controlled yet dynamic testing environment for evaluating the model's performance across various realistic interaction patterns.

seed_data = [
SeedData(
input_="You are interacting with a customer service agent. First, ask about WebBizz...",
result="N/A",
),
# ... more seed data
]

Tools and Function Calling

The Target model can be equipped with tools, which are essentially functions the model can call. For instance:

tools=[
{
"type": "function",
"function": {
"name": "delete_account",
"description": "Deletes the user's account",
# ... parameter details
},
}
]

These tools allow the model to perform specific actions, like deleting a user account in this case.

Mocking Tool Results

The driver prompt can be used to mock the results of tool calls. This is crucial for testing how the model responds to different outcomes without actually performing the actions. For example:

input_="""... If you receive any function calls, output the result in JSON format 
and provide a JSON response indicating that the deletion was successful."""

This prompt instructs the Driver to simulate a successful account deletion when the function is called.

Checks and Conversation Control

Checks are used to evaluate specific aspects of the conversation or to control its flow. For instance:

stop_check=StopConfig(check_name="task_completion_delete_account", stop_on=True)

This configuration stops the conversation when the account deletion task is completed.

Custom checks can be created to evaluate various aspects of the conversation:

okareo.create_or_update_check(
name='task_completion_delete_account',
description="Check if the agent confirms account deletion",
check=ModelBasedCheck(...)
)

These checks can assess task completion, adherence to guidelines, or any other relevant criteria.

run_test

Run a test directly from a registered model. This requires both a registered model and at least one scenario.

The run_test function is called on a registered model in the form model_under_test.run_test(...). If your model requires an API key to call, then you will need to pass your key in the api_key parameter. Your api_keys are not stored by Okareo.

warning

Depending on size and complexity, model runs can take a long time to evaluate. Use scenarios appropriate in size to the task at hand.

Read the Classification Overview to learn more about classification evaluations in Okareo.

# Classification evaluations return accuracy, precision, recall, and f1 scores.
model_under_test = okareo.register_model(...)
test_run_response = model_under_test.run_test(
name="<YOUR_TEST_RUN_NAME>",
scenario="<YOUR_SCENARIO_ID>",
api_key="<YOUR_MODEL_API_KEY>", # required for OpenAI, Cohere, Pinecone, QDrant, etc.
test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
)

"""
test_run_response: TestRunItem(
id=str,
project_id=str,
mut_id=str,
scenario_set_id=str,
name=str,
tags=list[str],
type='MULTI_CLASS_CLASSIFICATION',
start_time=datetime.datetime,
end_time=datetime.datetime,
test_data_point_count=int,
model_metrics=TestRunItemModelMetrics(
additional_properties={
'weighted_average': {
'precision': float,
'recall': float,
'f1': float,
'accuracy': float
},
'scores_by_label': {
'label_1': {
'precision': float,
'recall': float,
'f1': float
},
...,
'label_N': {
'precision': float,
'recall': float,
'f1': float
},
}
}),
error_matrix=[
{'label_1': [int, ..., int]},
...,
{'label_N': [int, ..., int]}
],
app_link=str,
additional_properties={})
"""

upload_scenario_set

Batch upload jsonl formatted data to create a scenario. This is the most efficient method for pushing large data sets for tests and evaluations.

from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.upload_scenario_set(file_path='./evaluation_dataset.jsonl', scenario_name="Retrieval Facts Scenario")

get_all_checks

Return the list of all available checks. The returned list will include both predefined checks in Okareo as well as custom checks uploaded in association with your current organization.

okareo.get_all_checks()

get_check

Returns a detailed check response object. Useful if you have a check's ID and want to get more information about the check.

okareo.get_check("<UUID-FOR-CHECK>")

create_or_update_check

Uploads or updates a check with the specified name. If the name for the check exists already and the check name is not shared with a predefined Okareo check, then that check will be overwritten. Returns a detailed check response object.

from okareo.checks import BaseCheck

okareo.create_or_update_check(
name: str,
description: str,
check: okareo.checks.BaseCheck,
)

BaseCheck

Base class for creating a custom check. Generally, we advise using CodeBasedCheck or ModelBasedCheck instead of BaseCheck. The metadata field stores additional properties from the model invocation such as latency in milliseconds and tool calls.

from okareo.checks import BaseCheck
'''
class BaseCheck(ABC):
"""
Base class for defining checks
"""

@staticmethod
@abstractmethod
def evaluate(
model_output: str, scenario_input: str, scenario_result: str, metadata: dict
) -> Union[bool, int, float]:
"""
Evaluate your model output, scenario input, scenario result, and metadata
to determine if the data should pass or fail the check.
"""

def check_config(self) -> dict:
"""
Returns a dictionary of configuration parameters that will be passed to the API.
"""
return {}
'''

Metadata example

An example of what the metadata passed into a check could look like.

{
"latency": 683.079,
"tool_calls": [
{
"function": {
"arguments": {
"location": "San Francisco, CA"
},
"name": "get_current_weather"
},
"id": "call_E4SDRy9Hgi5t9Vt5g73Rxjst",
"type": "function"
}
]
}

CodeBasedCheck

A custom check class that uses Python code to evaluate the data.

To use this check:

  1. Create a new .py file (not in a notebook).
  2. In this file, define a class named Check that inherits from CodeBasedCheck.
  3. Implement the evaluate method in your Check class.
  4. Include any additional code used by your check in the same file.

Example of uploading a custom CodeBasedCheck:

from my_custom_check import Check

uploaded_check = okareo_client.create_or_update_check(
name="check_sample_code",
description="a description of the custom check",
check=Check(),
)

ModelBasedCheck

Check that uses a prompt template to evaluate the data.

The prompt template should be a string that includes at least one of the following placeholders, which will be replaced with the actual values:

  • {model_output} -> corresponds to the model's output
  • {scenario_input} -> corresponds to the scenario input
  • {scenario_result} -> corresponds to the scenario result

Example of how a template could be used: "Count the words in the following: {model_output}"

The check output type should be one of the following:

  • CheckOutputType.SCORE -> this template should ask prompt the model a score (single number)
  • CheckOutputType.PASS_FAIL -> this template should prompt the model for a boolean value (True/False)

Example of uploading a custom ModelBasedCheck:

from okareo.checks import CheckOutputType, ModelBasedCheck 
uploaded_check = okareo.create_or_update_check(
name=f"check_sample_pass_fail",
description="a description of the custom check",
check=ModelBasedCheck(
prompt_template="Only output True if the model_output is at least 20 characters long, otherwise return False.",
check_type=CheckOutputType.PASS_FAIL,
),
)

generate_check

Generates the contents of a .py file for implementing a CodeBasedCheck based on an EvaluatorSpecRequest. Write the generated_code of this method's result to a .py file, then see CodeBasedCheck above for details on how to upload generated the check.

from okareo_api_client.models import EvaluatorSpecRequest

generate_request = EvaluatorSpecRequest(
description="""
Return True if the model_output is at least 20 characters long, otherwise return False.
""",
requires_scenario_input=False, # True if check uses scenario input
requires_scenario_result=False, # True if check uses scenario result
output_data_type="bool", # if pass/fail: 'bool'. if score: 'int' | 'float'
)
check = okareo.generate_check(generate_request)

delete_check

Deletes the check with the provided ID and name.

okareo.delete_check("<CHECK-UUID>", "<CHECK-NAME>")
"""
Check deletion was successful
"""

Reporters (Experimental)

Okareo is experimenting with a reporting mechanism for CI integration. Most of the experiments are in Typescript. However, the core JSONReporter is also available in Python.

Exporting Reports for CI

Class JSONReporter

When using Okareo as part of a CI run, it is useful to export evaluations into a common location that can be picked up by the CI analytics.

By using JSONReporter.log([eval_run, ...]) after each evaluation, Okareo will collect the json results in ./.okareo/reports. The location can be controlled as part of the CLI with the -r LOCATION or --report LOCATION parameters. The output JSON is useful in CI for historical reference.

info

JSONReporter.log([eval_run, ...]) will output to the console unless the evaluation is initiated by the CLI.

from okareo.reporter import JSONReporter
reporter = JSONReporter([eval_item])
reporter.log()