Skip to main content

Okareo TypeScript SDK

Okareo's TypeScript SDK helps you trap errors and evaluate large language models (LLMs), agents, or RAG in both structured test scenarios and real-world usage. It provides a unified way to collect model telemetry (inputs, outputs, errors) and run evaluations to measure performance and reliability​.

Use TypeScript SDK for integrating Okareo cloud-based or self-hosted service into your applications. The SDK lets you register and test your AI code (including LLMs, agents, and embedding models), define scenarios for evaluation (single-turn or multi-turn conversations, classification and retrieval), log real-time datapoints from your app, and use built-in or custom metrics (Checks) to evaluate performance.

Install and Authenticate

Installing the SDK: Use your preferred package manager to install dependencies.

npm install -D okareo-ts-sdk

After installing, you'll need to obtain an API token from Okareo. Sign up for a free account on the Okareo web app and generate an API Token. Set this token as an environment variable so the SDK can authenticate:

export OKAREO_API_KEY="<YOUR_TOKEN>"

Alternatively, you can pass the API key directly in code when initializing the client (see below).

Authenticating the Okareo client: In your code, import the Okareo class and create an instance with your API key:

import { Okareo } from "okareo-ts-sdk";

const okareo = new Okareo({
api_key: process.env.OKAREO_API_KEY,
});

This instantiation establishes a client that will communicate with Okareo's cloud API. You can either set the OKAREO_API_KEY env variable, or pass it directly as a parameter.


Verify

You can verify your installation and API connectivity with this simple test:

import { Okareo } from "okareo-ts-sdk";

const okareo = new Okareo({
api_key: process.env.OKAREO_API_KEY,
});

(async () => {
const projects = await okareo.getProjects();
console.log("✅ Installation verified! Projects for this account: ", projects);
})();

Using the Okareo SDK

Once Okareo client is authenticated, you can use the SDK to register models/agents, define evaluation scenarios, run tests, and log data. Below are common usage patterns with code examples.

Project IDs and Global Project

Most methods in the TypeScript SDK require a valid project_id (a UUID) to scope models, scenarios, evaluations, and data points. When you sign up, Okareo automatically creates a Global project you can use right away.

If you prefer to organize evaluations under your own project, you can create one:

const project = await okareo.createProject({
name: "My Project",
tags: ["team-x"]
});

const project_id = project.id;

To list your existing projects and get their IDs:

const projects = await okareo.getProjects();

for (const project of projects) {
console.log(`Name: ${project.name}, ID: ${project.id}`);
}

Defining Scenarios for Evaluation

A Scenario Set in Okareo represents a dataset or set of test cases (inputs and expected outputs) to evaluate your application task on. You can create scenarios programmatically using the SDK:

Prepare seed data: Each scenario data point consists of an input (e.g. a prompt or query) and an expected result (e.g. the correct answer or ideal response). You can use a plain array of objects to define these:

const seedData = [
{ input: "Capital of France?", result: "Paris" },
{ input: "5 + 7 =", result: "12" }
];

Create a scenario set: Once you have a list of seed data, call create_scenario_set on the Okareo client:


const scenarioSet = await okareo.create_scenario_set({
name: "My Evaluation Scenario",
seed_data: seedData,
project_id: "your-project-id"
});

console.log(scenarioSet.app_link);

The response includes a scenario_id and a web app_link you can use to view or reuse the scenario set. You can create scenario sets for different purposes – e.g. a set of questions for a Question/Answer model task, or a set of conversation turns for a chatbot. Okareo allows using these scenarios both for driving evaluations and as seeds for synthetic scenario generation (more on generation later).


Running Evaluations – GenerationModel

GenerationModel provides a standard interface to evaluate text-generation models (LLMs) in Okareo. You define the model by specifying its identifier and parameters (like temperature), register it with Okareo, create a scenario set of test prompts and expected results, then run an evaluation:


import { Okareo, TestRunType, GenerationModel } from "okareo-ts-sdk";

const okareo = new Okareo({ api_key: process.env.OKAREO_API_KEY! });

// 1. Define and register the model
const model = await okareo.register_model({
name: "LLM under evaluation",
project_id: "your-project-id",
models: [{
type: "generation",
model_id: "gpt-4o",
temperature: 0.7,
system_prompt_template: "{input}",
} as GenerationModel]
});

// 2. Create scenario set
const scenario = await okareo.create_scenario_set({
name: "Basic Q&A",
project_id: "your-project-id",
seed_data: [
{ input: "What is 2+2?", result: "4" },
{ input: "Hello", result: "Hi" }
]
});

// 3. Run evaluation
const testRun = await model.run_test({
model_api_key: process.env.OPENAI_API_KEY!,
name: "Example Eval Run",
project_id: "your-project-id",
scenario,
type: TestRunType.NL_GENERATION,
calculate_metrics: true,
checks: ["reference_similarity"]
});

console.log(testRun.app_link);

In this example:

  • We create a scenario set by directly passing an array of { input, result } pairs.
  • We register a generation model (OpenAI GPT-4o) using {input} as the prompt template.
  • We run a text-generation evaluation on the scenario set using run_test, specifying a check like "reference_similarity".

The response from run_test(...) is a TestRunItem that includes evaluation metrics and a direct app_link to view the run in the Okareo web app.

Okareo supports many providers through GenerationModel (OpenAI, Cohere, Anthropic, etc.). All are supported via the register_model() interface, which returns a model handle used for running evaluations.


Running Evaluations – CustomModel and ModelInvocation

Okareo also supports evaluating custom models – models not covered by built-in providers – using the CustomModel type. You implement an invoke() function that takes an input string and returns a ModelInvocation result. This allows you to integrate any arbitrary model (e.g., from HuggingFace or on-prem) into the Okareo evaluation framework.

import { Okareo, CustomModel, ModelInvocation, TestRunType } from "okareo-ts-sdk";

const okareo = new Okareo({ api_key: process.env.OKAREO_API_KEY! });

// Define a custom model that simply echoes the input
const model = await okareo.register_model({
name: "Custom Model",
project_id: "your-project-id", // replace with actual UUID
models: {
type: "custom",
invoke: (input: string): ModelInvocation => ({
model_prediction: input,
model_input: input
})
} as CustomModel,
});

// Reuse a scenario set defined earlier
const test_run = await model.run_test({
scenario,
name: "Custom Test",
project_id: "your-project-id",
type: TestRunType.NL_GENERATION,
calculate_metrics: true,
checks: ["reference_similarity"]
});

console.log(test_run.app_link);

In this example:

  • We define a custom model using the CustomModel type and an inline invoke() function.
  • The invoke() returns a ModelInvocation object with prediction, input, and optional metadata.
  • We register the custom model and run an evaluation using run_test() just like for built-in models.
  • This lets you evaluate any model logic — including proprietary pipelines — using Okareo’s scenario-based evaluation and metrics framework. Custom models are treated as first-class citizens across the entire workflow.

Combining Synthetic Scenario with Evaluation

The following script synthetically transforms a set of direct requests into passive questions and then evaluates the core_app.getIntentContextTemplate(user, chat_history) context through OpenAI to determine if actual intent is maintainted. The number of synthetic examples created is 3 times the number of rows in the DIRECTED_INPUT data passed in.

import { Okareo, OpenAIModel, RunTestProps, ClassificationReporter } from 'okareo-ts-sdk';

const OKAREO_API_KEY = process.env.OKAREO_API_KEY;

const main = async () => {
try {
const okareo = new Okareo({api_key:process.env.OKAREO_API_KEY });

const sData: any = await okareo.create_scenario_set({
name: "Detect Passive Intent",
project_id: project_id,
number_examples: 3,
generation_type: ScenarioType.TEXT_REVERSE_QUESTION,
seed_data: DIRECTED_INTENT
});

const model_under_test = await okareo.register_model({
name: "User Chat Intent - 3.5 Turbo",
tags: ["TS-SDK", "Testing"],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:core_app.getIntentContextTemplate(user, chat_history),
user_prompt_template:`{scenario_input}`
} as OpenAIModel
});

const eval_run: any = await model_under_test.run_test({
name: "TS-SDK Classification",
tags: ["Classification", "BUILD_ID"],
model_api_key: OPENAI_API_KEY,
project_id: project_id,
scenario_id: sData.scenario_id,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps );

const reporter = new ClassificationReporter({
eval_run,
error_max: 2, // allows for up to 2 errors
metrics_min: {
precision: 0.95,
recall: 0.9,
f1: 0.9,
accuracy: 0.95
},
});
reporter.log(); // logs a table to the console output with the report results

} catch (error) {
console.error(error);
}
}

main();

Typescript SDK and Okareo API

The Okareo Typescript SDK is a set of convenience functions and wrappers for the Okareo REST API.

warning

Reporters are only supported in Typescript.
If you are interested in Python support, please let us know.

Class Okareo

create_or_update_check

This uploads or updates a check with the specified name. If the name for the check exists already and the check name is not shared with a predefined Okareo check, then that check will be overwritten. Returns a detailed check response object.

There are two types of checks - Code (Deterministic) and Behavioral (Model Judge). Code based checks are very fast and entirely predictable. They are code. Behavioral checks pass judgement based on inference. Behavioral checks are slower and can be less predictable. However, they are occasionally the best way to express behavioral expectaions. For example, "did the model expose private data?" is hard to analyze deterministically.

Code checks use Python. Okareo will generate the python for you with the typescript okareo.generate_check SDK function. You can then pass the code result to okareo.create_or_update_check

// For code checks (e.g. deterministic)
okareo.create_or_update_check({
name: str,
description: str,
check_config: {
type: CheckOutputType.Score | CheckOutputType.PASS_FAIL,
code_contents: <CHECK_PYTHON_CODE> // Python code that inherits from BaseCheck
}
});

//For behavioral checks (e.g. prompt/judges)
okareo.create_or_update_check({
name: str,
description: str,
check_config: {
type: CheckOutputType.Score | CheckOutputType.PASS_FAIL,
prompt_template: <CHECK_PROMPT> // The prompt describing the desired behavior
}
});


delete_check

Deletes the check with the provided ID and name.

okareo.delete_check("<CHECK-UUID>", "<CHECK-NAME>")
/*
Check deletion was successful
*/

create_scenario_set

A scenario set is the Okareo unit of data collection. Any scenario can be used to drive a registered model or as a seed for synthetic data generation. Often both.

    import { Okareo, SeedData, ScenarioType } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});

okareo.create_scenario_set(
{
name:"NAME OF SCENARIO",
project_id: PROJECT_ID,
number_examples:1,
generation_type: ScenarioType.SEED
seed_data: [
SeedData({
input:"Example input to be sent to the model",
result:"Expected result from the model"
}),
]
}
)

find_datapoints

Datapoints are accessible for research and analysis as part of CI or elsewhere. Datapoints can be returned from a broad range of dimension criteria. Typicaly some combination of time, feedback, and model are used. But there are many others available.

import { Okareo, DatapointSearch } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});

const data: any = await okareo.find_datapoints(
DatapointSearch({
project_id: project_id,
mut_id: model_id,
})
);

generate_check

Generates the contents of a .py file for implementing a CodeBasedCheck based on an EvaluatorSpecRequest. Pass the generated_code of this method's result to the create_or_update_check function to make the check available within Okareo.

const check = okareo.generate_check({
project_id: "",
description: "Return True if the model_output is at least 20 characters long, otherwise return False.",
requires_scenario_input: false, // True if check uses scenario input
requires_scenario_result: false, // True if check uses scenario result
output_data_type: "bool" | "int" | "float", // if pass/fail: 'bool'. if score: 'int' | 'float'
})

generate_scenario_set

Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.

import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});

const data: any = await okareo.generate_scenario_set(
{
project_id: project_id,
name: "EXAMPLE SCENARIO NAME",
source_scenario_id: "SOURCE_SCENARIO_ID",
number_examples: 2,
generation_type: ScenarioType.REPHRASE_INVARIANT,
}
)

get_all_checks

Return the list of all available checks. The returned list will include both predefined checks in Okareo as well as custom checks uploaded in association with your current organization.

okareo.get_all_checks()

get_check

Returns a detailed check response object. Useful if you have a check's ID and want to get more information about the check.

okareo.get_check("<UUID-FOR-CHECK>")

run_test

Run a test directly from a registered model. This requires both a registered model and at least one scenario.

The run_test function is called on a registered model in the form model_under_test.run_test(...). If your model requires an API key to call, then you will need to pass your key in the api_key parameter. Your API keys are not stored by Okareo.

warning

Depending on size and complexity, model runs can take a long time to evaluate. Use scenarios appropriate in size to the task at hand.

Read the Classification Overview to learn more about classificaiton evaluations in Okareo.

// Classification evaluations return accuracy, precision, recall, and f1 scores.
const model_under_test = okareo.register_model(...);
const test_run_response: any = await model_under_test.run_test({
name:"<YOUR_TEST_RUN_NAME>",
tags: [<OPTIONAL_ARRAY_OF_STRING_TAGS>],
project_id: project_id,
scenario_id:"<YOUR_SCENARIO_ID>",
model_api_key: "<YOUR_MODEL_API_KEY>", //Key for OpenAI, Cohere, Pinecone, QDrant, etc.,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps);
/*
test_run_response: {
id:str,
project_id:str,
mut_id:str,
scenario_set_id:str,
name:str,
tags:Array[str],
type:'MULTI_CLASS_CLASSIFICATION',
start_time:Date,
end_time=Date,
test_data_point_count:int,
model_metrics: {
'weighted_average': {
'precision': float,
'recall': float,
'f1': float,
'accuracy': float
},
'scores_by_label': {
'label_1': {
'precision': float,
'recall': float,
'f1': float
},
...,
'label_N': {
'precision': float,
'recall': float,
'f1': float
},
}
},
error_matrix: [
{'label_1': [int, ..., int]},
...,
{'label_N': [int, ..., int]}
],
app_link: str
}
*/

ScenarioType

// import { ScenarioType } from "okareo-ts-sdk";
export declare enum ScenarioType {
COMMON_CONTRACTIONS = "COMMON_CONTRACTIONS",
COMMON_MISSPELLINGS = "COMMON_MISSPELLINGS",
CONDITIONAL = "CONDITIONAL",
LABEL_REVERSE_INVARIANT = "LABEL_REVERSE_INVARIANT",
NAMED_ENTITY_SUBSTITUTION = "NAMED_ENTITY_SUBSTITUTION",
NEGATION = "NEGATION",
REPHRASE_INVARIANT = "REPHRASE_INVARIANT",
ROUNDTRIP_INVARIANT = "ROUNDTRIP_INVARIANT",
SEED = "SEED",
TERM_RELEVANCE_INVARIANT = "TERM_RELEVANCE_INVARIANT",
TEXT_REVERSE_LABELED = "TEXT_REVERSE_LABELED",
TEXT_REVERSE_QUESTION = "TEXT_REVERSE_QUESTION"
}

Okareo has multiple synthetic data generators. We have provided details about each generator type below:

Common Contractions

ScenarioType.COMMON_CONTRACTIONS

Each input in the scenario will be shortened by 1 or 2 characters. For example, if the input is What is a steering wheel?, the generated input could be What is a steering whl?.

Common Misspellings

ScenarioType.COMMON_MISSPELLINGS

Common misspellings of the inputs will be generated. For example, if the input is What is a receipt?, the generated input could be What is a reviept?

Conditional

ScenarioType.CONDITIONAL

Each input in the scenario will be rephrased as a conditional statement. For example, if the input is What are the side effects of this medicine?, the generated input could be Considering this medicine, what might be the potential side effects?.

Rephrase

ScenarioType.REPHRASE_INVARIANT

Rephrasings of the inputs will be generated. For example, if the input is Neil Alden Armstrong was an American astronaut and aeronautical engineer who in 1969 became the first person to walk on the Moon, the generated input could be Neil Alden Armstrong, an American astronaut and aeronautical engineer, made history in 1969 as the first individual to set foot on the Moon.

Reverse Question

ScenarioType.TEXT_REVERSE_QUESTION

Each input in the scenario will be rephrased as a question that the input should be the answer for. For example, if the input is The first game of baseball was played in 1846., the generated input could be When was the first game of baseball ever played?.

Seed

ScenarioType.SEED

The simplest of all generators. It does nothing. A true NoOp.

Term Relevance

ScenarioType.TERM_RELEVANCE_INVARIANT

Each input in the scenario will be rephrased to only include the most relevant terms, where relevance is based on the list of inputs provided to the scenario. We will then use parts of speech to determine an valid ordering of relevant terms. For example, if the inputs are all names of various milk teas such as Cool Sweet Honey Taro Milk Tea with Brown Sugar Boba, the generated input could be Taro Milk Tea, since Taro, Milk, and Tea could be the most relevant terms.


get_scenario_sets

Return one or more scenarios based on the project_id or a specific project_id + scenario_id pair

import { Okareo, components } from "okareo-ts-sdk";

const okareo = new Okareo({api_key:OKAREO_API_KEY});
const project_id = "YOUR_PROJECT_ID";
const scenario_id = "YOUR_SCENARIO_ID";

const all_scenarios = await okareo.get_scenario_sets({ project_id });
// or
const specific_scenario = await okareo.get_scenario_sets({ project_id, scenario_id });

get_scenario_data_points

Return each of the datapoints related to a single evaluation run

import { Okareo, components } from "okareo-ts-sdk";
async get_scenario_data_points(scenario_id: string): Promise<components["schemas"]["ScenarioDataPoinResponse"][]> {
//...
}

get_test_run

Return a previously run test. This is useful for "hill-climbing" where you look at a prior run, make changes and re-run or if you want to baseline the current run from the last.

import { Okareo, components } from "okareo-ts-sdk";
async get_test_run(test_run_id: string): Promise<components["schemas"]["TestRunItem"]> {
//...
}

register_model

Register the model that you want to evaluate, test or collect datapoints from. Models must be uniquely named within a project namespace.

In order to run a test, you will need to register a model. If you have already registered a model with the same name, the existing model will be returned. The model data is only updated if the "update: true" flag is passed.

warning

The first time a model is defined, the attributes of the model are persisted. Subsequent calls to register_model will return the persisted model. They will not update the definition.

import { Okareo, CustomModel } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const model_under_test = await okareo.register_model({
name: "Example Custom Model",
tags: ["Custom", "End-2-End"],
project_id: project_id,
models: {
type: "custom",
invoke: (input: string) => {
return {
actual: "Technical Support",
model_response: {
input: input,
method: "hard coded",
context: "Example context response",
}
}
}
} as CustomModel
});

Okareo has ready-to-run integrations with the following models and vector databases. Don't hesitate to reach out if you need another model.

OpenAI (LLM)

import { OpenAIModel } from 'okareo';

interface OpenAIModel extends BaseModel {
type: "openai";
model_id: string;
temperature: number;
system_prompt_template: string;
user_prompt_template: string;
dialog_template: string;
tools?: unknown[];
}

Generation Model (LLM)

import { GenerationModel } from 'okareo';

interface GenerationModel extends BaseModel {
type: "generation";
model_id: string;
temperature: number;
system_prompt_template: string;
user_prompt_template: string;
dialog_template: string;
tools?: unknown[];
}

The GenerationModel is a universal LLM interface that supports most model providers. Users can plug in different model names, including OpenAI, Anthropic, and Cohere models.

Example using Cohere model with GenerationModel:

import { GenerationModel } from 'okareo';

const cohereModel: GenerationModel = {
type: "generation",
model_id: "command-r",
temperature: 0.7,
system_prompt_template: "You are a helpful assistant.",
};

Example with tools:

import { GenerationModel } from 'okareo';

const tools = [
{
type: "function",
function: {
name: "get_current_weather",
description: "Get the current weather in a given location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "The city and state, e.g. San Francisco, CA"
},
unit: {
type: "string",
enum: ["celsius", "fahrenheit"]
}
},
required: ["location"]
}
}
}
];

const modelWithTools: GenerationModel = {
type: "generation",
model_id: "gpt-3.5-turbo-0613",
temperature: 0.7,
system_prompt_template: "You are a helpful assistant with access to weather information.",
tools: tools
};

In these examples, we're using the Cohere "command-r" model and the OpenAI "gpt-3.5-turbo-0613" model through the GenerationModel interface. The second example demonstrates how to include tools, which can be used for function calling capabilities.

Pinecone (VectorDB)

//import { TPineconeDB } from "okareo-ts-sdk";
export interface TPineconeDB extends BaseModel {
type?: string | undefined; // from BaseModel
tags?: string[] | undefined; // from BaseModel
index_name: string;
region: string;
project_id: string;
top_k: string;
}

QDrant (VectorDB)

//import { TQDrant } from "okareo-ts-sdk";
export interface TQDrant extends BaseModel {
type?: string | undefined; // from BaseModel
tags?: string[] | undefined; // from BaseModel
collection_name: string;
url: string;
top_k: string;
}

Custom Model

You can use the CustomModel object to define your own custom, provider-agnostic models.

//import { TCustomModel } from "okareo-ts-sdk";
export interface TCustomModel extends BaseModel {
invoke(input: string): {
actual: any | string;
model_response: {
input: any | string;
method: any | string;
context: any | string;
}
};
}

To use the CustomModel object, you will need to implement an invoke method that returns a ModelInvocation object. For example,

import { CustomModel, ModelInvocation } from "okareo-ts-sdk";

const my_custom_model: CustomModel = {
type: "custom",
invoke: (input: string) => {
// your model's invoke logic goes here
return {
model_prediction: ...,
model_input: input,
model_output_metadata: {
prediction: ...,
other_data_1: ...,
other_data_2: ...,
...,
},
tool_calls: ...
} as ModelInvocation
}
}

Where the ModelInvocation's inputs are defined as follows:

export interface  ModelInvocation {
/**
* Prediction from the model to be used when running the evaluation,
* e.g. predicted class from classification model or generated text completion from
* a generative model. This would typically be parsed out of the overall model_output_metadata
*/
model_prediction?: Record<string, any> | unknown[] | string;
/**
* All the input sent to the model
*/
model_input?: Record<string, any> | unknown[] | string;
/**
* Full model response, including any metadata returned with model's output
*/
model_output_metadata?: Record<string, any> | unknown[] | string;
/**
* List of tool calls made during the model invocation, if any
*/
tool_calls?: any[];
}

The logic of your invoke method depends on many factors, chief among them the intended TestRunType of the CustomModel. Below, we highlight an example of how to use CustomModel for each TestRunType in Okareo.

The following CustomModel classification example is taken from the custommodel.test.ts script. This model always returns "Technical Support" as the model_prediction.

const classificationModel = CustomModel({
type: "custom",
invoke: (input: string) => {
return {
model_prediction: "Technical Support",
model_input: input,
model_output_metadata: {
input: input,
method: "hard coded",
context: "Example context"
}
} as ModelInvocation
}
});

MultiTurnDriver

A MultiTurnDriver allows you to evaluate a language model over the course of a full conversation. The MultiTurnDriver is made up of two pieces: a Driver and a Target.

The Driver is defined in your MultiTurnDriver, while your Target is defined as either a CustomMultiturnTarget or a GenerationModel.

// import { MultiTurnDriver, StopConfig } from "okareo-ts-sdk"
export interface MultiTurnDriver extends BaseModel {
type: "driver";
target: GenerationModel | CustomMultiturnTarget;
driver_temperature: number = 0.8
max_turns: bigint = 5
repeats: bigint = 1
first_turn: string = "target"
stop_check: StopConfig
}

Driver

The possible parameters for the Driver are:

driver_temperature: number = 1.0
max_turns: bigint = 5
repeats: bigint = 1
first_turn: string = "target"
stop_check: StopConfig

driver_temperature defines temperature used in the model that will simulate a user.

max_turns defines the maximum number of back-and-forth interactions that can be in the conversation.

repeats defines how many times each row in a scenario will be run when a model is run with run_test. Since the Driver is non-deterministic, repeating the same row of a scenario can lead to different conversations.

first_turn defines whether the Target or the Driver will send the first message in the conversation.

stop_check defines how the check will stop. It requires the check name, and a boolean value defining whether or not it stops on a True or False value returned from the check.

Target

A Target is either a GenerationModel or a CustomMultiturnTarget. Refer to GenerationModel for details on GenerationModel.

The only exception to the standard usage is that a system_prompt_template is required when using a MultiTurnDriver. The system_prompt_template defines the system prompt for how the Target should behave.

A CustomMultiturnTarget is defined in largely the same way as a CustomModel. The key difference is that the input is a list of messages in OpenAI's message format.

Driver and Target Interaction

The Driver simulates user behavior, while the Target represents the AI model being tested. This setup allows for testing complex scenarios and evaluating the model's performance over extended conversations.

Setting up a scenario

Scenarios in MultiTurnDriver are crafted using SeedData, where the input field serves as a driver prompt, instructing the simulated user (Driver) on how to behave throughout the conversation, including specific questions to ask, responses to give, and even how to react to the model's function calls, thereby creating a controlled yet dynamic testing environment for evaluating the model's performance across various realistic interaction patterns.

const seedData: SeedData[] = [
{
input: "You are interacting with a customer service agent. First, ask about WebBizz...",
result: "N/A",
},
// ... more seed data
];

Tools and Function Calling

The Target model can be equipped with tools, which are essentially functions the model can call. For instance:

const tools = [
{
type: "function",
function: {
name: "delete_account",
description: "Deletes the user's account",
// ... parameter details
},
}
];

These tools allow the model to perform specific actions, like deleting a user account in this case.

Mocking Tool Results

The driver prompt can be used to mock the results of tool calls. This is crucial for testing how the model responds to different outcomes without actually performing the actions. For example:

const input = `... If you receive any function calls, output the result in JSON format 
and provide a JSON response indicating that the deletion was successful.`;

This prompt instructs the Driver to simulate a successful account deletion when the function is called.

Checks and Conversation Control

Checks are used to evaluate specific aspects of the conversation or to control its flow. For instance:

const stopCheck: StopConfig = {
check_name: "task_completion_delete_account",
stop_on: true,
};

This configuration stops the conversation when the account deletion task is completed.

Custom checks can be created to evaluate various aspects of the conversation:

okareo.createOrUpdateCheck({
name: 'task_completion_delete_account',
description: "Check if the agent confirms account deletion",
check: new ModelBasedCheck(/* ... */)
});

These checks can assess task completion, adherence to guidelines, or any other relevant criteria.

upload_scenario_set

Batch upload jsonl formatted data to create a scenario. This is the most efficient method for pushing large data sets for tests and evaluations.

import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.upload_scenario_set(
{
file_path: "example_data/seed_data.jsonl",
scenario_name: "Uploaded Scenario Set",
project_id: project_id
}
);

Reporters

Primarily part of the Okareo Typescript SDK are a set of reporters. The reporters allow you to get rapid feedback in CI or locally from the command line.

Reporters are convenience functions that interpret evaluations based on thresholds that you provide. The reporters are not persisted and do not alter or change the evaluation. They are simply conveniences for rapid summarization locally and in CI.

Singleton Evaluation Reporters

There are two categories of reporters. The singleton reporters are based on specific evaluation types and can report on each. You can set thresholds specific to classification, retrieval, or generation and the reporters will provide detailed pass/fail information. The second category provides trend information. The history reporter takes a list of evaluations along with a threshold instance and returns a table of results over time.

Class ClassificationReporter

The classification reporter takes the evaluated metrics and the confusion matrix and returns a pass/fail, count of errors, and the specific metric that fails.

info

By convention we define the reporter thresholds independently. This way we can re-use them in trend analysis and across evaluations.

Example console output from passing test: Okareo Diagram

import { ClassificationReporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const eval_thresholds = {
error_max: 8,
metrics_min: {
precision: 0.95,
recall: 0.9,
f1: 0.9,
accuracy: 0.95
}
}
const reporter = new ClassificationReporter({
eval_run:classification_run,
...eval_thresholds,
});
reporter.log(); //provides a table of results
/*
// do something if it fails
if (!reporter.pass) { ... }
*/

Class RetrievalReporter

The retrieval reporter provides a shortcut for metrics @k. Each metric can reference a different k value. The result of the report is always in summary form and only returns metrics that exceed thresholds.

Example console output from failing test: Okareo Diagram

import { classification_reporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const report = retrieval_reporter(
{
eval_run:data, // data from a retrieval run
metrics_min: {
'Accuracy@k': {
value: 0.96,
at_k: 3
},
'Precision@k': {
value: 0.5,
at_k: 1 // can use different k values by metric
},
'Recall@k': {
value: 0.8,
at_k: 2 // can use different k values by metric
},
'NDCG@k': {
value: 0.2,
at_k: 3
},
'MRR@k': {
value: 0.96,
at_k: 3
},
'MAP@k': {
value: 0.96,
at_k: 3
}
}
}
);
expect(report.pass).toBeTruthy(); // example report assertion

Class GenerationReporter

The genration reporter takes an arbitrary list of metric name:value pairs and reports on results that did not meet the minimum threshold defined. Often these metrics are unique to your circumstance. Boolean values will be treated as "0" or "1".

Example console output from failing test: Okareo Diagram

import { classification_reporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const report = generation_reporter(
{
eval_run:data,
metrics_min: {
coherence: 4.9,
consistency: 3.2,
fluency: 4.7,
relevance: 4.3,
overall: 4.1
}
}
);
expect(report.pass).toBeTruthy(); // example report assertion

History Reporter

The second category of reporter provides historical information based on a series of test runs. Like the singletons, each reporter analyzes a single evaluation type at a time. However the mechanism is shared across all types.

Class EvaluationHistoryReporter

The EvaluationHistoryReporter requires four inputs: the evaluation type, list of evals, assertions, and the number to render. The type must be one of the Okareo TestRunType definitions. The assertions are shared with the singleton reports.

info

By convention we define the reporter thresholds independently. Re-using thresholds between singleton reports and historic reports is one of the many reasons.

Classification Report Okareo Diagram

Retrieval Report Okareo Diagram

Generation Report Okareo Diagram

const history_class = new EvaluationHistoryReporter({
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
evals:[TEST_RUN_CLASSIFICATION as components["schemas"]["TestRunItem"], TEST_RUN_CLASSIFICATION as components["schemas"]["TestRunItem"]],
assertions: class_metrics,
last_n: 5,
});
history_class.log();

Exporting Reports for CI

Class JSONReporter

When using Okareo as part of a CI run, it is useful to export evaluations into a common location that can be picked up by the CI analytics.

By using JSONReporter.log([eval_run, ...]) after each evaluation, Okareo will collect the json results in ./.okareo/reports. The location can be controlled as part of the CLI with the -r LOCATION or --report LOCATION parameters. The output JSON is useful in CI for historical reference.

info

JSONReporter.log([eval_run, ...]) will output to the console unless the evaluation is initiated by the CLI.

import { JSONReporter } from 'okareo-ts-sdk';
const reporter = new JSONReporter([eval_run]);
reporter.log();