Checks

What is a check?

In Okareo, a check is a unit of code that scores a generative model's output. The intention of a check is to assess a particular behavior of your LLM. Checks offer a number of benefits:

While LLMs may behave stochastically, checks are deterministic, letting you objectively assess your model's performance.
Checks can be narrowly scoped to assess specific model behaviors, letting you incorporate domain knowledge into your evaluation.

With checks, you can answer behavioral questions like:

Did the check pass? Was the check's threshold exceeded?
In what situations did this check fail?
Did the check change between Version A and Version B of my model?

Cookbook examples that showcase Okareo checks are available here:

Colab Notebook
Typescript Cookbook - Coming Soon

Creating or Updating Checks

Okareo provides a create_or_update_check method to create new checks or update existing ones. This method allows you to define checks using either a ModelBasedCheck or a CodeBasedCheck.

Types of Checks

ModelBasedCheck

A ModelBasedCheck uses a prompt template to evaluate the data. It's particularly useful when you want to leverage an existing language model to perform the evaluation.

How to use ModelBasedCheck

Here's an example of how to create a check using ModelBasedCheck:

check_sample_score = okareo.create_or_update_check(
    name="check_sample_score",
    description="Check sample score",
    check=ModelBasedCheck(
        prompt_template="Only output the number of words in the following text: {input} {output} {generation}",
        check_type=CheckOutputType.SCORE,
    ),
)

In this example:

name: A unique identifier for the check.
description: A brief description of what the check does.
check: An instance of ModelBasedCheck.
- prompt_template: A string that includes placeholders (input, output, generation) which will be replaced with actual values.
- check_type: Specifies the type of output (SCORE or PASS_FAIL).

The prompt_template should include at least one of the following placeholders:

generation: corresponds to the model's output
input: corresponds to the scenario input
result: corresponds to the scenario result

The check_type should be one of:

CheckOutputType.SCORE: The template should prompt the model for a score (single number)
CheckOutputType.PASS_FAIL: The template should prompt the model for a boolean value (True/False)

CodeBasedCheck

A CodeBasedCheck uses custom code to evaluate the data. This is useful when you need more complex logic or want to incorporate domain-specific knowledge into your check.

How to use CodeBasedCheck

To use a CodeBasedCheck:

Create a new Python file (not in a notebook).
In this file, define a class named 'Check' that inherits from CodeBasedCheck.
Implement the evaluate method in your Check class.
Include any additional code used by your check in the same file.

Here's an example:

# In my_custom_check.py
from okareo.checks import CodeBasedCheck

class Check(CodeBasedCheck):
    @staticmethod
    def evaluate(
        model_output: str, scenario_input: str, scenario_result: str
    ) -> Union[bool, int, float]:
        # Your evaluation logic here
        word_count = len(model_output.split())
        return word_count > 10  # Returns True if output has more than 10 words

Then, you can create or update the check using:

check_sample_code = okareo.create_or_update_check(
    name="check_sample_code",
    description="Check if output has more than 10 words",
    check=Check(),
)

The evaluate method should accept model_output, scenario_input, and scenario_result as arguments and return either a boolean, integer, or float.

Okareo checks

In Okareo, we provide out-of-the-box checks to let you quickly assess your LLM's performance. In the Okareo SDK, you can list the available checks by running the following method:

Typescript
Python

okareo.get_all_checks()

okareo.get_all_checks()

To use any of these checks, you simply specify them when running an evaluation as follows:

Typescript
Python

checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
    model_api_key: OPENAI_API_KEY,
    name: 'Evaluation Name',
    tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
    project_id: project_id,
    scenario_id: scenario_id,
    calculate_metrics: true,
    type: TestRunType.NL_GENERATION,
    checks: checks,
} as RunTestProps);

checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
    name="Evaluation Name",
    scenario=scenario,
    project_id=project_id,
    test_run_type=TestRunType.NL_GENERATION,
    calculate_metrics=True,
    checks=checks
)

You can track your LLM's behaviors using:

Off-the-shelf Checks provided by Okareo
Publicly available Checks from the community Checklist (working title)
Custom Checks generated based on user-specific application requirements

As of now, the following out-of-the box checks are available in Okareo:

conciseness
uniqueness
fluency/fluency_summary
coherence/coherence_summary
consistency/consistency_summary
relevance/relevance_summary
does_code_compile
contains_all_imports
compression_ratio
levenshtein_distance/levenshtein_distance_input

Automatic checks

The following checks make use of an LLM judge. The judge is provided with a system prompt describing the check in question. Each check is a measure of quality of the natural language text rated on an integer scale ranging from 1 to 5.

note

Any Automatic check that includes _summary in its name makes use of the scenario_input in addition to the model_output.

Conciseness

Name: conciseness.

The conciseness check rates the text on how concise the generated output is. If the model's output contains repeated ideas, the score will be lower.

Uniqueness

Name: uniqueness.

The uniqueness check will rate the text on how unique the text is compared to the other outputs in the evaluation. Consequently, this check uses all the rows in the scenario to score each row individually.

Fluency

Names: fluency, fluency_summary.

The fluency check is a measure of quality based on grammar, spelling, punctuation, word choice, and sentence structure. This check does not require a scenario input or result.

Coherence

Names: coherence, coherence_summary.

The coherence check is a measurement of structure and organization in a model's output. A higher score indicates that the output is well-structured and organized, and a lower score indicates the opposite.

Consistency

Names: consistency, consistency_summary.

The consistency check is a measurement of factual accuracy between the model output and the scenario input. This is useful in summarization tasks, where the model's input is a target document and the model's output is a summary of the target document.

Relevance

Names: relevance, relevance_summary.

The relevance check is a measure of summarization quality that rewards highly relevant information and penalizes redundancies or irrelevant information.

Code Generation checks

Does Code Compile

Name: does_code_compile.

This check simply checks whether the generated Python code compiles. This check lets you tell whether the generated code contains any non-Pythonic content (e.g., natural language, HTML, etc.). Requesting the does_code_compile check will run the following evaluate method:

class Check(BaseCheck):
    @staticmethod
    def evaluate(model_output: str) -> bool:
        try:
            compile(model_output, '<string>', 'exec')
            return True
        except SyntaxError as e:
            return False

Code Contains All Imports

Name: contains_all_imports.

This check looks at all the object/function calls in the generated code and ensures that the corresponding import statements are included.

Natural Language checks

Compression Ratio

Name: compression_ratio.

The compression ratio is a measure of how much smaller (or larger) a generated text is compared with a scenario input. In Okareo, requesting the compression_ratio check will invoke the following evaluate method:

class Check(BaseCheck):
    @staticmethod
    def evaluate(model_output: str, scenario_input: str) -> float:
        return len(model_output) / len(scenario_input)

Levenshtein Distance

Names: levenshtein_distance, levenshtein_distance_input.

The Levenshtein distance measures the amount of edits made to a given string where an "edit" can mean either an addition, a deletion, or a substitution. In Okareo, requesting the levenshtein_distance check will invoke the following evaluate method:

class Check(BaseCheck):
    @staticmethod
    def evaluate(model_output: str, scenario_response: str):
        # use Levenshtein distance with uniform weights
        weights = [1, 1, 1]
        return levenshtein_distance(model_output, scenario_response, weights)

def levenshtein_distance(s1, s2, weights):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1, weights)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + weights[0]
            deletions = current_row[j] + weights[1]
            substitutions = previous_row[j] + (c1 != c2) * weights[2]
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

Similarly, the levenshtein_distance_input call will use the following evaluate method:

class Check(BaseCheck):
    @staticmethod
    def evaluate(model_output: str, scenario_input: str):
        # use Levenshtein distance with uniform weights
        weights = [1, 1, 1]
        return levenshtein_distance(model_output, scenario_input, weights)

Custom checks

If the out-of-the-box checks do not serve your needs, then you can generate and upload your own Python-based checks.

Generating checks

To help you create your own checks, the Okareo SDK provides the generate_check method. You can describe the logic of your check using natural language, and an LLM will generate an evaluate method meeting those requirements.

For example, we can try to generate a check that looks for natural language below. To help you create your own checks, the Okareo SDK provides the generate_check method. You can describe the logic of your check using natural language, and an LLM will generate an evaluate method meeting those requirements.

For example, we can try to generate a check that looks for natural language below.

Typescript
Python

const generated_check = await okareo.generate_check({  
    project_id,
    name: "demo.summaryUnder256",
    description: "Pass if model_output contains at least one line of natural language.",
    output_data_type: "bool",
    requires_scenario_input:true,
    requires_scenario_result:true,
});

return await okareo.upload_check({
    project_id,
    ...generated_check
} as UploadEvaluatorProps);

from okareo_api_client.models.evaluator_spec_request import EvaluatorSpecRequest

description = """
Return `False` if `model_output` contains at least one line of natural language. 
Otherwise, return `True`.
"""

generate_request = EvaluatorSpecRequest(
    description=description,
    requires_scenario_input=False,
    requires_scenario_result=False,
    output_data_type="bool"
)
generated_test = okareo.generate_check(generate_request).generated_code

note

Please ensure that requires_scenario_input and requires_scenario_result are correctly configured for your check.

For example, if your check relies on the scenario_input, then you should set requires_scenario_input=True.

Uploading checks

Given a generated check, the Okareo SDK provides the upload_check method, which allows you to run custom checks in Okareo.

Typescript
Python

const upload_check: any = await okareo.upload_check({
    name: 'Example Uploaded Check',
    project_id,
    description: "Pass if the model result length is within 10% of the expected result.",
    requires_scenario_input: false,
    requires_scenario_result: true,
    output_data_type: "bool",
    file_path: "tests/example_eval.py",
    update: true
});

import tempfile

check_name = "has_no_natural_language"
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, f"{check_name}.py")
with open(file_path, "w+") as file:
    file.write(generated_test)

has_no_nl_check = okareo.upload_check(
    name=check_name,
    file_path=file_path,
    requires_scenario_input=False,
    requires_scenario_result=False
)

note

Your evaluate function must be saved locally as a .py file, and the file_path should point to this .py file.

Evaluating with uploaded checks

Once the check has been uploaded, you can use the check in a model_under_test.run_test by adding the name or the ID of the check to your list of checks. For example:

Typescript
Python

// provide a list of checks by name or ID 
const eval_results: any = await model.run_test({
    model_api_key: OPENAI_API_KEY,
    name: 'Evaluation Name',
    tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
    project_id: project_id,
    scenario_id: scenario_id,
    calculate_metrics: true,
    type: TestRunType.NL_GENERATION,
    checks: [
        "check_name_1",
        "check_name_2",
        ...
    ],
} as RunTestProps);

checks = [check_name] # alternatively: has_no_nl_check.id

# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
    name="Evaluation Name",
    scenario=scenario,
    test_run_type=TestRunType.NL_GENERATION,
    calculate_metrics=True,
    checks=checks
)

Checks

What is a check?​

Creating or Updating Checks​

Types of Checks​

ModelBasedCheck​

How to use ModelBasedCheck​

CodeBasedCheck​

How to use CodeBasedCheck​

Okareo checks

Automatic checks​

Conciseness​

Uniqueness​

Fluency​

Coherence​

Consistency​

Relevance​

Code Generation checks​

Does Code Compile​

Code Contains All Imports​

Natural Language checks​

Compression Ratio​

Levenshtein Distance​

Custom checks​

Generating checks​

Uploading checks​

Evaluating with uploaded checks​

What is a check?

Creating or Updating Checks

Types of Checks

ModelBasedCheck

How to use ModelBasedCheck

CodeBasedCheck

How to use CodeBasedCheck

Automatic checks

Conciseness

Uniqueness

Fluency

Coherence

Consistency

Relevance

Code Generation checks

Does Code Compile

Code Contains All Imports

Natural Language checks

Compression Ratio

Levenshtein Distance

Custom checks

Generating checks

Uploading checks

Evaluating with uploaded checks