LLM Engines

Engine implementations that adapt different LLM providers to Kani’s engine interface. Each engine handles API communication, token counting, streaming, and provider-specific quirks.

OpenAI Anthropic Google Ollama Mock

Overview

PathFinder supports multiple LLM providers through Kani engine subclasses:

  • OpenAI – Via Kani’s built-in OpenAIEngine, extended with Responses API support

  • Anthropic – Extended with prompt caching for 90% cost reduction on long system prompts

  • Google – Via Kani’s built-in GoogleEngine

  • Ollama – Local models via OpenAI-compatible API

  • Mock – Deterministic engine for E2E testing (keyword-matched tool calls)

Class Hierarchy

        classDiagram
    class BaseEngine {
        +predict()
        +stream()
    }
    class OpenAIEngine
    class ResponsesOpenAIEngine {
        +strips encrypted_content
    }
    class AnthropicEngine
    class CachedAnthropicEngine {
        +prompt caching
        +thinking-block fix
    }
    class MockEngine {
        +keyword matching
        +deterministic
    }

    BaseEngine <|-- OpenAIEngine
    OpenAIEngine <|-- ResponsesOpenAIEngine
    BaseEngine <|-- AnthropicEngine
    AnthropicEngine <|-- CachedAnthropicEngine
    BaseEngine <|-- MockEngine
    

Design Decisions

Why custom engine subclasses?

Each LLM provider has quirks that require engine-level fixes:

  • OpenAI’s Responses API doesn’t accept encrypted_content for non-reasoning models – ResponsesOpenAIEngine strips it

  • Anthropic’s API returns bare thinking blocks that fail Pydantic validation – CachedAnthropicEngine patches the response

  • Prompt caching (Anthropic) can reduce costs by 90% for repeated system prompts – implemented at the engine level, transparent to agents

Mock engine for E2E testing

The mock engine returns predetermined tool calls based on keyword matching in the user’s message. Everything downstream (WDK API calls, database mutations, gene set operations, auto-build) runs against real services. This catches integration bugs that pure unit tests miss.

OpenAI Responses Engine

Purpose: OpenAI engine using the Responses API. Strips reasoning.encrypted_content from non-reasoning models to prevent 400 errors.

OpenAI engine that uses Responses API without forcing encrypted reasoning.

Kani’s OpenAIEngine unconditionally adds include=["reasoning.encrypted_content"] for all Responses API calls, but non-reasoning models (gpt-4.1, gpt-4.1-mini, gpt-4.1-nano) reject this parameter. This subclass strips it for those models.

class veupath_chatbot.ai.engines.responses_openai.ResponsesOpenAIEngine(*args, **kwargs)[source]

Bases: OpenAIEngine

OpenAIEngine that always uses the Responses API.

Strips reasoning.encrypted_content from the include parameter for models that don’t support reasoning, preventing 400 errors.

__init__(*args, **kwargs)[source]
Parameters:
  • api_key – Your OpenAI API key. By default, the API key will be read from the OPENAI_API_KEY environment variable.

  • model – The id of the model to use (e.g. “gpt-4o-mini”, “ft:gpt-3.5-turbo:my-org:custom_suffix:id”).

  • max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.

  • api_type – Whether to use the Chat Completions API (default for most models) or Responses API (default for “deep-reasoning” style models). If unset, the best API type for the given model will be chosen.

  • organization – The OpenAI organization to use in requests. By default, the org ID would be read from the OPENAI_ORG_ID environment variable (defaults to the API key’s default org if not set).

  • retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 5).

  • api_base – The base URL of the OpenAI API to use.

  • headers – A dict of HTTP headers to include with each request.

  • client – An instance of openai.AsyncOpenAI (for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed the organization, retry, api_base, and headers params will be ignored.

  • tokenizer – The tokenizer to use for token estimation - for OpenAI models this will be loaded automatically. A class with a .encode(text: str) method that returns a list (usually of token ids).

  • hyperparams – The arguments to pass to the create_chat_completion call with each request. See https://platform.openai.com/docs/api-reference/chat/create for a full list of params.

Anthropic Cached Engine

Purpose: Anthropic engine with prompt caching and thinking-block fixes. Adds cache control markers to system messages, reducing cost by up to 90% on repeated conversations. Also fixes Pydantic validation errors for bare thinking-block responses.

Anthropic engine with prompt caching and thinking-block fixes.

class veupath_chatbot.ai.engines.cached_anthropic.CachedAnthropicEngine(api_key=None, model='claude-sonnet-4-0', max_tokens=2048, max_context_size=None, *, retry=2, api_base=None, headers=None, client=None, **hyperparams)[source]

Bases: AnthropicEngine

AnthropicEngine subclass that adds prompt caching and fixes thinking blocks.

  • Anthropic’s prompt caching reduces cache-hit costs by 90%.

  • Wraps single-MessagePart content in a list to prevent Pydantic validation errors when the response is a bare thinking block.

Mock Engine (E2E Testing)

Purpose: Deterministic mock LLM engine for E2E testing. Returns predetermined tool calls based on keyword matching in the user’s message. All downstream services (WDK, database, gene sets) run real – only the LLM call is mocked.

Design: The mock engine enables testing the full application stack (HTTP -> services -> integrations -> persistence) without LLM API costs or non-determinism. Test scenarios define expected tool call sequences that the mock replays in order.

Deterministic mock engine for E2E testing.

Returns predetermined tool calls based on keyword matching on the user message. The ONLY fake in the stack — everything downstream (WDK API, PostgreSQL, Redis, gene sets, auto-build) runs real.

class veupath_chatbot.ai.engines.mock.MockEngine(site_id='plasmodb')[source]

Bases: BaseEngine

Deterministic kani engine for E2E testing.

Returns predetermined tool calls based on keyword matching on the user message. After tool results appear in history, returns plain text to exit the full_round loop.

The ONLY mock in the stack — all downstream systems run real.

max_context_size: int = 128000

The maximum context size supported by this engine’s LM.

__init__(site_id='plasmodb')[source]
prompt_len(messages, functions=None, **kwargs)[source]

Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.

This method MAY be asynchronous. Use Kani.prompt_token_len() for a higher-level interface that handles asynchrony.

Parameters:
  • messages (list[ChatMessage]) – The messages in the prompt.

  • functions (list[AIFunction] | None) – The functions included in the prompt.

  • kwargs (object) – Any additional parameters to pass to the underlying token counting implementation (engine-specific).

Return type:

int

async predict(messages, functions=None, **hyperparams)[source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages (list[ChatMessage]) – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions (list[AIFunction] | None) – The functions the LM is allowed to call.

  • hyperparams (object) – Any additional parameters to pass to the engine.

Return type:

BaseCompletion

async stream(messages, functions=None, **hyperparams)[source]

Optional: Stream a completion from the engine, token-by-token.

This method’s signature is the same as BaseEngine.predict().

This method should yield strings as an asynchronous iterable.

Optionally, this method may also yield a BaseCompletion. If it does, it MUST be the last item yielded by this method.

If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.

Parameters:
  • messages (list[ChatMessage]) – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions (list[AIFunction] | None) – The functions the LM is allowed to call.

  • hyperparams (object) – Any additional parameters to pass to the engine.

Return type:

AsyncIterable[str | BaseCompletion]