LLM Engines¶
Engine implementations that adapt different LLM providers to Kani’s engine interface. Each engine handles API communication, token counting, streaming, and provider-specific quirks.
OpenAI Anthropic Google Ollama Mock
Overview¶
PathFinder supports multiple LLM providers through Kani engine subclasses:
OpenAI – Via Kani’s built-in
OpenAIEngine, extended with Responses API supportAnthropic – Extended with prompt caching for 90% cost reduction on long system prompts
Google – Via Kani’s built-in
GoogleEngineOllama – Local models via OpenAI-compatible API
Mock – Deterministic engine for E2E testing (keyword-matched tool calls)
Class Hierarchy¶
classDiagram
class BaseEngine {
+predict()
+stream()
}
class OpenAIEngine
class ResponsesOpenAIEngine {
+strips encrypted_content
}
class AnthropicEngine
class CachedAnthropicEngine {
+prompt caching
+thinking-block fix
}
class MockEngine {
+keyword matching
+deterministic
}
BaseEngine <|-- OpenAIEngine
OpenAIEngine <|-- ResponsesOpenAIEngine
BaseEngine <|-- AnthropicEngine
AnthropicEngine <|-- CachedAnthropicEngine
BaseEngine <|-- MockEngine
Design Decisions¶
Why custom engine subclasses?
Each LLM provider has quirks that require engine-level fixes:
OpenAI’s Responses API doesn’t accept
encrypted_contentfor non-reasoning models –ResponsesOpenAIEnginestrips itAnthropic’s API returns bare thinking blocks that fail Pydantic validation –
CachedAnthropicEnginepatches the responsePrompt caching (Anthropic) can reduce costs by 90% for repeated system prompts – implemented at the engine level, transparent to agents
Mock engine for E2E testing
The mock engine returns predetermined tool calls based on keyword matching in the user’s message. Everything downstream (WDK API calls, database mutations, gene set operations, auto-build) runs against real services. This catches integration bugs that pure unit tests miss.
OpenAI Responses Engine¶
Purpose: OpenAI engine using the Responses API. Strips
reasoning.encrypted_content from non-reasoning models to prevent 400 errors.
OpenAI engine that uses Responses API without forcing encrypted reasoning.
Kani’s OpenAIEngine unconditionally adds include=["reasoning.encrypted_content"]
for all Responses API calls, but non-reasoning models (gpt-4.1, gpt-4.1-mini,
gpt-4.1-nano) reject this parameter. This subclass strips it for those models.
- class veupath_chatbot.ai.engines.responses_openai.ResponsesOpenAIEngine(*args, **kwargs)[source]¶
Bases:
OpenAIEngineOpenAIEngine that always uses the Responses API.
Strips
reasoning.encrypted_contentfrom theincludeparameter for models that don’t support reasoning, preventing 400 errors.- __init__(*args, **kwargs)[source]¶
- Parameters:
api_key – Your OpenAI API key. By default, the API key will be read from the OPENAI_API_KEY environment variable.
model – The id of the model to use (e.g. “gpt-4o-mini”, “ft:gpt-3.5-turbo:my-org:custom_suffix:id”).
max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.
api_type – Whether to use the Chat Completions API (default for most models) or Responses API (default for “deep-reasoning” style models). If unset, the best API type for the given model will be chosen.
organization – The OpenAI organization to use in requests. By default, the org ID would be read from the OPENAI_ORG_ID environment variable (defaults to the API key’s default org if not set).
retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 5).
api_base – The base URL of the OpenAI API to use.
headers – A dict of HTTP headers to include with each request.
client – An instance of openai.AsyncOpenAI (for reusing the same client in multiple engines). You must specify exactly one of
(api_key, client). If this is passed theorganization,retry,api_base, andheadersparams will be ignored.tokenizer – The tokenizer to use for token estimation - for OpenAI models this will be loaded automatically. A class with a
.encode(text: str)method that returns a list (usually of token ids).hyperparams – The arguments to pass to the
create_chat_completioncall with each request. See https://platform.openai.com/docs/api-reference/chat/create for a full list of params.
Anthropic Cached Engine¶
Purpose: Anthropic engine with prompt caching and thinking-block fixes. Adds cache control markers to system messages, reducing cost by up to 90% on repeated conversations. Also fixes Pydantic validation errors for bare thinking-block responses.
Anthropic engine with prompt caching and thinking-block fixes.
- class veupath_chatbot.ai.engines.cached_anthropic.CachedAnthropicEngine(api_key=None, model='claude-sonnet-4-0', max_tokens=2048, max_context_size=None, *, retry=2, api_base=None, headers=None, client=None, **hyperparams)[source]¶
Bases:
AnthropicEngineAnthropicEngine subclass that adds prompt caching and fixes thinking blocks.
Anthropic’s prompt caching reduces cache-hit costs by 90%.
Wraps single-MessagePart content in a list to prevent Pydantic validation errors when the response is a bare thinking block.
Mock Engine (E2E Testing)¶
Purpose: Deterministic mock LLM engine for E2E testing. Returns predetermined tool calls based on keyword matching in the user’s message. All downstream services (WDK, database, gene sets) run real – only the LLM call is mocked.
Design: The mock engine enables testing the full application stack (HTTP -> services -> integrations -> persistence) without LLM API costs or non-determinism. Test scenarios define expected tool call sequences that the mock replays in order.
Deterministic mock engine for E2E testing.
Returns predetermined tool calls based on keyword matching on the user message. The ONLY fake in the stack — everything downstream (WDK API, PostgreSQL, Redis, gene sets, auto-build) runs real.
- class veupath_chatbot.ai.engines.mock.MockEngine(site_id='plasmodb')[source]¶
Bases:
BaseEngineDeterministic kani engine for E2E testing.
Returns predetermined tool calls based on keyword matching on the user message. After tool results appear in history, returns plain text to exit the full_round loop.
The ONLY mock in the stack — all downstream systems run real.
- prompt_len(messages, functions=None, **kwargs)[source]¶
Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.
This method MAY be asynchronous. Use
Kani.prompt_token_len()for a higher-level interface that handles asynchrony.- Parameters:
messages (list[ChatMessage]) – The messages in the prompt.
functions (list[AIFunction] | None) – The functions included in the prompt.
kwargs (object) – Any additional parameters to pass to the underlying token counting implementation (engine-specific).
- Return type:
- async predict(messages, functions=None, **hyperparams)[source]¶
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages (list[ChatMessage]) – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions (list[AIFunction] | None) – The functions the LM is allowed to call.
hyperparams (object) – Any additional parameters to pass to the engine.
- Return type:
- async stream(messages, functions=None, **hyperparams)[source]¶
Optional: Stream a completion from the engine, token-by-token.
This method’s signature is the same as
BaseEngine.predict().This method should yield strings as an asynchronous iterable.
Optionally, this method may also yield a
BaseCompletion. If it does, it MUST be the last item yielded by this method.If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.
- Parameters:
messages (list[ChatMessage]) – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions (list[AIFunction] | None) – The functions the LM is allowed to call.
hyperparams (object) – Any additional parameters to pass to the engine.
- Return type: