Evaluation Engine¶

The evaluation engine is the backend service that powers the workbench’s analysis features. It evaluates search performance with positive/negative control gene sets, computes classification and rank metrics, runs cross-validation, and enrichment analysis. The workbench UI at /workbench consumes these endpoints.

        flowchart LR
    A["Gene Set + Controls"] --> G{targetGeneIds?}
    G -- yes --> H["Set Intersection<br/>(no WDK call)"]
    G -- no --> B["Run Search on WDK"]
    B --> C["Evaluate Controls"]
    H --> C
    C --> D["Metrics<br/>P/R/F1"]
    C --> E["Cross-Validation"]
    C --> F["Enrichment"]

    style A fill:#2563eb,color:#fff
    style G fill:#f59e0b,color:#000
    style H fill:#10b981,color:#fff
    style C fill:#7c3aed,color:#fff

Evaluation Modes¶

The evaluation engine supports two evaluation modes:

Gene-ID mode (workbench gene sets):: When targetGeneIds is provided in the experiment config, the engine skips WDK search re-execution and evaluates using pure set intersection against the control genes. This is the correct path for workbench gene sets, which already contain materialized gene IDs.
Search re-execution mode (strategy evaluation):: When targetGeneIds is absent, the engine runs the WDK search using searchName and parameters from the config and evaluates the results against controls. This is the correct path when evaluating a search configuration itself — e.g., when the AI agent builds a strategy and needs to test its performance before the results have been materialized into a gene set.

Important

The benchmark and evaluate panels in the workbench both send targetGeneIds from the active gene set. This ensures metrics are computed against the actual gene set contents, not a potentially stale re-execution of search parameters.

Execution Endpoints¶

Method	Endpoint	Description
POST	`/api/v1/experiments/`	Create and run a single experiment (SSE)
POST	`/api/v1/experiments/batch`	Run across multiple organisms (SSE)
POST	`/api/v1/experiments/benchmark`	Run against multiple control sets (SSE)
POST	`/api/v1/experiments/seed`	Seed demo strategies and control sets (SSE)

Analysis Endpoints¶

Cross-experiment (not scoped to a single experiment):

Method	Endpoint	Description
POST	`/api/v1/experiments/overlap`	Pairwise gene set overlap (Jaccard, shared/unique genes)
POST	`/api/v1/experiments/enrichment-compare`	Compare enrichment results across experiments

Per-experiment (scoped to {experiment_id}):

Method	Endpoint	Description
POST	`/api/v1/experiments/{id}/cross-validate`	Run cross-validation on an existing experiment
POST	`/api/v1/experiments/{id}/enrich`	Run enrichment analysis
POST	`/api/v1/experiments/{id}/re-evaluate`	Re-run evaluation (e.g. after changing controls)
POST	`/api/v1/experiments/{id}/custom-enrich`	Custom enrichment request
POST	`/api/v1/experiments/{id}/threshold-sweep`	Threshold sweep for a parameter
GET	`/api/v1/experiments/{id}/export`	Download experiment report (HTML)

CRUD and Results¶

Method	Endpoint	Description
GET	`/api/v1/experiments/`	List experiments (optional site filter)
GET	`/api/v1/experiments/{id}`	Get one experiment
PATCH	`/api/v1/experiments/{id}`	Update (e.g. name)
DELETE	`/api/v1/experiments/{id}`	Delete an experiment

Results browsing (per-experiment):

Method	Endpoint	Description
GET	`/api/v1/experiments/{id}/results/attributes`	List available result attributes
GET	`/api/v1/experiments/{id}/results/records`	Paginated result records
POST	`/api/v1/experiments/{id}/results/record`	Get single record detail
GET	`/api/v1/experiments/{id}/results/distributions/{attr}`	Distribution data for an attribute
POST	`/api/v1/experiments/{id}/refine`	Refine/filter result records

Workbench chat (per-experiment conversational AI):

Method	Endpoint	Description
POST	`/api/v1/experiments/{id}/chat`	Start workbench chat stream (SSE)
GET	`/api/v1/experiments/{id}/chat/messages`	Get chat message history

Persistence¶

Experiments are stored in the experiments table (see veupath_chatbot.persistence.models.ExperimentRow): id, site_id, name, status, data (full JSON), batch_id, benchmark_id, created_at, updated_at. The experiment store (veupath_chatbot.services.experiment.store) keeps an in-memory cache and persists every mutation to PostgreSQL.

Control Sets¶

Reusable positive/negative gene sets are managed at /api/v1/control-sets (CRUD). They can be referenced when creating experiments (e.g. control_set_id). See veupath_chatbot.persistence.models.ControlSet.

Experiment Streaming (CQRS)¶

Purpose: Background task launchers for experiment execution using a CQRS event model. Events are persisted to Redis Streams; operations are tracked in PostgreSQL. This is how long-running experiments (single, batch, benchmark) are kicked off and their progress communicated to the frontend via SSE.

Background task launchers for experiment execution — CQRS version.

Events are persisted to Redis Streams. Operations are registered in PostgreSQL.

async veupath_chatbot.services.experiment.core.streaming.start_experiment(config, *, user_id=None)[source]¶

Launch a single experiment as a background task. Returns operation ID.

Return type:: str

async veupath_chatbot.services.experiment.core.streaming.start_batch_experiment(batch_config, *, user_id=None)[source]¶

Launch a batch experiment as a background task. Returns operation ID.

Return type:: str

async veupath_chatbot.services.experiment.core.streaming.start_benchmark(base_config, control_sets, *, user_id=None)[source]¶

Launch a benchmark suite as a background task. Returns operation ID.

Return type:: str

Service Layer¶

Core experiment service, orchestration, and store.

Experiment execution orchestrator.

Coordinates the full experiment lifecycle: evaluation, metrics computation, optional cross-validation, and optional enrichment analysis.

Each phase is a private function that mutates experiment and persists intermediate state to the store. The public run_experiment() function orchestrates phase sequencing, lifecycle management, and error handling.

async veupath_chatbot.services.experiment.service.run_experiment(config, *, user_id=None, progress_callback=None)[source]¶

Execute a full experiment and persist the result.

Parameters:

config (ExperimentConfig) – Experiment configuration.
user_id (str | None) – Owning user ID (for IDOR protection).
progress_callback (Callable[[JSONObject], Awaitable[None]] | None) – Optional async callback for SSE progress events.

Returns:

Completed experiment with all results.

Return type:

Experiment

Experiment store with write-through DB persistence.

Provides CRUD operations for experiment lifecycle management. Keeps an in-memory dict for fast synchronous access during experiment execution, and persists every mutation to PostgreSQL so experiments survive API restarts.

class veupath_chatbot.services.experiment.store.ExperimentStore[source]¶

Bases: WriteThruStore[Experiment]

Experiment repository with in-memory cache and DB write-through.

Inherits save/get/delete/aget/adelete from WriteThruStore. Adds domain-specific listing methods.

list_all(site_id=None, user_id=None)[source]¶

List experiments from in-memory cache.

Return type:: list[Experiment]

list_by_benchmark(benchmark_id)[source]¶

Return all experiments belonging to a benchmark suite (in-memory).

Return type:: list[Experiment]

async alist_all(site_id=None, user_id=None)[source]¶

List experiments: merges DB rows with in-memory (fresher) state.

Return type:: list[Experiment]

async alist_by_benchmark(benchmark_id)[source]¶

List experiments by benchmark: merges DB + in-memory.

Return type:: list[Experiment]

veupath_chatbot.services.experiment.store.get_experiment_store()[source]¶

Get the global experiment store singleton.

Return type:: ExperimentStore

Shared helpers for experiment execution and analysis.

Provides gene-list extraction utilities and the progress callback type alias.

veupath_chatbot.services.experiment.helpers.ProgressCallback¶

Emits an SSE-friendly progress event dict.

alias of Callable[[JSONObject], Awaitable[None]]

veupath_chatbot.services.experiment.helpers.safe_int(val, default=0)[source]¶

Safely convert a value to int, returning default on failure.

Return type:: int

veupath_chatbot.services.experiment.helpers.safe_float(val, default=0.0)[source]¶

Safely convert a value to float, returning default on failure.

Non-finite values (inf, -inf, nan) are replaced with default because they are not JSON-serializable and PostgreSQL rejects them in JSON columns.

Return type:: float

veupath_chatbot.services.experiment.helpers.extract_wdk_id(payload, key='id')[source]¶

Extract an integer ID from a WDK JSON response.

WDK formatters (StepFormatter, StrategyService, etc.) emit entity IDs as Java longs (always int in JSON) under a known key (typically "id" or "strategyId").

Parameters:

payload (object) – WDK response dict.
key (str) – JSON key containing the integer ID.

Returns:

The integer ID, or None if not found.

Return type:

int | None

veupath_chatbot.services.experiment.helpers.coerce_step_id(payload)[source]¶

Extract step ID from a WDK step-creation response.

Parameters:: payload (JSONObject | None) – WDK step-creation response.
Returns:: Step ID.
Raises:: ValueError – If step ID not found.
Return type:: int

async veupath_chatbot.services.experiment.helpers.extract_and_enrich_genes(*, site_id, result, negative_controls=None)[source]¶

Extract gene lists from a control-test result and enrich with WDK metadata.

Single entry point that replaces duplicated extract + enrich blocks.

Returns:: (true_positive, false_negative, false_positive, true_negative)
Return type:: tuple[list[GeneInfo], list[GeneInfo], list[GeneInfo], list[GeneInfo]]

Deserialize JSON dicts back into Experiment dataclass trees.

Simple sub-types are deserialized via the generic from_json converter. Only Experiment / ExperimentConfig require hand-written logic due to conditional field defaults and enrichment deduplication.

veupath_chatbot.services.experiment._deserialize.experiment_from_json(d)[source]¶

Reconstruct an Experiment from its JSON representation.

Parameters:: d (dict[str, Any]) – Dict produced by experiment_to_json().
Returns:: Fully hydrated Experiment dataclass.
Return type:: Experiment

WDK strategy materialization for experiments.

Creates, persists, and cleans up WDK strategies from experiment configs, including step tree materialization for multi-step and import modes.

async veupath_chatbot.services.experiment.materialization.cleanup_experiment_strategy(experiment)[source]¶

Delete the persisted WDK strategy when an experiment is deleted.

Parameters:: experiment (Experiment) – Experiment whose WDK strategy should be cleaned up.

Classification¶

Purpose: Gene record classification by experiment membership (TP/FP/FN/TN). Adds _classification field to WDK records based on gene ID membership in positive and negative control sets.

Gene record classification by experiment category membership.

Classifies WDK result records as TP / FP / FN / TN based on whether their gene ID appears in the experiment’s curated gene sets. Handles WDK transcript ID version suffixes (e.g. “GENE.1” -> “GENE”).

veupath_chatbot.services.experiment.classification.classify_records(records, tp_ids, fp_ids, fn_ids, tn_ids)[source]¶

Add _classification field to records based on gene ID membership.

For each record, extracts the primary key and checks membership in the four gene-set categories. WDK transcript IDs may include a version suffix (e.g. "PF3D7_0100100.1"); the function also checks the base ID with the suffix stripped.

Parameters:

records (list[JSONObject]) – WDK answer records (list of dicts).
tp_ids (set[str]) – True-positive gene IDs.
fp_ids (set[str]) – False-positive gene IDs.
fn_ids (set[str]) – False-negative gene IDs.
tn_ids (set[str]) – True-negative gene IDs.

Returns:

New list of records, each with a _classification field.

Return type:

list[JSONObject]

Evaluation Service¶

Purpose: Re-evaluation and threshold sweep service. Pure business logic for recomputing experiment metrics with updated controls or parameters.

Evaluation service: re-evaluate and threshold sweep.

Pure business logic extracted from the transport handler. No HTTP/SSE concerns here – callers (routers, tools, etc.) wrap the results in whatever transport format they need.

veupath_chatbot.services.experiment.evaluation.SWEEP_CONCURRENCY = 3¶: Max parallel WDK control-test runs per sweep.

veupath_chatbot.services.experiment.evaluation.SWEEP_TIMEOUT_S = 240¶: Server-side timeout for the entire sweep.

veupath_chatbot.services.experiment.evaluation.SWEEP_POINT_TIMEOUT_S = 90¶: Per-point timeout; prevents one slow point from blocking all.

async veupath_chatbot.services.experiment.evaluation.re_evaluate(exp)[source]¶

Re-run control evaluation against the (possibly modified) strategy.

Updates the experiment in-place (metrics + gene lists) and persists it. Returns the full experiment JSON.

Return type:: JSONObject

veupath_chatbot.services.experiment.evaluation.compute_sweep_values(*, sweep_type, values, min_value, max_value, steps)[source]¶

Compute the list of parameter values for a sweep.

Parameters:

sweep_type (str) – "numeric" or "categorical".
values (list[str] | None) – Explicit values for categorical sweeps.
min_value (float | None) – Range start for numeric sweeps.
max_value (float | None) – Range end for numeric sweeps.
steps (int) – Number of evenly-spaced points for numeric sweeps.

Returns:

List of stringified sweep values.

Raises:

ValidationError – On invalid inputs.

Return type:

list[str]

veupath_chatbot.services.experiment.evaluation.validate_sweep_parameter(exp, param_name)[source]¶

Ensure param_name exists in the experiment config.

For single-step experiments, checks exp.config.parameters. For tree-mode experiments, walks the step tree looking for the parameter in any leaf node’s parameters dict.

Raises:: ValidationError – If the parameter is missing.

veupath_chatbot.services.experiment.evaluation.format_metrics_dict(m)[source]¶

Format an ExperimentMetrics into a JSON-friendly dict.

Return type:: JSONObject

async veupath_chatbot.services.experiment.evaluation.run_sweep_point(*, exp, param_name, value, is_categorical)[source]¶

Run a single sweep point: modify the parameter and evaluate.

For tree-mode experiments, clones the step tree and injects the swept parameter value into every node that contains it, then calls run_controls_against_tree(). For single-step experiments, modifies the flat parameter dict and calls run_positive_negative_controls().

Returns:: Dict with value, metrics (or None), and optionally error.
Return type:: JSONObject

async veupath_chatbot.services.experiment.evaluation.cleanup_before_sweep(site_id)[source]¶

Best-effort cleanup of leaked internal control-test strategies.

async veupath_chatbot.services.experiment.evaluation.generate_sweep_events(*, exp, param_name, sweep_type, sweep_values)[source]¶

Run the full sweep and yield SSE-formatted events.

Yields sweep_point events as each point completes, then a final sweep_complete event with all sorted results.

Return type:: AsyncIterator[str]

Metrics and Evaluation¶

Key Metrics

\[\text{Precision} = \frac{|TP|}{|TP| + |FP|} \qquad \text{Recall} = \frac{|TP|}{|TP| + |FN|} \qquad F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

Where \(TP\) = true positives (returned genes in positive controls), \(FP\) = false positives (returned genes in negative controls), \(FN\) = false negatives (positive control genes not returned).

Classification metrics, rank metrics, and statistical utilities.

Metrics engine for computing exhaustive classification metrics.

Computes all standard binary classification metrics from the raw intersection counts returned by run_positive_negative_controls().

veupath_chatbot.services.experiment.metrics.compute_confusion_matrix(*, positive_hits, total_positives, negative_hits, total_negatives)[source]¶

Derive a confusion matrix from control-test intersection counts.

Parameters:

positive_hits (int) – Number of positive controls found in results (TP).
total_positives (int) – Total positive controls provided.
negative_hits (int) – Number of negative controls found in results (FP).
total_negatives (int) – Total negative controls provided.

Returns:

Populated confusion matrix.

Return type:

ConfusionMatrix

veupath_chatbot.services.experiment.metrics.compute_metrics(cm, *, total_results=0)[source]¶

Compute all classification metrics from a confusion matrix.

Parameters:

cm (ConfusionMatrix) – Confusion matrix.
total_results (int) – Total number of results returned by the search.

Returns:

Full metrics object.

Return type:

ExperimentMetrics

veupath_chatbot.services.experiment.metrics.evaluate_gene_ids_against_controls(*, gene_ids, positive_controls, negative_controls, site_id='', record_type='')[source]¶

Evaluate a gene set against controls using pure set intersection.

No WDK calls — the gene set already has its results. Returns the same dict shape that metrics_from_control_result() and extract_and_enrich_genes() consume.

Return type:: JSONObject

veupath_chatbot.services.experiment.metrics.metrics_from_control_result(result)[source]¶

Build metrics from the dict returned by run_positive_negative_controls().

Parameters:: result (JSONObject) – Raw control-test result dict.
Returns:: Full metrics.
Return type:: ExperimentMetrics

Rank-based evaluation metrics (Precision@K, Recall@K, Enrichment@K).

These metrics treat gene lists as ranked outputs rather than binary classifiers, which better matches how researchers use strategy results (“how many known positives are in my top K?”).

veupath_chatbot.services.experiment.rank_metrics.compute_rank_metrics(result_ids, positive_ids, negative_ids, k_values=None)[source]¶

Compute rank-based metrics from an ordered result list.

All computation is pure Python — no API calls.

Parameters:

result_ids (list[str]) – Ordered gene IDs from the strategy result.
positive_ids (set[str]) – Known positive control gene IDs.
negative_ids (set[str]) – Known negative control gene IDs (unused for rank metrics but kept for interface consistency).
k_values (list[int] | None) – List sizes at which to compute P@K / R@K / E@K.

Returns:

Rank metrics object.

Return type:

RankMetrics

async veupath_chatbot.services.experiment.rank_metrics.fetch_ordered_result_ids(site_id, step_id, max_results=5000, sort_attribute=None, sort_direction='ASC')[source]¶

Fetch ordered gene IDs from a persisted WDK strategy step.

When sort_attribute is provided the results are sorted by reportConfig.sorting via get_step_records(); otherwise the default WDK ordering is used (via get_step_answer()).

Parameters:

site_id (str) – VEuPathDB site ID.
step_id (int) – WDK step ID.
max_results (int) – Maximum number of IDs to retrieve.
sort_attribute (str | None) – WDK attribute name to sort by.
sort_direction (str) – "ASC" or "DESC".

Returns:

Ordered list of primary key values.

Return type:

list[str]

Shared statistical utilities for experiment analysis.

veupath_chatbot.services.experiment.stats.hypergeometric_log_sf(x, n, k, m)[source]¶

Approximate log survival function for hypergeometric distribution.

Uses a normal approximation of P(X >= x) for speed. Returns 0.0 (i.e. p=1.0) when the observed count is at or below the mean.

Parameters¶

x:: Number of observed successes.
n:: Population size (background).
k:: Number of success states in the population (result set size).
m:: Number of draws (gene set size).

Return type:: float

Analysis Features¶

Cross-validation, enrichment, overlap, comparison, robustness, and reporting.

K-fold cross-validation for overfitting detection.

Splits positive and negative control gene lists into k folds, evaluates each held-out fold, and aggregates metrics to detect overfitting.

veupath_chatbot.services.experiment.cross_validation.ProgressCallback¶

Async callback(fold_index, total_folds) for progress reporting.

alias of Callable[[int, int], Coroutine[Any, Any, None]]

veupath_chatbot.services.experiment.cross_validation.FoldEvaluator¶

Async callback(holdout_pos, holdout_neg) → control-test result dict.

alias of Callable[[list[str] | None, list[str] | None], Coroutine[Any, Any, JSONObject]]

async veupath_chatbot.services.experiment.cross_validation.run_cross_validation(*, site_id, record_type, controls_search_name, controls_param_name, positive_controls, negative_controls, controls_value_format='newline', search_name=None, parameters=None, tree=None, k=5, full_metrics=None, progress_callback=None)[source]¶

Run k-fold cross-validation on control gene lists.

When tree is provided, evaluates each fold against the full strategy tree. Otherwise, evaluates using the single-step search_name + parameters.

Return type:: CrossValidationResult

Enrichment analysis via WDK step analysis API.

Wraps VEuPathDB’s native GO, pathway, and word enrichment analyses that are available through the step analysis endpoint.

Plugin names (from stepAnalysisPlugins.xml):

go-enrichment → GoEnrichmentPlugin
pathway-enrichment → PathwaysEnrichmentPlugin
word-enrichment → WordEnrichmentPlugin

GO enrichment parameters (from GoEnrichmentPlugin.java):

goAssociationsOntologies — “Molecular Function” / etc.
goEvidenceCodes — evidence code filter
goSubset — GO slim subset
pValueCutoff — p-value threshold
organism — organism filter

Parameters are fetched from the WDK analysis form defaults so required fields like organism and pValueCutoff are always populated.

veupath_chatbot.services.experiment.enrichment.infer_enrichment_type(wdk_analysis_name, params, result)[source]¶

Infer the EnrichmentAnalysisType from a WDK analysis name.

For GO enrichment, uses the goAssociationsOntologies parameter or the goOntologies field in the result to determine which GO branch.

Return type:: Literal[‘go_function’, ‘go_component’, ‘go_process’, ‘pathway’, ‘word’]

veupath_chatbot.services.experiment.enrichment.is_enrichment_analysis(wdk_analysis_name)[source]¶

Return True if the WDK analysis name is an enrichment plugin.

Return type:: bool

veupath_chatbot.services.experiment.enrichment.upsert_enrichment_result(results, new)[source]¶

Replace an existing result of the same analysis_type, or append.

Mutates results in-place so callers don’t accumulate duplicate tabs when the same enrichment analysis is re-run.

veupath_chatbot.services.experiment.enrichment.parse_enrichment_from_raw(wdk_analysis_name, params, result)[source]¶

Parse a raw WDK analysis result into an EnrichmentResult.

Used by the generic analyses/run endpoint to return structured enrichment data instead of raw JSON.

Return type:: EnrichmentResult

veupath_chatbot.services.experiment.enrichment.encode_vocab_params(params, form_meta)[source]¶

Encode vocabulary param values as JSON arrays using form metadata.

WDK’s AbstractEnumParam.convertToTerms() requires all single-pick-vocabulary and multi-pick-vocabulary param values to be JSON-encoded arrays. This function ensures that encoding is applied after merging defaults with user params, so user-supplied plain strings don’t bypass the encoding.

Params whose type is not in the form metadata, or whose type is not a vocabulary type, are returned unchanged.

Return type:: JSONObject

async veupath_chatbot.services.experiment.enrichment.run_enrichment_analysis(*, site_id, record_type, search_name, parameters, analysis_type)[source]¶

Run a single enrichment analysis on a search result set.

Creates a temporary WDK strategy, runs the analysis, parses results, and cleans up.

Return type:: EnrichmentResult

async veupath_chatbot.services.experiment.enrichment.run_enrichment_on_step(*, site_id, step_id, analysis_type)[source]¶

Run enrichment on an already-persisted WDK step.

Used for multi-step experiments where the strategy already exists.

Return type:: EnrichmentResult

Custom gene set enrichment analysis against experiment results.

class veupath_chatbot.services.experiment.custom_enrichment.CustomEnrichmentResult[source]¶

Bases: TypedDict

Return shape of run_custom_enrichment().

geneSetName: str¶

geneSetSize: int¶

overlapCount: int¶

overlapGenes: list[str]¶

backgroundSize: int¶

tpCount: int¶

foldEnrichment: float¶

pValue: float¶

oddsRatio: float¶

veupath_chatbot.services.experiment.custom_enrichment.run_custom_enrichment(exp, gene_ids, gene_set_name)[source]¶

Test enrichment of a custom gene set against the experiment results.

Computes overlap, fold enrichment, p-value (hypergeometric), and odds ratio.

Return type:: CustomEnrichmentResult

Cross-experiment enrichment comparison.

class veupath_chatbot.services.experiment.enrichment_compare.EnrichmentRow[source]¶

Bases: TypedDict

Shape of one term row in the enrichment comparison.

termKey: str¶

termName: str¶

analysisType: str¶

scores: dict[str, JSONValue]¶

maxScore: float¶

experimentCount: int¶

class veupath_chatbot.services.experiment.enrichment_compare.EnrichmentCompareResult[source]¶

Bases: TypedDict

Return shape of compare_enrichment_across().

experimentIds: list[str]¶

experimentLabels: dict[str, str]¶

rows: list[EnrichmentRow]¶

totalTerms: int¶

veupath_chatbot.services.experiment.enrichment_compare.compare_enrichment_across(experiments, experiment_ids, analysis_type=None)[source]¶

Compare enrichment results across experiments.

Builds a term-by-experiment matrix of fold-enrichment scores. Optionally filters to a single analysis type.

Return type:: EnrichmentCompareResult

Gene set overlap analysis across experiments.

class veupath_chatbot.services.experiment.overlap.PairwiseOverlap[source]¶

Bases: TypedDict

Shape of one pairwise comparison entry.

experimentA: str¶

experimentB: str¶

labelA: str¶

labelB: str¶

sizeA: int¶

sizeB: int¶

intersection: int¶

union: int¶

jaccard: float¶

sharedGenes: list[str]¶

uniqueA: list[str]¶

uniqueB: list[str]¶

class veupath_chatbot.services.experiment.overlap.PerExperimentSummary[source]¶

Bases: TypedDict

Shape of one per-experiment summary entry.

experimentId: str¶

label: str¶

totalGenes: int¶

uniqueGenes: int¶

sharedGenes: int¶

class veupath_chatbot.services.experiment.overlap.GeneMembership[source]¶

Bases: TypedDict

Shape of one gene membership entry.

geneId: str¶

foundIn: int¶

totalExperiments: int¶

experiments: list[str]¶

class veupath_chatbot.services.experiment.overlap.OverlapResult[source]¶

Bases: TypedDict

Return shape of compute_gene_set_overlap().

experimentIds: list[str]¶

experimentLabels: dict[str, str]¶

pairwise: list[PairwiseOverlap]¶

perExperiment: list[PerExperimentSummary]¶

universalGenes: list[str]¶

totalUniqueGenes: int¶

geneMembership: list[GeneMembership]¶

veupath_chatbot.services.experiment.overlap.compute_gene_set_overlap(experiments, experiment_ids)[source]¶

Compute pairwise gene set overlap between experiments.

For each experiment the result gene set is the union of TP and FP genes. Returns Jaccard similarity, shared/unique genes, and membership counts.

Return type:: OverlapResult

Bootstrap robustness and uncertainty estimation.

Resamples control sets with replacement and recomputes rank metrics to derive confidence intervals and stability scores — all pure Python, no additional WDK API calls required.

veupath_chatbot.services.experiment.robustness.compute_robustness(result_ids, positive_ids, negative_ids, *, n_bootstrap=200, k_values=None, seed=42, alternative_negatives=None, include_rank_metrics=True)[source]¶

Compute bootstrap confidence intervals for classification (and optionally rank) metrics.

Parameters:

result_ids (list[str]) – Ordered gene IDs from the strategy result.
positive_ids (list[str]) – Positive control gene IDs.
negative_ids (list[str]) – Negative control gene IDs.
n_bootstrap (int) – Number of bootstrap iterations.
k_values (list[int] | None) – K values for Precision/Recall/Enrichment@K.
seed (int) – Random seed for reproducibility.
alternative_negatives (dict[str, list[str]] | None) – Optional map of label -> negative IDs for negative-set sensitivity analysis.
include_rank_metrics (bool) – When False, skip rank metric CIs and top-K stability — only classification CIs are computed.

Returns:

Bootstrap robustness result.

Return type:

BootstrapResult

Self-contained HTML report generation for experiments.

Generates a single-file HTML document with embedded styles, tables, and inline SVG charts. No external dependencies required.

veupath_chatbot.services.experiment.report.generate_experiment_report(experiment)[source]¶

Generate a self-contained HTML report for an experiment.

Parameters:: experiment (Experiment) – Full experiment object with results.
Returns:: Complete HTML string.
Return type:: str

Multi-step tree-knob optimization.

Tunes threshold parameters and boolean operators across a strategy tree using Optuna, optimizing for rank-based objectives (Precision@K, Enrichment@K) with optional list-size constraints.

async veupath_chatbot.services.experiment.tree_knobs.optimize_tree_knobs(*, site_id, record_type, base_tree, threshold_knobs, operator_knobs, positive_controls, negative_controls, controls_search_name, controls_param_name, controls_value_format, objective='precision_at_50', budget=50, max_list_size=None)[source]¶

Run Optuna optimization over tree knobs.

Parameters:

base_tree (JSONObject) – PlanStepNode-shaped dict (the template tree).
threshold_knobs (list[ThresholdKnob]) – Numeric parameter knobs on leaf steps.
operator_knobs (list[OperatorKnob]) – Boolean operator knobs on combine nodes.
objective (str) – Target metric name (e.g. precision_at_50).
budget (int) – Maximum number of Optuna trials.
max_list_size (int | None) – Optional upper bound on result list size.

Returns:

Optimization result with best trial and history.

Return type:

TreeOptimizationResult

AI Analysis¶

AI-powered analysis helpers and tool definitions.

Helper functions for experiment analysis AI tools.

Utility functions for extracting WDK record data, classifying genes, searching records, and fetching result IDs.

veupath_chatbot.services.experiment.ai_analysis_helpers.classify_gene(gene_id, tp_ids, fp_ids, fn_ids, tn_ids)[source]¶

Return the classification label for a gene ID.

Parameters:

gene_id (str | None) – Gene identifier to classify.
tp_ids (set[str]) – True positive gene IDs.
fp_ids (set[str]) – False positive gene IDs.
fn_ids (set[str]) – False negative gene IDs.
tn_ids (set[str]) – True negative gene IDs.

Returns:

One of "TP", "FP", "FN", "TN", or None.

Return type:

str | None

veupath_chatbot.services.experiment.ai_analysis_helpers.record_matches(attrs, query_lower, attribute)[source]¶

Check if a record’s attributes match a text query.

Parameters:

attrs (JSONObject) – Record attribute dict.
query_lower (str) – Lowercased search query.
attribute (str | None) – Specific attribute to search in, or None for all.

Returns:

True if any matching attribute value is found.

Return type:

bool

async veupath_chatbot.services.experiment.ai_analysis_helpers.build_primary_key(api, site_id, record_type, gene_id)[source]¶

Build a complete WDK primary key for a gene ID.

WDK requires all primary key columns (e.g. source_id + project_id for gene records). This helper fetches the record type info and fills missing columns from site configuration.

Parameters:

api (StrategyAPI) – Strategy API instance.
site_id (str) – VEuPathDB site identifier.
record_type (str) – WDK record type.
gene_id (str) – Gene identifier (the source_id value).

Returns:

List of {name, value} dicts forming the complete PK.

Return type:

list[JSONObject]

async veupath_chatbot.services.experiment.ai_analysis_helpers.fetch_group_records(api, record_type, gene_ids, limit=20, site_id=None)[source]¶

Fetch records for a list of gene IDs.

Parameters:

api (StrategyAPI) – Strategy API instance.
record_type (str) – WDK record type.
gene_ids (list[str]) – Gene IDs to fetch.
limit (int) – Max number of genes to fetch.
site_id (str | None) – Site ID for PK completion (fills project_id etc.).

Returns:

List of dicts with geneId and attributes.

Return type:

list[JSONObject]

async veupath_chatbot.services.experiment.ai_analysis_helpers.collect_all_result_ids(api, step_id)[source]¶

Fetch all result gene IDs from a WDK step by paginating.

Parameters:

api (StrategyAPI) – Strategy API instance.
step_id (int) – WDK step ID.

Returns:

Set of all gene IDs in the step’s results.

Return type:

set[str]

AI tools for deep experiment result analysis.

Provides function-calling tools that let the AI assistant access experiment data: paginate through records, look up individual genes, get attribute distributions, compare gene groups, and search results.

The agent class is built dynamically via build_analysis_agent_class() so that the services layer never needs a static import from veupath_chatbot.ai. The configured experiment-agent base class is injected at startup.

veupath_chatbot.services.experiment.ai_analysis_tools.configure(*, experiment_agent_cls)[source]¶

Wire the experiment agent base class.

Called once at application startup from the composition root.

class veupath_chatbot.services.experiment.ai_analysis_tools.ExperimentAnalysisAgent(engine, site_id, experiment_id, system_prompt, chat_history=None)[source]¶

Bases: RefinementToolsMixin, _AnalysisToolsMixin, Kani

AI agent with data-access and strategy-refinement tools.

Combines analysis tools (data browsing, gene lookup, distributions) with refinement tools (add steps, filter, re-evaluate) and the experiment assistant’s catalog/research tools (inherited via the injected base class).

The base class is set dynamically at startup; if not configured, instantiation falls back to plain Kani.

__init__(engine, site_id, experiment_id, system_prompt, chat_history=None)[source]¶

Experiment wizard AI assistant — prompt construction and orchestration.

Builds step-specific system prompts, creates a lightweight experiment assistant agent, and streams its response.

AI-layer dependencies (engine factory, agent classes) are injected at startup via configure() so that the services layer never imports from veupath_chatbot.ai.

veupath_chatbot.services.experiment.assistant.configure(*, create_engine_fn, experiment_agent_cls)[source]¶

Wire AI-layer implementations into the experiment assistant.

Called once at application startup from the composition root.

veupath_chatbot.services.experiment.assistant.build_system_prompt(step, site_id, context)[source]¶

Build the step-specific system prompt with injected context.

Parameters:

step (Literal['search', 'parameters', 'controls', 'run', 'results', 'analysis']) – Current wizard step.
site_id (str) – VEuPathDB site identifier.
context (JSONObject) – Wizard state (search, params, controls, etc.).

Returns:

Formatted system prompt string.

Return type:

str

async veupath_chatbot.services.experiment.assistant.run_assistant(site_id, step, message, context, history=None, model_override=None, provider_override=None, reasoning_effort=None)[source]¶

Create an experiment assistant and stream its response.

Parameters:

site_id (str) – VEuPathDB site identifier.
step (Literal['search', 'parameters', 'controls', 'run', 'results', 'analysis']) – Current wizard step.
message (str) – User message.
context (JSONObject) – Wizard state context.
history (list[JSONObject] | None) – Previous conversation messages.
model_override (str | None) – Model catalog ID override (default: openai/gpt-4.1-nano).
provider_override (Literal['openai', 'anthropic', 'google', 'ollama', 'mock'] | None) – Provider override.
reasoning_effort (Literal['none', 'low', 'medium', 'high'] | None) – Reasoning effort override.

Returns:

Async iterator of SSE-compatible event dicts.

Return type:

AsyncIterator[JSONObject]

Step Analysis¶

Multi-step strategy analysis: per-step evaluation, operator comparison, contribution analysis, and parameter sensitivity.

Step decomposition analysis for multi-step strategies.

Replaces the Optuna-based tree optimization with four interpretable analysis phases that give researchers actionable, per-step insights:

Per-step evaluation – evaluate each leaf independently.
Operator comparison – try all operators at each combine node.
Step contribution (ablation) – measure the impact of removing each leaf.
Parameter sensitivity – sweep numeric params across their range.

async veupath_chatbot.services.experiment.step_analysis.run_controls_against_tree(*, site_id, record_type, tree, controls_search_name, controls_param_name, controls_value_format, positive_controls=None, negative_controls=None)[source]¶

Materialise a PlanStepNode tree, intersect with controls, return metrics.

Creates a temporary WDK strategy containing the full tree, adds an intersection step with each control set on top of the root, queries the result counts, then deletes everything.

Returns the same shape as run_positive_negative_controls() so metrics_from_control_result() can consume it directly.

Return type:: JSONObject

async veupath_chatbot.services.experiment.step_analysis.run_step_analysis(*, site_id, record_type, tree, controls_search_name, controls_param_name, controls_value_format, positive_controls, negative_controls, baseline_result, phases=None, progress_callback=None)[source]¶

Run all requested step analysis phases.

Parameters:

tree (JSONObject) – PlanStepNode-shaped dict.
baseline_result (JSONObject) – Raw result from the initial tree evaluation.
phases (list[str] | None) – Which phases to run. Defaults to all four.

Returns:

Aggregated StepAnalysisResult.

Return type:

StepAnalysisResult

Main entry point: run_step_analysis coordinates all four analysis phases.

async veupath_chatbot.services.experiment.step_analysis.orchestrator.run_step_analysis(*, site_id, record_type, tree, controls_search_name, controls_param_name, controls_value_format, positive_controls, negative_controls, baseline_result, phases=None, progress_callback=None)[source]¶

Run all requested step analysis phases.

Parameters:

tree (JSONObject) – PlanStepNode-shaped dict.
baseline_result (JSONObject) – Raw result from the initial tree evaluation.
phases (list[str] | None) – Which phases to run. Defaults to all four.

Returns:

Aggregated StepAnalysisResult.

Return type:

StepAnalysisResult

Phase 1: Per-step evaluation – evaluate each leaf independently.

async veupath_chatbot.services.experiment.step_analysis.phase_step_eval.evaluate_steps(*, site_id, record_type, tree, controls_search_name, controls_param_name, controls_value_format, positive_controls, negative_controls, progress_callback=None)[source]¶

Evaluate each leaf step against controls, preserving ancestor transforms.

For each leaf, the evaluation includes any transform chain above it (e.g. GenesByOrthologs) so that cross-organism searches are converted before being compared against controls.

Parameters:: tree (JSONObject) – PlanStepNode-shaped dict.
Returns:: One StepEvaluation per leaf.
Return type:: list[StepEvaluation]

Phase 2: Operator comparison – try all operators at each combine node.

async veupath_chatbot.services.experiment.step_analysis.phase_operators.compare_operators(*, site_id, record_type, tree, controls_search_name, controls_param_name, controls_value_format, positive_controls, negative_controls, progress_callback=None)[source]¶

For each combine node, evaluate INTERSECT, UNION, MINUS and recommend.

Parameters:: tree (JSONObject) – PlanStepNode-shaped dict.
Returns:: One OperatorComparison per combine node.
Return type:: list[OperatorComparison]

Phase 3: Step contribution (ablation) – measure impact of removing each leaf.

async veupath_chatbot.services.experiment.step_analysis.phase_contribution.analyze_contributions(*, site_id, record_type, tree, controls_search_name, controls_param_name, controls_value_format, positive_controls, negative_controls, baseline_metrics, progress_callback=None)[source]¶

Ablation analysis: remove each leaf and measure the impact.

Parameters:: baseline_metrics (JSONObject) – Metrics from the full tree evaluation.
Returns:: One StepContribution per leaf.
Return type:: list[StepContribution]

Phase 4: Parameter sensitivity – sweep numeric params across their range.

async veupath_chatbot.services.experiment.step_analysis.phase_sensitivity.sweep_parameters(*, site_id, record_type, tree, controls_search_name, controls_param_name, controls_value_format, positive_controls, negative_controls, progress_callback=None)[source]¶

Sweep numeric params on each leaf across their WDK-declared range.

Respects paired min/max bound parameters, deduplicates identical searches across leaves, and only recommends changes when the improvement is meaningful.

Parameters:: tree (JSONObject) – PlanStepNode-shaped dict.
Returns:: One ParameterSensitivity per numeric param.
Return type:: list[ParameterSensitivity]

Control evaluation logic: run trees/steps against control sets and extract metrics.

async veupath_chatbot.services.experiment.step_analysis._evaluation.run_controls_against_tree(*, site_id, record_type, tree, controls_search_name, controls_param_name, controls_value_format, positive_controls=None, negative_controls=None)[source]¶

Materialise a PlanStepNode tree, intersect with controls, return metrics.

Creates a temporary WDK strategy containing the full tree, adds an intersection step with each control set on top of the root, queries the result counts, then deletes everything.

Returns the same shape as run_positive_negative_controls() so metrics_from_control_result() can consume it directly.

Return type:: JSONObject

Tree traversal and manipulation helpers for step analysis.

Types¶

Pydantic models for experiment configuration, metrics, enrichment, and results.

Shared data types for the Experiment Lab.

This package consolidates all experiment-related dataclasses, type aliases, and serialization helpers. All public symbols are re-exported here.

class veupath_chatbot.services.experiment.types.ConfusionMatrix(true_positives, false_positives, true_negatives, false_negatives)[source]¶

Evaluation Engine¶

Evaluation Modes¶

Execution Endpoints¶

Analysis Endpoints¶

CRUD and Results¶

Persistence¶

Control Sets¶

Experiment Streaming (CQRS)¶

Service Layer¶

Classification¶

Evaluation Service¶

Metrics and Evaluation¶

Parameters¶

Analysis Features¶

AI Analysis¶

AI Refinement Tools¶

Step Analysis¶

Types¶

Seed Data¶