Gene Sets

Gene set management — persistent named collections of gene IDs with source tracking, confidence scoring, ensemble analysis, and reverse search ranking.

Overview

Gene sets are the bridge between strategy results and downstream analysis. When a strategy step returns gene IDs, those IDs can be captured as a named gene set for further use in enrichment analysis, cross-validation, and workbench exploration.

        flowchart LR
    A["Strategy Results"] -->|capture| B["Gene Set"]
    C["User Paste/Upload"] -->|create| B
    D["Set Operations"] -->|derive| B
    B --> E["Enrichment"]
    B --> F["Confidence Scoring"]
    B --> G["Ensemble Analysis"]
    B --> H["Reverse Search"]
    B --> I["Export CSV/TXT"]

    style B fill:#7c3aed,color:#fff
    

Key capabilities:

  • CRUD — Create gene sets from strategy results, user paste/upload, or derived operations (intersection, union, difference)

  • Confidence scoring — Per-gene composite scores combining classification status, ensemble frequency, and enrichment support

  • Ensemble analysis — Score genes by frequency across multiple gene sets (consensus voting)

  • Reverse search — Rank gene set candidates by their recovery of known positive genes using pure set intersection (no WDK API calls needed)

  • Write-through persistence — In-memory store backed by PostgreSQL for fast reads with durable writes

Design Decisions

Why in-memory + DB?

Gene sets are read on nearly every workbench operation. The write-through store keeps a dict in memory for O(1) lookups while persisting mutations to PostgreSQL for durability. This avoids per-request DB round-trips for the common read path.

Why pure set operations for reverse search?

Gene IDs from strategy results are already materialized. Ranking candidates by set intersection (recall, precision, F1) is instantaneous compared to running WDK API calls. This makes the “which gene set best recovers my positive controls?” question answerable in milliseconds.

Source tracking

Each gene set records its source (strategy, paste, upload, derived, saved) for provenance. This lets the UI show where a gene set came from and whether it’s “live” (from a strategy) or static.

Gene Set Store

Purpose: Write-through gene set store. In-memory dict for fast reads, PostgreSQL persistence for durability. Thread-safe via asyncio.

Gene set store with write-through DB persistence.

Keeps an in-memory dict for fast synchronous access during AI tool calls, and persists every mutation to PostgreSQL so gene sets survive API restarts.

class veupath_chatbot.services.gene_sets.store.GeneSetStore[source]

Bases: WriteThruStore[GeneSet]

Gene set repository with in-memory cache and DB write-through.

Inherits save/get/delete/aget/adelete from WriteThruStore. Adds domain-specific listing methods.

list_all(*, site_id=None)[source]
Return type:

list[GeneSet]

list_for_user(user_id, *, site_id=None)[source]
Return type:

list[GeneSet]

async alist_all(*, site_id=None)[source]

List gene sets: merges DB rows with in-memory (fresher) state.

Return type:

list[GeneSet]

async alist_for_user(user_id, *, site_id=None)[source]
Return type:

list[GeneSet]

veupath_chatbot.services.gene_sets.store.get_gene_set_store()[source]

Get the global gene set store singleton.

Return type:

GeneSetStore

Gene Set Types

Purpose: Core data model for gene sets.

Gene Set data model.

class veupath_chatbot.services.gene_sets.types.GeneSet(id, name, site_id, gene_ids, source, user_id=None, created_at=<factory>, wdk_strategy_id=None, wdk_step_id=None, search_name=None, record_type=None, parameters=None, parent_set_ids=<factory>, operation=None, step_count=1)[source]

Bases: object

A named collection of gene IDs for analysis.

id: str
name: str
site_id: str
gene_ids: list[str]
source: Literal['strategy', 'paste', 'upload', 'derived', 'saved']
user_id: UUID | None = None
created_at: datetime
wdk_strategy_id: int | None = None
wdk_step_id: int | None = None
search_name: str | None = None
record_type: str | None = None
parameters: dict[str, str] | None = None
parent_set_ids: list[str]
operation: str | None = None
step_count: int = 1
__init__(id, name, site_id, gene_ids, source, user_id=None, created_at=<factory>, wdk_strategy_id=None, wdk_step_id=None, search_name=None, record_type=None, parameters=None, parent_set_ids=<factory>, operation=None, step_count=1)

Gene Set Operations

Purpose: CRUD operations on gene sets — create, update, delete, list, and derived set operations (intersection, union, difference).

Gene set business logic.

All domain operations on gene sets live here. The transport layer (HTTP router) delegates to this module for create, delete, list, set operations, enrichment, and step-results access.

veupath_chatbot.services.gene_sets.operations.count_steps_in_tree(node)[source]

Recursively count steps in a WDK strategy step tree.

Return type:

int

async veupath_chatbot.services.gene_sets.operations.resolve_root_step_id(api, *, strategy_id)[source]

Get the root step ID from a WDK strategy.

Return type:

int | None

async veupath_chatbot.services.gene_sets.operations.fetch_gene_ids_from_step(api, *, step_id)[source]

Fetch all gene IDs from a WDK step via the standard report endpoint.

Return type:

list[str]

class veupath_chatbot.services.gene_sets.operations.GeneSetService(store)[source]

Bases: object

Orchestrates all gene-set domain operations.

Depends on the gene-set store and (lazily) on WDK APIs. The transport layer should instantiate this once per request or hold a singleton.

__init__(store)[source]
async flush(gene_set_id)[source]

Ensure a gene set is persisted to the database.

The default save path is fire-and-forget. Call this when you need the row to exist in the DB immediately (e.g., before setting an FK).

async get_for_user(user_id, gene_set_id)[source]

Retrieve a gene set, raising KeyError if not found or wrong owner.

Return type:

GeneSet

async create(*, user_id, name, site_id, gene_ids, source, wdk_strategy_id=None, wdk_step_id=None, search_name=None, record_type=None, parameters=None)[source]

Create a gene set, auto-resolving from WDK if needed.

Return type:

GeneSet

async list_for_user(user_id, *, site_id=None)[source]

List gene sets for a user, optionally filtered by site.

Return type:

list[GeneSet]

find_by_wdk_strategy(user_id, wdk_strategy_id)[source]

Find an existing gene set for a WDK strategy (cache lookup).

Return type:

GeneSet | None

async delete(user_id, gene_set_id)[source]

Delete a gene set, raising KeyError if not found or wrong owner.

async perform_set_operation(*, user_id, set_a_id, set_b_id, operation, name)[source]

Perform a set operation (intersect, union, minus) between two gene sets.

Return type:

GeneSet

async run_enrichment(user_id, gene_set_id, enrichment_types)[source]

Run enrichment analysis on a gene set.

Return type:

list[EnrichmentResult]

async get_step_results_service(user_id, gene_set_id)[source]

Get a StepResultsService for a gene set.

Raises ValueError if the gene set has no associated WDK step.

Return type:

StepResultsService

async get_strategy_tree(user_id, gene_set_id)[source]

Get the WDK strategy tree for a gene set.

Returns the gene set and the strategy tree dict. Raises ValueError if no WDK strategy is associated.

Return type:

tuple[GeneSet, JSONObject]

Confidence Scoring

Composite Confidence Score

\[C_g = w_1 \cdot \mathbb{1}[g \in TP] + w_2 \cdot \frac{f_g}{N} + w_3 \cdot E_g\]

Where \(f_g\) is the ensemble frequency (how many gene sets contain gene \(g\)), \(N\) is the total number of gene sets, and \(E_g\) is the enrichment support score (membership in significant GO terms / pathways).

Purpose: Per-gene composite confidence scoring. Combines classification status (TP/FP/FN/TN), ensemble frequency (how many gene sets include this gene), and enrichment support (GO term / pathway membership) into a single score.

Per-gene composite confidence scoring.

Combines classification, ensemble frequency, and enrichment support into a single ranked score. Pure computation — no I/O.

class veupath_chatbot.services.gene_sets.confidence.GeneConfidenceScore(gene_id, composite_score, classification_score, ensemble_score, enrichment_score)[source]

Bases: object

Confidence breakdown for a single gene.

gene_id: str
composite_score: float
classification_score: float
ensemble_score: float
enrichment_score: float
__init__(gene_id, composite_score, classification_score, ensemble_score, enrichment_score)
veupath_chatbot.services.gene_sets.confidence.compute_gene_confidence(*, tp_ids, fp_ids, fn_ids, tn_ids, ensemble_scores=None, enrichment_gene_counts=None, max_enrichment_terms=1)[source]

Compute per-gene confidence scores, sorted descending by composite.

Return type:

list[GeneConfidenceScore]

Ensemble Analysis

Purpose: Ensemble gene scoring by frequency across multiple gene sets. Genes appearing in more sets get higher scores (consensus voting). Returns sorted results for the workbench UI.

Ensemble gene scoring — frequency across multiple gene sets.

class veupath_chatbot.services.gene_sets.ensemble.EnsembleScore[source]

Bases: TypedDict

A single gene’s ensemble score.

geneId: str
frequency: float
count: int
total: int
inPositives: bool
veupath_chatbot.services.gene_sets.ensemble.compute_ensemble_scores(gene_sets, positive_controls=None)[source]

Score genes by how frequently they appear across gene sets.

Returns a list of EnsembleScore dicts sorted by frequency (desc), then gene ID (asc).

Return type:

list[EnsembleScore]