Gene Sets¶
Gene set management — persistent named collections of gene IDs with source tracking, confidence scoring, ensemble analysis, and reverse search ranking.
Overview¶
Gene sets are the bridge between strategy results and downstream analysis. When a strategy step returns gene IDs, those IDs can be captured as a named gene set for further use in enrichment analysis, cross-validation, and workbench exploration.
flowchart LR
A["Strategy Results"] -->|capture| B["Gene Set"]
C["User Paste/Upload"] -->|create| B
D["Set Operations"] -->|derive| B
B --> E["Enrichment"]
B --> F["Confidence Scoring"]
B --> G["Ensemble Analysis"]
B --> H["Reverse Search"]
B --> I["Export CSV/TXT"]
style B fill:#7c3aed,color:#fff
Key capabilities:
CRUD — Create gene sets from strategy results, user paste/upload, or derived operations (intersection, union, difference)
Confidence scoring — Per-gene composite scores combining classification status, ensemble frequency, and enrichment support
Ensemble analysis — Score genes by frequency across multiple gene sets (consensus voting)
Reverse search — Rank gene set candidates by their recovery of known positive genes using pure set intersection (no WDK API calls needed)
Write-through persistence — In-memory store backed by PostgreSQL for fast reads with durable writes
Design Decisions¶
Why in-memory + DB?
Gene sets are read on nearly every workbench operation. The write-through store keeps a dict in memory for O(1) lookups while persisting mutations to PostgreSQL for durability. This avoids per-request DB round-trips for the common read path.
Why pure set operations for reverse search?
Gene IDs from strategy results are already materialized. Ranking candidates by set intersection (recall, precision, F1) is instantaneous compared to running WDK API calls. This makes the “which gene set best recovers my positive controls?” question answerable in milliseconds.
Source tracking
Each gene set records its source (strategy, paste,
upload, derived, saved) for provenance. This lets the UI show where
a gene set came from and whether it’s “live” (from a strategy) or static.
Gene Set Store¶
Purpose: Write-through gene set store. In-memory dict for fast reads, PostgreSQL persistence for durability. Thread-safe via asyncio.
Gene set store with write-through DB persistence.
Keeps an in-memory dict for fast synchronous access during AI tool calls, and persists every mutation to PostgreSQL so gene sets survive API restarts.
- class veupath_chatbot.services.gene_sets.store.GeneSetStore[source]¶
Bases:
WriteThruStore[GeneSet]Gene set repository with in-memory cache and DB write-through.
Inherits save/get/delete/aget/adelete from WriteThruStore. Adds domain-specific listing methods.
Gene Set Types¶
Purpose: Core data model for gene sets.
Gene Set data model.
- class veupath_chatbot.services.gene_sets.types.GeneSet(id, name, site_id, gene_ids, source, user_id=None, created_at=<factory>, wdk_strategy_id=None, wdk_step_id=None, search_name=None, record_type=None, parameters=None, parent_set_ids=<factory>, operation=None, step_count=1)[source]¶
Bases:
objectA named collection of gene IDs for analysis.
- __init__(id, name, site_id, gene_ids, source, user_id=None, created_at=<factory>, wdk_strategy_id=None, wdk_step_id=None, search_name=None, record_type=None, parameters=None, parent_set_ids=<factory>, operation=None, step_count=1)¶
Gene Set Operations¶
Purpose: CRUD operations on gene sets — create, update, delete, list, and derived set operations (intersection, union, difference).
Gene set business logic.
All domain operations on gene sets live here. The transport layer (HTTP router) delegates to this module for create, delete, list, set operations, enrichment, and step-results access.
- veupath_chatbot.services.gene_sets.operations.count_steps_in_tree(node)[source]¶
Recursively count steps in a WDK strategy step tree.
- Return type:
- async veupath_chatbot.services.gene_sets.operations.resolve_root_step_id(api, *, strategy_id)[source]¶
Get the root step ID from a WDK strategy.
- Return type:
int | None
- async veupath_chatbot.services.gene_sets.operations.fetch_gene_ids_from_step(api, *, step_id)[source]¶
Fetch all gene IDs from a WDK step via the standard report endpoint.
- class veupath_chatbot.services.gene_sets.operations.GeneSetService(store)[source]¶
Bases:
objectOrchestrates all gene-set domain operations.
Depends on the gene-set store and (lazily) on WDK APIs. The transport layer should instantiate this once per request or hold a singleton.
- async flush(gene_set_id)[source]¶
Ensure a gene set is persisted to the database.
The default save path is fire-and-forget. Call this when you need the row to exist in the DB immediately (e.g., before setting an FK).
- async get_for_user(user_id, gene_set_id)[source]¶
Retrieve a gene set, raising KeyError if not found or wrong owner.
- Return type:
- async create(*, user_id, name, site_id, gene_ids, source, wdk_strategy_id=None, wdk_step_id=None, search_name=None, record_type=None, parameters=None)[source]¶
Create a gene set, auto-resolving from WDK if needed.
- Return type:
- async list_for_user(user_id, *, site_id=None)[source]¶
List gene sets for a user, optionally filtered by site.
- find_by_wdk_strategy(user_id, wdk_strategy_id)[source]¶
Find an existing gene set for a WDK strategy (cache lookup).
- Return type:
GeneSet | None
- async delete(user_id, gene_set_id)[source]¶
Delete a gene set, raising KeyError if not found or wrong owner.
- async perform_set_operation(*, user_id, set_a_id, set_b_id, operation, name)[source]¶
Perform a set operation (intersect, union, minus) between two gene sets.
- Return type:
- async run_enrichment(user_id, gene_set_id, enrichment_types)[source]¶
Run enrichment analysis on a gene set.
- Return type:
- async get_step_results_service(user_id, gene_set_id)[source]¶
Get a StepResultsService for a gene set.
Raises ValueError if the gene set has no associated WDK step.
- Return type:
Confidence Scoring¶
Composite Confidence Score
Where \(f_g\) is the ensemble frequency (how many gene sets contain gene \(g\)), \(N\) is the total number of gene sets, and \(E_g\) is the enrichment support score (membership in significant GO terms / pathways).
Purpose: Per-gene composite confidence scoring. Combines classification status (TP/FP/FN/TN), ensemble frequency (how many gene sets include this gene), and enrichment support (GO term / pathway membership) into a single score.
Per-gene composite confidence scoring.
Combines classification, ensemble frequency, and enrichment support into a single ranked score. Pure computation — no I/O.
- class veupath_chatbot.services.gene_sets.confidence.GeneConfidenceScore(gene_id, composite_score, classification_score, ensemble_score, enrichment_score)[source]¶
Bases:
objectConfidence breakdown for a single gene.
- __init__(gene_id, composite_score, classification_score, ensemble_score, enrichment_score)¶
Ensemble Analysis¶
Purpose: Ensemble gene scoring by frequency across multiple gene sets. Genes appearing in more sets get higher scores (consensus voting). Returns sorted results for the workbench UI.
Ensemble gene scoring — frequency across multiple gene sets.
- class veupath_chatbot.services.gene_sets.ensemble.EnsembleScore[source]¶
Bases:
TypedDictA single gene’s ensemble score.
Reverse Search¶
Purpose: Rank gene set candidates by recovery of known positive genes. Uses pure set intersection — no WDK API calls needed. Scores on recall, precision, and F1 for given positive controls.
Reverse search — rank gene sets by how well they recover positive genes.
Given a set of known-positive gene IDs, score each candidate gene set on recall, precision, and F1 using pure set intersection. No WDK calls needed because the gene IDs are already materialised.
- class veupath_chatbot.services.gene_sets.reverse_search.GeneSetCandidate(id, name, gene_ids, search_name=None)[source]¶
Bases:
objectA gene set to evaluate against the positive controls.
- __init__(id, name, gene_ids, search_name=None)¶
- class veupath_chatbot.services.gene_sets.reverse_search.RankedResult(gene_set_id, name, search_name, recall, precision, f1, result_count, overlap_count)[source]¶
Bases:
objectA scored gene set with classification metrics.
- __init__(gene_set_id, name, search_name, recall, precision, f1, result_count, overlap_count)¶