Research Services

Web search and literature search. Integrates with DuckDuckGo, Semantic Scholar, Europe PMC, Crossref, OpenAlex, PubMed, arXiv, and preprint servers. Used by the web_search and literature_search tools.

Overview

  • Literature Search — Orchestrates multiple APIs; deduplication, filtering, ranking. Returns citations with DOIs, PMIDs, authors.

  • Web Search — DuckDuckGo for general web queries.

  • Research Utils — Text normalization, fuzzy scoring, deduplication, filters.

  • Clients — Per-source API clients (Semantic Scholar, PubMed, etc.).

Design Decisions

Multi-source aggregation

No single literature API has complete coverage. Semantic Scholar excels at recent ML/CS papers; PubMed covers biomedical literature; Europe PMC has open-access full text; OpenAlex provides broad citation data. PathFinder queries all sources in parallel and deduplicates by DOI/PMID, giving researchers the best coverage without requiring them to know which API to use.

Deduplication by DOI + PMID

Papers appear in multiple databases. The literature search generates a dedupe key from DOI (preferred) or PMID, removing exact duplicates. Near-duplicates (different title casing, different abstract length) are handled by fuzzy matching on normalized title + first author.

DuckDuckGo for web search

DuckDuckGo’s Instant Answer API requires no API key and has generous rate limits. It’s used for general web queries when the agent needs non-academic information (documentation, protocols, methods).

Supported Research Sources

Source

Type

Coverage

Semantic Scholar

Academic

CS, ML, biomedical — citation graphs

PubMed

Academic

Biomedical literature (MEDLINE)

Europe PMC

Academic

Open-access full text, preprints

OpenAlex

Academic

Broad citation data, works metadata

Crossref

Academic

DOI metadata, publisher data

arXiv

Preprint

CS, math, physics, biology preprints

DuckDuckGo

Web

General web search (no API key required)

Research Utils

Purpose: Utility functions for research: text normalization, author limiting, fuzzy scoring, deduplication keys, filter predicates. Used by literature search and citation processing.

Key functions: passes_filters(), dedupe_key(), rerank_score(), norm_text()

Utility functions for research services.

veupath_chatbot.services.research.utils.norm_text(value)[source]

Normalize text for comparison.

Parameters:

value (str | None) – Text to normalize.

Returns:

Normalized string.

Return type:

str

veupath_chatbot.services.research.utils.list_str(value)[source]

Convert a JSON value to a list of strings.

Parameters:

value (JSONValue) – Value to process.

Return type:

list[str]

veupath_chatbot.services.research.utils.limit_authors(authors, max_authors)[source]

Limit the number of authors, appending ‘et al.’ if truncated.

Parameters:
  • authors (list[str] | None) – Author list.

  • max_authors (int) – Maximum number of authors (-1 for no limit).

Returns:

Truncated list or None.

Return type:

list[str] | None

veupath_chatbot.services.research.utils.truncate_text(text, max_chars)[source]

Truncate text to max_chars, appending ellipsis if truncated.

Parameters:
  • text (str | None) – Text to truncate.

  • max_chars (int) – Maximum character count.

Returns:

Truncated string or None.

Return type:

str | None

veupath_chatbot.services.research.utils.strip_tags(text)[source]

Remove HTML tags and normalize whitespace.

Parameters:

text (str) – HTML string.

Returns:

Plain text.

Return type:

str

veupath_chatbot.services.research.utils.decode_ddg_redirect(href)[source]

Decode DuckDuckGo redirect URLs.

Parameters:

href (str) – Redirect URL.

Returns:

Decoded URL.

Return type:

str

veupath_chatbot.services.research.utils.candidate_queries(q)[source]

Generate candidate query variations for fallback searches.

Parameters:

q (str) – Search query.

Returns:

Candidate query variations.

Return type:

list[str]

veupath_chatbot.services.research.utils.looks_blocked(status_code, html)[source]

Check if a response looks like it was blocked by rate limiting.

Parameters:
  • status_code (int) – HTTP status code.

  • html (str) – Response HTML body.

Returns:

True if response looks blocked.

Return type:

bool

veupath_chatbot.services.research.utils.norm_for_match(text)[source]

Normalize text for fuzzy matching.

Parameters:

text (str | None) – Text to normalize.

Returns:

Normalized string for matching.

Return type:

str

veupath_chatbot.services.research.utils.fallback_ratio(a, b)[source]

Fallback similarity ratio using SequenceMatcher.

Parameters:
  • a (str) – First string.

  • b (str) – Second string.

Returns:

Similarity ratio (0-100).

Return type:

float

veupath_chatbot.services.research.utils.fuzzy_score(query, text)[source]

Calculate fuzzy similarity score between query and text.

Parameters:
  • query (str) – Search query.

  • text (str) – Text to score.

Returns:

Fuzzy similarity score.

Return type:

float

veupath_chatbot.services.research.utils.rerank_score(query, item)[source]

Calculate reranking score for a literature search result.

Parameters:
  • query (str) – Search query.

  • item (JSONObject) – Literature result item.

Returns:

Tuple of (score, score breakdown).

Return type:

tuple[float, dict[str, float]]

veupath_chatbot.services.research.utils.passes_filters(*, title, authors, year, doi, pmid, journal, year_from, year_to, author_includes, title_includes, journal_includes, doi_equals, pmid_equals, require_doi)[source]

Check if a literature result passes all filters.

Parameters:
  • title (str) – Result title.

  • authors (list[str] | None) – Author list.

  • year (int | None) – Publication year.

  • doi (str | None) – DOI.

  • pmid (str | None) – PubMed ID.

  • journal (str | None) – Journal name.

  • year_from (int | None) – Minimum year filter.

  • year_to (int | None) – Maximum year filter.

  • author_includes (str | None) – Author substring filter.

  • title_includes (str | None) – Title substring filter.

  • journal_includes (str | None) – Journal substring filter.

  • doi_equals (str | None) – Exact DOI filter.

  • pmid_equals (str | None) – Exact PMID filter.

  • require_doi (bool) – Whether DOI is required.

Returns:

True if result passes all filters.

Return type:

bool

veupath_chatbot.services.research.utils.dedupe_key(item)[source]

Generate a deduplication key for a literature result.

Parameters:

item (JSONObject) – Item dict.

Return type:

str

async veupath_chatbot.services.research.utils.fetch_page_summary(client, url, *, max_chars)[source]

Fetch and extract a text summary from a web page.

Streams the response and stops reading as soon as </head> is found or 32 KB have been consumed. Meta description tags are checked first; if none are present the longest <p> in the buffered content is used as a fallback. Returns None for PDFs, Google Scholar links, or on error.

Return type:

str | None

Research Clients

API clients for literature sources. Each implements search for its backend. All clients inherit from BaseClient / StandardClient defined in the base module.

Shared base for literature search API clients.

class veupath_chatbot.services.research.clients._base.BaseClient(*, timeout_seconds=15.0)[source]

Bases: object

Common initialisation for all literature API clients.

__init__(*, timeout_seconds=15.0)[source]
class veupath_chatbot.services.research.clients._base.StandardClient(*, timeout_seconds=15.0)[source]

Bases: BaseClient

Client with the standard fetch-parse-build search pattern.

Subclasses implement _source_name, _fetch_raw, and _parse_item. The search method is inherited.

async search(query, *, limit, abstract_max_chars)[source]
Return type:

JSONObject

veupath_chatbot.services.research.clients._base.make_citation(*, source, id_prefix, title, url=None, authors=None, year=None, doi=None, pmid=None, snippet=None)[source]

Build a citation dict from common fields.

Return type:

JSONObject

veupath_chatbot.services.research.clients._base.build_response(*, query, source, results, citations)[source]

Build the standard client response dict, deduplicating citation tags.

Return type:

JSONObject

Semantic Scholar API client.

class veupath_chatbot.services.research.clients.semanticscholar.SemanticScholarClient(*, timeout_seconds=15.0)[source]

Bases: StandardClient

Client for Semantic Scholar API.

Europe PMC API client.

class veupath_chatbot.services.research.clients.europepmc.EuropePmcClient(*, timeout_seconds=15.0)[source]

Bases: StandardClient

Client for Europe PMC API.

PubMed API client.

class veupath_chatbot.services.research.clients.pubmed.PubmedClient(*, timeout_seconds=15.0)[source]

Bases: BaseClient

Client for PubMed API.

PubMed requires a multi-step fetch (esearch -> esummary -> optional efetch for abstracts), so it keeps a custom search method. Per-item parsing still goes through _parse_item / _build_results.

async search(query, *, limit, include_abstract, abstract_max_chars)[source]

Search PubMed.

Return type:

JSONObject

Crossref API client.

class veupath_chatbot.services.research.clients.crossref.CrossrefClient(*, timeout_seconds=15.0)[source]

Bases: StandardClient

Client for Crossref API.

OpenAlex API client.

class veupath_chatbot.services.research.clients.openalex.OpenAlexClient(*, timeout_seconds=15.0)[source]

Bases: StandardClient

Client for OpenAlex API.

arXiv API client.

class veupath_chatbot.services.research.clients.arxiv.ArxivClient(*, timeout_seconds=15.0)[source]

Bases: StandardClient

Client for arXiv API.

Preprint site search client (bioRxiv, medRxiv).

class veupath_chatbot.services.research.clients.preprint.PreprintClient(*, timeout_seconds=15.0)[source]

Bases: BaseClient

Client for preprint site searches via DuckDuckGo.

Preprint search has a unique signature (site, source, include_abstract) and a post-processing step that fetches page summaries, so it keeps a custom search method. Per-item parsing still goes through _parse_item / _build_results.

async search(query, *, site, source, limit, include_abstract, abstract_max_chars)[source]

Search preprint sites using DuckDuckGo.

Return type:

JSONObject