Research Services¶
Web search and literature search. Integrates with DuckDuckGo, Semantic Scholar,
Europe PMC, Crossref, OpenAlex, PubMed, arXiv, and preprint servers. Used by
the web_search and literature_search tools.
Overview¶
Literature Search — Orchestrates multiple APIs; deduplication, filtering, ranking. Returns citations with DOIs, PMIDs, authors.
Web Search — DuckDuckGo for general web queries.
Research Utils — Text normalization, fuzzy scoring, deduplication, filters.
Clients — Per-source API clients (Semantic Scholar, PubMed, etc.).
Design Decisions¶
Multi-source aggregation
No single literature API has complete coverage. Semantic Scholar excels at recent ML/CS papers; PubMed covers biomedical literature; Europe PMC has open-access full text; OpenAlex provides broad citation data. PathFinder queries all sources in parallel and deduplicates by DOI/PMID, giving researchers the best coverage without requiring them to know which API to use.
Deduplication by DOI + PMID
Papers appear in multiple databases. The literature search generates a dedupe key from DOI (preferred) or PMID, removing exact duplicates. Near-duplicates (different title casing, different abstract length) are handled by fuzzy matching on normalized title + first author.
DuckDuckGo for web search
DuckDuckGo’s Instant Answer API requires no API key and has generous rate limits. It’s used for general web queries when the agent needs non-academic information (documentation, protocols, methods).
Source |
Type |
Coverage |
|---|---|---|
Semantic Scholar |
Academic |
CS, ML, biomedical — citation graphs |
PubMed |
Academic |
Biomedical literature (MEDLINE) |
Europe PMC |
Academic |
Open-access full text, preprints |
OpenAlex |
Academic |
Broad citation data, works metadata |
Crossref |
Academic |
DOI metadata, publisher data |
arXiv |
Preprint |
CS, math, physics, biology preprints |
DuckDuckGo |
Web |
General web search (no API key required) |
Literature Search¶
Purpose: Orchestrates multiple literature APIs to find papers by query. Handles deduplication, filtering, and ranking across sources. Returns citations with DOI, PMID, authors, abstract.
Key class: LiteratureSearchService — method: search()
Literature search service orchestrating multiple API clients.
- class veupath_chatbot.services.research.literature_search.LiteratureSearchService(*, timeout_seconds=15.0)[source]¶
Bases:
objectService for searching scientific literature across multiple sources.
- async search(query, *, source='all', limit=5, sort='relevance', include_abstract=False, abstract_max_chars=2000, max_authors=2, year_from=None, year_to=None, author_includes=None, title_includes=None, journal_includes=None, doi_equals=None, pmid_equals=None, require_doi=False)[source]¶
Search scientific literature across multiple sources.
- Return type:
Web Search¶
Purpose: DuckDuckGo-based web search for general queries. Used when the agent needs to look up external information.
Key class: WebSearchService
Web search service using DuckDuckGo.
- class veupath_chatbot.services.research.web_search.WebSearchService(*, timeout_seconds=15.0)[source]¶
Bases:
objectService for web search using DuckDuckGo HTML interface.
Research Utils¶
Purpose: Utility functions for research: text normalization, author limiting, fuzzy scoring, deduplication keys, filter predicates. Used by literature search and citation processing.
Key functions: passes_filters(), dedupe_key(),
rerank_score(), norm_text()
Utility functions for research services.
- veupath_chatbot.services.research.utils.list_str(value)[source]¶
Convert a JSON value to a list of strings.
- veupath_chatbot.services.research.utils.limit_authors(authors, max_authors)[source]¶
Limit the number of authors, appending ‘et al.’ if truncated.
- veupath_chatbot.services.research.utils.truncate_text(text, max_chars)[source]¶
Truncate text to max_chars, appending ellipsis if truncated.
- veupath_chatbot.services.research.utils.strip_tags(text)[source]¶
Remove HTML tags and normalize whitespace.
- veupath_chatbot.services.research.utils.decode_ddg_redirect(href)[source]¶
Decode DuckDuckGo redirect URLs.
- veupath_chatbot.services.research.utils.candidate_queries(q)[source]¶
Generate candidate query variations for fallback searches.
- veupath_chatbot.services.research.utils.looks_blocked(status_code, html)[source]¶
Check if a response looks like it was blocked by rate limiting.
- veupath_chatbot.services.research.utils.norm_for_match(text)[source]¶
Normalize text for fuzzy matching.
- veupath_chatbot.services.research.utils.fallback_ratio(a, b)[source]¶
Fallback similarity ratio using SequenceMatcher.
- veupath_chatbot.services.research.utils.fuzzy_score(query, text)[source]¶
Calculate fuzzy similarity score between query and text.
- veupath_chatbot.services.research.utils.rerank_score(query, item)[source]¶
Calculate reranking score for a literature search result.
- veupath_chatbot.services.research.utils.passes_filters(*, title, authors, year, doi, pmid, journal, year_from, year_to, author_includes, title_includes, journal_includes, doi_equals, pmid_equals, require_doi)[source]¶
Check if a literature result passes all filters.
- Parameters:
title (str) – Result title.
year (int | None) – Publication year.
doi (str | None) – DOI.
pmid (str | None) – PubMed ID.
journal (str | None) – Journal name.
year_from (int | None) – Minimum year filter.
year_to (int | None) – Maximum year filter.
author_includes (str | None) – Author substring filter.
title_includes (str | None) – Title substring filter.
journal_includes (str | None) – Journal substring filter.
doi_equals (str | None) – Exact DOI filter.
pmid_equals (str | None) – Exact PMID filter.
require_doi (bool) – Whether DOI is required.
- Returns:
True if result passes all filters.
- Return type:
- veupath_chatbot.services.research.utils.dedupe_key(item)[source]¶
Generate a deduplication key for a literature result.
- Parameters:
item (JSONObject) – Item dict.
- Return type:
- async veupath_chatbot.services.research.utils.fetch_page_summary(client, url, *, max_chars)[source]¶
Fetch and extract a text summary from a web page.
Streams the response and stops reading as soon as
</head>is found or 32 KB have been consumed. Meta description tags are checked first; if none are present the longest<p>in the buffered content is used as a fallback. ReturnsNonefor PDFs, Google Scholar links, or on error.- Return type:
str | None
Research Clients¶
API clients for literature sources. Each implements search for its backend.
All clients inherit from BaseClient / StandardClient
defined in the base module.
Shared base for literature search API clients.
- class veupath_chatbot.services.research.clients._base.BaseClient(*, timeout_seconds=15.0)[source]¶
Bases:
objectCommon initialisation for all literature API clients.
- class veupath_chatbot.services.research.clients._base.StandardClient(*, timeout_seconds=15.0)[source]¶
Bases:
BaseClientClient with the standard fetch-parse-build search pattern.
Subclasses implement
_source_name,_fetch_raw, and_parse_item. Thesearchmethod is inherited.
- veupath_chatbot.services.research.clients._base.make_citation(*, source, id_prefix, title, url=None, authors=None, year=None, doi=None, pmid=None, snippet=None)[source]¶
Build a citation dict from common fields.
- Return type:
- veupath_chatbot.services.research.clients._base.build_response(*, query, source, results, citations)[source]¶
Build the standard client response dict, deduplicating citation tags.
- Return type:
Semantic Scholar API client.
- class veupath_chatbot.services.research.clients.semanticscholar.SemanticScholarClient(*, timeout_seconds=15.0)[source]¶
Bases:
StandardClientClient for Semantic Scholar API.
Europe PMC API client.
- class veupath_chatbot.services.research.clients.europepmc.EuropePmcClient(*, timeout_seconds=15.0)[source]¶
Bases:
StandardClientClient for Europe PMC API.
PubMed API client.
- class veupath_chatbot.services.research.clients.pubmed.PubmedClient(*, timeout_seconds=15.0)[source]¶
Bases:
BaseClientClient for PubMed API.
PubMed requires a multi-step fetch (esearch -> esummary -> optional efetch for abstracts), so it keeps a custom
searchmethod. Per-item parsing still goes through_parse_item/_build_results.
Crossref API client.
- class veupath_chatbot.services.research.clients.crossref.CrossrefClient(*, timeout_seconds=15.0)[source]¶
Bases:
StandardClientClient for Crossref API.
OpenAlex API client.
- class veupath_chatbot.services.research.clients.openalex.OpenAlexClient(*, timeout_seconds=15.0)[source]¶
Bases:
StandardClientClient for OpenAlex API.
arXiv API client.
- class veupath_chatbot.services.research.clients.arxiv.ArxivClient(*, timeout_seconds=15.0)[source]¶
Bases:
StandardClientClient for arXiv API.
Preprint site search client (bioRxiv, medRxiv).
- class veupath_chatbot.services.research.clients.preprint.PreprintClient(*, timeout_seconds=15.0)[source]¶
Bases:
BaseClientClient for preprint site searches via DuckDuckGo.
Preprint search has a unique signature (
site,source,include_abstract) and a post-processing step that fetches page summaries, so it keeps a customsearchmethod. Per-item parsing still goes through_parse_item/_build_results.