Implement Ask AI over a published Sourcey docs site.
langchain-sourcey reads Sourcey's generated search and LLM artefacts and
returns canonical page URLs for citation.
Sourcey already emits the files a retriever needs:
search-index.jsonfor candidate discoveryllms-full.txtfor full-page hydration- canonical page URLs for citations
No hosted index is required. Point site_url at the docs root and use it.
pip install -U langchain-sourceyPoint site_url at the root of a published Sourcey build:
https://sourcey.com/docshttps://sourcey.com/cheesestorehttps://cheesestore.github.io
from langchain_sourcey import SourceyRetriever
retriever = SourceyRetriever(
site_url="https://sourcey.com/docs",
top_k=3,
)
docs = retriever.invoke("mcp integration")
for doc in docs:
print(doc.metadata["title"])
print(doc.metadata["source"])
print(doc.page_content[:160])
print()For a runnable script, see examples/live_quickstart.py.
More context: Sourcey guide
Install a chat model package. This example uses OpenAI:
pip install -U langchain-openaifrom langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_sourcey import SourceyRetriever
retriever = SourceyRetriever(site_url="https://sourcey.com/docs", top_k=3)
prompt = ChatPromptTemplate.from_template(
"""Answer the question using the documentation context below.
{context}
Question: {question}"""
)
chain = (
RunnablePassthrough.assign(context=(lambda x: x["question"]) | retriever)
| prompt
| ChatOpenAI(model="gpt-4.1-mini")
| StrOutputParser()
)
answer = chain.invoke({"question": "How does Sourcey document MCP servers?"})
print(answer)For a fuller example, see examples/rag_chain.py.
For predictable retrieval, the published Sourcey site should expose:
- publish
search-index.json - publish
llms-full.txt - set
siteUrlinsourcey.config.tsso citations are canonical
search-index.json is required.
llms-full.txt is strongly recommended. If it is missing, the retriever falls
back to the matched page HTML.
Each returned Document includes:
source: canonical page URL used for citationsmatched_url: original matched URL, including anchors when relevantmatched_title: matched search entry titletitle: hydrated page titlepath: Sourcey output path such asguides/search.htmlanchor: matched fragment, if anytab: Sourcey tab labelcategory: Sourcey search categorysite_url: docs root used for retrievalscore: retriever ranking score
python -m pip install -e .[dev] build twine
PYTHONPATH=src pytest -q
SOURCEY_TEST_SITE_URL=https://sourcey.com/docs PYTHONPATH=src pytest tests/integration_tests/test_live_retriever.py -q
python -m build
python -m twine check dist/*See CONTRIBUTING.md for the release and verification flow.
This repo includes draft docs ready to turn into a LangChain docs PR:
This repo also contains the JavaScript package in js.
- npm package:
langchain-sourcey - draft JS docs: docs/langchain-js/provider-sourcey.mdx
- draft JS docs: docs/langchain-js/retriever-sourcey.mdx
This package ships SourceyRetriever only. No loader yet.