Skip to content

feat(tools): add web_fetch builtin tool (HTTP fetch + readable extraction)#586

Open
yaozheng-fang wants to merge 1 commit into
mainfrom
feat/web-fetch-tool
Open

feat(tools): add web_fetch builtin tool (HTTP fetch + readable extraction)#586
yaozheng-fang wants to merge 1 commit into
mainfrom
feat/web-fetch-tool

Conversation

@yaozheng-fang
Copy link
Copy Markdown
Collaborator

@yaozheng-fang yaozheng-fang commented Jun 5, 2026

Summary

Adds a web_fetch builtin tool at veadk/tools/builtin_tools/web_fetch.py — a plain-HTTP page reader the agent can call to read public URLs. Modeled on OpenClaw's web_fetch.

from veadk.tools.builtin_tools.web_fetch import web_fetch
agent = Agent(..., tools=[web_fetch])

What it does

  • HTTP GET, no JavaScript — Chrome-like User-Agent + Accept-Language; http(s) only.
  • SSRF guard — resolves the host and rejects private / loopback / link-local / reserved / multicast / unspecified addresses, and re-validates every redirect hop (HTTP 3xx and <meta refresh>), bounded to 3 hops.
  • Extraction — HTML → markdown or plain text via a dependency-free coarse converter (keeps headings/links/lists); PDF → text via pypdf (detected by content-type or %PDF magic bytes).
  • Limits / cache — 2 MB download cap (10 MB for PDFs), 30 s timeout, max_chars truncation (hard cap 200k), 15-minute in-process TTL cache.
  • Returns {"url", "title", "content", "truncated"}, or {"error": ...} on failure.

Signature

def web_fetch(url: str, extract_mode: str = "markdown",
              max_chars: int = 50000, tool_context=None) -> dict

Tested (live)

  • HTML: example.com, en.wikipedia.org/wiki/Volcano, www.sina.com.cn (Chinese decodes correctly).
  • <meta refresh>: sina.com → follows to www.sina.com.cn.
  • PDF: arxiv.org/pdf/1706.03762 → extracts the paper text.
  • SSRF: localhost / 169.254.169.254 (cloud metadata) / 192.168.x / non-http schemes → blocked.
  • Registers cleanly as a tool on a veadk Agent.

Limitations (by design)

  • No JavaScript rendering — JS-only pages stay sparse.
  • No socket-level DNS pinning (resolve + re-validate only) — small TOCTOU window; noted in a code comment as a hardening follow-up.

Comment thread veadk/tools/builtin_tools/web_fetch.py Fixed
@yaozheng-fang yaozheng-fang force-pushed the feat/web-fetch-tool branch from 572960c to 338a65c Compare June 5, 2026 11:39
Comment thread veadk/tools/builtin_tools/web_fetch.py Fixed
…tion)

A plain-HTTP web fetch tool at veadk/tools/builtin_tools/web_fetch.py, modeled
on OpenClaw's web_fetch:

- HTTP GET (no JavaScript), Chrome-like headers, http(s) only.
- SSRF guard: rejects private / loopback / link-local / reserved / multicast
  addresses and re-validates every redirect, including <meta refresh> hops
  (bounded by a 3-hop budget).
- Extraction: HTML -> markdown or plain text (coarse, dependency-free) and
  PDF -> text via pypdf (by content-type or %PDF magic bytes).
- Limits: 2MB download cap (10MB for PDFs), 30s timeout, max_chars truncation,
  15-minute in-process TTL cache.
- Returns {"url", "title", "content", "truncated"}, or {"error": ...} on failure.
@yaozheng-fang yaozheng-fang force-pushed the feat/web-fetch-tool branch from 338a65c to de93c39 Compare June 5, 2026 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants