Visual element labeling for AI browser agents.
Vulpine Mark annotates browser screenshots with numbered badges over interactive elements (@1, @2, @3...) and returns a structured element map alongside the image. AI agents can then say "click @14" instead of guessing pixel coordinates — boosting click accuracy by 20-30% on visual grounding benchmarks.
Agents that drive browsers from screenshots alone are bad at clicking. Microsoft's Set-of-Mark (SoM) paper showed that overlaying numbered labels on interactive elements closes most of the gap with element-handle-based agents. The catch: existing implementations need 11GB GPUs (OmniParser), are tied to research benchmarks (WebArena), or are proprietary (Anthropic Computer Use).
Vulpine Mark is the first standalone, GPU-free, browser-agnostic SoM tool. It reads element positions directly from the live DOM via CDP and paints labels in pure Go — no ML models, no GPU, no JS injection beyond a single Runtime.evaluate call.
vulpine-mark is the open-source Set-of-Mark library in the VulpineOS
stack. The v0.1.x line is intended to stay small and reusable:
- standalone Go package + CLI
- CDP-driven element enumeration
- image and structured output generation
- no tenant, billing, or hosted-browser concerns
For release notes, see CHANGELOG.md. For the release checklist used for tags, see RELEASING.md.
- Zero ML overhead — reads bounding rects from the DOM, draws PNG labels in Go
- CDP-native — works with any browser exposing CDP (Chrome, Edge, Camoufox via foxbridge)
- Color-coded by role — green buttons, blue links, purple inputs, orange selects
- Viewport-aware — only labels elements actually visible to the agent
- JSON element map — structured
{label, tag, role, text, x, y, w, h}for every badge - CLI + Go library — drop into pipelines or import as a package
go install github.com/VulpineOS/vulpine-mark/cmd/vulpine-mark@latest# Connect to a running browser, annotate the active page
vulpine-mark --cdp http://localhost:9222 --output annotated.png --json elements.json
# Full-page capture with stable labels and a max-element cap
vulpine-mark \
--cdp http://localhost:9222 \
--full-page \
--max-elements 25 \
--palette colorblind \
--output annotated.png \
--json elements.jsonAdd --quiet when scripting if you want to suppress the CLI progress lines on
stderr and keep only the requested output artifacts.
The repo ships deterministic demo pages under docs/demo-pages
for validating the main output modes:
stability.html— repeated controls for label-stability checksdpr.html— compact controls for retina / zoom badge placementfull-page.html— below-the-fold sections for scroll-and-stitch captureclustered-catalog.html— repeated cards for cluster-mode verificationdiff-modal.html— click-triggered modal for diff-only labeling
Serve them locally with:
python3 -m http.server 8123 --directory docs/demo-pagesThen navigate a CDP browser to http://127.0.0.1:8123/index.html and point
vulpine-mark at the active page.
import (
"context"
"github.com/VulpineOS/vulpine-mark/pkg/vulpinemark"
)
ctx := context.Background()
mark, err := vulpinemark.New(ctx, "http://localhost:9222")
if err != nil { panic(err) }
defer mark.Close()
result, err := mark.Annotate(ctx)
// result.Image - annotated PNG bytes
// result.Elements - map[string]Element keyed by "@N"
// result.Labels - ordered []stringEnd-to-end: annotate the active page, persist the labeled PNG, then drive a click by label.
package main
import (
"context"
"fmt"
"os"
"github.com/VulpineOS/vulpine-mark/pkg/vulpinemark"
)
func main() {
ctx := context.Background()
// Connect to a running Chrome / Camoufox with --remote-debugging-port=9222.
mark, err := vulpinemark.New(ctx, "http://localhost:9222")
if err != nil {
panic(err)
}
defer mark.Close()
// Capture the viewport and label every interactive element.
result, err := mark.Annotate(ctx)
if err != nil {
panic(err)
}
if err := os.WriteFile("annotated.png", result.Image, 0o644); err != nil {
panic(err)
}
fmt.Printf("labeled %d elements\n", len(result.Labels))
// Pick a label — an AI agent would read the PNG and reply with one.
if _, ok := result.Elements["@1"]; ok {
if err := mark.Click(ctx, "@1"); err != nil {
panic(err)
}
}
// Type into a labeled input, then hover another element.
_ = mark.Type(ctx, "@3", "hello world")
_ = mark.Hover(ctx, "@7")
}Use mark.AnnotateFullPage() (or the --full-page CLI flag) to capture
and label the entire scrollable page, including elements currently below
the fold.
A typical product grid or list of search results has dozens of visually
identical items. Labeling each one individually drowns the agent in
@1, @2, @3... @47 label soup. Cluster mode detects
repeated shapes and groups them under a single label — members are
accessed with bracket syntax.
result, err := mark.AnnotateClustered(ctx)
// result.Clusters[0].Label == "@C1"
// result.Clusters[0].Members has N Elements
// Click the 3rd item in cluster @C1.
_ = mark.Click(ctx, "@C1[3]")The annotated image draws a single amber outline around the union of
all cluster members plus one @C<N>[1..count] badge at the top-left,
instead of one badge per member. Cluster labels use the @C<N>
namespace so they never collide with ungrouped element labels
(@1, @2, ...). Elements that don't fit any cluster are labeled
individually as before.
Given a previous Result from an earlier annotate, AnnotateDiff
re-captures the page and labels only elements that are new or have
moved. Useful for modal detection, before/after action verification,
and keeping agent prompts focused on what actually changed.
before, _ := mark.Annotate(ctx)
_ = mark.Click(ctx, "@3")
after, _ := mark.AnnotateDiff(ctx, before)
// after.Image highlights only new or moved elements.
// Newly-appeared elements are labeled "*@1", "*@2", ...
// Moved elements are labeled "~@1", "~@2", ... so the agent
// can tell fresh UI (e.g. a modal) from mere layout shifts.Four built-in color palettes ship with the library: DefaultPalette,
HighContrastPalette, MonochromePalette, and ColorblindSafePalette
(Wong, Nature Methods 2011). Swap palettes at runtime:
mark.SetPalette(vulpinemark.ColorblindSafePalette)From the CLI:
vulpine-mark --palette colorblind --output annotated.png
# Also: default, high-contrast, monochromeThe palette controls element badges and the cluster outline color, so downstream color-based filters (e.g. "find all buttons") stay consistent no matter which scheme you pick.
AnnotateSVG returns the same Result populated with a vector
overlay in Result.SVG. The overlay is sized to match the raster
screenshot so frontends can layer it with CSS and toggle labels on
and off:
result, _ := mark.AnnotateSVG(ctx)
os.WriteFile("overlay.svg", []byte(result.SVG), 0o644)Or via CLI:
vulpine-mark --svg overlay.svg --output screenshot.pngBy default, labels are assigned by enumeration order, so @5 may
refer to a different element after a re-annotate if elements shift.
Opt into semantic-hash labels:
mark.UseStableLabels(true)Labels are then derived from (role, accessible-name, rounded-x, rounded-y), so the same logical element keeps the same label across
re-renders as long as its role, name, and rough position are
unchanged. Useful for long-running agent loops that re-annotate
between every action.
On dense pages, cap the number of labeled elements so the agent prompt stays readable. Clusters count as one each toward the cap; the lowest-confidence elements are dropped first.
mark.SetMaxElements(20)
// or --max-elements 20Instead of discrete numbered badges, AnnotateHeatmap renders a
translucent overlay whose alpha is proportional to each element's
importance (confidence * log(area + 1)), normalized across the
returned set so the single most prominent element always saturates the
palette. Useful for visual triage at a glance — "where should the
agent look first?"
result, _ := mark.AnnotateHeatmap(ctx)
os.WriteFile("heatmap.png", result.Image, 0o644)Or from the CLI:
vulpine-mark --heatmap --output heatmap.pngThe heatmap honors the currently-configured palette and any
SetElementFilter callback, so role-restricted heatmaps work out of
the box:
mark.SetElementFilter(vulpinemark.IncludeRoles("button", "link"))
result, _ := mark.AnnotateHeatmap(ctx)When you only need the structured element list (e.g. feeding a text-only LLM or building a selector index), skip screenshot capture entirely:
result, _ := mark.AnnotateJSON(ctx)
// result.Image == nil
// result.Elements, result.Labels populated as usualFrom the CLI, --json-only writes the element map to --json (or to
stdout if no JSON path is given) and never touches --output:
vulpine-mark --json-only > elements.jsonFor fine-grained control over which elements get labeled, pass a
custom ElementFilter. It runs after enumeration and visibility /
occlusion checks.
mark.SetElementFilter(func(el vulpinemark.Element) bool {
return el.Role == "button" && el.Confidence > 0.5
})Two helpers ship for the common cases:
mark.SetElementFilter(vulpinemark.IncludeRoles("button", "link"))
mark.SetElementFilter(vulpinemark.ExcludeRoles("checkbox", "radio"))From the CLI:
vulpine-mark --include-role button,link --exclude-role checkboxEvery Element returned by the library carries a Confidence field
in [0, 1] computed from an accessible-name presence, aria-label,
area, occlusion, and clipped-overflow status. Labels with confidence
below 0.3 are rendered in a muted gray so the agent knows to be
cautious about acting on them.
Early MVP. Currently supports Chrome and Camoufox/foxbridge over CDP WebSocket. iOS Safari support is on the roadmap via MobileBridge.
vulpine-mark is intentionally the browser-facing labeling library, not
the hosted API implementation. If you are wiring it into a service, the
recommended boundary is:
- your service owns browser/session lifecycle
- your service resolves a page websocket URL
vulpine-markconnects to that page and returns labels + image data- your service maps that result into its own endpoint contract
The public adapter spec used by Vulpine API is documented in docs/visual-extraction-api-adapter.md.
MIT