Micro Scraper is an enterprise-grade, serverless-ready headless web intelligence API built on Next.js, TypeScript, and Puppeteer. It enables automated extraction of SEO-critical metadata (title, meta description, and H1) from any public webpage using a hardened, timeout-controlled Chromium engine. The system is designed for reliability, scalability, and integration into marketing analytics, lead intelligence, competitive research, and content auditing pipelines.
- ๐ท๏ธ Project Title
- ๐งพ Executive Summary
- ๐ Table of Contents
- ๐งฉ Project Overview
- ๐ฏ Objectives & Goals
- โ Acceptance Criteria
- ๐ป Prerequisites
- โ๏ธ Installation & Setup
- ๐ API Documentation
- ๐ฅ๏ธ UI / Frontend
- ๐ข Status Codes
- ๐ Features
- ๐งฑ Tech Stack & Architecture
- ๐ ๏ธ Workflow & Implementation
- ๐งช Testing & Validation
- ๐ Validation Summary
- ๐งฐ Verification Tools
- ๐งฏ Troubleshooting
- ๐ Security & Secrets
- โ๏ธ Deployment
- โก Quick-Start Cheat Sheet
- ๐งพ Usage Notes
- ๐ง Performance & Optimization
- ๐ Enhancements & Features
- ๐งฉ Maintenance & Future Work
- ๐ Key Achievements
- ๐งฎ High-Level Architecture
- ๐๏ธ Project Structure
- ๐งญ Live Demonstration
- ๐ก Summary, Closure & Compliance
Micro Scraper is a production-ready, API-first metadata extraction engine that runs Chromium in a fully headless environment and exposes scraping functionality through a REST endpoint. It eliminates browser automation complexity for clients while enforcing timeouts, validation, and controlled execution.
- Provide a zero-UI API for webpage metadata extraction
- Guarantee execution within 20-second SLA windows
- Prevent invalid or malformed URL execution
- Support user-agent spoofing for bot-resistant pages
- Be deployable on serverless platforms such as Vercel
| Category | Requirement | Status |
|---|---|---|
| Metadata | Extract title, meta description, and h1 | Passed |
| Validation | Reject missing or invalid URLs | Passed |
| Timeout | Abort after 20 seconds | Passed |
| Headless Runtime | No GUI dependencies | Passed |
| Bonus | User-Agent override | Passed |
- Node.js 18+
- npm 9+
- Chromium auto-download via Puppeteer
- Local or cloud runtime capable of Node.js execution
- Clone repository
- Install Node dependencies
- Start Next.js development server
- Call API using browser, curl, or Postman
- Method: GET
- Endpoint: /api/scrape
- Parameters:
- url โ target page (required)
- ua โ custom user agent (optional)
- Returns structured JSON with SEO metadata and status
This project exposes a minimal Next.js frontend for hosting but operates primarily as an API. UI files (layout.tsx, page.tsx, globals.css) are used for base Next.js rendering and can be extended to provide dashboards, monitoring, or request testing tools.
| Code | Description |
|---|---|
| 200 | Successful extraction |
| 400 | Invalid or missing URL |
| 504 | Timeout |
| 500 | Scraping failure |
| Category | Capability | Technical Description | Business Impact |
|---|---|---|---|
| Headless Scraping | Chromium Automation | Puppeteer launches isolated Chromium instances per request, enabling JavaScript-rendered pages to be scraped reliably. | Supports modern SPAs and JS-heavy marketing pages. |
| SEO Intelligence | Metadata Extraction | Reads title, meta description, and H1 from DOM for SEO auditing and lead intelligence. | Improves marketing analysis and competitor research. |
| Reliability | Timeout Guard | Promise.race based timeout enforcement aborts slow or hanging pages after 20 seconds. | Prevents resource exhaustion in production. |
| Stealth Mode | User-Agent Spoofing | Optional UA override bypasses bot detection and CDN filtering. | Higher scrape success on protected sites. |
| Cloud Ready | Serverless Execution | Runs on Vercel Node.js runtime with no browser GUI. | Zero-ops deployment. |
Client โ URL Request โ Headless Chromium โ DOM Scan โ SEO Metadata โ JSON Response
| Layer | Technology | Purpose |
|---|---|---|
| API Gateway | Next.js App Router | Exposes REST endpoint, request validation, response formatting |
| Execution Engine | Node.js Runtime | Controls Puppeteer lifecycle |
| Browser Automation | Puppeteer + Chromium | Loads web pages, executes JS, reads DOM |
| Data Layer | In-Memory Objects | Holds extracted metadata before JSON serialization |
โโโโโโโโโโโโโโ
โ Client โ
โโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ Next.js API โ
โ Validation โ
โ Timeout Ctrl โ
โโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ Puppeteer โ
โ Controller โ
โโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ Headless โ
โ Chromium โ
โโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ DOM Extract โ
โ title / h1 โ
โโโโโโโฌโโโโโโโโโ
โ
โผ
JSON Response
- Client submits URL to /api/scrape
- Next.js validates URL format
- Puppeteer launches a new Chromium instance
- Timeout watchdog starts (20s)
- Page navigates and waits for network idle
- DOM is queried for title, meta description, H1
- Data is serialized into JSON
- Browser instance is destroyed
- Response is returned to client
Request โ Validation โ Browser Launch โ Page Load โ DOM Parse โ Timeout Guard โ JSON Output
| ID | Area | Command | Expected Output | Explanation |
|---|---|---|---|---|
| T1 | Valid URL | curl /api/scrape?url=example.com | 200 + metadata | Normal scrape |
| T2 | Invalid URL | curl /api/scrape?url=bad | 400 | Validation check |
| T3 | Timeout | curl slow IP | 504 | Timeout enforcement |
- All request paths validated before execution
- Timeout logic prevents infinite execution
- Error codes mapped to HTTP semantics
- UA override verified in runtime testing
- Compatible with serverless deployment
- Timeouts โ Check target website latency
- Chromium errors โ Use no-sandbox flags in restricted environments
- 400 errors โ Verify URL format
- No credentials stored
- Sandboxed Chromium execution
- .env support for future tokens
- Deploy to Vercel as Next.js serverless API
- Supports Node.js production servers
- npm install
- npm run dev
- Call /api/scrape
- Designed for SEO audits, marketing intelligence, and automation
- Supports single-page requests for high reliability
- Should be fronted by rate-limiters in production
Network-idle waits, strict timeouts, and minimal DOM extraction ensure low-latency and controlled resource usage.
- Multi-page crawling
- Content extraction
- Screenshot capture
- Headless browser upgrades
- Rate limiting
- Cache layers
- Production-ready scraping API
- Serverless-compatible design
- Strict SLA enforcement
User โ โผ API Gateway (Next.js) โ โผ Input Validation โ โผ Execution Controller โ โผ Headless Browser โ โผ Target Website โ โผ DOM Analyzer โ โผ Metadata Normalizer โ โผ JSON Formatter โ โผ Client Response
MICRO-SCRAPER/ โ โโโ app/ โ โโโ api/ โ โ โโโ scrape/ โ โ โโโ route.ts (Scraping controller) โ โโโ globals.css (Global styles) โ โโโ layout.tsx (Next.js layout) โ โโโ page.tsx (Base UI) โ โโโ public/ โโโ screenshots/ โโโ .next/ โโโ package.json โโโ tsconfig.json โโโ next.config.ts โโโ README.md
npm run dev curl http://localhost:3000/api/scrape?url=https://example.com
Micro Scraper complies with modern API engineering, cloud deployment standards, and headless browser execution best practices. It provides a secure, scalable, and reliable way to extract high-value marketing metadata, making it suitable for enterprise automation, SaaS platforms, and AI pipelines.