High‑Performance PDF Table Extraction API
- 🔑 API Key Authentication – Secure access for every endpoint (except health checks).
- ⚡ Asynchronous Processing – Long‑running PDF jobs are handled in the background via Celery (or fallback to FastAPI
BackgroundTasks). - 📊 Smart Table Extraction – Uses Camelot (lattice/stream) as primary engine, falls back to pdfplumber for borderless tables.
- 📝 Markdown‑Ready Text – Extracted document text includes placeholders like
[Table 1]that match the sheet names in the generated Excel file. - 🧩 Hybrid Architecture – Combines multiple extraction libraries for maximum coverage.
- 🐳 Docker Ready – Full
docker-composesetup for development and production. - 📈 Scalable – Redis/Celery support for distributed job processing.
- 🛠️ Easy Configuration – All settings via environment variables (
.env).
- Python 3.10+
- Ghostscript (required by Camelot on Windows)
- (Optional) Redis – for Celery broker and rate limiting
# Clone the repository
git clone https://github.com/arnalph/docxy.git
cd docxy
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtCopy the example environment file and adjust as needed:
cp .env.example .envFor a local, zero‑dependency run, use:
DATABASE_URL=sqlite+aiosqlite:///./docxy.db
USE_REDIS=False
STORAGE_TYPE=local
UPLOAD_DIR=uploadsalembic upgrade headpython app/core/init_admin.pyThis will output an API key like sk_.... Save it – you’ll need it for authentication.
python run.pyThe API will be available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.
If you have Redis enabled, start a worker in another terminal:
celery -A app.core.celery_app worker --loglevel=infoAll endpoints (except /health) require an API key sent in the Authorization header:
Authorization: Bearer <your-api-key>
Upload a PDF for processing.
- Request:
multipart/form-datawith afilefield (PDF only). - Response:
201 Createdwith job ID and initial status.
curl -X POST "http://localhost:8000/api/v1/jobs" \
-H "Authorization: Bearer sk_..." \
-F "file=@document.pdf"Poll job status.
- Response: JSON with
status(PENDING,PROCESSING,COMPLETED,FAILED), progress, and error message.
Retrieve the extracted data once the job is COMPLETED.
- Response: JSON containing:
download_url: URL to download the Excel file.full_text: Markdown text of the PDF with[Table N]placeholders.
Public endpoint that returns the health status of the API and its dependencies.
A simple admin dashboard (no authentication – use only in development).
Key environment variables (see .env.example for all):
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
Database connection string | sqlite+aiosqlite:///./docxy.db |
USE_REDIS |
Enable Redis for Celery & rate limiting | False |
REDIS_URL |
Redis connection URL | – |
STORAGE_TYPE |
local or s3 |
local |
UPLOAD_DIR |
Local upload directory | uploads |
POPPLER_PATH |
Path to poppler binaries (Windows only) | – |
CAMELOT_FLAVOR |
Default Camelot flavor | lattice |
USE_PDFPLUMBER_FALLBACK |
Fallback to pdfplumber if Camelot finds no tables | True |
DEBUG |
Enable debug logging | False |
Docxy is built with a modular, async‑first design:
- FastAPI handles HTTP requests, authentication, and job dispatching.
- SQLAlchemy (async) with Alembic for database migrations.
- Celery (optional) processes PDF jobs in the background; falls back to FastAPI
BackgroundTaskswhen Redis is unavailable. - Camelot + pdfplumber extract tables and full text.
- Storage Service abstracts between local filesystem and S3/MinIO.
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Client │───▶│ FastAPI │───▶│ Celery │
└────────────┘ └────────────┘ └────────────┘
│ │
▼ ▼
┌────────────┐ ┌────────────┐
│ DB/Redis │ │ Extraction │
│ │ │ Service │
└────────────┘ └────────────┘
Contributions are welcome! Please open an issue or submit a pull request. For major changes, please discuss first.