You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(llm): add async document and receipt extraction pipelines
- Introduce async_document_extraction_pipeline.py for generic document extraction with parallel processing.
- Implement async_extract_receipts_pipeline.py for receipt extraction, leveraging async capabilities for improved performance.
- Update README to reflect new async pipelines, highlighting performance benefits and usage examples.
- Include receipt-specific schemas and data transformations for structured extraction.
Benefits:
- Significant speed improvements for batch processing through concurrent LLM calls and parallel image processing.
- Enhanced user experience with clear examples and documentation for both sync and async usage.
-**Asynchronous** (`async_document_extraction_pipeline.py`): Parallel processing with 3-10x speedup
29
+
30
+
For batch processing of multiple documents, the async version provides significant performance improvements through concurrent LLM calls and parallel image processing.
31
+
21
32
## Quick Start
22
33
23
34
### Option 1: Use the Receipt Pipeline
24
35
25
-
Run the ready-to-use receipt extraction pipeline:
36
+
**Async Version (Recommended for batch processing):**
37
+
38
+
```bash
39
+
uv run async_extract_receipts_pipeline.py
40
+
```
41
+
42
+
**Sync Version (Simple, sequential processing):**
26
43
27
44
```bash
28
45
uv run extract_receipts_pipeline.py
29
46
```
30
47
31
-
The receipt pipeline includes:
48
+
Both receipt pipelines include:
32
49
-`Receipt` and `ReceiptItem` Pydantic schemas
33
-
- Receipt-specific data transformations
50
+
- Receipt-specific data transformations (uppercase company, parse dates)
34
51
- Pre-configured extraction prompt
52
+
- Image scaling for better OCR
35
53
- Example usage in `__main__` block
36
54
37
-
### Option 2: Create Your Own Pipeline
55
+
The async version processes 4 receipts in parallel by default and includes progress indicators.
56
+
57
+
### Option 2: Create Your Own Pipeline (Synchronous)
38
58
39
59
Import the generic pipeline and create a custom extractor:
40
60
@@ -89,6 +109,72 @@ if __name__ == "__main__":
89
109
print(result_df)
90
110
```
91
111
112
+
### Option 3: Create Your Own Pipeline (Async - Recommended for Batch Processing)
113
+
114
+
Import the async pipeline for better performance with multiple documents:
115
+
116
+
```python
117
+
import asyncio
118
+
from datetime import date
119
+
from pathlib import Path
120
+
from typing import Optional
121
+
122
+
import pandas as pd
123
+
from pydantic import BaseModel, Field
124
+
from async_document_extraction_pipeline import extract_structured_data_async
0 commit comments