π Data Platform Engineer | Python Developer | Synthetic-Data & OCR Nerd
π I build high-throughput, cost-efficient data platforms for Indic AIβfrom petabyte-scale ingestion to governed metadata and evaluation.
π BSc in Programming & Data Science (ongoing) β IIT Madras
π Bengaluru, India
- TB β PB scale pipelines with 10Γ throughput (~1 TB/hr), 33% lower infra cost, 60% storage reduction.
- Built Ayurveda SFT (β 5M Q&A from 1K+ books) and domain benchmarks (Ayurveda, Finance, Legal, Agri) for fair evaluation of Indic LLMs.
- Chitrakshara co-author β a large multilingual multimodal dataset (193M images, 30B tokens).
- Architected hybrid on-prem stack (JuiceFS + DuckDB) with DataHub lineage for 1M+ datasets; reproducible and governed.
- Engineered async+throttled Archive.org ingestion (10Γ faster, scaled to 10M+ books).
- Data Platforms & Governance: DataHub lineage, metadata tagging, SLAs/ownership, reproducible training data.
- Massive Ingestion Pipelines: Common Crawl (WARC/WET), Obelics-style web curation, Archive/NDLI collectors.
- OCR β Post-Correction: Surya/Kraken/Tesseract Indic, quality filters, taxonomy tagging for downstream RAG/SFT.
- Serving & Synthetic Data: sglang/vLLM setups, prompt/rubric pipelines, evaluation harnesses for Indic domains.
- Switched CC ingestion from WET β WARC for Indic LMs β +3% training efficiency.
- OBELICS-inspired pipeline over 230M+ URLs, curated 193M Indic imageβtext; -33% cost.
- Airline forecasting pipelines: 80β90% runtime cut, 40β60% storage saved.
- Kafka + PySpark persona generation, real-time DB syncs, and Presto for complex analytics at scale.
- Languages: Python (PySpark/Pandas), SQL, Bash, JavaScript
- Data/Compute: Spark, Kafka, DuckDB, Trino/Presto, Airflow, DataHub, Scrapy
- AI/ML: sglang, Vertex AI, OCR (Surya/Kraken/Tesseract), evaluation/benchmarks
- Cloud & Infra: AWS (S3/EMR/Glue/Redshift/EC2), Azure (Blob/Functions/Databricks/SQL), Docker, CI/CD
- Storage/Query: JuiceFS, MinIO, Redshift, Azure SQL, MySQL, MongoDB
Petabyte-ready platform with JuiceFS + DuckDB and DataHub lineage across 1M+ datasets. Reproducible pipelines for LLM training/eval; standardized tags for NDLI/Archive/CC sources.
Domain-specialized Ayurveda SFT (~5M Q&A), bilingual prompts, and robust benchmark design (easyβhard strata, cultural relevance). Drove best-in-class results in size class.
Scaled to 230M+ URLs; curated 193M image-text pairs; -33% infra cost with careful batching, filtering, and de-dup.
Parallelized downloader for HF/Archive/arXiv, moving TBs in ~60 minutes with throttling, retries, and manifesting.
Repo: data_scraper (Hugging Face + Archive focus)
- Harvard CS50 (score 9.9, 2nd Rank in Diploma)
- AWS Cloud Practitioner, Azure Data Fundamentals, SQL (Intermediate)
- Extension Mania (IIT Madras) β Winner
- Publication: A Large Multilingual Multimodal Dataset for Indian Languages β 193M images, 30B tokens
- π Email: mohdnauman330@gmail.com
- πΌ LinkedIn: linkedin.com/in/nauman-data-llm
- π§ͺ Portfolio: Portfolio Website
- π GitHub: github.com/Noman654

