Mohd Nauman Noman654

👋 Hi, I'm Mohd Nauman

🌟 Data Platform Engineer | Python Developer | Synthetic-Data & OCR Nerd
🔍 I build high-throughput, cost-efficient data platforms for Indic AI—from petabyte-scale ingestion to governed metadata and evaluation.

🎓 BSc in Programming & Data Science (ongoing) — IIT Madras
📍 Bengaluru, India

🚀 Snapshot

TB → PB scale pipelines with 10× throughput (~1 TB/hr), 33% lower infra cost, 60% storage reduction.
Built Ayurveda SFT (≈ 5M Q&A from 1K+ books) and domain benchmarks (Ayurveda, Finance, Legal, Agri) for fair evaluation of Indic LLMs.
Chitrakshara co-author — a large multilingual multimodal dataset (193M images, 30B tokens).
Architected hybrid on-prem stack (JuiceFS + DuckDB) with DataHub lineage for 1M+ datasets; reproducible and governed.
Engineered async+throttled Archive.org ingestion (10× faster, scaled to 10M+ books).

🧭 What I Do

Data Platforms & Governance: DataHub lineage, metadata tagging, SLAs/ownership, reproducible training data.
Massive Ingestion Pipelines: Common Crawl (WARC/WET), Obelics-style web curation, Archive/NDLI collectors.
OCR → Post-Correction: Surya/Kraken/Tesseract Indic, quality filters, taxonomy tagging for downstream RAG/SFT.
Serving & Synthetic Data: sglang/vLLM setups, prompt/rubric pipelines, evaluation harnesses for Indic domains.

🏆 Highlights / Wins

Switched CC ingestion from WET → WARC for Indic LMs → +3% training efficiency.
OBELICS-inspired pipeline over 230M+ URLs, curated 193M Indic image–text; -33% cost.
Airline forecasting pipelines: 80–90% runtime cut, 40–60% storage saved.
Kafka + PySpark persona generation, real-time DB syncs, and Presto for complex analytics at scale.

🧰 Tech Stack

Languages: Python (PySpark/Pandas), SQL, Bash, JavaScript
Data/Compute: Spark, Kafka, DuckDB, Trino/Presto, Airflow, DataHub, Scrapy
AI/ML: sglang, Vertex AI, OCR (Surya/Kraken/Tesseract), evaluation/benchmarks
Cloud & Infra: AWS (S3/EMR/Glue/Redshift/EC2), Azure (Blob/Functions/Databricks/SQL), Docker, CI/CD
Storage/Query: JuiceFS, MinIO, Redshift, Azure SQL, MySQL, MongoDB

🔬 Selected Projects

Bharat Data Sagar — Hybrid Data Platform

Petabyte-ready platform with JuiceFS + DuckDB and DataHub lineage across 1M+ datasets. Reproducible pipelines for LLM training/eval; standardized tags for NDLI/Archive/CC sources.

AyurParam / BhashaBench-Ayur (Internal)

Domain-specialized Ayurveda SFT (~5M Q&A), bilingual prompts, and robust benchmark design (easy↔hard strata, cultural relevance). Drove best-in-class results in size class.

Obelics-style Indic Web Crawler

Scaled to 230M+ URLs; curated 193M image-text pairs; -33% infra cost with careful batching, filtering, and de-dup.

One-Click Data Downloader

Parallelized downloader for HF/Archive/arXiv, moving TBs in ~60 minutes with throttling, retries, and manifesting.
Repo: data_scraper (Hugging Face + Archive focus)

📜 Certifications & Recognition

Harvard CS50 (score 9.9, 2nd Rank in Diploma)
AWS Cloud Practitioner, Azure Data Fundamentals, SQL (Intermediate)
Extension Mania (IIT Madras) — Winner
Publication: A Large Multilingual Multimodal Dataset for Indian Languages — 193M images, 30B tokens

🤝 Let’s Connect

💌 Email: mohdnauman330@gmail.com
💼 LinkedIn: linkedin.com/in/nauman-data-llm
🧪 Portfolio: Portfolio Website
🐙 GitHub: github.com/Noman654

Provide feedback

Saved searches

Use saved searches to filter your results more quickly