Skip to content
View Noman654's full-sized avatar

Highlights

  • Pro

Block or report Noman654

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Noman654/README.md

πŸ‘‹ Hi, I'm Mohd Nauman

🌟 Data Platform Engineer | Python Developer | Synthetic-Data & OCR Nerd
πŸ” I build high-throughput, cost-efficient data platforms for Indic AIβ€”from petabyte-scale ingestion to governed metadata and evaluation.

πŸŽ“ BSc in Programming & Data Science (ongoing) β€” IIT Madras
πŸ“ Bengaluru, India


πŸš€ Snapshot

  • TB β†’ PB scale pipelines with 10Γ— throughput (~1 TB/hr), 33% lower infra cost, 60% storage reduction.
  • Built Ayurveda SFT (β‰ˆ 5M Q&A from 1K+ books) and domain benchmarks (Ayurveda, Finance, Legal, Agri) for fair evaluation of Indic LLMs.
  • Chitrakshara co-author β€” a large multilingual multimodal dataset (193M images, 30B tokens).
  • Architected hybrid on-prem stack (JuiceFS + DuckDB) with DataHub lineage for 1M+ datasets; reproducible and governed.
  • Engineered async+throttled Archive.org ingestion (10Γ— faster, scaled to 10M+ books).

🧭 What I Do

  • Data Platforms & Governance: DataHub lineage, metadata tagging, SLAs/ownership, reproducible training data.
  • Massive Ingestion Pipelines: Common Crawl (WARC/WET), Obelics-style web curation, Archive/NDLI collectors.
  • OCR β†’ Post-Correction: Surya/Kraken/Tesseract Indic, quality filters, taxonomy tagging for downstream RAG/SFT.
  • Serving & Synthetic Data: sglang/vLLM setups, prompt/rubric pipelines, evaluation harnesses for Indic domains.

πŸ† Highlights / Wins

  • Switched CC ingestion from WET β†’ WARC for Indic LMs β†’ +3% training efficiency.
  • OBELICS-inspired pipeline over 230M+ URLs, curated 193M Indic image–text; -33% cost.
  • Airline forecasting pipelines: 80–90% runtime cut, 40–60% storage saved.
  • Kafka + PySpark persona generation, real-time DB syncs, and Presto for complex analytics at scale.

🧰 Tech Stack

  • Languages: Python (PySpark/Pandas), SQL, Bash, JavaScript
  • Data/Compute: Spark, Kafka, DuckDB, Trino/Presto, Airflow, DataHub, Scrapy
  • AI/ML: sglang, Vertex AI, OCR (Surya/Kraken/Tesseract), evaluation/benchmarks
  • Cloud & Infra: AWS (S3/EMR/Glue/Redshift/EC2), Azure (Blob/Functions/Databricks/SQL), Docker, CI/CD
  • Storage/Query: JuiceFS, MinIO, Redshift, Azure SQL, MySQL, MongoDB

πŸ”¬ Selected Projects

Bharat Data Sagar β€” Hybrid Data Platform

Petabyte-ready platform with JuiceFS + DuckDB and DataHub lineage across 1M+ datasets. Reproducible pipelines for LLM training/eval; standardized tags for NDLI/Archive/CC sources.

AyurParam / BhashaBench-Ayur (Internal)

Domain-specialized Ayurveda SFT (~5M Q&A), bilingual prompts, and robust benchmark design (easy↔hard strata, cultural relevance). Drove best-in-class results in size class.

Obelics-style Indic Web Crawler

Scaled to 230M+ URLs; curated 193M image-text pairs; -33% infra cost with careful batching, filtering, and de-dup.

One-Click Data Downloader

Parallelized downloader for HF/Archive/arXiv, moving TBs in ~60 minutes with throttling, retries, and manifesting.
Repo: data_scraper (Hugging Face + Archive focus)


πŸ“œ Certifications & Recognition

  • Harvard CS50 (score 9.9, 2nd Rank in Diploma)
  • AWS Cloud Practitioner, Azure Data Fundamentals, SQL (Intermediate)
  • Extension Mania (IIT Madras) β€” Winner
  • Publication: A Large Multilingual Multimodal Dataset for Indian Languages β€” 193M images, 30B tokens

🀝 Let’s Connect

Popular repositories Loading

  1. dataengineer_prep dataengineer_prep Public

    Data engineering interview prep - PySpark notebooks, theory docs, quizzes, and company-specific patterns. Built around Zephyr Coffee Co., a fictional 200-store chain with messy data.

    Jupyter Notebook 28 6

  2. StockAdvisor StockAdvisor Public

    Python 2

  3. second-brain-rpg second-brain-rpg Public

    PARA From Second Brain

    TypeScript 1

  4. learning-java-2825378 learning-java-2825378 Public

    Forked from LinkedInLearning/learning-java-2825378

    Learning Java (REVISION Q1 2020)

  5. ping-pong-Game ping-pong-Game Public

    this is a simple two player ping pong game

    Lua

  6. OLDBOOK OLDBOOK Public

    this a simple which store has which type of book

    HTML