Title & Overview
Template: Topic Modeling: An Intermediate, End-to-End Analysis Tutorial
Overview (≤2 sentences): Learners will implement topic modeling pipelines using both classical probabilistic methods (LDA with gensim) and modern embedding-based clustering (BERTopic). It is intermediate because it emphasizes coherence/purity metrics, reproducibility, and interpretability of discovered topics.
Purpose
The value-add is understanding when to apply topic modeling for unsupervised exploration, how to compare probabilistic vs embedding-based approaches, and how to evaluate topics using quantitative metrics and qualitative inspection. Learners will practice reproducibility, slice-based evaluation, and reporting.
Prerequisites
- Skills: Python, Git, pandas, ML basics.
- NLP: tokenization, embeddings, clustering, evaluation metrics.
- Tooling: pandas, scikit-learn, gensim, BERTopic, Hugging Face Transformers, MLflow, FastAPI.
Setup Instructions
-
Environment: Conda/Poetry (Python 3.11), deterministic seeds.
-
Install: pandas, scikit-learn, gensim, BERTopic, Hugging Face Transformers + Datasets, MLflow, FastAPI.
-
Datasets:
- Small: 20 Newsgroups (classic benchmark).
- Medium: AG News (for unsupervised topic discovery).
-
Repo layout:
tutorials/t9-topic-modeling/
├─ notebooks/
├─ src/
│ ├─ lda.py
│ ├─ bertopic_model.py
│ ├─ eval.py
│ └─ config.yaml
├─ data/README.md
├─ reports/
└─ tests/
Core Concepts
- Classical topic modeling: LDA (Latent Dirichlet Allocation) with bag-of-words.
- Embedding-based topic modeling: BERTopic (transformer embeddings + clustering).
- Evaluation: topic coherence (UMass, c_v), purity/diversity metrics.
- Interpretability: top words per topic, representative documents.
- Error slicing: performance by doc length, class/domain, noisy inputs.
Step-by-Step Walkthrough
- Data intake & preprocessing: load 20 Newsgroups and AG News, preprocess with byte-level BPE tokenizer, reproducible splits.
- Classical baseline: LDA via gensim with TF-IDF vectorization; tune number of topics, alpha/beta priors.
- Modern approach: BERTopic with Sentence-Transformer embeddings + clustering; compare to LDA.
- Evaluation: topic coherence (gensim
c_v), topic diversity, silhouette scores.
- Error analysis: incoherent topics, fragmented vs overly broad topics, slice performance by document length.
- Reporting: metrics tables, topic–word distributions, top representative docs →
reports/t9-topic-modeling.md.
- (Optional) Serve: FastAPI endpoint that assigns topics to input docs with schema validation.
Hands-On Exercises
- Ablations: LDA num_topics=10/20/50 vs BERTopic default.
- Robustness: noisy documents or code-switched text; compare topic coherence.
- Slice analysis: topic quality for short vs long docs.
- Stretch: hybrid models (LDA-initialized BERTopic, or embeddings clustered then refined by LDA).
Common Pitfalls & Troubleshooting
- Too many/few topics: harms interpretability; must tune.
- Sparse docs: short texts degrade LDA; embeddings often handle better.
- Metrics misuse: coherence ≠ human interpretability; always pair with manual inspection.
- Memory use: BERTopic on large datasets → requires batching or dimensionality reduction.
- Tokenizer drift: different preprocessing pipelines → irreproducible topics.
Best Practices
- Always log preprocessing config, tokenizer artifacts, and num_topics with MLflow.
- Combine quantitative metrics (coherence) with qualitative review of top words/docs.
- Unit tests: deterministic topic assignments for toy corpus under fixed seeds.
- Guardrails: enforce max doc length and schema validation in serving.
- Keep LDA → BERTopic narrative for reproducible comparisons.
Reflection & Discussion Prompts
- Why does LDA struggle with short texts compared to embeddings?
- What does topic coherence miss in evaluating real-world interpretability?
- How might civic datasets (e.g., public comments) benefit from topic modeling?
Next Steps / Advanced Extensions
- Experiment with other clustering methods in BERTopic (HDBSCAN vs k-means).
- Explore multilingual topic modeling with mBERT.
- Domain adaptation: civic tech datasets, policy feedback.
- Lightweight monitoring: drift in topic distributions over time.
Glossary / Key Terms
LDA, topic coherence, topic purity/diversity, BERTopic, embeddings, clustering, silhouette score.
Additional Resources
Contributors
Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset licenses: 20 Newsgroups (scikit-learn, open), AG News (CC).
Issues Referenced
Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: T9: Topic Modeling.
Title & Overview
Template: Topic Modeling: An Intermediate, End-to-End Analysis Tutorial
Overview (≤2 sentences): Learners will implement topic modeling pipelines using both classical probabilistic methods (LDA with gensim) and modern embedding-based clustering (BERTopic). It is intermediate because it emphasizes coherence/purity metrics, reproducibility, and interpretability of discovered topics.
Purpose
The value-add is understanding when to apply topic modeling for unsupervised exploration, how to compare probabilistic vs embedding-based approaches, and how to evaluate topics using quantitative metrics and qualitative inspection. Learners will practice reproducibility, slice-based evaluation, and reporting.
Prerequisites
Setup Instructions
Environment: Conda/Poetry (Python 3.11), deterministic seeds.
Install: pandas, scikit-learn, gensim, BERTopic, Hugging Face Transformers + Datasets, MLflow, FastAPI.
Datasets:
Repo layout:
Core Concepts
Step-by-Step Walkthrough
c_v), topic diversity, silhouette scores.reports/t9-topic-modeling.md.Hands-On Exercises
Common Pitfalls & Troubleshooting
Best Practices
Reflection & Discussion Prompts
Next Steps / Advanced Extensions
Glossary / Key Terms
LDA, topic coherence, topic purity/diversity, BERTopic, embeddings, clustering, silhouette score.
Additional Resources
Contributors
Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset licenses: 20 Newsgroups (scikit-learn, open), AG News (CC).
Issues Referenced
Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: T9: Topic Modeling.