Text Analysis Tutorial: Topic Modeling (LDA/gensim vs BERTopic; coherence vs purity)



---

# Title & Overview

**Template:** *Topic Modeling: An Intermediate, End-to-End Analysis Tutorial*
**Overview (≤2 sentences):** Learners will implement topic modeling pipelines using both classical probabilistic methods (LDA with gensim) and modern embedding-based clustering (BERTopic). It is intermediate because it emphasizes coherence/purity metrics, reproducibility, and interpretability of discovered topics.

# Purpose

The value-add is understanding when to apply **topic modeling for unsupervised exploration**, how to compare probabilistic vs embedding-based approaches, and how to evaluate topics using quantitative metrics and qualitative inspection. Learners will practice reproducibility, slice-based evaluation, and reporting.

# Prerequisites

* Skills: Python, Git, pandas, ML basics.
* NLP: tokenization, embeddings, clustering, evaluation metrics.
* Tooling: pandas, scikit-learn, gensim, BERTopic, Hugging Face Transformers, MLflow, FastAPI.

# Setup Instructions

* Environment: Conda/Poetry (Python 3.11), deterministic seeds.
* Install: pandas, scikit-learn, gensim, BERTopic, Hugging Face Transformers + Datasets, MLflow, FastAPI.
* Datasets:

  * **Small:** 20 Newsgroups (classic benchmark).
  * **Medium:** AG News (for unsupervised topic discovery).
* Repo layout:

  ```
  tutorials/t9-topic-modeling/
    ├─ notebooks/
    ├─ src/
    │   ├─ lda.py
    │   ├─ bertopic_model.py
    │   ├─ eval.py
    │   └─ config.yaml
    ├─ data/README.md
    ├─ reports/
    └─ tests/
  ```

# Core Concepts

* **Classical topic modeling:** LDA (Latent Dirichlet Allocation) with bag-of-words.
* **Embedding-based topic modeling:** BERTopic (transformer embeddings + clustering).
* **Evaluation:** topic coherence (UMass, c\_v), purity/diversity metrics.
* **Interpretability:** top words per topic, representative documents.
* **Error slicing:** performance by doc length, class/domain, noisy inputs.

# Step-by-Step Walkthrough

1. **Data intake & preprocessing:** load 20 Newsgroups and AG News, preprocess with byte-level BPE tokenizer, reproducible splits.
2. **Classical baseline:** LDA via gensim with TF-IDF vectorization; tune number of topics, alpha/beta priors.
3. **Modern approach:** BERTopic with Sentence-Transformer embeddings + clustering; compare to LDA.
4. **Evaluation:** topic coherence (gensim `c_v`), topic diversity, silhouette scores.
5. **Error analysis:** incoherent topics, fragmented vs overly broad topics, slice performance by document length.
6. **Reporting:** metrics tables, topic–word distributions, top representative docs → `reports/t9-topic-modeling.md`.
7. *(Optional)* Serve: FastAPI endpoint that assigns topics to input docs with schema validation.

# Hands-On Exercises

* Ablations: LDA num\_topics=10/20/50 vs BERTopic default.
* Robustness: noisy documents or code-switched text; compare topic coherence.
* Slice analysis: topic quality for short vs long docs.
* Stretch: hybrid models (LDA-initialized BERTopic, or embeddings clustered then refined by LDA).

# Common Pitfalls & Troubleshooting

* **Too many/few topics:** harms interpretability; must tune.
* **Sparse docs:** short texts degrade LDA; embeddings often handle better.
* **Metrics misuse:** coherence ≠ human interpretability; always pair with manual inspection.
* **Memory use:** BERTopic on large datasets → requires batching or dimensionality reduction.
* **Tokenizer drift:** different preprocessing pipelines → irreproducible topics.

# Best Practices

* Always log preprocessing config, tokenizer artifacts, and num\_topics with MLflow.
* Combine quantitative metrics (coherence) with qualitative review of top words/docs.
* Unit tests: deterministic topic assignments for toy corpus under fixed seeds.
* Guardrails: enforce max doc length and schema validation in serving.
* Keep **LDA → BERTopic** narrative for reproducible comparisons.

# Reflection & Discussion Prompts

* Why does LDA struggle with short texts compared to embeddings?
* What does topic coherence miss in evaluating real-world interpretability?
* How might civic datasets (e.g., public comments) benefit from topic modeling?

# Next Steps / Advanced Extensions

* Experiment with other clustering methods in BERTopic (HDBSCAN vs k-means).
* Explore multilingual topic modeling with mBERT.
* Domain adaptation: civic tech datasets, policy feedback.
* Lightweight monitoring: drift in topic distributions over time.

# Glossary / Key Terms

LDA, topic coherence, topic purity/diversity, BERTopic, embeddings, clustering, silhouette score.

# Additional Resources

* [[gensim](https://radimrehurek.com/gensim/)](https://radimrehurek.com/gensim/)
* [[BERTopic](https://maartengr.github.io/BERTopic/)](https://maartengr.github.io/BERTopic/)
* [[Hugging Face Transformers](https://huggingface.co/docs/transformers)](https://huggingface.co/docs/transformers)
* [[Hugging Face Datasets](https://huggingface.co/datasets)](https://huggingface.co/datasets)
* [[MLflow](https://mlflow.org/)](https://mlflow.org/)
* [[FastAPI](https://fastapi.tiangolo.com/)](https://fastapi.tiangolo.com/)

# Contributors

Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset licenses: 20 Newsgroups (scikit-learn, open), AG News (CC).

# Issues Referenced

Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: **T9: Topic Modeling**.

---



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text Analysis Tutorial: Topic Modeling (LDA/gensim vs BERTopic; coherence vs purity) #253

Title & Overview

Purpose

Prerequisites

Setup Instructions

Core Concepts

Step-by-Step Walkthrough

Hands-On Exercises

Common Pitfalls & Troubleshooting

Best Practices

Reflection & Discussion Prompts

Next Steps / Advanced Extensions

Glossary / Key Terms

Additional Resources

Contributors

Issues Referenced

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Text Analysis Tutorial: Topic Modeling (LDA/gensim vs BERTopic; coherence vs purity) #253

Description

Title & Overview

Purpose

Prerequisites

Setup Instructions

Core Concepts

Step-by-Step Walkthrough

Hands-On Exercises

Common Pitfalls & Troubleshooting

Best Practices

Reflection & Discussion Prompts

Next Steps / Advanced Extensions

Glossary / Key Terms

Additional Resources

Contributors

Issues Referenced

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions