SMDT is a lightweight toolkit designed for ingesting, normalizing, enriching, and analyzing social-media data. It prioritizes streaming-friendly processing for large datasets, providing builders, utilities, and NLP hooks to transform raw exports (JSONL/CSV) into edge lists and NetworkX graphs.
The goal is to provide a flexible, consistent data model to enable reproducible data analysis across different social platforms.
- Features
- Prerequisites & Database Setup
- Installation & Quickstart
- Usage
- Project Structure
- Data Model
- Development & Testing
- Citation
- Ingest & Standardize: Convert raw platform exports (Twitter/X, Bluesky, TruthSocial) into normalized SQL tables (
Communities,Accounts,Posts,Actions,Entities). - Anonymize & Redact: Remove or pseudonymize sensitive fields using policy-driven helpers before sharing datasets.
- Enrich & Label: Apply computed features (language detection, toxicity scores, embeddings) via a local or server-backed enrichment framework.
- Build Networks: Generate edge lists (User–User, Entity–Cooccurrence) and bipartite graphs compatible with NetworkX and Gephi.
- Scale: Designed for streaming; handles datasets that do not fit in memory using incremental builders and Parquet exports.
System Requirements:
- Python 3.11+
- PostgreSQL 14.19+
- TimescaleDB Extension
- PostGIS Extension
SMDT requires a PostgreSQL database with both the TimescaleDB and PostGIS extensions enabled.
Click to expand detailed PostgreSQL, TimescaleDB & PostGIS Installation Guide
Windows
- Download version 14.19 from the EDB PostgreSQL Archive.
- Run the installer (Default port:
5432). Set a password for thepostgresuser. - Add
C:\Program Files\PostgreSQL\14\binto your SystemPathenvironment variable.
macOS (Homebrew)
brew install postgresql@14
brew services start postgresql@14
brew link --force postgresql@14Linux (Ubuntu/Debian)
sudo sh -c 'echo "deb [http://apt.postgresql.org/pub/repos/apt](http://apt.postgresql.org/pub/repos/apt) $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - [https://www.postgresql.org/media/keys/ACCC4CF8.asc](https://www.postgresql.org/media/keys/ACCC4CF8.asc) | sudo apt-key add -
sudo apt update
sudo apt install postgresql-14
sudo systemctl start postgresql
sudo systemctl enable postgresqlWindows
-
Download the
.zipfor Windows (amd64) from TimescaleDB Releases. -
Extract the folder.
-
Run PowerShell as Administrator, navigate to the folder, and run
.\setup.exe. -
Restart the PostgreSQL service via
services.msc.
macOS
brew tap timescale/tap
brew install timescaledb
timescaledb-tune --quiet --yes
brew services restart postgresql@14Linux
sudo add-apt-repository ppa:timescale/timescaledb-ppa
sudo apt-get update
sudo apt install timescaledb-2-postgresql-14
sudo timescaledb-tune --quiet --yes
sudo systemctl restart postgresqlWindows
-
Open the Stack Builder utility (installed with PostgreSQL).
-
Select your PostgreSQL 14 installation.
-
Expand Spatial Extensions and check PostGIS 3.x Bundle for PostgreSQL 14.
-
Follow the prompts to install, then restart the PostgreSQL service.
macOS
brew install postgis
brew services restart postgresql@14Linux
sudo apt install postgresql-14-postgis-3
sudo systemctl restart postgresqlRun the following SQL commands to create the database and enable the extensions:
Connect to Postgres:
# Mac/Linux
psql -U postgres
# Windows
psql -U postgresRun SQL:
CREATE DATABASE project_db;
CREATE USER project_user WITH ENCRYPTED PASSWORD 'your_password_here';
GRANT ALL PRIVILEGES ON DATABASE project_db TO project_user;
\c project_db
-- Enable Extensions
CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE EXTENSION IF NOT EXISTS postgis;
-- Verify installation
\dxThis project uses uv for fast Python package management. For detailed instructions, refer to the uv Installation Guide.
-
Clone the Repository
git clone [https://github.com/ViralLab/SMDT](https://github.com/ViralLab/SMDT) cd SMDT -
Initialize Environment
uv init uv venv source venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install Dependencies
uv sync
-
Configure Environment
Create a
.envfile or set environment variables for your database connection:DEFAULT_DB_NAME=project_db DB_USER=project_user DB_PASSWORD=your_password_here DB_HOST=localhost DB_PORT=5432
Convert raw JSONL data into normalized objects (Posts, Accounts, Entities, Actions).
from smdt.standardizers.twitter.twitter_v2 import TwitterV2Standardizer
from smdt.io.readers.jsonl import JSONLReader
standardizer = TwitterV2Standardizer()
# Stream through a JSONL export
for record in JSONLReader("data/tweets.jsonl"):
for model in standardizer.standardize(record):
# model is an instance of Posts, Accounts, Entities, or Actions
print(model)Check the completeness and schema distributions of your normalized tables.
from smdt.config import DBConfig
from smdt.store.standard_db import StandardDB
from smdt.inspector.inspector import Inspector, report_schemas
cfg = DBConfig() # reads DB_* env vars
db = StandardDB(db_name=cfg.default_dbname or 'mydb', cfg=cfg)
ins = Inspector(db, schema=getattr(cfg, 'owner', 'public'))
report_schemas([ins], only_tables=['posts', 'actions', 'accounts'])Generate interaction graphs for analysis.
from smdt.config import DBConfig
from smdt.store.standard_db import StandardDB
from smdt.networks.api import user_interaction, entity_cooccurrence
cfg = DBConfig()
db = StandardDB(db_name="mydb", cfg=cfg)
# User interaction network (who quoted whom)
# Result is a DataFrame with src, dst, weight, edge_type
result = user_interaction(db, interaction="QUOTE", weighting="count")
print(result.edges.head())
# Export to Parquet for Gephi/NetworkX
result.edges.to_parquet("edges.parquet")SMDT/
├── src/smdt/ # Main package
│ ├── anonymizer/ # Redaction and pseudonymization utilities
│ ├── config.py # Configuration (DB, anonymization)
│ ├── enrichers/ # Text enrichment framework (local + server adapters)
│ ├── ingest/ # Ingestion pipelines and deduplication logic
│ ├── inspector/ # Data quality inspection utilities
│ ├── io/ # Streaming readers (JSONL, CSV, ZIP)
│ ├── networks/ # Network builders and streaming helpers
│ ├── standardizers/ # Platform-specific normalizers (Twitter, Bluesky, etc.)
│ └── store/ # DB models and StandardDB abstraction
├── tests/
│ ├── unit/ # Fast unit tests (no external deps)
│ └── integration/ # DB integration tests (requires Postgres)
├── prompt.yml # Prompt templates for enrichers
└── pyproject.toml # Project metadata and dependencies
SMDT normalizes data into four primary tables. If creating a new standardizer, ensure your output maps to these fields:
| Table | Key Fields |
|---|---|
| Communities | community_id, community_type(CHANNEL/GROUP), community_username, community_name, bio, is_public, member_count, post_count, profile_image_url, owner_account_id, created_at, retrieved_at |
| Accounts | account_id, username, profile_name, bio, location, post_count,friend_count, follower_count, is_verified, profile_image_url, created_at, retrieved_at |
| Posts | post_id, account_id, conversation_id, community_id, body, like_count, dislike_count, view_count, share_count, comment_count, quote_count, bookmark_count |
| Entities | account_id, community_id, post_id , body, entity_type (e.g HASHTAG), created_at, retrieved_at |
| Actions | originator_account_id, originator_post_id, target_account_id, target_post_id, originator_community_id, target_community_id, action_type (e.g SHARE), created_at, retrieved_at |
Please ensure you follow existing code styles and add tests for new behaviors.
Integration tests require a database. Set TEST_DATABASE_URL in your .env.test.
# Run all tests
uv run python -m pytest
# Run only unit tests (fast, no DB required)
uv run python -m pytest tests/unit
# Run only integration tests
uv run python -m pytest tests/integration
# Verbose output
uv run python -m pytest -vTo add support for a platform like Threads:
-
Create a new module in
src/smdt/standardizers/threads/that maps raw data to the normalized models. -
Update
src/smdt/standardizers/__init__.pyto import and expose the new standardizer.
[License Information Here]
