SMDT — Social Media Data Toolkit

SMDT is a lightweight toolkit designed for ingesting, normalizing, enriching, and analyzing social-media data. It prioritizes streaming-friendly processing for large datasets, providing builders, utilities, and NLP hooks to transform raw exports (JSONL/CSV) into edge lists and NetworkX graphs.

The goal is to provide a flexible, consistent data model to enable reproducible data analysis across different social platforms.

Features

Ingest & Standardize: Convert raw platform exports (Twitter/X, Bluesky, TruthSocial) into normalized SQL tables (Communities, Accounts, Posts, Actions, Entities).
Anonymize & Redact: Remove or pseudonymize sensitive fields using policy-driven helpers before sharing datasets.
Enrich & Label: Apply computed features (language detection, toxicity scores, embeddings) via a local or server-backed enrichment framework.
Build Networks: Generate edge lists (User–User, Entity–Cooccurrence) and bipartite graphs compatible with NetworkX and Gephi.
Scale: Designed for streaming; handles datasets that do not fit in memory using incremental builders and Parquet exports.

Prerequisites & Database Setup

System Requirements:

Python 3.11+
PostgreSQL 14.19+
TimescaleDB Extension
PostGIS Extension

Database Installation

SMDT requires a PostgreSQL database with both the TimescaleDB and PostGIS extensions enabled.

Click to expand detailed PostgreSQL, TimescaleDB & PostGIS Installation Guide

1. Install PostgreSQL 14.19

Windows

Download version 14.19 from the EDB PostgreSQL Archive.
Run the installer (Default port: 5432). Set a password for the postgres user.
Add C:\Program Files\PostgreSQL\14\bin to your System Path environment variable.

macOS (Homebrew)

brew install postgresql@14
brew services start postgresql@14
brew link --force postgresql@14

Linux (Ubuntu/Debian)

sudo sh -c 'echo "deb [http://apt.postgresql.org/pub/repos/apt](http://apt.postgresql.org/pub/repos/apt) $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - [https://www.postgresql.org/media/keys/ACCC4CF8.asc](https://www.postgresql.org/media/keys/ACCC4CF8.asc) | sudo apt-key add -
sudo apt update
sudo apt install postgresql-14
sudo systemctl start postgresql
sudo systemctl enable postgresql

2. Install TimescaleDB

Windows

Download the .zip for Windows (amd64) from TimescaleDB Releases.
Extract the folder.
Run PowerShell as Administrator, navigate to the folder, and run .\setup.exe.
Restart the PostgreSQL service via services.msc.

macOS

brew tap timescale/tap
brew install timescaledb
timescaledb-tune --quiet --yes
brew services restart postgresql@14

Linux

sudo add-apt-repository ppa:timescale/timescaledb-ppa
sudo apt-get update
sudo apt install timescaledb-2-postgresql-14
sudo timescaledb-tune --quiet --yes
sudo systemctl restart postgresql

3. Install PostGIS

Windows

Open the Stack Builder utility (installed with PostgreSQL).
Select your PostgreSQL 14 installation.
Expand Spatial Extensions and check PostGIS 3.x Bundle for PostgreSQL 14.
Follow the prompts to install, then restart the PostgreSQL service.

macOS

brew install postgis
brew services restart postgresql@14

Linux

sudo apt install postgresql-14-postgis-3
sudo systemctl restart postgresql

4. Initialize Database

Run the following SQL commands to create the database and enable the extensions:

Connect to Postgres:

# Mac/Linux
psql -U postgres

# Windows
psql -U postgres

Run SQL:

CREATE DATABASE project_db;
CREATE USER project_user WITH ENCRYPTED PASSWORD 'your_password_here';
GRANT ALL PRIVILEGES ON DATABASE project_db TO project_user;

\c project_db

-- Enable Extensions
CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE EXTENSION IF NOT EXISTS postgis;

-- Verify installation
\dx

Installation & Quickstart

This project uses uv for fast Python package management. For detailed instructions, refer to the uv Installation Guide.

Clone the Repository

git clone [https://github.com/ViralLab/SMDT](https://github.com/ViralLab/SMDT)
cd SMDT

Initialize Environment

uv init
uv venv
source venv/bin/activate  # On Windows: .venv\Scripts\activate

Install Dependencies
```
uv sync
```

Configure Environment

Create a .env file or set environment variables for your database connection:

DEFAULT_DB_NAME=project_db
DB_USER=project_user
DB_PASSWORD=your_password_here
DB_HOST=localhost
DB_PORT=5432

Usage

1. Standardize Raw Exports

Convert raw JSONL data into normalized objects (Posts, Accounts, Entities, Actions).

from smdt.standardizers.twitter.twitter_v2 import TwitterV2Standardizer
from smdt.io.readers.jsonl import JSONLReader

standardizer = TwitterV2Standardizer()

# Stream through a JSONL export
for record in JSONLReader("data/tweets.jsonl"):
    for model in standardizer.standardize(record):
        # model is an instance of Posts, Accounts, Entities, or Actions
        print(model)

2. Inspect Data Quality

Check the completeness and schema distributions of your normalized tables.

from smdt.config import DBConfig
from smdt.store.standard_db import StandardDB
from smdt.inspector.inspector import Inspector, report_schemas

cfg = DBConfig() # reads DB_* env vars
db = StandardDB(db_name=cfg.default_dbname or 'mydb', cfg=cfg)
ins = Inspector(db, schema=getattr(cfg, 'owner', 'public'))

report_schemas([ins], only_tables=['posts', 'actions', 'accounts'])

3. Build Networks

Generate interaction graphs for analysis.

from smdt.config import DBConfig
from smdt.store.standard_db import StandardDB
from smdt.networks.api import user_interaction, entity_cooccurrence

cfg = DBConfig()
db = StandardDB(db_name="mydb", cfg=cfg)

# User interaction network (who quoted whom)
# Result is a DataFrame with src, dst, weight, edge_type
result = user_interaction(db, interaction="QUOTE", weighting="count")
print(result.edges.head())

# Export to Parquet for Gephi/NetworkX
result.edges.to_parquet("edges.parquet")

Project Structure

SMDT/
├── src/smdt/                  # Main package
│   ├── anonymizer/            # Redaction and pseudonymization utilities
│   ├── config.py              # Configuration (DB, anonymization)
│   ├── enrichers/             # Text enrichment framework (local + server adapters)
│   ├── ingest/                # Ingestion pipelines and deduplication logic
│   ├── inspector/             # Data quality inspection utilities
│   ├── io/                    # Streaming readers (JSONL, CSV, ZIP)
│   ├── networks/              # Network builders and streaming helpers
│   ├── standardizers/         # Platform-specific normalizers (Twitter, Bluesky, etc.)
│   └── store/                 # DB models and StandardDB abstraction
├── tests/
│   ├── unit/                  # Fast unit tests (no external deps)
│   └── integration/           # DB integration tests (requires Postgres)
├── prompt.yml                 # Prompt templates for enrichers
└── pyproject.toml             # Project metadata and dependencies

Data Model

SMDT normalizes data into four primary tables. If creating a new standardizer, ensure your output maps to these fields:

Table	Key Fields
Communities	`community_id`, `community_type`(CHANNEL/GROUP), `community_username`, `community_name`, `bio`, `is_public`, `member_count`, `post_count`, `profile_image_url`, `owner_account_id`, `created_at`, `retrieved_at`
Accounts	`account_id`, `username`, `profile_name`, `bio`, `location`, `post_count`,`friend_count`, `follower_count`, `is_verified`, `profile_image_url`, `created_at`, `retrieved_at`
Posts	`post_id`, `account_id`, `conversation_id`, `community_id`, `body`, `like_count`, `dislike_count`, `view_count`, `share_count`, `comment_count`, `quote_count`, `bookmark_count`
Entities	`account_id`, `community_id`, `post_id` , `body`, `entity_type` (e.g HASHTAG), `created_at`, `retrieved_at`
Actions	`originator_account_id`, `originator_post_id`, `target_account_id`, `target_post_id`, `originator_community_id`, `target_community_id`, `action_type` (e.g SHARE), `created_at`, `retrieved_at`

Development & Testing

Please ensure you follow existing code styles and add tests for new behaviors.

Running Tests

Integration tests require a database. Set TEST_DATABASE_URL in your .env.test.

# Run all tests
uv run python -m pytest

# Run only unit tests (fast, no DB required)
uv run python -m pytest tests/unit

# Run only integration tests
uv run python -m pytest tests/integration

# Verbose output
uv run python -m pytest -v

Adding a Platform

To add support for a platform like Threads:

Create a new module in src/smdt/standardizers/threads/ that maps raw data to the normalized models.
Update src/smdt/standardizers/__init__.py to import and expose the new standardizer.

Citation

License

[License Information Here]

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.github/workflows		.github/workflows
assets		assets
case_studies		case_studies
scripts		scripts
site		site
src/smdt		src/smdt
tests		tests
.env		.env
.env.test		.env.test
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
prompt.yml		prompt.yml
pydoc-markdown.yml		pydoc-markdown.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMDT — Social Media Data Toolkit

Table of Contents

Features

Prerequisites & Database Setup

Database Installation

1. Install PostgreSQL 14.19

2. Install TimescaleDB

3. Install PostGIS

4. Initialize Database

Installation & Quickstart

Usage

1. Standardize Raw Exports

2. Inspect Data Quality

3. Build Networks

Project Structure

Data Model

Development & Testing

Running Tests

Adding a Platform

Citation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SMDT — Social Media Data Toolkit

Table of Contents

Features

Prerequisites & Database Setup

Database Installation

1. Install PostgreSQL 14.19

2. Install TimescaleDB

3. Install PostGIS

4. Initialize Database

Installation & Quickstart

Usage

1. Standardize Raw Exports

2. Inspect Data Quality

3. Build Networks

Project Structure

Data Model

Development & Testing

Running Tests

Adding a Platform

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages