Six-Stack Personality Classification Pipeline

Production-ready machine learning pipeline for personality classification using ensemble learning, data augmentation, and automated hyperparameter optimization. Achieved top 5% (200/4329) in Kaggle Personality Classification Competition. Modular, maintainable, and includes an interactive dashboard.

Technology Stack

ML: scikit-learn, XGBoost, LightGBM, CatBoost, Optuna
Data: pandas, numpy, scipy, SDV
Dashboard: Dash, Plotly
DevOps: Docker, GitHub Actions, pre-commit, uv, Ruff, mypy, Bandit

Dashboard Preview

Watch a live demo of the Personality Classification Dashboard in action

Quick Start

git clone <repository-url>
cd Personality-classification
uv sync
make train-models   # Train models
make dash           # Launch dashboard
uv run python src/main_modular.py   # Run pipeline

Features

Modular architecture: 8 specialized modules
6 ensemble stacks (A-F) with complementary ML algorithms
Automated hyperparameter optimization (Optuna)
Advanced data augmentation (SDV Copula)
Interactive Dash dashboard
Dockerized deployment
Full test coverage (pytest)

Architecture

src/
├── main_modular.py                 # Main production pipeline (MLOps-enhanced)
├── modules/                        # Core modules
│   ├── config.py                   # Configuration & logging
│   ├── data_loader.py              # Data loading & external merge
│   ├── preprocessing.py            # Feature engineering
│   ├── data_augmentation.py        # Advanced synthetic data
│   ├── model_builders.py           # Model stack construction
│   ├── ensemble.py                 # Ensemble & OOF predictions
│   ├── optimization.py             # Optuna utilities
│   └── utils.py                    # Utility functions

dash_app/                           # Interactive Dashboard
├── dashboard/                            # Application source
│   ├── app.py                      # Main Dash application
│   ├── layout.py                   # UI layout components
│   ├── callbacks.py                # Interactive callbacks
│   └── model_loader.py             # Model loading utilities
├── main.py                         # Application entry point
├── Dockerfile                      # Container configuration
└── docker-compose.yml             # Multi-service orchestration

models/                             # Trained Models
├── ensemble_model.pkl              # Production ensemble model
├── ensemble_metadata.json         # Model metadata and labels
├── stack_*_model.pkl              # Individual stack models
└── stack_*_metadata.json          # Stack-specific metadata

scripts/                            # Utility Scripts
└── train_and_save_models.py        # Model training and persistence

data/                               # Datasets

docs/                               # Documentation
└── [Generated documentation]       # Technical guides

best_params/                        # Optimized parameters
└── stack_*_best_params.json        # Per-stack best parameters

Installation

Prerequisites

Python 3.11+
uv (modern Python package manager) - Install uv

Setup

# Clone repository
git clone <repository-url>
cd Personality-classification

# Install dependencies
uv sync

# Verify installation
uv run python examples/minimal_test.py

Alternative Installation (pip)

# If you prefer pip over uv
pip install -r requirements.txt  # Generated from pyproject.toml

Usage

# Run production pipeline
uv run python src/main_modular.py

# Launch dashboard (after training models)
make train-models
make dash

# Stop dashboard
make stop-dash

Dashboard

See the video demo above for the latest dashboard interface and features. To launch the dashboard:

make train-models
make dash
# Dashboard available at http://localhost:8050

Configuration

The pipeline is highly configurable through src/modules/config.py:

Core Parameters

# Reproducibility
RND = 42                           # Global random seed

# Cross-validation
N_SPLITS = 5                       # Stratified K-fold splits

# Hyperparameter optimization
N_TRIALS_STACK = 15               # Optuna trials per stack (15 for testing, 100+ for production)
N_TRIALS_BLEND = 200              # Ensemble blending optimization trials

# Threading configuration
class ThreadConfig(Enum):
    N_JOBS = 4                    # Parallel jobs for sklearn
    THREAD_COUNT = 4              # Thread count for XGBoost/LightGBM

Data Augmentation

# Augmentation settings
ENABLE_DATA_AUGMENTATION = True
AUGMENTATION_METHOD = "sdv_copula"    # or "basic", "smote", "adasyn"
AUGMENTATION_RATIO = 0.05             # 5% synthetic data

# Quality control
DIVERSITY_THRESHOLD = 0.95            # Minimum diversity score
QUALITY_THRESHOLD = 0.7               # Minimum quality score

Advanced Settings

# Label noise for robustness
LABEL_NOISE_RATE = 0.02              # 2% label noise for Stack F

# Testing mode
TESTING_MODE = True                   # Reduced dataset for development
TESTING_SAMPLE_SIZE = 1000           # Samples in testing mode

# Logging
LOG_LEVEL = "INFO"                   # DEBUG, INFO, WARNING, ERROR

Model Stacks

The pipeline employs six specialized ensemble stacks, each optimized for different aspects of the problem:

Stack	Focus	Algorithms	Hyperparameter Space	Special Features
A	Traditional ML (Narrow)	Random Forest, Logistic Regression, XGBoost, LightGBM, CatBoost	Conservative search space	Stable baseline performance
B	Traditional ML (Wide)	Same as Stack A	Extended search space	Broader exploration
C	Gradient Boosting	XGBoost, CatBoost	Gradient boosting focused	Tree-based specialists
D	Sklearn Ensemble	Extra Trees, Hist Gradient Boosting, SVM, Gaussian NB	Sklearn-native models	Diverse algorithm mix
E	Neural Networks	MLPClassifier, Deep architectures	Neural network tuning	Non-linear pattern capture
F	Noise-Robust Training	Same as Stack A	Standard space + label noise	Improved generalization

Ensemble Strategy

Out-of-fold predictions for unbiased ensemble training
Optuna-optimized blending weights for each stack
Meta-learning approach with Logistic Regression as final combiner
Stratified cross-validation ensures robust evaluation

Performance Metrics

Target Performance

The pipeline is designed to achieve high accuracy through ensemble learning and advanced optimization techniques. Performance will vary based on:

Dataset Statistics
├── Training Samples: ~18,000+ (with augmentation)
├── Test Samples: ~6,000+
├── Original Features: 8 personality dimensions
├── Engineered Features: 14+ (with preprocessing)
├── Augmented Samples: Variable (adaptive, typically 5-10%)
└── Class Balance: Extrovert/Introvert classification

Technical Specifications
├── Memory Usage: <4GB peak (configurable)
├── CPU Utilization: 4 cores (configurable)
├── Model Persistence: Yes - Best parameters saved
└── Reproducibility: Yes - Fixed random seeds

Testing & Validation

Development Testing

# Enable testing mode (faster execution)
# Edit src/modules/config.py:
TESTING_MODE = True
TESTING_SAMPLE_SIZE = 1000

# Run with reduced dataset
uv run python src/main_modular.py

Troubleshooting

Common Issues

Dashboard Issues

# Dashboard won't start
make train-models              # Ensure models are trained first
make stop-dash && make dash    # Stop and restart dashboard

# Port already in use
lsof -ti:8050 | xargs kill     # Kill process on port 8050
make dash                      # Restart dashboard

# Missing model files
make train-models              # Retrain models
ls models/                     # Verify model files exist

Memory Issues

# Reduce computational load
# In src/modules/config.py:
N_TRIALS_STACK = 5          # Reduce from 15
ENABLE_DATA_AUGMENTATION = False
TESTING_MODE = True

Import Errors

# Verify environment
uv run python --version     # Should be 3.11+
uv sync                     # Reinstall dependencies
uv run python -c "import sklearn, pandas, numpy, dash; print('OK')"

Performance Issues

# Optimize for your system
# In src/modules/config.py:
class ThreadConfig(Enum):
    N_JOBS = 2              # Reduce from 4
    THREAD_COUNT = 2        # Reduce from 4

Debug Mode

# Enable detailed logging
# In src/modules/config.py:
LOG_LEVEL = "DEBUG"

# Run with verbose output
uv run python src/main_modular.py 2>&1 | tee debug.log

Documentation

See the docs/ directory for:

Technical Guide
API Reference
Data Augmentation
Configuration Guide
Performance Tuning
Deployment Guide

Lead Developer & Maintainer

Lead Developer: Jeremy Vachier For issues, feature requests, or questions, use GitHub Issues or Discussions.

Contributing

Contributions welcome! Fork the repo, create a feature branch, implement and test your changes, then submit a pull request.

License

Licensed under the Apache License 2.0. See LICENSE.

Project Status

Status: Production Ready | Interactive Dashboard | Modular | Well Documented

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github		.github
dash_app		dash_app
docs		docs
models		models
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

License

jvachier/Personality-classification

Folders and files

Latest commit

History

Repository files navigation

Six-Stack Personality Classification Pipeline

Technology Stack

Dashboard Preview

Quick Start

Table of Contents

Features

Architecture

Installation

Prerequisites

Setup

Alternative Installation (pip)

Usage

Dashboard

Configuration

Core Parameters

Data Augmentation

Advanced Settings

Model Stacks

Ensemble Strategy

Performance Metrics

Target Performance

Testing & Validation

Development Testing

Troubleshooting

Common Issues

Dashboard Issues

Memory Issues

Import Errors

Performance Issues

Debug Mode

Documentation

Lead Developer & Maintainer

Contributing

License

Project Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages