Production-ready machine learning pipeline for personality classification using ensemble learning, data augmentation, and automated hyperparameter optimization. Achieved top 5% (200/4329) in Kaggle Personality Classification Competition. Modular, maintainable, and includes an interactive dashboard.
ML: scikit-learn, XGBoost, LightGBM, CatBoost, Optuna
Data: pandas, numpy, scipy, SDV
Dashboard: Dash, Plotly
DevOps: Docker, GitHub Actions, pre-commit, uv, Ruff, mypy, Bandit
Watch a live demo of the Personality Classification Dashboard in action
git clone <repository-url>
cd Personality-classification
uv sync
make train-models # Train models
make dash # Launch dashboard
uv run python src/main_modular.py # Run pipeline- Dashboard Preview
- Quick Start
- Features
- Architecture
- Installation
- Usage
- Dashboard
- Configuration
- Model Stacks
- Performance Metrics
- Testing & Validation
- Troubleshooting
- Documentation
- Modular architecture: 8 specialized modules
- 6 ensemble stacks (A-F) with complementary ML algorithms
- Automated hyperparameter optimization (Optuna)
- Advanced data augmentation (SDV Copula)
- Interactive Dash dashboard
- Dockerized deployment
- Full test coverage (pytest)
src/
├── main_modular.py # Main production pipeline (MLOps-enhanced)
├── modules/ # Core modules
│ ├── config.py # Configuration & logging
│ ├── data_loader.py # Data loading & external merge
│ ├── preprocessing.py # Feature engineering
│ ├── data_augmentation.py # Advanced synthetic data
│ ├── model_builders.py # Model stack construction
│ ├── ensemble.py # Ensemble & OOF predictions
│ ├── optimization.py # Optuna utilities
│ └── utils.py # Utility functions
dash_app/ # Interactive Dashboard
├── dashboard/ # Application source
│ ├── app.py # Main Dash application
│ ├── layout.py # UI layout components
│ ├── callbacks.py # Interactive callbacks
│ └── model_loader.py # Model loading utilities
├── main.py # Application entry point
├── Dockerfile # Container configuration
└── docker-compose.yml # Multi-service orchestration
models/ # Trained Models
├── ensemble_model.pkl # Production ensemble model
├── ensemble_metadata.json # Model metadata and labels
├── stack_*_model.pkl # Individual stack models
└── stack_*_metadata.json # Stack-specific metadata
scripts/ # Utility Scripts
└── train_and_save_models.py # Model training and persistence
data/ # Datasets
docs/ # Documentation
└── [Generated documentation] # Technical guides
best_params/ # Optimized parameters
└── stack_*_best_params.json # Per-stack best parameters
- Python 3.11+
- uv (modern Python package manager) - Install uv
# Clone repository
git clone <repository-url>
cd Personality-classification
# Install dependencies
uv sync
# Verify installation
uv run python examples/minimal_test.py# If you prefer pip over uv
pip install -r requirements.txt # Generated from pyproject.toml# Run production pipeline
uv run python src/main_modular.py
# Launch dashboard (after training models)
make train-models
make dash
# Stop dashboard
make stop-dashSee the video demo above for the latest dashboard interface and features. To launch the dashboard:
make train-models
make dash
# Dashboard available at http://localhost:8050The pipeline is highly configurable through src/modules/config.py:
# Reproducibility
RND = 42 # Global random seed
# Cross-validation
N_SPLITS = 5 # Stratified K-fold splits
# Hyperparameter optimization
N_TRIALS_STACK = 15 # Optuna trials per stack (15 for testing, 100+ for production)
N_TRIALS_BLEND = 200 # Ensemble blending optimization trials
# Threading configuration
class ThreadConfig(Enum):
N_JOBS = 4 # Parallel jobs for sklearn
THREAD_COUNT = 4 # Thread count for XGBoost/LightGBM# Augmentation settings
ENABLE_DATA_AUGMENTATION = True
AUGMENTATION_METHOD = "sdv_copula" # or "basic", "smote", "adasyn"
AUGMENTATION_RATIO = 0.05 # 5% synthetic data
# Quality control
DIVERSITY_THRESHOLD = 0.95 # Minimum diversity score
QUALITY_THRESHOLD = 0.7 # Minimum quality score# Label noise for robustness
LABEL_NOISE_RATE = 0.02 # 2% label noise for Stack F
# Testing mode
TESTING_MODE = True # Reduced dataset for development
TESTING_SAMPLE_SIZE = 1000 # Samples in testing mode
# Logging
LOG_LEVEL = "INFO" # DEBUG, INFO, WARNING, ERRORThe pipeline employs six specialized ensemble stacks, each optimized for different aspects of the problem:
| Stack | Focus | Algorithms | Hyperparameter Space | Special Features |
|---|---|---|---|---|
| A | Traditional ML (Narrow) | Random Forest, Logistic Regression, XGBoost, LightGBM, CatBoost | Conservative search space | Stable baseline performance |
| B | Traditional ML (Wide) | Same as Stack A | Extended search space | Broader exploration |
| C | Gradient Boosting | XGBoost, CatBoost | Gradient boosting focused | Tree-based specialists |
| D | Sklearn Ensemble | Extra Trees, Hist Gradient Boosting, SVM, Gaussian NB | Sklearn-native models | Diverse algorithm mix |
| E | Neural Networks | MLPClassifier, Deep architectures | Neural network tuning | Non-linear pattern capture |
| F | Noise-Robust Training | Same as Stack A | Standard space + label noise | Improved generalization |
- Out-of-fold predictions for unbiased ensemble training
- Optuna-optimized blending weights for each stack
- Meta-learning approach with Logistic Regression as final combiner
- Stratified cross-validation ensures robust evaluation
The pipeline is designed to achieve high accuracy through ensemble learning and advanced optimization techniques. Performance will vary based on:
Dataset Statistics
├── Training Samples: ~18,000+ (with augmentation)
├── Test Samples: ~6,000+
├── Original Features: 8 personality dimensions
├── Engineered Features: 14+ (with preprocessing)
├── Augmented Samples: Variable (adaptive, typically 5-10%)
└── Class Balance: Extrovert/Introvert classification
Technical Specifications
├── Memory Usage: <4GB peak (configurable)
├── CPU Utilization: 4 cores (configurable)
├── Model Persistence: Yes - Best parameters saved
└── Reproducibility: Yes - Fixed random seeds
# Enable testing mode (faster execution)
# Edit src/modules/config.py:
TESTING_MODE = True
TESTING_SAMPLE_SIZE = 1000
# Run with reduced dataset
uv run python src/main_modular.py# Dashboard won't start
make train-models # Ensure models are trained first
make stop-dash && make dash # Stop and restart dashboard
# Port already in use
lsof -ti:8050 | xargs kill # Kill process on port 8050
make dash # Restart dashboard
# Missing model files
make train-models # Retrain models
ls models/ # Verify model files exist# Reduce computational load
# In src/modules/config.py:
N_TRIALS_STACK = 5 # Reduce from 15
ENABLE_DATA_AUGMENTATION = False
TESTING_MODE = True# Verify environment
uv run python --version # Should be 3.11+
uv sync # Reinstall dependencies
uv run python -c "import sklearn, pandas, numpy, dash; print('OK')"# Optimize for your system
# In src/modules/config.py:
class ThreadConfig(Enum):
N_JOBS = 2 # Reduce from 4
THREAD_COUNT = 2 # Reduce from 4# Enable detailed logging
# In src/modules/config.py:
LOG_LEVEL = "DEBUG"
# Run with verbose output
uv run python src/main_modular.py 2>&1 | tee debug.logSee the docs/ directory for:
- Technical Guide
- API Reference
- Data Augmentation
- Configuration Guide
- Performance Tuning
- Deployment Guide
Lead Developer: Jeremy Vachier For issues, feature requests, or questions, use GitHub Issues or Discussions.
Contributions welcome! Fork the repo, create a feature branch, implement and test your changes, then submit a pull request.
Licensed under the Apache License 2.0. See LICENSE.
Status: Production Ready | Interactive Dashboard | Modular | Well Documented