This repository contains my solution for the Predict2Protect Machine Learning Competition, where the goal was to predict whether a company is Alive or Failed based on its financial and operational features.
The challenge involved developing a classification model that predicts a company’s status_label (Alive / Failed).
Data consisted of company-level records, each belonging to a specific Division and MajorGroup.
The evaluation metric was Macro F1-Score, emphasizing balanced performance on both classes.
- Checked for missing values, data imbalance, and outliers.
- Verified no company overlap between train and test (leakage prevention).
- Analyzed
DivisionandMajorGrouphierarchies (noting class imbalance). - Ensured that every company’s
status_labelis consistent (all rows share the same label).
- Generated ratio-based and interaction features.
- Aggregated statistics at the company level.
- Applied label encoding for categorical columns.
- Removed redundant or highly correlated features.
- Implemented LightGBM, CatBoost, XGBoost, and ExtraTrees models.
- Used GroupKFold Cross-Validation (grouped by
company_name) to avoid data leakage. - Optimized Macro F1-Score via threshold tuning.
- Final predictions created using weighted averaging across model outputs.
- Enforced company-level label consistency:
If any row of a company was failed, the whole company was marked as failed.
- Calibrated thresholds to maximize F1 on validation folds.
| Metric | Score |
|---|---|
| Cross-Validation Macro F1 | 0.63 ± 0.01 |
| Final Leaderboard Macro F1 | 0.635972 |
| Final Rank | #14 / Top 15 Finalist |
- Importance of group-aware validation to prevent leakage.
- Effectiveness of label consistency enforcement at company level.
- Balancing precision and recall through threshold optimization.
- Robust pipeline design beats leaderboard hacks in the long run.
- Advanced hyperparameter optimization (Optuna/Bayesian tuning).
- Seed aggregation and model ensembling for variance reduction.
- Hierarchical encoding of
Division→MajorGrouprelationships.
- Python, Pandas, NumPy
- LightGBM, CatBoost, XGBoost, Scikit-Learn
- Matplotlib, Seaborn, tqdm
Devkanth Ravi
Top-15 Finalist, Predict2Protect 2025
📫 Connect: LinkedIn | Kaggle
“A consistent, explainable model always outlasts a lucky leaderboard spike.”