Skip to content

Machine learning solution for the Predict2Protect challenge — predicting whether a company is Alive or Failed based on financial and operational data. Implements a group-aware, leakage-safe pipeline using LightGBM, CatBoost, and XGBoost with Macro F1 optimization. Finalist solution ranked Top 10

License

Notifications You must be signed in to change notification settings

MonarchofCoding/predict2protect-company-failure-prediction

Repository files navigation

🧠 Predict2Protect – Company Failure Prediction Challenge

This repository contains my solution for the Predict2Protect Machine Learning Competition, where the goal was to predict whether a company is Alive or Failed based on its financial and operational features.


🏁 Competition Overview

The challenge involved developing a classification model that predicts a company’s status_label (Alive / Failed).
Data consisted of company-level records, each belonging to a specific Division and MajorGroup.
The evaluation metric was Macro F1-Score, emphasizing balanced performance on both classes.


🚀 Approach Summary

1. Data Understanding & EDA

  • Checked for missing values, data imbalance, and outliers.
  • Verified no company overlap between train and test (leakage prevention).
  • Analyzed Division and MajorGroup hierarchies (noting class imbalance).
  • Ensured that every company’s status_label is consistent (all rows share the same label).

2. Feature Engineering

  • Generated ratio-based and interaction features.
  • Aggregated statistics at the company level.
  • Applied label encoding for categorical columns.
  • Removed redundant or highly correlated features.

3. Model Building

  • Implemented LightGBM, CatBoost, XGBoost, and ExtraTrees models.
  • Used GroupKFold Cross-Validation (grouped by company_name) to avoid data leakage.
  • Optimized Macro F1-Score via threshold tuning.
  • Final predictions created using weighted averaging across model outputs.

4. Post-Processing

  • Enforced company-level label consistency:

    If any row of a company was failed, the whole company was marked as failed.

  • Calibrated thresholds to maximize F1 on validation folds.

📊 Results

Metric Score
Cross-Validation Macro F1 0.63 ± 0.01
Final Leaderboard Macro F1 0.635972
Final Rank #14 / Top 15 Finalist

💬 Key Learnings

  • Importance of group-aware validation to prevent leakage.
  • Effectiveness of label consistency enforcement at company level.
  • Balancing precision and recall through threshold optimization.
  • Robust pipeline design beats leaderboard hacks in the long run.

🔮 Future Work

  • Advanced hyperparameter optimization (Optuna/Bayesian tuning).
  • Seed aggregation and model ensembling for variance reduction.
  • Hierarchical encoding of DivisionMajorGroup relationships.

🧰 Tech Stack

  • Python, Pandas, NumPy
  • LightGBM, CatBoost, XGBoost, Scikit-Learn
  • Matplotlib, Seaborn, tqdm

🧑‍💻 Author

Devkanth Ravi
Top-15 Finalist, Predict2Protect 2025
📫 Connect: LinkedIn | Kaggle


“A consistent, explainable model always outlasts a lucky leaderboard spike.”

About

Machine learning solution for the Predict2Protect challenge — predicting whether a company is Alive or Failed based on financial and operational data. Implements a group-aware, leakage-safe pipeline using LightGBM, CatBoost, and XGBoost with Macro F1 optimization. Finalist solution ranked Top 10

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published