-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Description
When training on GPU on very large datasets (more than 70.000.00 rows), the model fitting breaks with the error
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852
Sometimes I will get a similar error with (best_split_info.right_count) > (0) instead.
Reproducible example
Here is a minimal example with random data
import lightgbm
import numpy as np
model = lightgbm.LGBMRegressor(
boosting="gbdt",
n_estimators=200,
learning_rate=0.01,
max_depth=5,
num_leaves=31,
min_data_in_leaf=1000,
min_sum_hessian_in_leaf=1e-3,
bagging_freq=5,
bagging_fraction=0.1,
feature_fraction=0.1,
lambda_l1=1e-10,
lambda_l2=1e-07,
alpha=0.5,
objective="l1",
num_threads=8,
device="gpu",
seed=1,
)
np.random.seed(0)
Xtrain = np.random.rand(100_000_000, 250)
y = np.random.rand(100_000_000)
model.fit(Xtrain, y)
# The following, with less data works fine:
# model.fit(Xtrain[:10_000_000], y[:10_000_000])The script outputs:
[LightGBM] [Warning] min_data_in_leaf is set=1000, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1000
[LightGBM] [Warning] feature_fraction is set=0.1, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.1
[LightGBM] [Warning] lambda_l2 is set=1e-07, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1e-07
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=0.001, min_child_weight=0.001 will be ignored. Current value: min_sum_hessian_in_leaf=0.001
[LightGBM] [Warning] lambda_l1 is set=1e-10, reg_alpha=0.0 will be ignored. Current value: lambda_l1=1e-10
[LightGBM] [Warning] bagging_fraction is set=0.1, subsample=1.0 will be ignored. Current value: bagging_fraction=0.1
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1000, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1000
[LightGBM] [Warning] feature_fraction is set=0.1, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.1
[LightGBM] [Warning] lambda_l2 is set=1e-07, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1e-07
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=0.001, min_child_weight=0.001 will be ignored. Current value: min_sum_hessian_in_leaf=0.001
[LightGBM] [Warning] lambda_l1 is set=1e-10, reg_alpha=0.0 will be ignored. Current value: lambda_l1=1e-10
[LightGBM] [Warning] bagging_fraction is set=0.1, subsample=1.0 will be ignored. Current value: bagging_fraction=0.1
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 63750
[LightGBM] [Info] Number of data points in the train set: 100000000, number of used features: 250
[LightGBM] [Info] Using GPU Device: NVIDIA H200 NVL, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 250 dense feature groups (24032.59 MB) transferred to GPU in 3.873491 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.500117
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852 .
Traceback (most recent call last):
File "/home/alexander/models/LightGBM_split_error.py", line 33, in <module>
model.fit(Xtrain, y)
File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/sklearn.py", line 1398, in fit
super().fit(
File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/sklearn.py", line 1049, in fit
self._Booster = train(
^^^^^^
File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/engine.py", line 322, in train
booster.update(fobj=fobj)
File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/basic.py", line 4154, in update
_safe_call(
File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/basic.py", line 313, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852 .
Environment info
LightGBM version or commit hash: version 4.6.0
Command(s) you used to install LightGBM
python3.12 -m venv .venv
source .venv/bin/activate
pip install numpy==2.2.6 scikit-learn==1.7.2 lightgbm==4.6.0> pip list
Package Version
------------- -------
joblib 1.5.2
lightgbm 4.6.0
numpy 2.2.6
pip 24.0
scikit-learn 1.7.2
scipy 1.16.2
threadpoolctl 3.6.0
I have also tried compiling lightGBM with a CUDA kernel based on #7062 (commit hash 32781ba), but I got the same error for with both device="cuda" and device="gpu" on the LGBMregressor.
The GPU I am using is a NVIDIA H200 NVL
nvidia-smiNVIDIA-SMI 570.172.08
Driver Version: 570.172.08
CUDA Version: 12.8
Additional Comments
- I have real data where there is actual signal between the features and target , but I get the same outcome. That is, fitting works when subsetting the number of rows, but it raises an error when I apply it to all the rows.
On that data, I have found that I can sometimes successfully run for a few training iterations and then crash with the error above. - Training on less features, e.g. X.shape=(10 000 000, 50), works fine.
- I do not get the same issue on CPU
- I have tried changing
min_child_weight,min_child_samples,feature_fraction,bagging fraction,min_sum_hessian_leaf,max_depth,num_leaves, andmax_bin, but it does not seem to fix the problem. - I am aware of the related issues Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task #3679 and Check failed: (best_split_info.left_count) > (0) #4946, but did not find a solution to my issue in there.