Skip to content

LightGBM breaks on very large data when training on GPU #7063

@AMChristgau

Description

@AMChristgau

Description

When training on GPU on very large datasets (more than 70.000.00 rows), the model fitting breaks with the error

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852

Sometimes I will get a similar error with (best_split_info.right_count) > (0) instead.

Reproducible example

Here is a minimal example with random data

import lightgbm
import numpy as np

model = lightgbm.LGBMRegressor(
    boosting="gbdt",
    n_estimators=200,
    learning_rate=0.01,
    max_depth=5,
    num_leaves=31,
    min_data_in_leaf=1000,
    min_sum_hessian_in_leaf=1e-3,
    bagging_freq=5,
    bagging_fraction=0.1,
    feature_fraction=0.1,
    lambda_l1=1e-10,
    lambda_l2=1e-07,
    alpha=0.5,
    objective="l1",
    num_threads=8,
    device="gpu",
    seed=1,
)

np.random.seed(0)
Xtrain = np.random.rand(100_000_000, 250)
y = np.random.rand(100_000_000)
model.fit(Xtrain, y)

# The following, with less data works fine: 
# model.fit(Xtrain[:10_000_000], y[:10_000_000])

The script outputs:

[LightGBM] [Warning] min_data_in_leaf is set=1000, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1000
[LightGBM] [Warning] feature_fraction is set=0.1, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.1
[LightGBM] [Warning] lambda_l2 is set=1e-07, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1e-07
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=0.001, min_child_weight=0.001 will be ignored. Current value: min_sum_hessian_in_leaf=0.001
[LightGBM] [Warning] lambda_l1 is set=1e-10, reg_alpha=0.0 will be ignored. Current value: lambda_l1=1e-10
[LightGBM] [Warning] bagging_fraction is set=0.1, subsample=1.0 will be ignored. Current value: bagging_fraction=0.1
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1000, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1000
[LightGBM] [Warning] feature_fraction is set=0.1, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.1
[LightGBM] [Warning] lambda_l2 is set=1e-07, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1e-07
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=0.001, min_child_weight=0.001 will be ignored. Current value: min_sum_hessian_in_leaf=0.001
[LightGBM] [Warning] lambda_l1 is set=1e-10, reg_alpha=0.0 will be ignored. Current value: lambda_l1=1e-10
[LightGBM] [Warning] bagging_fraction is set=0.1, subsample=1.0 will be ignored. Current value: bagging_fraction=0.1
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 63750
[LightGBM] [Info] Number of data points in the train set: 100000000, number of used features: 250
[LightGBM] [Info] Using GPU Device: NVIDIA H200 NVL, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 250 dense feature groups (24032.59 MB) transferred to GPU in 3.873491 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.500117
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852 .

Traceback (most recent call last):
  File "/home/alexander/models/LightGBM_split_error.py", line 33, in <module>
    model.fit(Xtrain, y)
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/sklearn.py", line 1398, in fit
    super().fit(
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/sklearn.py", line 1049, in fit
    self._Booster = train(
                    ^^^^^^
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/engine.py", line 322, in train
    booster.update(fobj=fobj)
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/basic.py", line 4154, in update
    _safe_call(
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/basic.py", line 313, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852 .

Environment info

LightGBM version or commit hash: version 4.6.0

Command(s) you used to install LightGBM

python3.12 -m venv .venv
source .venv/bin/activate
pip install numpy==2.2.6 scikit-learn==1.7.2 lightgbm==4.6.0
> pip list
Package       Version
------------- -------
joblib        1.5.2
lightgbm      4.6.0
numpy         2.2.6
pip           24.0
scikit-learn  1.7.2
scipy         1.16.2
threadpoolctl 3.6.0

I have also tried compiling lightGBM with a CUDA kernel based on #7062 (commit hash 32781ba), but I got the same error for with both device="cuda" and device="gpu" on the LGBMregressor.

The GPU I am using is a NVIDIA H200 NVL

nvidia-smi

NVIDIA-SMI 570.172.08
Driver Version: 570.172.08
CUDA Version: 12.8

Additional Comments

  • I have real data where there is actual signal between the features and target , but I get the same outcome. That is, fitting works when subsetting the number of rows, but it raises an error when I apply it to all the rows.
    On that data, I have found that I can sometimes successfully run for a few training iterations and then crash with the error above.
  • Training on less features, e.g. X.shape=(10 000 000, 50), works fine.
  • I do not get the same issue on CPU
  • I have tried changing min_child_weight, min_child_samples, feature_fraction, bagging fraction, min_sum_hessian_leaf, max_depth, num_leaves, and max_bin, but it does not seem to fix the problem.
  • I am aware of the related issues Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task #3679 and Check failed: (best_split_info.left_count) > (0) #4946, but did not find a solution to my issue in there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    gpu (CUDA)Issue is related to the CUDA GPU variant.gpu (OpenCL)Issue is related to the OpenCL-based GPU variant.question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions