LightGBM breaks on very large data when training on GPU

## Description
When training on GPU on very large datasets (more than 70.000.00 rows), the model fitting breaks with the error

> `lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852
> `

Sometimes I will get a similar error with `(best_split_info.right_count) > (0)` instead. 

## Reproducible example
Here is a minimal example with random data
```python
import lightgbm
import numpy as np

model = lightgbm.LGBMRegressor(
    boosting="gbdt",
    n_estimators=200,
    learning_rate=0.01,
    max_depth=5,
    num_leaves=31,
    min_data_in_leaf=1000,
    min_sum_hessian_in_leaf=1e-3,
    bagging_freq=5,
    bagging_fraction=0.1,
    feature_fraction=0.1,
    lambda_l1=1e-10,
    lambda_l2=1e-07,
    alpha=0.5,
    objective="l1",
    num_threads=8,
    device="gpu",
    seed=1,
)

np.random.seed(0)
Xtrain = np.random.rand(100_000_000, 250)
y = np.random.rand(100_000_000)
model.fit(Xtrain, y)

# The following, with less data works fine: 
# model.fit(Xtrain[:10_000_000], y[:10_000_000])
```

The script outputs:
```
[LightGBM] [Warning] min_data_in_leaf is set=1000, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1000
[LightGBM] [Warning] feature_fraction is set=0.1, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.1
[LightGBM] [Warning] lambda_l2 is set=1e-07, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1e-07
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=0.001, min_child_weight=0.001 will be ignored. Current value: min_sum_hessian_in_leaf=0.001
[LightGBM] [Warning] lambda_l1 is set=1e-10, reg_alpha=0.0 will be ignored. Current value: lambda_l1=1e-10
[LightGBM] [Warning] bagging_fraction is set=0.1, subsample=1.0 will be ignored. Current value: bagging_fraction=0.1
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] min_data_in_leaf is set=1000, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1000
[LightGBM] [Warning] feature_fraction is set=0.1, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.1
[LightGBM] [Warning] lambda_l2 is set=1e-07, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1e-07
[LightGBM] [Warning] boosting is set=gbdt, boosting_type=gbdt will be ignored. Current value: boosting=gbdt
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=0.001, min_child_weight=0.001 will be ignored. Current value: min_sum_hessian_in_leaf=0.001
[LightGBM] [Warning] lambda_l1 is set=1e-10, reg_alpha=0.0 will be ignored. Current value: lambda_l1=1e-10
[LightGBM] [Warning] bagging_fraction is set=0.1, subsample=1.0 will be ignored. Current value: bagging_fraction=0.1
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 63750
[LightGBM] [Info] Number of data points in the train set: 100000000, number of used features: 250
[LightGBM] [Info] Using GPU Device: NVIDIA H200 NVL, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 250 dense feature groups (24032.59 MB) transferred to GPU in 3.873491 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.500117
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852 .

Traceback (most recent call last):
  File "/home/alexander/models/LightGBM_split_error.py", line 33, in <module>
    model.fit(Xtrain, y)
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/sklearn.py", line 1398, in fit
    super().fit(
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/sklearn.py", line 1049, in fit
    self._Booster = train(
                    ^^^^^^
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/engine.py", line 322, in train
    booster.update(fobj=fobj)
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/basic.py", line 4154, in update
    _safe_call(
  File "/home/alexander/models/.venv/lib/python3.12/site-packages/lightgbm/basic.py", line 313, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 852 .
```

## Environment info

LightGBM version or commit hash: version 4.6.0

Command(s) you used to install LightGBM

```shell
python3.12 -m venv .venv
source .venv/bin/activate
pip install numpy==2.2.6 scikit-learn==1.7.2 lightgbm==4.6.0
```
```
> pip list
Package       Version
------------- -------
joblib        1.5.2
lightgbm      4.6.0
numpy         2.2.6
pip           24.0
scikit-learn  1.7.2
scipy         1.16.2
threadpoolctl 3.6.0
```

I have also tried compiling lightGBM with a CUDA kernel based on #7062  (commit hash 32781bae00a13e84ae152b57c830065595ba31e4), but I got the same error for with both device="cuda" and device="gpu" on the LGBMregressor.


The GPU I am using is a NVIDIA H200 NVL
```shell 
nvidia-smi
```
> NVIDIA-SMI 570.172.08
> Driver Version: 570.172.08     
> CUDA Version: 12.8 


## Additional Comments

-  I have real data where there is actual signal between the features and target , but I get the same outcome. That is, fitting works when subsetting the number of rows, but it raises an error when I apply it to all the rows.  
On that data, I have found that I can sometimes successfully run for a few training iterations and then crash with the error above.
- Training on less features, e.g. X.shape=(10 000 000, 50), works fine.
- I do not get the same issue on CPU
- I have tried changing `min_child_weight`, `min_child_samples`, `feature_fraction`, `bagging fraction`, `min_sum_hessian_leaf`, `max_depth`, `num_leaves`, and `max_bin`, but it does not seem to fix the problem.
- I am aware of the related issues #3679 and #4946, but did not find a solution to my issue in there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LightGBM breaks on very large data when training on GPU #7063

Description

Reproducible example

Environment info

Additional Comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LightGBM breaks on very large data when training on GPU #7063

Description

Description

Reproducible example

Environment info

Additional Comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions