rag train

When running the RAG training script train.sh in the example folder, the following error occurs:

2025-11-07 08:43:54,784 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379...
2025-11-07 08:43:54,795 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at http://172.17.0.2:8265
(pid=56124) free(): double free detected in tcache 2
(pid=56124) *** SIGABRT received at time=1762505050 on cpu 30 ***
(pid=56124) PC: @     0x7ee742b7d9fc  (unknown)  pthread_kill
(pid=56124)     @     0x7ee742b29520  (unknown)  (unknown)
(pid=56124) [2025-11-07 08:44:10,019 E 56124 56124] logging.cc:474: *** SIGABRT received at time=1762505050 on cpu 30 ***
(pid=56124) [2025-11-07 08:44:10,019 E 56124 56124] logging.cc:474: PC: @     0x7ee742b7d9fc  (unknown)  pthread_kill
(pid=56124) [2025-11-07 08:44:10,019 E 56124 56124] logging.cc:474:     @     0x7ee742b29520  (unknown)  (unknown)
(pid=56124) Fatal Python error: Aborted
(pid=56124)
(pid=56124) Stack (most recent call first):
(pid=56124)   File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 182 in is_available
(pid=56124)   File "/opt/conda/lib/python3.11/site-packages/verl/utils/device.py", line 28 in <module>
(pid=56124)   File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
(pid=56124)   File "<frozen importlib._bootstrap_external>", line 940 in exec_module
(pid=56124)   File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1147 in _find_and_load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1176 in _find_and_load
(pid=56124)   File "/opt/conda/lib/python3.11/site-packages/verl/protocol.py", line 37 in <module>
(pid=56124)   File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
(pid=56124)   File "<frozen importlib._bootstrap_external>", line 940 in exec_module
(pid=56124)   File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1147 in _find_and_load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1176 in _find_and_load
(pid=56124)   File "/opt/conda/lib/python3.11/site-packages/verl/__init__.py", line 23 in <module>
(pid=56124)   File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
(pid=56124)   File "<frozen importlib._bootstrap_external>", line 940 in exec_module
(pid=56124)   File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1147 in _find_and_load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1176 in _find_and_load
(pid=56124)   File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
(pid=56124)   File "<frozen importlib._bootstrap>", line 1126 in _find_and_load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1176 in _find_and_load
(pid=56124)   File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
(pid=56124)   File "<frozen importlib._bootstrap>", line 1126 in _find_and_load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1176 in _find_and_load
(pid=56124)   File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
(pid=56124)   File "<frozen importlib._bootstrap>", line 1126 in _find_and_load_unlocked
(pid=56124)   File "<frozen importlib._bootstrap>", line 1176 in _find_and_load
(pid=56124)   File "/opt/conda/lib/python3.11/site-packages/ray/_private/function_manager.py", line 649 in _load_actor_class_from_gcs
(pid=56124)   File "/opt/conda/lib/python3.11/site-packages/ray/_private/function_manager.py", line 544 in load_actor_class
(pid=56124)   File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 1042 in main_loop
(pid=56124)   File "/opt/conda/lib/python3.11/site-packages/ray/_private/workers/default_worker.py", line 322 in <module>
(pid=56124)
(pid=56124) Extension modules: psutil._psutil_linux, msgpack._cmsgpack, google._upb._message, _brotli, zstandard.backend_c, yaml._yaml, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, markupsafe._speedups, _cffi_backend, websockets.speedups, setproctitle._setproctitle, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 86)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 0a000000a6230df10c7f1f24f47bad41e1cd1cf609ec5ee955f94ec351e7171c Worker ID: 0e3ad1d22ab3b2b4822c3ee3b3f2370ad2807783b7cf16e271d0cfec Node ID: 627e478f584a97a15c630f6f011b71ef07921b36d4a764a0c90580ba Worker IP address: 172.17.0.2 Worker port: 10055 Worker PID: 56124 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/code/agent-lightning-main/examples/rag/data/musique_train.parquet', 'data.val_files=/code/agent-lightning-main/examples/rag/data/musique_dev_128.parquet', 'trainer.n_gpus_per_node=1', 'data.train_batch_size=2', 'actor_rollout_ref.rollout.n=2', 'actor_rollout_ref.model.path=/models/Qwen2.5-Coder-0.5B-Instruct', 'data.max_prompt_length=2048', 'data.max_response_length=1024', 'trainer.total_epochs=2', 'trainer.logger=console', 'trainer.val_before_train=True', 'actor_rollout_ref.rollout.name=hf', '++actor_rollout_ref.rollout.batch_size=2', '++actor_rollout_ref.rollout.max_new_tokens=1024', '++actor_rollout_ref.rollout.temperature=0.7', '++actor_rollout_ref.rollout.top_p=0.9']
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/agentlightning/verl/entrypoint.py", line 29, in main
    run_ppo(config, train_dataset=None, val_dataset=None, store=None, llm_proxy=None, adapter=None)
  File "/opt/conda/lib/python3.11/site-packages/agentlightning/verl/entrypoint.py", line 50, in run_ppo
    ray.get(
  File "/opt/conda/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 2961, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 1028, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: TaskRunner
        actor_id: baf61da82878d689ad6376490b000000
        pid: 56124
        namespace: 56ee1a5c-3c00-42ad-a3be-88925f77d934
        ip: 172.17.0.2
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Please tell me how to fix it.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rag train #284

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rag train #284

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions