Skip to content

rag train #284

@cicijohn1983

Description

@cicijohn1983

When running the RAG training script train.sh in the example folder, the following error occurs:

2025-11-07 08:43:54,784 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379...
2025-11-07 08:43:54,795 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at http://172.17.0.2:8265
(pid=56124) free(): double free detected in tcache 2
(pid=56124) *** SIGABRT received at time=1762505050 on cpu 30 ***
(pid=56124) PC: @ 0x7ee742b7d9fc (unknown) pthread_kill
(pid=56124) @ 0x7ee742b29520 (unknown) (unknown)
(pid=56124) [2025-11-07 08:44:10,019 E 56124 56124] logging.cc:474: *** SIGABRT received at time=1762505050 on cpu 30 ***
(pid=56124) [2025-11-07 08:44:10,019 E 56124 56124] logging.cc:474: PC: @ 0x7ee742b7d9fc (unknown) pthread_kill
(pid=56124) [2025-11-07 08:44:10,019 E 56124 56124] logging.cc:474: @ 0x7ee742b29520 (unknown) (unknown)
(pid=56124) Fatal Python error: Aborted
(pid=56124)
(pid=56124) Stack (most recent call first):
(pid=56124) File "/opt/conda/lib/python3.11/site-packages/torch/cuda/init.py", line 182 in is_available
(pid=56124) File "/opt/conda/lib/python3.11/site-packages/verl/utils/device.py", line 28 in
(pid=56124) File "", line 241 in _call_with_frames_removed
(pid=56124) File "", line 940 in exec_module
(pid=56124) File "", line 690 in _load_unlocked
(pid=56124) File "", line 1147 in _find_and_load_unlocked
(pid=56124) File "", line 1176 in _find_and_load
(pid=56124) File "/opt/conda/lib/python3.11/site-packages/verl/protocol.py", line 37 in
(pid=56124) File "", line 241 in _call_with_frames_removed
(pid=56124) File "", line 940 in exec_module
(pid=56124) File "", line 690 in _load_unlocked
(pid=56124) File "", line 1147 in _find_and_load_unlocked
(pid=56124) File "", line 1176 in _find_and_load
(pid=56124) File "/opt/conda/lib/python3.11/site-packages/verl/init.py", line 23 in
(pid=56124) File "", line 241 in _call_with_frames_removed
(pid=56124) File "", line 940 in exec_module
(pid=56124) File "", line 690 in _load_unlocked
(pid=56124) File "", line 1147 in _find_and_load_unlocked
(pid=56124) File "", line 1176 in _find_and_load
(pid=56124) File "", line 241 in _call_with_frames_removed
(pid=56124) File "", line 1126 in _find_and_load_unlocked
(pid=56124) File "", line 1176 in _find_and_load
(pid=56124) File "", line 241 in _call_with_frames_removed
(pid=56124) File "", line 1126 in _find_and_load_unlocked
(pid=56124) File "", line 1176 in _find_and_load
(pid=56124) File "", line 241 in _call_with_frames_removed
(pid=56124) File "", line 1126 in _find_and_load_unlocked
(pid=56124) File "", line 1176 in _find_and_load
(pid=56124) File "/opt/conda/lib/python3.11/site-packages/ray/_private/function_manager.py", line 649 in _load_actor_class_from_gcs
(pid=56124) File "/opt/conda/lib/python3.11/site-packages/ray/_private/function_manager.py", line 544 in load_actor_class
(pid=56124) File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 1042 in main_loop
(pid=56124) File "/opt/conda/lib/python3.11/site-packages/ray/_private/workers/default_worker.py", line 322 in
(pid=56124)
(pid=56124) Extension modules: psutil._psutil_linux, msgpack._cmsgpack, google._upb._message, _brotli, zstandard.backend_c, yaml._yaml, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, markupsafe._speedups, _cffi_backend, websockets.speedups, setproctitle._setproctitle, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 86)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 0a000000a6230df10c7f1f24f47bad41e1cd1cf609ec5ee955f94ec351e7171c Worker ID: 0e3ad1d22ab3b2b4822c3ee3b3f2370ad2807783b7cf16e271d0cfec Node ID: 627e478f584a97a15c630f6f011b71ef07921b36d4a764a0c90580ba Worker IP address: 172.17.0.2 Worker port: 10055 Worker PID: 56124 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/code/agent-lightning-main/examples/rag/data/musique_train.parquet', 'data.val_files=/code/agent-lightning-main/examples/rag/data/musique_dev_128.parquet', 'trainer.n_gpus_per_node=1', 'data.train_batch_size=2', 'actor_rollout_ref.rollout.n=2', 'actor_rollout_ref.model.path=/models/Qwen2.5-Coder-0.5B-Instruct', 'data.max_prompt_length=2048', 'data.max_response_length=1024', 'trainer.total_epochs=2', 'trainer.logger=console', 'trainer.val_before_train=True', 'actor_rollout_ref.rollout.name=hf', '++actor_rollout_ref.rollout.batch_size=2', '++actor_rollout_ref.rollout.max_new_tokens=1024', '++actor_rollout_ref.rollout.temperature=0.7', '++actor_rollout_ref.rollout.top_p=0.9']
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/agentlightning/verl/entrypoint.py", line 29, in main
run_ppo(config, train_dataset=None, val_dataset=None, store=None, llm_proxy=None, adapter=None)
File "/opt/conda/lib/python3.11/site-packages/agentlightning/verl/entrypoint.py", line 50, in run_ppo
ray.get(
File "/opt/conda/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 2961, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/_private/worker.py", line 1028, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: TaskRunner
actor_id: baf61da82878d689ad6376490b000000
pid: 56124
namespace: 56ee1a5c-3c00-42ad-a3be-88925f77d934
ip: 172.17.0.2
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Please tell me how to fix it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions