Skip to content

Master should reject calls from previously unresponsive slave hosts #371

@wjdhollow

Description

@wjdhollow

The build state will not be updated if the master thinks the slave disconnected before it is able to post back the status.

The problem that we are seeing is that a slave is working on a build, but when it tries to send a PUT to update the build state, the master responds with a 404 because it loses the record of the slave.

There does not appear to be a correlation with the job, but the problem does seem to occur when the system is under heavy load.

The slave process appears to recover because it reconnects itself to the master through some mechanism.

This particular issue occurred on ip-10-228-91-76.pod.box.net on build 657 and build 684. The problems occurred around 1:38 PM and 2:50 PM June 14, 2017.

Here is a snippet of the load average over time:

01:10:01 PM         4      1126      0.52      0.91      2.51
01:20:01 PM         4      1129      0.77      0.74      1.65
01:30:01 PM        19      1173      8.61      8.04      4.98
01:40:01 PM         2      1108      6.18      8.07      6.57
01:50:01 PM         7      1100      0.79      2.86      4.61
02:00:01 PM        20      1167      6.31      3.85      3.86
02:10:01 PM         6      1109      0.87      3.26      4.08
02:20:01 PM         5      1208     39.92     20.56     11.18
02:30:01 PM        10      1168      4.19     10.34     11.18
02:40:01 PM         3      1164     10.89      9.87     10.15
02:50:01 PM         6      1132     24.95     26.16     17.82
03:00:01 PM         5      1118      0.39      4.23      9.82
03:10:01 PM         6      1165      5.74      4.46      6.98
03:20:01 PM         3      1113      0.43      1.64      4.57

03:20:01 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
03:30:02 PM         4      1146      3.95      3.14      3.95
03:40:01 PM        10      1124      0.15      0.70      2.30
03:50:01 PM        13      1163      6.63      4.28      3.05
04:00:01 PM         2      1137      0.68      2.97      3.34
04:10:01 PM        14      1157      7.55      6.39      5.09
04:20:01 PM        14      1178      7.12      6.06      5.22
04:30:01 PM         7      1125      0.93      2.14      3.77
Average:            6      1123      3.05      2.98      2.92

The load average at the time of failure for build 657, 1:38 PM, was actually over 11.0. The host has 10 cores, so in both cases the CPU load went over 100%. The IOWAIT and network load were relatively low at those times, so it is my belief that it was caused primarily by CPU.

Here is the stack trace.

[2017-06-14 14:49:41.131] 4362 ERROR   Bld684-Setup    unhandled_excep Unhandled exception handler caught exception.
Traceback (most recent call last):
  File "/home/jenkins/ClusterRunnerBuild/app/util/safe_thread.py", line 18, in run
  File "/usr/local/lib/python3.4/threading.py", line 868, in run
  File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 138, in _async_setup_build
  File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 326, in _notify_master_of_state_change
  File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 95, in put_with_digest
  File "/home/jenkins/ClusterRunnerBuild/app/util/decorators.py", line 38, in function_with_retries
  File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 83, in put
  File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 150, in _request
app.util.network._RequestFailedError: Request to http://pod4101-automation1024.pod.box.net:43000/v1/slave/259 failed with status_code 404 and response "{"error": "Invalid build id: None."}"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions