-
Notifications
You must be signed in to change notification settings - Fork 40
Description
The build state will not be updated if the master thinks the slave disconnected before it is able to post back the status.
The problem that we are seeing is that a slave is working on a build, but when it tries to send a PUT to update the build state, the master responds with a 404 because it loses the record of the slave.
There does not appear to be a correlation with the job, but the problem does seem to occur when the system is under heavy load.
The slave process appears to recover because it reconnects itself to the master through some mechanism.
This particular issue occurred on ip-10-228-91-76.pod.box.net on build 657 and build 684. The problems occurred around 1:38 PM and 2:50 PM June 14, 2017.
Here is a snippet of the load average over time:
01:10:01 PM 4 1126 0.52 0.91 2.51
01:20:01 PM 4 1129 0.77 0.74 1.65
01:30:01 PM 19 1173 8.61 8.04 4.98
01:40:01 PM 2 1108 6.18 8.07 6.57
01:50:01 PM 7 1100 0.79 2.86 4.61
02:00:01 PM 20 1167 6.31 3.85 3.86
02:10:01 PM 6 1109 0.87 3.26 4.08
02:20:01 PM 5 1208 39.92 20.56 11.18
02:30:01 PM 10 1168 4.19 10.34 11.18
02:40:01 PM 3 1164 10.89 9.87 10.15
02:50:01 PM 6 1132 24.95 26.16 17.82
03:00:01 PM 5 1118 0.39 4.23 9.82
03:10:01 PM 6 1165 5.74 4.46 6.98
03:20:01 PM 3 1113 0.43 1.64 4.57
03:20:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
03:30:02 PM 4 1146 3.95 3.14 3.95
03:40:01 PM 10 1124 0.15 0.70 2.30
03:50:01 PM 13 1163 6.63 4.28 3.05
04:00:01 PM 2 1137 0.68 2.97 3.34
04:10:01 PM 14 1157 7.55 6.39 5.09
04:20:01 PM 14 1178 7.12 6.06 5.22
04:30:01 PM 7 1125 0.93 2.14 3.77
Average: 6 1123 3.05 2.98 2.92
The load average at the time of failure for build 657, 1:38 PM, was actually over 11.0. The host has 10 cores, so in both cases the CPU load went over 100%. The IOWAIT and network load were relatively low at those times, so it is my belief that it was caused primarily by CPU.
Here is the stack trace.
[2017-06-14 14:49:41.131] 4362 ERROR Bld684-Setup unhandled_excep Unhandled exception handler caught exception.
Traceback (most recent call last):
File "/home/jenkins/ClusterRunnerBuild/app/util/safe_thread.py", line 18, in run
File "/usr/local/lib/python3.4/threading.py", line 868, in run
File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 138, in _async_setup_build
File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 326, in _notify_master_of_state_change
File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 95, in put_with_digest
File "/home/jenkins/ClusterRunnerBuild/app/util/decorators.py", line 38, in function_with_retries
File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 83, in put
File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 150, in _request
app.util.network._RequestFailedError: Request to http://pod4101-automation1024.pod.box.net:43000/v1/slave/259 failed with status_code 404 and response "{"error": "Invalid build id: None."}"