Master should reject calls from previously unresponsive slave hosts

The build state will not be updated if the master thinks the slave disconnected before it is able to post back the status.

The problem that we are seeing is that a slave is working on a build, but when it tries to send a PUT to update the build state, the master responds with a 404 because it loses the record of the slave. 

There does not appear to be a correlation with the job, but the problem does seem to occur when the system is under heavy load. 

The slave process appears to recover because it reconnects itself to the master through some mechanism. 

This particular issue occurred on ip-10-228-91-76.pod.box.net on build 657 and build 684. The problems occurred around 1:38 PM and 2:50 PM June 14, 2017.


Here is a snippet of the load average over time:
```
01:10:01 PM         4      1126      0.52      0.91      2.51
01:20:01 PM         4      1129      0.77      0.74      1.65
01:30:01 PM        19      1173      8.61      8.04      4.98
01:40:01 PM         2      1108      6.18      8.07      6.57
01:50:01 PM         7      1100      0.79      2.86      4.61
02:00:01 PM        20      1167      6.31      3.85      3.86
02:10:01 PM         6      1109      0.87      3.26      4.08
02:20:01 PM         5      1208     39.92     20.56     11.18
02:30:01 PM        10      1168      4.19     10.34     11.18
02:40:01 PM         3      1164     10.89      9.87     10.15
02:50:01 PM         6      1132     24.95     26.16     17.82
03:00:01 PM         5      1118      0.39      4.23      9.82
03:10:01 PM         6      1165      5.74      4.46      6.98
03:20:01 PM         3      1113      0.43      1.64      4.57

03:20:01 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
03:30:02 PM         4      1146      3.95      3.14      3.95
03:40:01 PM        10      1124      0.15      0.70      2.30
03:50:01 PM        13      1163      6.63      4.28      3.05
04:00:01 PM         2      1137      0.68      2.97      3.34
04:10:01 PM        14      1157      7.55      6.39      5.09
04:20:01 PM        14      1178      7.12      6.06      5.22
04:30:01 PM         7      1125      0.93      2.14      3.77
Average:            6      1123      3.05      2.98      2.92
```
The load average at the time of failure for build 657, 1:38 PM, was actually over 11.0. The host has 10 cores, so in both cases the CPU load went over 100%. The IOWAIT and network load were relatively low at those times, so it is my belief that it was caused primarily by CPU.

Here is the stack trace. 

```
[2017-06-14 14:49:41.131] 4362 ERROR   Bld684-Setup    unhandled_excep Unhandled exception handler caught exception.
Traceback (most recent call last):
  File "/home/jenkins/ClusterRunnerBuild/app/util/safe_thread.py", line 18, in run
  File "/usr/local/lib/python3.4/threading.py", line 868, in run
  File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 138, in _async_setup_build
  File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 326, in _notify_master_of_state_change
  File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 95, in put_with_digest
  File "/home/jenkins/ClusterRunnerBuild/app/util/decorators.py", line 38, in function_with_retries
  File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 83, in put
  File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 150, in _request
app.util.network._RequestFailedError: Request to http://pod4101-automation1024.pod.box.net:43000/v1/slave/259 failed with status_code 404 and response "{"error": "Invalid build id: None."}"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Master should reject calls from previously unresponsive slave hosts #371

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Master should reject calls from previously unresponsive slave hosts #371

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions