Skip to content

refactor: improve PilotStatusAgent performances and refactor duplicated code#8569

Open
aldbr wants to merge 2 commits into
DIRACGrid:integrationfrom
aldbr:fix-pilotstatusagent-ssh-leak
Open

refactor: improve PilotStatusAgent performances and refactor duplicated code#8569
aldbr wants to merge 2 commits into
DIRACGrid:integrationfrom
aldbr:fix-pilotstatusagent-ssh-leak

Conversation

@aldbr

@aldbr aldbr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

BEGINRELEASENOTES
*WorkloadManagement
CHANGE: improve PilotStatusAgent performances and refactor duplicated code
ENDRELEASENOTES

@aldbr

aldbr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Tested in LHCb, seems to work fine.
I'll just polish it a bit and fix the tests not passing

@aldbr aldbr force-pushed the fix-pilotstatusagent-ssh-leak branch 4 times, most recently from d2d2ac7 to eff200b Compare June 9, 2026 09:29
@aldbr aldbr marked this pull request as ready for review June 9, 2026 09:38
@aldbr aldbr force-pushed the fix-pilotstatusagent-ssh-leak branch from eff200b to 01d0a8c Compare June 9, 2026 09:38
@aldbr aldbr requested review from atsareg and fstagni as code owners June 9, 2026 09:38
@fstagni

fstagni commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Just fix

/home/docs/checkouts/readthedocs.org/user_builds/dirac/checkouts/8569/src/DIRAC/WorkloadManagementSystem/Utilities/QueueUtilities.py:docstring of DIRAC.WorkloadManagementSystem.Utilities.QueueUtilities.QueueCECache.__init__:2: WARNING: Field list ends without a blank line; unexpected unindent. [docutils]

@aldbr aldbr force-pushed the fix-pilotstatusagent-ssh-leak branch from 619ec79 to d35f6a7 Compare June 9, 2026 15:21
except Exception as e:
self.log.warn("Failed to close SSH gateway connection", str(e))

def shutdown(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobAgent and PushJobAgent are calling the shutdown method for the (inner) computing elements, which effectively before this PR is defined only for PoolComputingElement. There is seemingly no overlap, but just pointing out that this the same method signature.

gateway = getattr(connection, "gateway", None)
try:
connection.close()
except Exception as e: # a close failure must not break cache eviction

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more specific Exception?

if isinstance(gateway, Connection):
try:
gateway.close()
except Exception as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more specific Exception?

return
try:
cached["CE"].shutdown()
except Exception as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more specific Exception?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: PilotStatusAgent leaks SSH connections and exhausts memory (~45 GB), overloading SSH gateways

2 participants