Skip to content

Large memory footprint when running patchman --update-errata #819

@cpoppema

Description

@cpoppema

Hi!

I'm running the patchman --update-errata --erratum-type debian command as a cronjob with memory constraints. Recently these limits were reached and the job no longer completes (1Gi). Increasing these limits short-term is not a problem on my side.

I've observed large memory usage when decompressing and reading the file map. Here's some rudimentary print output that hopefully makes it obvious.

starting update_errata
current= 82952 kB peak= 82952 kB
    starting update_debian_errata
        starting retrieve_debian_codenames
        current= 82960 kB peak= 82960 kB
        retrieve_debian_codenames took 0:00:00.021886

        starting create_debian_os_releases
        current= 83500 kB peak= 83500 kB
        create_debian_os_releases took 0:00:00.003031

        starting fetch_debian_dsa_advisories
        current= 84640 kB peak= 85016 kB
        fetch_debian_dsa_advisories took 0:00:00.932085

        starting fetch_debian_dla_advisories
        current= 86320 kB peak= 86320 kB
        fetch_debian_dla_advisories took 0:00:00.943669

        starting fetch_dscs_from_debian_package_file_maps
        current= 181048 kB peak= 1068368 kB
        fetch_dscs_from_debian_package_file_maps took 0:00:05.944284

        starting get_accepted_debian_codenames
        current= 181048 kB peak= 1068368 kB
        get_accepted_debian_codenames took 0:00:00.000083

        starting parse_debian_errata
        current= 192976 kB peak= 1068368 kB
        parse_debian_errata took 0:00:10.646258

        starting create_debian_errata
        current= 193232 kB peak= 1068368 kB
        create_debian_errata took 0:00:00.166258

    update_debian_errata took 0:00:18.658681
    current= 190488 kB peak= 1068368 kB
    
update_errata took 0:00:18.671028
current= 190488 kB peak= 1068368 kB

The function fetch_dscs_from_debian_package_file_maps does three things that increase memory usage, at least twice significantly:

        data = fetch_content(res, f'Fetching `{repo}` package file map')
        file_map_data = extract(data, file_map_url).decode()
        parse_debian_package_file_map(file_map_data, repo)
  1. fetch_content keeps an entire .bz2 file contents in memory; this is not too bad: The file is +/- 9 MB.
  2. extracting the .bz2 file in memory; this hits harder: The largest decompressed file is +/- 240 MB, peak memory goes up by a lot more.
  3. parsing the file_map_data: some memory inevitably is taken by growing the DSCs dict, however that should be far less.

With more detailed logging:

before: current= 87_436 kB peak= 87_436 kB

after fetch_content() for repo debian: current= 97_312 kB peak= 104_504 kB
after extract() for repo debian: current= 360_944 kB peak= 601_872 kB
after parse_debian_package_file_map() for repo debian: current= 420_836 kB peak= 1_068_304 kB

after fetch_content() for repo debian-security: current= 412_060 kB peak= 1_068_304 kB
after extract() for repo debian-security: current= 176_636 kB peak= 1_068_304 kB
after parse_debian_package_file_map() for repo debian-security: current= 181_296 kB peak= 1_068_304 kB

To investigate an alternative way (file-streaming/buffering) I've run patchman with the following patch:

--- a/errata/sources/distros/debian.py
+++ b/errata/sources/distros/debian.py
@@ -14,9 +14,12 @@
 # You should have received a copy of the GNU General Public License
 # along with Patchman. If not, see <http://www.gnu.org/licenses/>
 
+import bz2
 import concurrent.futures
 import csv
 import re
+import shutil
+import tempfile
 from datetime import datetime
 from io import StringIO
 
@@ -28,7 +31,7 @@ from operatingsystems.utils import get_or_create_osrelease
 from packages.models import Package
 from packages.utils import find_evr, get_or_create_package
 from patchman.signals import pbar_start, pbar_update
-from util import extract, fetch_content, get_setting_of_type, get_url
+from util import fetch_content, get_setting_of_type, get_url
 from util.logging import error_message, warning_message
 
 DSCs = {}
@@ -76,11 +79,30 @@ def fetch_dscs_from_debian_package_file_maps():
         file_map_url = f'https://deb.debian.org/{repo}/indices/package-file.map.bz2'
         res = get_url(file_map_url)
         data = fetch_content(res, f'Fetching `{repo}` package file map')
-        file_map_data = extract(data, file_map_url).decode()
-        parse_debian_package_file_map(file_map_data, repo)
-
-
-def parse_debian_package_file_map(data, repo):
+        try:
+            with (
+                tempfile.NamedTemporaryFile(mode="w+b", suffix=".bz2") as bz2_file,
+                tempfile.NamedTemporaryFile(mode="w+b", suffix=".map") as file_map,
+            ):
+                bz2_file.write(data)
+                bz2_file.seek(0)
+
+                # Streaming decompress to reduce memory footprint.
+                with bz2.BZ2File(bz2_file.name) as decompressor_f:
+                    shutil.copyfileobj(decompressor_f, file_map)
+
+                # Streaming parse file map to reduce memory footprint.
+                file_map.seek(0)
+                parse_debian_package_file_map(file_map, repo)
+        except IOError as e:
+            if e == 'invalid data stream':
+                error_message(text='bunzip2: ' + e)
+        except ValueError as e:
+            if e == "couldn't find end of stream":
+                error_message(text='bunzip2: ' + e)
+
+
+def parse_debian_package_file_map(file_map, repo):
     """ Parse the a Debian package file map
         Format:
             Path: ./pool/updates/main/3/389-ds-base/389-ds-base_1.4.0.21-1+deb10u1.dsc
@@ -88,7 +110,8 @@ def parse_debian_package_file_map(data, repo):
             Source-Version: 1.4.0.21-1+deb10u1
     """
     parsing_dsc = False
-    for line in data.splitlines():
+    for line in file_map:
+        line = line.removesuffix(b'\n').decode()
         if line.startswith('Path:'):
             if line.endswith('.dsc'):
                 parsing_dsc = True

Nothing permanent, just as a way of showing where the memory cost is. This patch has the downside of requiring write access (to /tmp). I'm not sure if that's already a requirement for patchman to run.

After adding back some debug prints, this shows an improvement of +/- 900 MB for this specific debian job:

before: current= 87_384 kB peak= 87_756 kB

after fetch_content() for repo debian: current= 98_064 kB peak= 105_484 kB
after copyfileobj() for repo debian: current= 99_704 kB peak= 105_484 kB
after parse_debian_package_file_map() for repo debian: current= 151_796 kB peak= 151_796 kB

after fetch_content() for repo debian-security: current= 143_052 kB peak= 150_408 kB
after copyfileobj() for repo debian-security: current= 145_868 kB peak= 150_408 kB
after parse_debian_package_file_map() for repo debian-security: current= 146_312 kB peak= 150_408 kB

I suspect gains can also be made for other erratum types other than debian, since those also follow the pattern of reading entire compressed + decompressed files into memory.

I hope this helps, if there's anything more you want me to do, let me know!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions