Hi!
I'm running the patchman --update-errata --erratum-type debian command as a cronjob with memory constraints. Recently these limits were reached and the job no longer completes (1Gi). Increasing these limits short-term is not a problem on my side.
I've observed large memory usage when decompressing and reading the file map. Here's some rudimentary print output that hopefully makes it obvious.
starting update_errata
current= 82952 kB peak= 82952 kB
starting update_debian_errata
starting retrieve_debian_codenames
current= 82960 kB peak= 82960 kB
retrieve_debian_codenames took 0:00:00.021886
starting create_debian_os_releases
current= 83500 kB peak= 83500 kB
create_debian_os_releases took 0:00:00.003031
starting fetch_debian_dsa_advisories
current= 84640 kB peak= 85016 kB
fetch_debian_dsa_advisories took 0:00:00.932085
starting fetch_debian_dla_advisories
current= 86320 kB peak= 86320 kB
fetch_debian_dla_advisories took 0:00:00.943669
starting fetch_dscs_from_debian_package_file_maps
current= 181048 kB peak= 1068368 kB
fetch_dscs_from_debian_package_file_maps took 0:00:05.944284
starting get_accepted_debian_codenames
current= 181048 kB peak= 1068368 kB
get_accepted_debian_codenames took 0:00:00.000083
starting parse_debian_errata
current= 192976 kB peak= 1068368 kB
parse_debian_errata took 0:00:10.646258
starting create_debian_errata
current= 193232 kB peak= 1068368 kB
create_debian_errata took 0:00:00.166258
update_debian_errata took 0:00:18.658681
current= 190488 kB peak= 1068368 kB
update_errata took 0:00:18.671028
current= 190488 kB peak= 1068368 kB
The function fetch_dscs_from_debian_package_file_maps does three things that increase memory usage, at least twice significantly:
data = fetch_content(res, f'Fetching `{repo}` package file map')
file_map_data = extract(data, file_map_url).decode()
parse_debian_package_file_map(file_map_data, repo)
- fetch_content keeps an entire .bz2 file contents in memory; this is not too bad: The file is +/- 9 MB.
- extracting the .bz2 file in memory; this hits harder: The largest decompressed file is +/- 240 MB, peak memory goes up by a lot more.
- parsing the file_map_data: some memory inevitably is taken by growing the
DSCs dict, however that should be far less.
With more detailed logging:
before: current= 87_436 kB peak= 87_436 kB
after fetch_content() for repo debian: current= 97_312 kB peak= 104_504 kB
after extract() for repo debian: current= 360_944 kB peak= 601_872 kB
after parse_debian_package_file_map() for repo debian: current= 420_836 kB peak= 1_068_304 kB
after fetch_content() for repo debian-security: current= 412_060 kB peak= 1_068_304 kB
after extract() for repo debian-security: current= 176_636 kB peak= 1_068_304 kB
after parse_debian_package_file_map() for repo debian-security: current= 181_296 kB peak= 1_068_304 kB
To investigate an alternative way (file-streaming/buffering) I've run patchman with the following patch:
--- a/errata/sources/distros/debian.py
+++ b/errata/sources/distros/debian.py
@@ -14,9 +14,12 @@
# You should have received a copy of the GNU General Public License
# along with Patchman. If not, see <http://www.gnu.org/licenses/>
+import bz2
import concurrent.futures
import csv
import re
+import shutil
+import tempfile
from datetime import datetime
from io import StringIO
@@ -28,7 +31,7 @@ from operatingsystems.utils import get_or_create_osrelease
from packages.models import Package
from packages.utils import find_evr, get_or_create_package
from patchman.signals import pbar_start, pbar_update
-from util import extract, fetch_content, get_setting_of_type, get_url
+from util import fetch_content, get_setting_of_type, get_url
from util.logging import error_message, warning_message
DSCs = {}
@@ -76,11 +79,30 @@ def fetch_dscs_from_debian_package_file_maps():
file_map_url = f'https://deb.debian.org/{repo}/indices/package-file.map.bz2'
res = get_url(file_map_url)
data = fetch_content(res, f'Fetching `{repo}` package file map')
- file_map_data = extract(data, file_map_url).decode()
- parse_debian_package_file_map(file_map_data, repo)
-
-
-def parse_debian_package_file_map(data, repo):
+ try:
+ with (
+ tempfile.NamedTemporaryFile(mode="w+b", suffix=".bz2") as bz2_file,
+ tempfile.NamedTemporaryFile(mode="w+b", suffix=".map") as file_map,
+ ):
+ bz2_file.write(data)
+ bz2_file.seek(0)
+
+ # Streaming decompress to reduce memory footprint.
+ with bz2.BZ2File(bz2_file.name) as decompressor_f:
+ shutil.copyfileobj(decompressor_f, file_map)
+
+ # Streaming parse file map to reduce memory footprint.
+ file_map.seek(0)
+ parse_debian_package_file_map(file_map, repo)
+ except IOError as e:
+ if e == 'invalid data stream':
+ error_message(text='bunzip2: ' + e)
+ except ValueError as e:
+ if e == "couldn't find end of stream":
+ error_message(text='bunzip2: ' + e)
+
+
+def parse_debian_package_file_map(file_map, repo):
""" Parse the a Debian package file map
Format:
Path: ./pool/updates/main/3/389-ds-base/389-ds-base_1.4.0.21-1+deb10u1.dsc
@@ -88,7 +110,8 @@ def parse_debian_package_file_map(data, repo):
Source-Version: 1.4.0.21-1+deb10u1
"""
parsing_dsc = False
- for line in data.splitlines():
+ for line in file_map:
+ line = line.removesuffix(b'\n').decode()
if line.startswith('Path:'):
if line.endswith('.dsc'):
parsing_dsc = True
Nothing permanent, just as a way of showing where the memory cost is. This patch has the downside of requiring write access (to /tmp). I'm not sure if that's already a requirement for patchman to run.
After adding back some debug prints, this shows an improvement of +/- 900 MB for this specific debian job:
before: current= 87_384 kB peak= 87_756 kB
after fetch_content() for repo debian: current= 98_064 kB peak= 105_484 kB
after copyfileobj() for repo debian: current= 99_704 kB peak= 105_484 kB
after parse_debian_package_file_map() for repo debian: current= 151_796 kB peak= 151_796 kB
after fetch_content() for repo debian-security: current= 143_052 kB peak= 150_408 kB
after copyfileobj() for repo debian-security: current= 145_868 kB peak= 150_408 kB
after parse_debian_package_file_map() for repo debian-security: current= 146_312 kB peak= 150_408 kB
I suspect gains can also be made for other erratum types other than debian, since those also follow the pattern of reading entire compressed + decompressed files into memory.
I hope this helps, if there's anything more you want me to do, let me know!
Hi!
I'm running the
patchman --update-errata --erratum-type debiancommand as a cronjob with memory constraints. Recently these limits were reached and the job no longer completes (1Gi). Increasing these limits short-term is not a problem on my side.I've observed large memory usage when decompressing and reading the file map. Here's some rudimentary print output that hopefully makes it obvious.
The function
fetch_dscs_from_debian_package_file_mapsdoes three things that increase memory usage, at least twice significantly:DSCsdict, however that should be far less.With more detailed logging:
To investigate an alternative way (file-streaming/buffering) I've run patchman with the following patch:
Nothing permanent, just as a way of showing where the memory cost is. This patch has the downside of requiring write access (to /tmp). I'm not sure if that's already a requirement for patchman to run.
After adding back some debug prints, this shows an improvement of +/- 900 MB for this specific debian job:
I suspect gains can also be made for other erratum types other than debian, since those also follow the pattern of reading entire compressed + decompressed files into memory.
I hope this helps, if there's anything more you want me to do, let me know!