Skip to content

improve files cache memory usage even further #8618

@MichaelDietzel

Description

@MichaelDietzel

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

None of the above. Probably a feature request.

Your borg version (borg -V).

2.0.0b14

Describe the problem you're observing.

In #8511 the memory usage of the files cache is already greatly reduced by using the IDs provided by the new borghash. If I understand everything correctly, there probably often are multiple consecutive IDs in the chunk list of a file. Thus the memory usage could be optimized by storing an additional number indicating how many chunks with consecutive IDs follow after a given chunk ID.

disclaimer

I am not familiar with how to use msgpack efficiently so here is just one random idea on how to implement this. Probably there are better ideas, but I wanted to be able to get a rough estimate what this could mean for the memory usage. (I could find barely any documentation about the inner workings of msgpack and I did not read the source code to find out what exactly happens. If anyone knows a good documentation I'd be happy for a hint)

optimization for consecutive chunks

Increase the size of the ID from 4 Bytes to 5 Bytes. The added Byte contains how many consecutive chunks there are after the given ID. So there can be 0 to 255 following consecutice chunks. If there are more consecutive chunks, another ID would have to be encoded. This would hurt a little but at that point the reduction in memory usage should already be pretty good.
Some estimates of the change in memory usage based on the estimate given in #8511 which is X + N * 5 Bytes, with N being the number of chunks used by a file.

  • Worst case X + N * 6 Bytes. Probably mostly for small files where X is big compared to N * 6.
  • Best case X + N * ~0.02 Bytes
  • When each chunk ID has exactly one consecutive following ID: X + N * 3 Bytes

This assumes that msgpack does not align the data it stores and thus would align 5 Byte-values at 8 Byte adresses and thus contain 3 Bytes of overhead each. But in that case maybe separate list would help.

additional optimization for repeated chunks

I assume besides consecutive IDs there could also be repeated IDs as well, for example if a disk image with long runs of zeroes (empty space) is stored. So maybe having a special case for this could be worthwhile as well. Either by separately storing this information or otherwise by "taking some values away" from the additional byte and using them to indicate repated chunks instead. Example: 255 means 1 additional repetition, 254 means 2 repetitions, 253 means 4 repetitions, 252 means 8 repetitions, ... up to some good value. That way the best case from above would only get slightly worse while still being able to get some significant gains for long runs of repeated values.

So I hope my assumptions are correct and this leads to some more nice improvements

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions