improve files cache memory usage even further

## Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes

## Is this a BUG / ISSUE report or a QUESTION?

None of the above. Probably a feature request.

## Your borg version (borg -V).

2.0.0b14

## Describe the problem you're observing.

In #8511 the memory usage of the files cache is already greatly reduced by using the IDs provided by the new borghash. If I understand everything correctly, there probably often are multiple consecutive IDs in the chunk list of a file. Thus the memory usage could be optimized by storing an additional number indicating how many chunks with consecutive IDs follow after a given chunk ID.

### disclaimer
I am not familiar with how to use msgpack efficiently so here is just one random idea on how to implement this. Probably there are better ideas, but I wanted to be able to get a rough estimate what this could mean for the memory usage. (I could find barely any documentation about the inner workings of msgpack and I did not read the source code to find out what exactly happens. If anyone knows a good documentation I'd be happy for a hint)

### optimization for consecutive chunks
Increase the size of the ID from 4 Bytes to 5 Bytes. The added Byte contains how many consecutive chunks there are after the given ID. So there can be 0 to 255 following consecutice chunks. If there are more consecutive chunks, another ID would have to be encoded. This would hurt a little but at that point the reduction in memory usage should already be pretty good.
Some estimates of the change in memory usage based on the estimate given in #8511 which is  X + N * 5 Bytes, with N being the number of chunks used by a file.
- Worst case  X + N * 6 Bytes. Probably mostly for small files where X is big compared to N * 6.
- Best case  X + N * ~0.02 Bytes
- When each chunk ID has exactly one consecutive following ID: X + N * 3 Bytes

This assumes that msgpack does not align the data it stores and thus would align 5 Byte-values at 8 Byte adresses and thus contain 3 Bytes of overhead each. But in that case maybe separate list would help.

### additional optimization for repeated chunks
I assume besides consecutive IDs there could also be repeated IDs as well, for example if a disk image with long runs of zeroes (empty space) is stored. So maybe having a special case for this could be worthwhile as well. Either by separately storing this information or otherwise by "taking some values away" from the additional byte and using them to indicate repated chunks instead. Example: 255 means 1 additional repetition, 254 means 2 repetitions, 253 means 4 repetitions, 252 means 8 repetitions, ... up to some good value. That way the best case from above would only get slightly worse while still being able to get some significant gains for long runs of repeated values.

So I hope my assumptions are correct and this leads to some more nice improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve files cache memory usage even further #8618

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Is this a BUG / ISSUE report or a QUESTION?

Your borg version (borg -V).

Describe the problem you're observing.

disclaimer

optimization for consecutive chunks

additional optimization for repeated chunks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

improve files cache memory usage even further #8618

Description

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Is this a BUG / ISSUE report or a QUESTION?

Your borg version (borg -V).

Describe the problem you're observing.

disclaimer

optimization for consecutive chunks

additional optimization for repeated chunks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions