Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
ChangeLog
Changelog
=========

Version 0.1.0 2024-11-18
------------------------

- HashTableNT: deal with byte_order separately
- HashTableNT: give separate formats in value_format namedtuple
- HashTableNT: handle ``byte_order`` separately.
- HashTableNT: provide separate formats in the ``value_format`` namedtuple.

Version 0.0.2 2024-11-10
------------------------

- Fixed "KV array is full" crash on 32bit platforms (and maybe also some other
int-size related issues), #27.
- Added a .update method to HashTableNT (like dict.update), #28.
- Fixed "KV array is full" crash on 32-bit platforms (and maybe also some other
integer-size related issues), #27.
- Added an ``.update()`` method to HashTableNT (like ``dict.update()``), #28.

Version 0.0.1 2024-10-31
------------------------
Expand Down
72 changes: 32 additions & 40 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,71 +1,63 @@
BorgHash
=========
========

Memory-efficient hashtable implementations as a Python library,
implemented in Cython.
Memory-efficient hashtable implementations as a Python library implemented in Cython.

HashTable
---------

``HashTable`` is a rather low-level implementation, usually one rather wants to
use the ``HashTableNT`` wrapper. But read on to get the basics...
``HashTable`` is a fairly low-level implementation; usually one will want to use the ``HashTableNT`` wrapper. Read on for the basics...

Keys and Values
~~~~~~~~~~~~~~~

The keys MUST be perfectly random ``bytes`` of arbitrary, but constant length,
like from a cryptographic hash (sha256, hmac-sha256, ...).
The implementation relies on this "perfectly random" property and does not
implement an own hash function, but just takes 32 bits from the given key.
The keys MUST be perfectly random ``bytes`` of arbitrary but fixed length, like from a cryptographic hash (SHA-256, HMAC-SHA-256, ...).
The implementation relies on this "perfectly random" property and does not implement its own hash function; it just takes 32 bits from the given key.

The values are binary ``bytes`` of arbitrary, but constant length.
The values are ``bytes`` of arbitrary but fixed length.

The length of the keys and values is defined when creating a ``HashTable``
instance (after that, the length must always match that defined length).
The lengths of the keys and values are defined when creating a ``HashTable`` instance; thereafter, the lengths must always match the defined size.

Implementation details
~~~~~~~~~~~~~~~~~~~~~~

To have little memory overhead overall, the hashtable only stores uint32_t
indexes into separate keys and values arrays (short: kv arrays).
To have little memory overhead overall, the hashtable only stores ``uint32_t``
indices into separate keys and values arrays (short: kv arrays).

A new key just gets appended to the keys array. The corresponding value gets
appended to the values array. After that, the key and value do not change their
A new key is appended to the keys array. The corresponding value is appended to the values array. After that, the key and value do not change their
index as long as they exist in the hashtable and the ht and kv arrays are in
memory. Even when kv pairs are deleted from ``HashTable``, the kv arrays never
shrink and the indexes of other kv pairs don't change.
shrink and the indices of other kv pairs don't change.

This is because we want to have stable array indexes for the keys/values so the
indexes can be used outside of ``HashTable`` as memory-efficient references.
This is because we want to have stable array indices for the keys/values, so the
indices can be used outside of ``HashTable`` as memory-efficient references.

Memory allocated
~~~~~~~~~~~~~~~~

For a hashtable load factor of 0.1 - 0.5, a kv array grow factor of 1.3 and
For a hashtable load factor of 0.1 0.5, a kv array growth factor of 1.3, and
N kv pairs, memory usage in bytes is approximately:

- Hashtable: from ``N * 4 / 0.5`` to ``N * 4 / 0.1``
- Keys/Values: from ``N * len(key+value) * 1.0`` to ``N * len(key+value) * 1.3``
- Overall: from ``N * (8 + len(key+value))`` to ``N * (40 + len(key+value) * 1.3)``
- Keys/Values: from ``N * len(key + value) * 1.0`` to ``N * len(key + value) * 1.3``
- Overall: from ``N * (8 + len(key + value))`` to ``N * (40 + len(key + value) * 1.3)``

When the hashtable or the kv arrays are resized, there will be short memory
usage spikes. For the kv arrays, ``realloc()`` is used to avoid copying of
data and memory usage spikes, if possible.
When the hashtable or the kv arrays are resized, there will be brief memory-usage spikes. For the kv arrays, ``realloc()`` is used to avoid copying data and to minimize memory-usage spikes, if possible.

HashTableNT
-----------

``HashTableNT`` is a convenience wrapper around ``HashTable``:

- accepts and returns ``namedtuple`` values
- implements persistence: can read (write) the hashtable from (to) a file.
- Accepts and returns ``namedtuple`` values.
- Implements persistence: can read the hashtable from a file and write it to a file.

Keys and Values
~~~~~~~~~~~~~~~

Keys: ``bytes``, see ``HashTable``.

Values: any fixed type of ``namedtuple`` that can be serialized to ``bytes``
Values: any fixed ``namedtuple`` type that can be serialized to ``bytes``
by Python's ``struct`` module using a given format string.

When setting a value, it is automatically serialized. When a value is returned,
Expand All @@ -75,11 +67,11 @@ Persistence
~~~~~~~~~~~

``HashTableNT`` has ``.write()`` and ``.read()`` methods to save/load its
content to/from a file, using an efficient binary format.
contents to/from a file, using an efficient binary format.

When a ``HashTableNT`` is saved to disk, only the non-deleted entries are
persisted and when it is loaded from disk, a new hashtable and new, dense
kv arrays are built - thus, kv indexes will be different!
persisted. When it is loaded from disk, a new hashtable and new, dense
kv arrays are built; thus, kv indices will be different!

API
---
Expand All @@ -96,15 +88,15 @@ Example code

::

# HashTableNT mapping 256bit key [bytes] --> Chunk value [namedtuple]
# HashTableNT mapping 256-bit key [bytes] --> Chunk value [namedtuple]
Chunk = namedtuple("Chunk", ["refcount", "size"])
ChunkFormat = namedtuple("ChunkFormat", ["refcount", "size"])
chunk_format = ChunkFormat(refcount="I", size="I")

# 256bit (32Byte) key, 2x 32bit (4Byte) values
# 256-bit (32-byte) key, 2x 32-bit (4-byte) values
ht = HashTableNT(key_size=32, value_type=Chunk, value_format=chunk_format)

key = b"x" * 32 # the key is usually from a cryptographic hash fn
key = b"x" * 32 # the key is usually from a cryptographic hash function
value = Chunk(refcount=1, size=42)
ht[key] = value
assert ht[key] == value
Expand All @@ -131,9 +123,9 @@ Want a demo?

Run ``borghash-demo`` after installing the ``borghash`` package.

It will show you the demo code, run it and print the results for your machine.
It will show you the demo code, run it, and print the results for your machine.

Results on an Apple MacBook Pro (M3 Pro CPU) are like:
Results on an Apple MacBook Pro (M3 Pro CPU) look like:

::

Expand All @@ -144,18 +136,18 @@ Results on an Apple MacBook Pro (M3 Pro CPU) are like:
State of this project
---------------------

**API is still unstable and expected to change as development goes on.**
**API is still unstable and expected to change as development continues.**

**As long as the API is unstable, there will be no data migration tools,
like e.g. for reading an existing serialized hashtable.**
e.g., for reading an existing serialized hashtable.**

There might be missing features or optimization potential, feedback welcome!
There might be missing features or optimization potential; feedback is welcome!

Borg?
-----

Please note that this code is currently **not** used by the stable release of
BorgBackup (aka "borg"), but might be used by borg master branch in the future.
BorgBackup (aka "borg"), but it might be used by Borg's master branch in the future.

License
-------
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
try:
from Cython.Build import cythonize
except ImportError:
cythonize = None # we don't have cython installed
cythonize = None # we don't have Cython installed

ext = '.pyx' if cythonize else '.c'

Expand Down
34 changes: 17 additions & 17 deletions src/borghash/HashTable.pyx
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
"""
HashTable: low-level ht mapping fully random bytes keys to bytes values.
key and value length can be chosen, but is fixed afterwards.
the keys and values are stored in arrays separate from the hashtable.
the hashtable only stores the 32bit indexes into the key/value arrays.
HashTable: low-level hash table mapping fully random bytes keys to bytes values.
Key and value lengths can be chosen, but are fixed thereafter.
The keys and values are stored in arrays separate from the hashtable.
The hashtable only stores the 32-bit indices into the key/value arrays.
"""
from __future__ import annotations
from typing import BinaryIO, Iterator, Any
Expand Down Expand Up @@ -49,7 +49,7 @@ cdef class HashTable:
shrink_factor: float = 0.4, grow_factor: float = 2.0,
kv_grow_factor: float = 1.3) -> None:
# the load of the ht (.table) shall be between 0.25 and 0.5, so it is fast and has few collisions.
# it is cheap to have a low hash table load, because .table only stores uint32_t indexes into the
# it is cheap to have a low hash table load, because .table only stores uint32_t indices into the
# .keys and .values array.
# the keys/values arrays have bigger elements and are not hash tables, thus collisions and load
# factor are no concern there. the kv_grow_factor can be relatively small.
Expand Down Expand Up @@ -96,7 +96,7 @@ cdef class HashTable:
free(self.values)

def clear(self) -> None:
"""empty HashTable, start from scratch"""
"""Empty the HashTable and start from scratch."""
self.capacity = 0
self.used = 0
self._resize_table(self.initial_capacity)
Expand All @@ -107,7 +107,7 @@ cdef class HashTable:
return self.used

cdef size_t _get_index(self, uint8_t* key):
"""key must be perfectly random distributed bytes, so we don't need a hash function here."""
"""Key must be perfectly random bytes, so we don't need a hash function here."""
cdef uint32_t key32 = (key[0] << 24) | (key[1] << 16) | (key[2] << 8) | key[3]
return key32 % self.capacity

Expand Down Expand Up @@ -149,7 +149,7 @@ cdef class HashTable:
self._resize_kv(int(self.kv_capacity * self.kv_grow_factor))
if self.kv_used >= self.kv_capacity:
# Should never happen. See "RESERVED" constant - we allow almost 4Gi kv entries.
# For a typical 256bit key and a small 32bit value that would already consume 176GiB+
# For a typical 256-bit key and a small 32-bit value that would already consume 176GiB+
# memory (plus spikes to even more when hashtable or kv arrays get resized).
raise RuntimeError("KV array is full")

Expand Down Expand Up @@ -260,7 +260,7 @@ cdef class HashTable:
self.tombstones = 0

cdef void _resize_kv(self, size_t new_capacity):
# We must never use kv indexes >= RESERVED, thus we'll never need more capacity either.
# We must never use kv indices >= RESERVED; thus, we'll never need more capacity either.
cdef size_t capacity = min(new_capacity, <size_t> RESERVED - 1)
self.stats_resize_kv += 1
# realloc is already highly optimized (in Linux). By using mremap internally only the peak address space usage is "old size" + "new size", while the peak memory usage is only "new size".
Expand All @@ -270,8 +270,8 @@ cdef class HashTable:

def k_to_idx(self, key: bytes) -> int:
"""
return the key's index in the keys array (index is stable while in memory).
this can be used to "abbreviate" a known key (e.g. 256bit key -> 32bit index).
Return the key's index in the keys array (index is stable while in memory).
This can be used to "abbreviate" a known key (e.g., 256-bit key -> 32-bit index).
"""
if len(key) != self.ksize:
raise ValueError("Key size does not match the defined size")
Expand All @@ -283,16 +283,16 @@ cdef class HashTable:

def idx_to_k(self, idx: int) -> bytes:
"""
for a given index, return the key stored at that index in the keys array.
this is the reverse of k_to_idx (e.g. 32bit index -> 256bit key).
For a given index, return the key stored at that index in the keys array.
This is the reverse of k_to_idx (e.g., 32-bit index -> 256-bit key).
"""
cdef uint32_t kv_index = <uint32_t> idx
return self.keys[kv_index * self.ksize:(kv_index + 1) * self.ksize]

def kv_to_idx(self, key: bytes, value: bytes) -> int:
"""
return the key's/value's index in the keys/values array (index is stable while in memory).
this can be used to "abbreviate" a known key/value pair. (e.g. 256bit key + 32bit value -> 32bit index).
Return the key's/value's index in the keys/values array (index is stable while in memory).
This can be used to "abbreviate" a known key/value pair (e.g., 256-bit key + 32-bit value -> 32-bit index).
"""
if len(key) != self.ksize:
raise ValueError("Key size does not match the defined size")
Expand All @@ -309,8 +309,8 @@ cdef class HashTable:

def idx_to_kv(self, idx: int) -> tuple[bytes, bytes]:
"""
for a given index, return the key/value stored at that index in the keys/values array.
this is the reverse of kv_to_idx (e.g. 32bit index -> 256bit key + 32bit value).
For a given index, return the key/value stored at that index in the keys/values array.
This is the reverse of kv_to_idx (e.g., 32-bit index -> 256-bit key + 32-bit value).
"""
cdef uint32_t kv_index = <uint32_t> idx
key = self.keys[kv_index * self.ksize:(kv_index + 1) * self.ksize]
Expand Down
8 changes: 4 additions & 4 deletions src/borghash/HashTableNT.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ cdef class HashTableNT:
if not all(isinstance(fmt, str) and len(fmt) > 0 for fmt in value_format):
raise ValueError("value_format's elements must be str and non-empty.")
if byte_order not in BYTE_ORDER:
raise ValueError("byte_order must be one of: {','.join(BYTE_ORDER.keys())}")
raise ValueError(f"byte_order must be one of: {', '.join(BYTE_ORDER.keys())}")
self.key_size = key_size
self.value_type = value_type
self.value_format = value_format
Expand Down Expand Up @@ -124,7 +124,7 @@ cdef class HashTableNT:
return self._to_namedtuple_value(binary_value)

def update(self, other=(), /, **kwds):
"""Like dict.update, but other can also be a HashTableNT instance."""
"""Like dict.update(), but 'other' can also be a HashTableNT instance."""
if isinstance(other, HashTableNT):
for key, value in other.items():
self[key] = value
Expand Down Expand Up @@ -228,9 +228,9 @@ cdef class HashTableNT:

def size(self) -> int:
"""
do a rough worst-case estimate of the on-disk size when using .write().
Do a rough worst-case estimate of the on-disk size when using .write().

the serialized size of the metadata is a bit hard to predict, but we cover that with one_time_overheads.
The serialized size of the metadata is a bit hard to predict, but we cover that with one_time_overheads.
"""
one_time_overheads = 4096 # very rough
N = self.inner.used
Expand Down
2 changes: 1 addition & 1 deletion src/borghash/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
borghash - hashtable implementations in cython.
borghash - hashtable implementations in Cython.
"""
from .HashTable import HashTable
from .HashTableNT import HashTableNT
8 changes: 4 additions & 4 deletions src/borghash/__main__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Demonstration of borghash.
Demonstration of BorgHash.
"""

def demo():
Expand All @@ -17,12 +17,12 @@ def demo():
value_type = namedtuple("Chunk", ["refcount", "size"])
value_format_t = namedtuple("ChunkFormat", ["refcount", "size"])
value_format = value_format_t(refcount="I", size="I")
# 256bit (32Byte) key, 2x 32bit (4Byte) values
# 256-bit (32-byte) key, 2x 32-bit (4-byte) values
ht = HashTableNT(key_size=32, value_type=value_type, value_format=value_format)

t0 = time()
for i in range(count):
# make up a 256bit key from i, first 32bits need to be well distributed.
# Make up a 256-bit key from i; the first 32 bits need to be well distributed.
key = f"{i:4x}{' '*28}".encode()
value = value_type(refcount=i, size=i * 2)
ht[key] = value
Expand Down Expand Up @@ -50,7 +50,7 @@ def demo():

t4 = time()
for i in range(count):
# make up a 256bit key from i, first 32bits need to be well distributed.
# Make up a 256-bit key from i; the first 32 bits need to be well distributed.
key = f"{i:4x}{' '*28}".encode()
expected_value = value_type(refcount=i, size=i * 2)
assert ht_read.pop(key) == expected_value
Expand Down
Loading