Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/dataset/dataset_design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,33 @@ We note that the dataset currently exclusively supports storing data in an
SQLite database. This is not an intrinsic limitation of the dataset and
measurement layer. It is possible that at a future state support for writing
to a different backend will be added.

.. _sec:design_split_storage:

Split Raw Data Storage
======================

As the main SQLite database grows with many datasets, browsing experiments and
loading metadata can become slower due to the file size. To address this,
QCoDeS supports an optional **split raw data storage** mode (see
:ref:`sec:intro_split_raw_data` for user-facing details).

From a design perspective, this feature adds a thin routing layer inside the
``DataSet`` class without changing any public interfaces:

- A ``_data_conn`` property transparently returns either the main database
connection or a per-dataset raw data connection, depending on the
configuration.
- Write paths (``add_results``, ``_BackgroundWriter``) and read paths
(``get_parameter_data``, ``DataSetCacheWithDBBackend``, ``number_of_results``,
``__len__``) all go through this single routing point.
- The per-dataset SQLite file is a lightweight database containing only the
results table and numpy type adapters -- no QCoDeS metadata schema.
- Subscriber triggers (used for real-time data callbacks) are created on the
data connection so that they fire regardless of which database holds the
results table.

The implementation is contained in ``qcodes.dataset.raw_data_storage`` (helper
functions) and a handful of additions to ``qcodes.dataset.data_set`` (routing
logic). The ``Measurement`` context manager, ``DataSaver``, and all export
functions work without modification.
41 changes: 41 additions & 0 deletions docs/dataset/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,44 @@ For dataset operations, QCoDeS provides functions for:
- **Exporting datasets**: :doc:`Exporting data to other file formats <../examples/DataSet/Exporting-data-to-other-file-formats>`
- **Extracting runs between databases**: :doc:`Extracting runs from one DB file to another <../examples/DataSet/Extracting-runs-from-one-DB-file-to-another>` and :func:`qcodes.dataset.extract_runs_into_db`
- **Bulk export and metadata-only databases**: :func:`qcodes.dataset.export_datasets_and_create_metadata_db` for creating lightweight metadata-only databases while exporting all data to NetCDF files

.. _sec:intro_split_raw_data:

Split Raw Data Storage
======================

By default, all measurement data (the results table rows) is stored in the same SQLite database alongside metadata such as experiments, runs, parameter layouts, and dependencies. Over time, the main database file can grow very large, which can slow down operations like browsing experiments and loading metadata.

QCoDeS supports an optional **split raw data storage** mode in which the actual measurement data for each ``DataSet`` is written to an individual, per-dataset SQLite file while all metadata remains in the main database. Each per-dataset file is named after the dataset's GUID (e.g. ``<guid>.db``) and is stored in a configurable folder.

This feature is controlled by two configuration options in ``qcodesrc.json``:

- ``dataset.raw_data_to_separate_db`` (bool, default ``false``): enables or disables split storage.
- ``dataset.raw_data_path`` (string, default ``"{db_location}"``): the folder where per-dataset files are created. The ``{db_location}`` placeholder is expanded to a folder derived from the main database path (e.g. ``~/experiments.db`` becomes ``~/experiments_db/``).

When enabled:

- The main database retains the full results table schema (column definitions) but no data rows are written to it, keeping it lightweight.
- All ``INSERT`` and ``SELECT`` operations on results data are transparently routed to the per-dataset file.
- The path to the per-dataset file is persisted in the run's metadata (``raw_data_db_path``), so ``load_by_id`` and related loading functions automatically reconnect to the correct file.
- All public ``DataSet`` APIs (``get_parameter_data``, ``to_pandas_dataframe``, ``to_xarray_dataset``, ``cache``, ``export``, etc.) work identically whether split storage is enabled or not.

Example runtime configuration::

import qcodes as qc

qc.config.dataset.raw_data_to_separate_db = True
qc.config.dataset.raw_data_path = "/data/raw_measurements/"

If the per-dataset raw data files are moved to a different folder (e.g. during data migration or archival), the stored paths in the main database will become stale. Use the :func:`~qcodes.dataset.update_raw_data_paths` helper to update them::

from qcodes.dataset import update_raw_data_paths

update_raw_data_paths(
db_path="/path/to/main_database.db",
new_raw_data_folder="/new/location/of/raw_files/"
)

This scans all datasets with a ``raw_data_db_path`` metadata entry, checks whether the corresponding ``.db`` file exists in the new folder, and updates the stored path accordingly.

For more details on database management, see the :doc:`Database notebook <../examples/DataSet/Database>`.
91 changes: 91 additions & 0 deletions docs/examples/DataSet/Database.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,97 @@
"\n",
"Moreover, we have also written an [example notebook](Extracting-runs-from-one-DB-file-to-another.ipynb) of transferring `DataSets` between database flies that may serve as a template for more complex data organization protocols."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split Raw Data Storage\n",
"\n",
"As the main database grows with many datasets, browsing experiments and loading metadata can become slower. QCoDeS supports an optional **split raw data storage** mode that writes the raw measurement data for each dataset into its own individual SQLite file, while keeping all metadata (experiments, runs, parameters, dependencies) in the main database.\n",
"\n",
"This keeps the main database lightweight and makes it faster to work with, while still allowing all existing `DataSet` APIs to function identically.\n",
"\n",
"### Configuration\n",
"\n",
"Split raw data storage is controlled by two configuration options:\n",
"\n",
"- `dataset.raw_data_to_separate_db` (bool, default `False`): enables or disables split storage.\n",
"- `dataset.raw_data_path` (string, default `\"{db_location}\"`): the folder where per-dataset SQLite files are created. The `{db_location}` placeholder expands to a folder derived from the main database path (e.g. `~/experiments.db` becomes `~/experiments_db/`).\n",
"\n",
"You can enable it at runtime:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Enable split raw data storage\n",
"qc.config.dataset.raw_data_to_separate_db = True\n",
"\n",
"# Optionally set a custom path for per-dataset files\n",
"qc.config.dataset.raw_data_path = \"/data/raw_measurements/\"\n",
"\n",
"# Or use the default which derives from the main DB location:\n",
"# qc.config.dataset.raw_data_path = \"{db_location}\"\n",
"# e.g. ~/experiments.db -> ~/experiments_db/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or permanently in your `qcodesrc.json`:\n",
"\n",
"```json\n",
"{\n",
" \"dataset\": {\n",
" \"raw_data_to_separate_db\": true,\n",
" \"raw_data_path\": \"{db_location}\"\n",
" }\n",
"}\n",
"```\n",
"\n",
"### How It Works\n",
"\n",
"When split storage is enabled:\n",
"\n",
"1. When a measurement starts (`mark_started()`), a per-dataset SQLite file named `<guid>.db` is created in the configured folder.\n",
"2. All measurement data (results table rows) is written to this per-dataset file instead of the main database.\n",
"3. The main database retains the results table schema (column definitions) but contains no data rows, keeping it small.\n",
"4. The path to the per-dataset file is saved in the run metadata, so `load_by_id()` and related functions automatically find and reconnect to the correct file.\n",
"5. All `DataSet` methods (`get_parameter_data`, `to_pandas_dataframe`, `to_xarray_dataset`, `cache`, `export`, etc.) work transparently with split storage.\n",
"\n",
"> **Note:** Datasets created with split storage enabled can always be loaded later, even if the configuration is changed back to the default, as long as the per-dataset files remain at their original paths.\n",
"\n",
"### Updating Paths After Moving Raw Data Files\n",
"\n",
"If you move the per-dataset raw data files to a different folder, the paths stored in the main database become stale. Use `update_raw_data_paths` to fix them:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# from qcodes.dataset import update_raw_data_paths\n",
"\n",
"# After moving raw data files to a new folder:\n",
"# update_raw_data_paths(\n",
"# db_path=\"/path/to/main_database.db\",\n",
"# new_raw_data_folder=\"/new/location/of/raw_files/\"\n",
"# )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function scans all datasets that have a `raw_data_db_path` metadata entry, checks whether the corresponding `.db` file exists in the new folder, and updates the stored path. Datasets whose files are not found in the new folder are skipped with a warning."
]
}
],
"metadata": {
Expand Down
4 changes: 3 additions & 1 deletion src/qcodes/configuration/qcodesrc.json
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,9 @@
"export_chunked_export_of_large_files_enabled": false,
"export_chunked_threshold": 1000,
"in_memory_cache": true,
"load_from_exported_file": false
"load_from_exported_file": false,
"raw_data_to_separate_db": false,
"raw_data_path": "{db_location}"
},
"telemetry":
{
Expand Down
10 changes: 10 additions & 0 deletions src/qcodes/configuration/qcodesrc_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,16 @@
"type": "boolean",
"default": true,
"description": "Should the data be cached in memory as it is measured. Useful to disable for large datasets to save on memory consumption."
},
"raw_data_to_separate_db": {
"type": "boolean",
"default": false,
"description": "If true, raw measurement data (results tables) will be written to individual per-dataset SQLite files instead of the main database. Metadata remains in the main database."
},
"raw_data_path": {
"type": "string",
"default": "{db_location}",
"description": "Path to the folder where per-dataset raw data SQLite files are stored. {db_location} is a directory in the same folder as the .db file with a matching name, e.g. for ~/experiments.db raw data files will be stored in ~/experiments_db/"
}
},
"description": "Settings related to the DataSet and Measurement Context manager",
Expand Down
2 changes: 2 additions & 0 deletions src/qcodes/dataset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
and from disk
"""

from ._raw_data_storage import update_raw_data_paths
from .data_set import (
get_guids_by_run_spec,
load_by_counter,
Expand Down Expand Up @@ -120,4 +121,5 @@
"plot_dataset",
"reset_default_experiment_id",
"rundescriber_from_json",
"update_raw_data_paths",
]
Loading
Loading