feat: split raw data storage into per-dataset SQLite files by astafan8 · Pull Request #8219 · microsoft/Qcodes

astafan8 · 2026-06-12T15:50:23Z

Summary

This PR implements split raw data storage for QCoDeS: an opt-in feature that writes raw measurement data (results table rows) into individual per-dataset SQLite files while keeping all metadata in the main database. The goal is to prevent the main DB file from growing excessively large as datasets accumulate, making metadata browsing and experiment management faster.

Motivation

The main QCoDeS SQLite database stores both metadata (experiments, runs, parameter layouts, dependencies) and raw measurement data (results tables) in a single file. Over time, this file can grow to many gigabytes, slowing down operations that only need metadata. By splitting the raw data into per-dataset files, the main DB stays lightweight while data integrity is preserved.

Design Decisions

Architecture: transparent routing via `_data_conn` property

A single _data_conn property on DataSet is the routing point for all data read/write operations
Returns the per-dataset raw data connection when split is enabled, otherwise falls back to self.conn (main DB)
All write paths (add_results, _BackgroundWriter) and read paths (get_parameter_data, DataSetCacheWithDBBackend, number_of_results, __len__) go through this property
Zero changes to public DataSet API — all existing methods work identically

Config: follows existing export path pattern

Two new config options in dataset section of qcodesrc.json:
- raw_data_to_separate_db (bool, default false)
- raw_data_path (string, default "{db_location}")
Reuses _expand_export_path() from export_config.py for path expansion (e.g., ~/experiments.db → ~/experiments_db/)
Pattern mirrors the existing export_path / export_type config approach

Per-dataset files: lightweight, GUID-named

Each file is named <guid>.db and contains only the results table + numpy type adapters
No QCoDeS metadata schema in per-dataset files — they are minimal
Path to per-dataset file is persisted in run metadata (raw_data_db_path dynamic column) for automatic reconnection on load_by_id()

Empty results table kept in main DB

We considered removing the results table from the main DB entirely, but this would break:
- _Subscriber trigger creation (SQLite triggers require the table to exist)
- __len__ / number_of_results before dataset is started (when raw data DB doesn't yet exist)
- Low-level query functions that inspect table structure via PRAGMA TABLE_INFO
- The _check_if_table_found logic used in _get_datasetprotocol_from_guid to distinguish DataSet vs DataSetInMem
Decision: keep the empty table schema (column definitions, no rows) — negligible overhead, full backward compatibility

`get_parameter_data` bypass for raw data DB

The standard get_parameter_data() in queries.py calls get_rundescriber_from_result_table_name() which queries the runs table — this table doesn't exist in the raw data DB
Solution: when _raw_data_conn is set, bypass the top-level function and call get_shaped_parameter_data_for_one_paramtree() directly with the already-held rundescriber

Subscriber triggers on data connection

_Subscriber.__init__ creates SQL triggers for real-time data callbacks
Changed to use _data_conn instead of self.conn so triggers fire on the correct DB where data is actually inserted

BackgroundWriter support

_BackgroundWriter maintains a _raw_data_conns dict keyed by file path
Queue items include optional raw_data_path key for routing
Connections are lazily created and reused across datasets sharing the same raw data DB path

Files Changed

File	Type	Description
`src/qcodes/dataset/raw_data_storage.py`	New	Helper module: `is_raw_data_storage_enabled()`, `get_raw_data_folder()`, `get_raw_data_db_path()`, `connect_to_raw_data_db()`, `create_raw_data_db()`
`tests/dataset/test_raw_data_storage.py`	New	19 tests (7 unit + 10 integration + 2 non-interference)
`src/qcodes/dataset/data_set.py`	Modified	`_data_conn` property, `_raw_data_conn` attribute, routing in `__init__`, `_perform_start_actions`, `add_results`, `get_parameter_data`, `number_of_results`, `__len__`, `_BackgroundWriter`, `_get_datasetprotocol_from_guid`
`src/qcodes/dataset/data_set_cache.py`	Modified	`load_data_from_db()` uses `_data_conn`
`src/qcodes/dataset/subscriber.py`	Modified	Trigger creation uses `_data_conn`
`src/qcodes/configuration/qcodesrc.json`	Modified	Added config defaults
`src/qcodes/configuration/qcodesrc_schema.json`	Modified	Added config schema
`docs/dataset/introduction.rst`	Modified	New "Split Raw Data Storage" section
`docs/dataset/dataset_design.rst`	Modified	Design notes on split storage
`docs/examples/DataSet/Database.ipynb`	Modified	Config and usage documentation

Verification

Own tests: 19/19 pass

Unit tests for all helper functions (config reading, path generation, DB creation)
Integration tests: write/read round-trip, cache, load_by_id, multiple datasets, background writer
Non-interference: feature disabled → data goes to main DB as before

Full test suite with feature globally enabled: 1031 passed, 5 failed, 35 skipped

The 5 failures are all expected/explained:

Test	Reason
`test_raw_data_conn_is_none`	Own test for disabled mode — overridden by global enable
`test_data_in_main_db`	Own test for disabled mode — overridden by global enable
`test_get_parameter_data`	Low-level query test calls `queries.get_parameter_data(ds.conn, ...)` directly, bypassing DataSet
`test_get_parameter_data_independent_parameters`	Same as above
`test_get_run_attributes`	Metadata assertion expects exact `{'foo': 'bar'}` but split adds `raw_data_db_path`

Full test suite with feature disabled (default): all pass unchanged

Code quality

Ruff lint: ✅ all checks passed
Pyright type check: ✅ 0 errors, 0 warnings

Add optional configuration to write results-table data into individual per-dataset SQLite files (<guid>.db) while keeping all metadata in the main database. This keeps the main DB lightweight as it grows. Config options (dataset section of qcodesrc.json): - raw_data_to_separate_db (bool, default false) - raw_data_path (string, default '{db_location}') Implementation: - New module: qcodes.dataset.raw_data_storage (helper functions) - DataSet._data_conn property routes reads/writes to correct DB - BackgroundWriter supports per-dataset raw data connections - Subscriber triggers created on data connection for compatibility - Per-dataset DB path persisted in run metadata for auto-reconnect - Empty results table schema kept in main DB for compatibility - 19 new tests, all existing tests pass unchanged - Documentation added to dataset intro, design docs, and Database notebook Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codecov · 2026-06-12T15:58:10Z

Codecov Report

❌ Patch coverage is 93.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.08%. Comparing base (c81021e) to head (efca7d8).

Files with missing lines	Patch %	Lines
src/qcodes/dataset/data_set.py	86.95%	6 Missing ⚠️
src/qcodes/dataset/raw_data_storage.py	98.07%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8219      +/-   ##
==========================================
+ Coverage   71.02%   71.08%   +0.06%     
==========================================
  Files         301      302       +1     
  Lines       31888    31980      +92     
==========================================
+ Hits        22647    22732      +85     
- Misses       9241     9248       +7

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: split raw data storage into per-dataset SQLite files#8219

feat: split raw data storage into per-dataset SQLite files#8219
astafan8 wants to merge 1 commit into
microsoft:mainfrom
astafan8:feature/split-raw-data-sqlite

astafan8 commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

astafan8 commented Jun 12, 2026

Summary

Motivation

Design Decisions

Architecture: transparent routing via _data_conn property

Config: follows existing export path pattern

Per-dataset files: lightweight, GUID-named

Empty results table kept in main DB

get_parameter_data bypass for raw data DB

Subscriber triggers on data connection

BackgroundWriter support

Files Changed

Verification

Own tests: 19/19 pass

Full test suite with feature globally enabled: 1031 passed, 5 failed, 35 skipped

Full test suite with feature disabled (default): all pass unchanged

Code quality

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Architecture: transparent routing via `_data_conn` property

`get_parameter_data` bypass for raw data DB

codecov Bot commented Jun 12, 2026 •

edited

Loading