Skip to content

feat: split raw data storage into per-dataset SQLite files#8219

Draft
astafan8 wants to merge 1 commit into
microsoft:mainfrom
astafan8:feature/split-raw-data-sqlite
Draft

feat: split raw data storage into per-dataset SQLite files#8219
astafan8 wants to merge 1 commit into
microsoft:mainfrom
astafan8:feature/split-raw-data-sqlite

Conversation

@astafan8

Copy link
Copy Markdown
Contributor

Summary

This PR implements split raw data storage for QCoDeS: an opt-in feature that writes raw measurement data (results table rows) into individual per-dataset SQLite files while keeping all metadata in the main database. The goal is to prevent the main DB file from growing excessively large as datasets accumulate, making metadata browsing and experiment management faster.

Motivation

The main QCoDeS SQLite database stores both metadata (experiments, runs, parameter layouts, dependencies) and raw measurement data (results tables) in a single file. Over time, this file can grow to many gigabytes, slowing down operations that only need metadata. By splitting the raw data into per-dataset files, the main DB stays lightweight while data integrity is preserved.

Design Decisions

Architecture: transparent routing via _data_conn property

  • A single _data_conn property on DataSet is the routing point for all data read/write operations
  • Returns the per-dataset raw data connection when split is enabled, otherwise falls back to self.conn (main DB)
  • All write paths (add_results, _BackgroundWriter) and read paths (get_parameter_data, DataSetCacheWithDBBackend, number_of_results, __len__) go through this property
  • Zero changes to public DataSet API — all existing methods work identically

Config: follows existing export path pattern

  • Two new config options in dataset section of qcodesrc.json:
    • raw_data_to_separate_db (bool, default false)
    • raw_data_path (string, default "{db_location}")
  • Reuses _expand_export_path() from export_config.py for path expansion (e.g., ~/experiments.db~/experiments_db/)
  • Pattern mirrors the existing export_path / export_type config approach

Per-dataset files: lightweight, GUID-named

  • Each file is named <guid>.db and contains only the results table + numpy type adapters
  • No QCoDeS metadata schema in per-dataset files — they are minimal
  • Path to per-dataset file is persisted in run metadata (raw_data_db_path dynamic column) for automatic reconnection on load_by_id()

Empty results table kept in main DB

  • We considered removing the results table from the main DB entirely, but this would break:
    • _Subscriber trigger creation (SQLite triggers require the table to exist)
    • __len__ / number_of_results before dataset is started (when raw data DB doesn't yet exist)
    • Low-level query functions that inspect table structure via PRAGMA TABLE_INFO
    • The _check_if_table_found logic used in _get_datasetprotocol_from_guid to distinguish DataSet vs DataSetInMem
  • Decision: keep the empty table schema (column definitions, no rows) — negligible overhead, full backward compatibility

get_parameter_data bypass for raw data DB

  • The standard get_parameter_data() in queries.py calls get_rundescriber_from_result_table_name() which queries the runs table — this table doesn't exist in the raw data DB
  • Solution: when _raw_data_conn is set, bypass the top-level function and call get_shaped_parameter_data_for_one_paramtree() directly with the already-held rundescriber

Subscriber triggers on data connection

  • _Subscriber.__init__ creates SQL triggers for real-time data callbacks
  • Changed to use _data_conn instead of self.conn so triggers fire on the correct DB where data is actually inserted

BackgroundWriter support

  • _BackgroundWriter maintains a _raw_data_conns dict keyed by file path
  • Queue items include optional raw_data_path key for routing
  • Connections are lazily created and reused across datasets sharing the same raw data DB path

Files Changed

File Type Description
src/qcodes/dataset/raw_data_storage.py New Helper module: is_raw_data_storage_enabled(), get_raw_data_folder(), get_raw_data_db_path(), connect_to_raw_data_db(), create_raw_data_db()
tests/dataset/test_raw_data_storage.py New 19 tests (7 unit + 10 integration + 2 non-interference)
src/qcodes/dataset/data_set.py Modified _data_conn property, _raw_data_conn attribute, routing in __init__, _perform_start_actions, add_results, get_parameter_data, number_of_results, __len__, _BackgroundWriter, _get_datasetprotocol_from_guid
src/qcodes/dataset/data_set_cache.py Modified load_data_from_db() uses _data_conn
src/qcodes/dataset/subscriber.py Modified Trigger creation uses _data_conn
src/qcodes/configuration/qcodesrc.json Modified Added config defaults
src/qcodes/configuration/qcodesrc_schema.json Modified Added config schema
docs/dataset/introduction.rst Modified New "Split Raw Data Storage" section
docs/dataset/dataset_design.rst Modified Design notes on split storage
docs/examples/DataSet/Database.ipynb Modified Config and usage documentation

Verification

Own tests: 19/19 pass

  • Unit tests for all helper functions (config reading, path generation, DB creation)
  • Integration tests: write/read round-trip, cache, load_by_id, multiple datasets, background writer
  • Non-interference: feature disabled → data goes to main DB as before

Full test suite with feature globally enabled: 1031 passed, 5 failed, 35 skipped

The 5 failures are all expected/explained:

Test Reason
test_raw_data_conn_is_none Own test for disabled mode — overridden by global enable
test_data_in_main_db Own test for disabled mode — overridden by global enable
test_get_parameter_data Low-level query test calls queries.get_parameter_data(ds.conn, ...) directly, bypassing DataSet
test_get_parameter_data_independent_parameters Same as above
test_get_run_attributes Metadata assertion expects exact {'foo': 'bar'} but split adds raw_data_db_path

Full test suite with feature disabled (default): all pass unchanged

Code quality

  • Ruff lint: ✅ all checks passed
  • Pyright type check: ✅ 0 errors, 0 warnings

Add optional configuration to write results-table data into individual
per-dataset SQLite files (<guid>.db) while keeping all metadata in the
main database. This keeps the main DB lightweight as it grows.

Config options (dataset section of qcodesrc.json):
  - raw_data_to_separate_db (bool, default false)
  - raw_data_path (string, default '{db_location}')

Implementation:
  - New module: qcodes.dataset.raw_data_storage (helper functions)
  - DataSet._data_conn property routes reads/writes to correct DB
  - BackgroundWriter supports per-dataset raw data connections
  - Subscriber triggers created on data connection for compatibility
  - Per-dataset DB path persisted in run metadata for auto-reconnect
  - Empty results table schema kept in main DB for compatibility
  - 19 new tests, all existing tests pass unchanged
  - Documentation added to dataset intro, design docs, and Database notebook

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.08%. Comparing base (c81021e) to head (efca7d8).

Files with missing lines Patch % Lines
src/qcodes/dataset/data_set.py 86.95% 6 Missing ⚠️
src/qcodes/dataset/raw_data_storage.py 98.07% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8219      +/-   ##
==========================================
+ Coverage   71.02%   71.08%   +0.06%     
==========================================
  Files         301      302       +1     
  Lines       31888    31980      +92     
==========================================
+ Hits        22647    22732      +85     
- Misses       9241     9248       +7     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant