feat: split raw data storage into per-dataset SQLite files#8219
Draft
astafan8 wants to merge 1 commit into
Draft
feat: split raw data storage into per-dataset SQLite files#8219astafan8 wants to merge 1 commit into
astafan8 wants to merge 1 commit into
Conversation
Add optional configuration to write results-table data into individual
per-dataset SQLite files (<guid>.db) while keeping all metadata in the
main database. This keeps the main DB lightweight as it grows.
Config options (dataset section of qcodesrc.json):
- raw_data_to_separate_db (bool, default false)
- raw_data_path (string, default '{db_location}')
Implementation:
- New module: qcodes.dataset.raw_data_storage (helper functions)
- DataSet._data_conn property routes reads/writes to correct DB
- BackgroundWriter supports per-dataset raw data connections
- Subscriber triggers created on data connection for compatibility
- Per-dataset DB path persisted in run metadata for auto-reconnect
- Empty results table schema kept in main DB for compatibility
- 19 new tests, all existing tests pass unchanged
- Documentation added to dataset intro, design docs, and Database notebook
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8219 +/- ##
==========================================
+ Coverage 71.02% 71.08% +0.06%
==========================================
Files 301 302 +1
Lines 31888 31980 +92
==========================================
+ Hits 22647 22732 +85
- Misses 9241 9248 +7 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements split raw data storage for QCoDeS: an opt-in feature that writes raw measurement data (results table rows) into individual per-dataset SQLite files while keeping all metadata in the main database. The goal is to prevent the main DB file from growing excessively large as datasets accumulate, making metadata browsing and experiment management faster.
Motivation
The main QCoDeS SQLite database stores both metadata (experiments, runs, parameter layouts, dependencies) and raw measurement data (results tables) in a single file. Over time, this file can grow to many gigabytes, slowing down operations that only need metadata. By splitting the raw data into per-dataset files, the main DB stays lightweight while data integrity is preserved.
Design Decisions
Architecture: transparent routing via
_data_connproperty_data_connproperty onDataSetis the routing point for all data read/write operationsself.conn(main DB)add_results,_BackgroundWriter) and read paths (get_parameter_data,DataSetCacheWithDBBackend,number_of_results,__len__) go through this propertyConfig: follows existing export path pattern
datasetsection ofqcodesrc.json:raw_data_to_separate_db(bool, defaultfalse)raw_data_path(string, default"{db_location}")_expand_export_path()fromexport_config.pyfor path expansion (e.g.,~/experiments.db→~/experiments_db/)export_path/export_typeconfig approachPer-dataset files: lightweight, GUID-named
<guid>.dband contains only the results table + numpy type adaptersraw_data_db_pathdynamic column) for automatic reconnection onload_by_id()Empty results table kept in main DB
_Subscribertrigger creation (SQLite triggers require the table to exist)__len__/number_of_resultsbefore dataset is started (when raw data DB doesn't yet exist)PRAGMA TABLE_INFO_check_if_table_foundlogic used in_get_datasetprotocol_from_guidto distinguish DataSet vs DataSetInMemget_parameter_databypass for raw data DBget_parameter_data()inqueries.pycallsget_rundescriber_from_result_table_name()which queries therunstable — this table doesn't exist in the raw data DB_raw_data_connis set, bypass the top-level function and callget_shaped_parameter_data_for_one_paramtree()directly with the already-held rundescriberSubscriber triggers on data connection
_Subscriber.__init__creates SQL triggers for real-time data callbacks_data_conninstead ofself.connso triggers fire on the correct DB where data is actually insertedBackgroundWriter support
_BackgroundWritermaintains a_raw_data_connsdict keyed by file pathraw_data_pathkey for routingFiles Changed
src/qcodes/dataset/raw_data_storage.pyis_raw_data_storage_enabled(),get_raw_data_folder(),get_raw_data_db_path(),connect_to_raw_data_db(),create_raw_data_db()tests/dataset/test_raw_data_storage.pysrc/qcodes/dataset/data_set.py_data_connproperty,_raw_data_connattribute, routing in__init__,_perform_start_actions,add_results,get_parameter_data,number_of_results,__len__,_BackgroundWriter,_get_datasetprotocol_from_guidsrc/qcodes/dataset/data_set_cache.pyload_data_from_db()uses_data_connsrc/qcodes/dataset/subscriber.py_data_connsrc/qcodes/configuration/qcodesrc.jsonsrc/qcodes/configuration/qcodesrc_schema.jsondocs/dataset/introduction.rstdocs/dataset/dataset_design.rstdocs/examples/DataSet/Database.ipynbVerification
Own tests: 19/19 pass
load_by_id, multiple datasets, background writerFull test suite with feature globally enabled: 1031 passed, 5 failed, 35 skipped
The 5 failures are all expected/explained:
test_raw_data_conn_is_nonetest_data_in_main_dbtest_get_parameter_dataqueries.get_parameter_data(ds.conn, ...)directly, bypassing DataSettest_get_parameter_data_independent_parameterstest_get_run_attributes{'foo': 'bar'}but split addsraw_data_db_pathFull test suite with feature disabled (default): all pass unchanged
Code quality