USE 306 - refactor class relationships #181

ghukill · 2025-12-19T14:49:06Z

Purpose and background context

From the primary commit with meaningful changes:

Why these changes are being introduced:

This refactoring work was a long time coming, inspired by a recent need to gracefully handle a read request for embeddings against a dataset without embeddings parquet files. If we can normalize how and when tables are created, and the handling of duckdb connections, we can normalize handling read requests for data that may not be available (yet). As such, this refactoring work will help normalize read edge cases now and going forward.

This library was built in stages. First was TIMDEXDataset, which read parquet files directly. Then TIMDEXDatasetMetadata, which more formally introduced DuckDB. It handled the connection creation and configuration. This connection was shared with TIMDEXDataset as we leaned into DuckDB reading. Lastly, TIMDEXEmbeddings was added as our first new "source" of data. This class shared the connection from TIMDEXDataset. Both TIMDEXDatasetMetadata and TIMDEXEmbeddings were doing their own SQLAlchemy table reflections. TIMDEXDatasetMetadata could be instantiated on its own, while TIMDEXEmbeddings was assumed to take an instance of TIMDEXDataset.

At this point, while things worked, it was clear that a refactor would be beneficial. We needed clearer responsibility of what created and configured the DuckDB connection, solidify that TIMDEXDatasetMetadata and TIMDEXEmbeddings are components on TIMDEXDataset, and how and when SQLAlchemy reflection was performed. Aligning all these things will make responding to these read and write edge cases much easier.

How this addresses that need:

A new factory class is created DuckDBConnectionFactory that is responsible for creating and configuring any DuckDB connections used.
Both TIMDEXDatasetMetadata and TIMDEXEmbeddings require a TIMDEXDataset instance, and then themselves become components on TIMDEXDataset. We can more accurately call them "components" then of the primary TIMDEXDataset.
TIMDEXDataset handles the creation of a DuckDB connection via the new factory, and this connection is then accesible to its components TIMDEXDatasetMetadata and TIMDEXEmbeddings (maybe more in the future)
TIMDEXDataset is also responsible for all SQLAlchemy reflection, saving to self.sa_tables. In this way, any component that may want a SQLAlchemy instance, e.g. for reading, it can get it from self.timdex_dataset.get_sa_table(<schema>, <table>).
Refreshing of classes is greatly simplifed: TIMDEXDataset is the true orchestrator now, so a full re-init satisfies this. Components no longer have their own .refresh() methods.
Where possible, update all tests to use components like TIMDEXEmbeddings as part of a TIMDEXDataset intsance, not a long class instance.

Side effects of this change:

It is not recommended to use TIMDEXDatasetMetadata or TIMDEXEmbeddings by themselves; they are meant as components on a TIMDEXDataset.

What's next?

First, actually addressing the need in USE-306! As demonstrated in the walkthrough below, ideally we would not have somewhat opaque ValueError bubble up when we try to read either records or embeddings when not -- yet -- present. Ideally, we yield zero records, maybe with a WARNING message.

Second, we could explore things like automatically building metadata after first write, or refreshing TIMDEXDataset instance after first write for a particular source (e.g. embeddings). The structure is normalized enough now, this would be relatively straight forward.

How can a reviewer manually see the effects of these changes?

It's admiteddly a little difficult to follow the git diff either side-by-side or unified to get a feel for what changed. This small walkthrough is designed to try and help with that.

1- Create a brand new, empty dataset:

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger
from tests.utils import generate_sample_records, generate_sample_embeddings_for_run

logger = configure_dev_logger()

DATASET_PATH = "/tmp/use306"

td = TIMDEXDataset(DATASET_PATH)

Note the output, indicating a couple of things:

No metadata found, suggesting we may need to rebuild it.
No embeddings found, but no errors.

2- Attempt reading of records:

td.read_dataframe(limit=10)
# ValueError: Table 'records' not found in schema 'metadata'.

This errror is very similar to the situation from USE-306 that prompted this work, specifically reading embeddings that don't exist yet. I'm highlighting this here to make clear that in a second pass we can address both, now that their reasons + mechanics are basically identical.

3- Write some records:

td.write(
    generate_sample_records(
        num_records=10,
        run_date="2025-12-19",
        run_type="full",
        run_id="abc123",
        action="index",
    )
)

4- Attempt to read records again:

td.read_dataframe(limit=10)

Same error, despite there being records! Again, another chance for improvement now that we have standardized things. It's quite easy to imagine detecting the first write to a dataset and automatically building metadata. Or, instead of Table 'records' not found in schema 'metadata'. as the error, we catch that and suggest something like, "Records table not found. If this is a new TIMDEX dataset, please run td.metadata.rebuild_dataset_metadata()."

5- Build metadata and immediately read records:

td.metadata.rebuild_dataset_metadata()

td.read_dataframe()

Now we are reaping some of the dividends. After metadata building, we can immediately read records without a manual re-initialization of TIMDEXDataset. This is from improved handling of connections and streamling of refresh methods.

6- Attempt read of embeddings:

td.embeddings.read_dataframe()
# ValueError: Table 'embeddings' not found in schema 'data'.

We get an error, but it's identical to when metadata tables didn't exist! Again, this demonstrates how we can approach improving the ergonomics of both those situations in the same fashion.

7- Write some embeddings:

td.embeddings.write(generate_sample_embeddings_for_run(td, run_id="abc123"))

8- Attemp to read embeddings:

td.embeddings.read_dataframe()
# ValueError: Table 'embeddings' not found in schema 'data'.

Argh, same error, despite writing embeddings! We can address this with a more fully formed and functional refresh now:

td.refresh()

td.embeddings.read_dataframe()

This should be successful. TIMDEXDataset.refresh() is now more capable.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-306

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: This refactoring work was a long time coming, inspired by a recent need to gracefully handle a read request for embeddings against a dataset without embeddings parquet files. If we can normalize how and when tables are created, and the handling of duckdb connections, we can normalize handling read requests for data that may not be available (yet). As such, this refactoring work will help normalize read edge cases now and going forward. This library was built in stages. First was TIMDEXDataset, which read parquet files directly. Then TIMDEXDatasetMetadata, which more formally introduced DuckDB. It handled the connection creation and configuration. This connection was shared with TIMDEXDataset as we leaned into DuckDB reading. Lastly, TIMDEXEmbeddings was added as our first new "source" of data. This class shared the connection from TIMDEXDataset. Both TIMDEXDatasetMetadata and TIMDEXEmbeddings were doing their own SQLAlchemy table reflections. TIMDEXDatasetMetadata could be instantiated on its own, while TIMDEXEmbeddings was assumed to take an instance of TIMDEXDataset. At this point, while things worked, it was clear that a refactor would be beneficial. We needed clearer responsibility of what created and configured the DuckDB connection, solidify that TIMDEXDatasetMetadata and TIMDEXEmbeddings are components on TIMDEXDataset, and how and when SQLAlchemy reflection was performed. Aligning all these things will make responding to these read and write edge cases much easier. How this addresses that need: - A new factory class is created DuckDBConnectionFactory that is responsible for creating and configuring any DuckDB connections used. - Both TIMDEXDatasetMetadata and TIMDEXEmbeddings require a TIMDEXDataset instance, and then themselves become components on TIMDEXDataset. We can more accurately call them "components" then of the primary TIMDEXDataset. - TIMDEXDataset handles the creation of a DuckDB connection via the new factory, and this connection is then accesible to its components TIMDEXDatasetMetadata and TIMDEXEmbeddings (maybe more in the future) - TIMDEXDataset is also responsible for all SQLAlchemy reflection, saving to self.sa_tables. In this way, any component that may want a SQLAlchemy instance, e.g. for reading, it can get it from `self.timdex_dataset.get_sa_table(<schema>, <table)`. - Refreshing of classes is greatly simplifed: TIMDEXDataset is the true orchestrator now, so a full re-init satisfies this. Components no longer have their own `.refresh()` methods. - Where possible, update all tests to use components like TIMDEXEmbeddings as part of a TIMDEXDataset intsance, not a long class instance. Side effects of this change: * It is not recommended to use TIMDEXDatasetMetadata or TIMDEXEmbeddings by themselves; they are meant as components on a TIMDEXDataset. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-306

ghukill · 2025-12-19T14:54:58Z

timdex_dataset_api/dataset.py

+        self.metadata = TIMDEXDatasetMetadata(self)
        self.embeddings = TIMDEXEmbeddings(self)


We now have parity for how these components are attached to TIMDEXDataset:

pass an instance of self

assume those components will utilize things from self.timdex_dataset as needed

Smart consolidation!

ghukill · 2025-12-19T14:55:50Z

timdex_dataset_api/metadata.py

                exist_ok=True,
            )

-    def configure_duckdb_connection(self, conn: DuckDBPyConnection) -> None:


Commenting here somewhat arbitrarily: all this DuckDB connection creation + configuration is moved to the new factory class.

Again, a very smart refactor decision!

ghukill · 2025-12-19T15:31:26Z

timdex_dataset_api/utils.py

+    def create_connection(self, path: str = ":memory:") -> DuckDBPyConnection:
+        """Create a new configured DuckDB connection.
+
+        Args:
+            path: Database file path or ":memory:" for in-memory database (default)
+        """
+        start_time = time.perf_counter()
+        conn = duckdb.connect(path)
+        conn.execute("SET enable_progress_bar = false;")
+        self.configure_connection(conn)
+        logger.debug(
+            f"DuckDB connection created, {round(time.perf_counter()-start_time,2)}s"
+        )
+        return conn


This is the only way a DuckDB connection should be created now, from anywhere and for any reason.

Perfection!

ghukill · 2025-12-19T15:31:59Z

timdex_dataset_api/utils.py

+    def configure_connection(self, conn: DuckDBPyConnection) -> None:
+        """Configure an existing DuckDB connection."""
+        self._install_extensions(conn)
+        self._configure_s3_secret(conn)
+        self._configure_memory_profile(conn)


This was all ported directly from TIMDEXDatasetMetadata where it used to live.

coveralls · 2025-12-19T15:34:39Z

Pull Request Test Coverage Report for Build 20373489346

Details

102 of 108 (94.44%) changed or added relevant lines in 5 files are covered.
6 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.8%) to 93.089%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
timdex_dataset_api/embeddings.py	11	12	91.67%
timdex_dataset_api/metadata.py	25	26	96.15%
timdex_dataset_api/utils.py	38	39	97.44%
timdex_dataset_api/dataset.py	27	30	90.0%

Files with Coverage Reduction	New Missed Lines	%
timdex_dataset_api/utils.py	6	87.32%

Totals
Change from base Build 20273712575:	-0.8%
Covered Lines:	687
Relevant Lines:	738

💛 - Coveralls

ehanson8

My favorite ratio of red to green! Smart updates and look forward to the upcoming changes, approved!

ehanson8 · 2025-12-19T19:12:01Z

timdex_dataset_api/dataset.py

+        self.metadata = TIMDEXDatasetMetadata(self)
        self.embeddings = TIMDEXEmbeddings(self)


Smart consolidation!

ehanson8 · 2025-12-19T19:20:51Z

timdex_dataset_api/metadata.py

                exist_ok=True,
            )

-    def configure_duckdb_connection(self, conn: DuckDBPyConnection) -> None:


Again, a very smart refactor decision!

ehanson8 · 2025-12-19T19:37:27Z

timdex_dataset_api/metadata.py

-    def get_sa_table(self, table: str) -> Table:
-        """Get SQLAlchemy Table from reflected SQLAlchemy metadata."""
-        schema_table = f"metadata.{table}"
-        if schema_table not in self._sa_metadata.tables:
-            raise ValueError(
-                f"Could not find table '{table}' in DuckDB schema 'metadata'."
-            )
-        return self._sa_metadata.tables[schema_table]


Love seeing near-duplicate methods like this disappear!

ehanson8 · 2025-12-19T19:45:19Z

timdex_dataset_api/utils.py

+    def create_connection(self, path: str = ":memory:") -> DuckDBPyConnection:
+        """Create a new configured DuckDB connection.
+
+        Args:
+            path: Database file path or ":memory:" for in-memory database (default)
+        """
+        start_time = time.perf_counter()
+        conn = duckdb.connect(path)
+        conn.execute("SET enable_progress_bar = false;")
+        self.configure_connection(conn)
+        logger.debug(
+            f"DuckDB connection created, {round(time.perf_counter()-start_time,2)}s"
+        )
+        return conn


Perfection!

jonavellecuerdo

Awesome work! ✨The classes definitely read cleaner and the relationships between TIMDEXDataset, TIMDEXEmbeddings, and TIMDEXDatasetMetadata are clearer.

ghukill added 3 commits December 19, 2025 09:11

Update dependencies

21a0256

Version bump to 3.9

efccf88

ghukill commented Dec 19, 2025

View reviewed changes

ghukill marked this pull request as ready for review December 19, 2025 15:26

ghukill requested a review from a team as a code owner December 19, 2025 15:26

ghukill commented Dec 19, 2025

View reviewed changes

ehanson8 approved these changes Dec 19, 2025

View reviewed changes

jonavellecuerdo self-assigned this Dec 22, 2025

jonavellecuerdo approved these changes Dec 22, 2025

View reviewed changes

ghukill merged commit 7ae3b96 into main Dec 22, 2025
2 checks passed

		self.metadata = TIMDEXDatasetMetadata(self)
		self.embeddings = TIMDEXEmbeddings(self)

USE 306 - refactor class relationships #181

USE 306 - refactor class relationships #181

Conversation

ghukill commented Dec 19, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Dec 19, 2025

Pull Request Test Coverage Report for Build 20373489346

Details

💛 - Coveralls

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ghukill commented Dec 19, 2025 •

edited by atlassian bot

Loading