-
Notifications
You must be signed in to change notification settings - Fork 0
USE 306 - refactor class relationships #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: This refactoring work was a long time coming, inspired by a recent need to gracefully handle a read request for embeddings against a dataset without embeddings parquet files. If we can normalize how and when tables are created, and the handling of duckdb connections, we can normalize handling read requests for data that may not be available (yet). As such, this refactoring work will help normalize read edge cases now and going forward. This library was built in stages. First was TIMDEXDataset, which read parquet files directly. Then TIMDEXDatasetMetadata, which more formally introduced DuckDB. It handled the connection creation and configuration. This connection was shared with TIMDEXDataset as we leaned into DuckDB reading. Lastly, TIMDEXEmbeddings was added as our first new "source" of data. This class shared the connection from TIMDEXDataset. Both TIMDEXDatasetMetadata and TIMDEXEmbeddings were doing their own SQLAlchemy table reflections. TIMDEXDatasetMetadata could be instantiated on its own, while TIMDEXEmbeddings was assumed to take an instance of TIMDEXDataset. At this point, while things worked, it was clear that a refactor would be beneficial. We needed clearer responsibility of what created and configured the DuckDB connection, solidify that TIMDEXDatasetMetadata and TIMDEXEmbeddings are components on TIMDEXDataset, and how and when SQLAlchemy reflection was performed. Aligning all these things will make responding to these read and write edge cases much easier. How this addresses that need: - A new factory class is created DuckDBConnectionFactory that is responsible for creating and configuring any DuckDB connections used. - Both TIMDEXDatasetMetadata and TIMDEXEmbeddings require a TIMDEXDataset instance, and then themselves become components on TIMDEXDataset. We can more accurately call them "components" then of the primary TIMDEXDataset. - TIMDEXDataset handles the creation of a DuckDB connection via the new factory, and this connection is then accesible to its components TIMDEXDatasetMetadata and TIMDEXEmbeddings (maybe more in the future) - TIMDEXDataset is also responsible for all SQLAlchemy reflection, saving to self.sa_tables. In this way, any component that may want a SQLAlchemy instance, e.g. for reading, it can get it from `self.timdex_dataset.get_sa_table(<schema>, <table)`. - Refreshing of classes is greatly simplifed: TIMDEXDataset is the true orchestrator now, so a full re-init satisfies this. Components no longer have their own `.refresh()` methods. - Where possible, update all tests to use components like TIMDEXEmbeddings as part of a TIMDEXDataset intsance, not a long class instance. Side effects of this change: * It is not recommended to use TIMDEXDatasetMetadata or TIMDEXEmbeddings by themselves; they are meant as components on a TIMDEXDataset. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-306
| self.metadata = TIMDEXDatasetMetadata(self) | ||
| self.embeddings = TIMDEXEmbeddings(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have parity for how these components are attached to TIMDEXDataset:
- pass an instance of
self - assume those components will utilize things from
self.timdex_datasetas needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart consolidation!
| exist_ok=True, | ||
| ) | ||
|
|
||
| def configure_duckdb_connection(self, conn: DuckDBPyConnection) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting here somewhat arbitrarily: all this DuckDB connection creation + configuration is moved to the new factory class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, a very smart refactor decision!
| def create_connection(self, path: str = ":memory:") -> DuckDBPyConnection: | ||
| """Create a new configured DuckDB connection. | ||
| Args: | ||
| path: Database file path or ":memory:" for in-memory database (default) | ||
| """ | ||
| start_time = time.perf_counter() | ||
| conn = duckdb.connect(path) | ||
| conn.execute("SET enable_progress_bar = false;") | ||
| self.configure_connection(conn) | ||
| logger.debug( | ||
| f"DuckDB connection created, {round(time.perf_counter()-start_time,2)}s" | ||
| ) | ||
| return conn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only way a DuckDB connection should be created now, from anywhere and for any reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfection!
| def configure_connection(self, conn: DuckDBPyConnection) -> None: | ||
| """Configure an existing DuckDB connection.""" | ||
| self._install_extensions(conn) | ||
| self._configure_s3_secret(conn) | ||
| self._configure_memory_profile(conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was all ported directly from TIMDEXDatasetMetadata where it used to live.
Pull Request Test Coverage Report for Build 20373489346Details
💛 - Coveralls |
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My favorite ratio of red to green! Smart updates and look forward to the upcoming changes, approved!
| self.metadata = TIMDEXDatasetMetadata(self) | ||
| self.embeddings = TIMDEXEmbeddings(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart consolidation!
| exist_ok=True, | ||
| ) | ||
|
|
||
| def configure_duckdb_connection(self, conn: DuckDBPyConnection) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, a very smart refactor decision!
| def get_sa_table(self, table: str) -> Table: | ||
| """Get SQLAlchemy Table from reflected SQLAlchemy metadata.""" | ||
| schema_table = f"metadata.{table}" | ||
| if schema_table not in self._sa_metadata.tables: | ||
| raise ValueError( | ||
| f"Could not find table '{table}' in DuckDB schema 'metadata'." | ||
| ) | ||
| return self._sa_metadata.tables[schema_table] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love seeing near-duplicate methods like this disappear!
| def create_connection(self, path: str = ":memory:") -> DuckDBPyConnection: | ||
| """Create a new configured DuckDB connection. | ||
| Args: | ||
| path: Database file path or ":memory:" for in-memory database (default) | ||
| """ | ||
| start_time = time.perf_counter() | ||
| conn = duckdb.connect(path) | ||
| conn.execute("SET enable_progress_bar = false;") | ||
| self.configure_connection(conn) | ||
| logger.debug( | ||
| f"DuckDB connection created, {round(time.perf_counter()-start_time,2)}s" | ||
| ) | ||
| return conn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfection!
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! ✨The classes definitely read cleaner and the relationships between TIMDEXDataset, TIMDEXEmbeddings, and TIMDEXDatasetMetadata are clearer.
Purpose and background context
From the primary commit with meaningful changes:
Why these changes are being introduced:
This refactoring work was a long time coming, inspired by a recent need to gracefully handle a read request for embeddings against a dataset without embeddings parquet files. If we can normalize how and when tables are created, and the handling of duckdb connections, we can normalize handling read requests for data that may not be available (yet). As such, this refactoring work will help normalize read edge cases now and going forward.
This library was built in stages. First was
TIMDEXDataset, which read parquet files directly. ThenTIMDEXDatasetMetadata, which more formally introduced DuckDB. It handled the connection creation and configuration. This connection was shared withTIMDEXDatasetas we leaned into DuckDB reading. Lastly,TIMDEXEmbeddingswas added as our first new "source" of data. This class shared the connection fromTIMDEXDataset. BothTIMDEXDatasetMetadataandTIMDEXEmbeddingswere doing their own SQLAlchemy table reflections.TIMDEXDatasetMetadatacould be instantiated on its own, whileTIMDEXEmbeddingswas assumed to take an instance ofTIMDEXDataset.At this point, while things worked, it was clear that a refactor would be beneficial. We needed clearer responsibility of what created and configured the DuckDB connection, solidify that
TIMDEXDatasetMetadataandTIMDEXEmbeddingsare components onTIMDEXDataset, and how and when SQLAlchemy reflection was performed. Aligning all these things will make responding to these read and write edge cases much easier.How this addresses that need:
DuckDBConnectionFactorythat is responsible for creating and configuring any DuckDB connections used.TIMDEXDatasetMetadataandTIMDEXEmbeddingsrequire aTIMDEXDatasetinstance, and then themselves become components onTIMDEXDataset. We can more accurately call them "components" then of the primaryTIMDEXDataset.TIMDEXDatasethandles the creation of a DuckDB connection via the new factory, and this connection is then accesible to its componentsTIMDEXDatasetMetadataandTIMDEXEmbeddings(maybe more in the future)TIMDEXDatasetis also responsible for all SQLAlchemy reflection, saving toself.sa_tables. In this way, any component that may want a SQLAlchemy instance, e.g. for reading, it can get it fromself.timdex_dataset.get_sa_table(<schema>, <table>).TIMDEXDatasetis the true orchestrator now, so a full re-init satisfies this. Components no longer have their own.refresh()methods.TIMDEXEmbeddingsas part of aTIMDEXDatasetintsance, not a long class instance.Side effects of this change:
TIMDEXDatasetMetadataorTIMDEXEmbeddingsby themselves; they are meant as components on aTIMDEXDataset.What's next?
First, actually addressing the need in USE-306! As demonstrated in the walkthrough below, ideally we would not have somewhat opaque
ValueErrorbubble up when we try to read either records or embeddings when not -- yet -- present. Ideally, we yield zero records, maybe with aWARNINGmessage.Second, we could explore things like automatically building metadata after first write, or refreshing
TIMDEXDatasetinstance after first write for a particular source (e.g. embeddings). The structure is normalized enough now, this would be relatively straight forward.How can a reviewer manually see the effects of these changes?
It's admiteddly a little difficult to follow the git diff either side-by-side or unified to get a feel for what changed. This small walkthrough is designed to try and help with that.
1- Create a brand new, empty dataset:
Note the output, indicating a couple of things:
2- Attempt reading of records:
This errror is very similar to the situation from USE-306 that prompted this work, specifically reading embeddings that don't exist yet. I'm highlighting this here to make clear that in a second pass we can address both, now that their reasons + mechanics are basically identical.
3- Write some records:
4- Attempt to read records again:
Same error, despite there being records! Again, another chance for improvement now that we have standardized things. It's quite easy to imagine detecting the first write to a dataset and automatically building metadata. Or, instead of
Table 'records' not found in schema 'metadata'.as the error, we catch that and suggest something like, "Records table not found. If this is a new TIMDEX dataset, please run td.metadata.rebuild_dataset_metadata()."5- Build metadata and immediately read records:
Now we are reaping some of the dividends. After metadata building, we can immediately read records without a manual re-initialization of
TIMDEXDataset. This is from improved handling of connections and streamling of refresh methods.6- Attempt read of embeddings:
We get an error, but it's identical to when metadata tables didn't exist! Again, this demonstrates how we can approach improving the ergonomics of both those situations in the same fashion.
7- Write some embeddings:
8- Attemp to read embeddings:
Argh, same error, despite writing embeddings! We can address this with a more fully formed and functional refresh now:
This should be successful.
TIMDEXDataset.refresh()is now more capable.Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Code review