Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Dec 22, 2025

Purpose and background context

Primary git commit:


Why these changes are being introduced:

There are a copule of primary ways in which a read method can fail based on the table requested:

  1. The table name is invalid and will never work.
  2. The table name is valid, but does not yet exist in the DuckDB context.

For the first, we want to raise an error immediately. For the second, there is a bit more nuance depending on the table requested.

How this addresses that need:

For TIMDEXDataset.read_* methods, we operate from the assumption that records and current_records should always be available and so we raise an error indicating a metadata rebuild is required.

But for TIMDEXEmbeddings.read_* methods, and potentially other data sources as added, we may legitimately not have data yet. As such, we'll want to log warnings and suggest a refresh, but just return an empty set.


Additional commentary:

This is an incremental improvment to a problem that runs a little deeper.

The improvement is more explicit errors and logging right now. For example, TIM can now attempt to read embeddings for a source during the CLI command reindex-source without error, just an empty set if they don't yet exist. We also raise meaningful errors for a brand new dataset, where are after the first records write, we need to build the metadata a first (and only) time.

Both of these edge cases point to a slightly deeper need: the creation of DuckDB schemas and tables explicitly, from clearly visible code in TDA, that does not require actual data to exist yet. If we do this, then read methods will naturally just return nothing if nothing exists, beacuse the schemas / tables will always exist. This also serves the function of self-documenting what schemas/tables make up the dataset.

Currently, we rely on creating schemas/tables from pre-existing parquet data. This is often a very fine and good approach, and has served us well, but the dataset structure and patterns have matured enough, we could probably define these tables more explicitly in code solving these edge cases and better documenting the dataset structure.

How can a reviewer manually see the effects of these changes?

1- Start ipython shell

2- Prepare a brand new dataset:

import shutil

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger
from tests.utils import generate_sample_records, generate_sample_embeddings_for_run

logger = configure_dev_logger()

DATASET_PATH = "/tmp/use306"
RUN_ID = "abc123"

# delete if already exists
shutil.rmtree(DATASET_PATH)

td = TIMDEXDataset(DATASET_PATH)

3- Records error handling:

td.read_dataframe()
# Failure: no records yet, so not metadata tables yet, raise exception

# write records
td.write(
    generate_sample_records(
        num_records=10,
        run_id=RUN_ID,
        run_date="2025-12-19",
        run_type="full",
        action="index",
    )
)

td.read_dataframe()
# Failure: we have records, but still need to build metadata, raise exception

td.metadata.rebuild_dataset_metadata()

print(td.read_dataframe())
# success

4- Embeddings error handling:

td.embeddings.read_dataframe()
# Failure: no embeddings, log warning; yield empty set / no exception

# write embeddings
td.embeddings.write(generate_sample_embeddings_for_run(td, run_id=RUN_ID))

td.embeddings.read_dataframe()
# Failure: embeddings exist now, but not fully established in DuckDB context; yield empty set / no exception

td.refresh()

print(td.read_dataframe())
# success

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES:

  • Applications like TIM, which now attempt embeddings reading for its reindex-source CLI command, will no longer encounter an error if embeddings don't yet exist.
  • In the rare edge cases of a brand new dataset, we have better error raising and logging.

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

There are a copule of primary ways in which a read method can fail based on the table requested:
1. The table name is invalid and will never work.
2. The table name is valid, but does not yet exist in the DuckDB context.

For the first, we want to raise an error immediately.  For the second, there is a bit more nuance depending
on the table requested.

How this addresses that need:

For TIMDEXDataset.read_* methods, we operate from the assumption that `records` and `current_records` should always be available and so we raise an error indicating a metadata rebuild is required.

But for TIMDEXEmbeddings, and potentially other data sources as added, we may legitimately not have data yet.
As such, we'll want to log warnings and suggest a refresh, but just return an empty set.

Side effects of this change:
* Applications like TIM, which now attempt embeddings reading for its `reindex-source` CLI command, will
no longer encounter an error if embeddings don't yet exist.
* In the rare edge cases of a brand new dataset, we have better error raising and logging.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-306
Comment on lines -305 to -307

# NOTE: at time of test creation, this manual reload is required
td = TIMDEXDataset(td.location)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other similar lines: based on previous refactor, missed at the time, we can lean on the improved .refresh() method!

@ghukill ghukill marked this pull request as ready for review December 22, 2025 15:22
@ghukill ghukill requested a review from a team as a code owner December 22, 2025 15:22
@jonavellecuerdo jonavellecuerdo self-assigned this Dec 22, 2025
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected, approved!

@ghukill ghukill merged commit 03521ef into main Dec 22, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants