Skip to content

upload_filepath re-uploads files after transaction rollback — should check S3 before uploading #1397

@ttngu207

Description

@ttngu207

DataJoint Version

datajoint 0.14.8 (datajoint/external.py, datajoint/s3.py)

Problem

ExternalTable.upload_filepath() only checks the external tracking table (DB) to determine whether a file already exists on S3 before uploading. When a transaction rolls back after a successful S3 upload, the DB tracking entry is lost but the file remains on S3. On retry, upload_filepath finds no DB entry and re-uploads the entire file, even though it's already there.

This is particularly painful for large files (multi-GB). In our case, a 10 GB recording.dat gets uploaded to S3 in ~5 minutes, but the DB connection times out during the enclosing transaction. Every retry re-uploads the same 10 GB file — wasting time and bandwidth — and then fails again at the same point.

Root Cause

S3 uploads are not transactional, but upload_filepath treats them as if they are by relying solely on the DB tracking table:

# external.py lines 293-315
check_hash = (self & {"hash": uuid}).fetch("contents_hash")
if check_hash.size:
    # DB entry exists → skip (correct)
else:
    # DB entry missing → always upload, even if file is already on S3
    self._upload_file(local_filepath, external_path, metadata=...)
    self.connection.query("INSERT INTO ...")

After a transaction rollback:

  1. S3 file exists ✓ (upload succeeded before rollback)
  2. DB tracking entry gone ✗ (rolled back)
  3. Retry: DB check finds nothing → re-uploads entire file to S3

Proposed Fix

Before uploading, check S3 directly using s3.exists() + s3.get_size() (both already implemented in s3.py). If a file with matching size already exists at the expected path, skip the upload:

check_hash = (self & {"hash": uuid}).fetch("contents_hash")
if check_hash.size:
    # DB tracking entry exists — skip
    if not skip_checksum and contents_hash != check_hash[0]:
        raise DataJointError(...)
else:
    external_path = self._make_external_filepath(relative_filepath)
    already_uploaded = False
    if self.spec["protocol"] == "s3":
        if self.s3.exists(str(external_path)):
            remote_size = self.s3.get_size(str(external_path))
            if remote_size == file_size:
                already_uploaded = True
                logger.info(
                    f"File already exists on S3 with matching size, "
                    f"skipping upload: '{relative_filepath}'"
                )
    if not already_uploaded:
        self._upload_file(
            local_filepath, external_path,
            metadata={"contents_hash": str(contents_hash) if contents_hash else ""},
        )
    # Always insert the DB tracking entry
    self.connection.query("INSERT INTO ...")

For even stronger verification, the contents_hash is already stored in S3 object metadata (set at upload time on line 306). It could be checked via stat_object without downloading the file:

stat = self.s3.client.stat_object(self.bucket, str(external_path))
remote_contents_hash = stat.metadata.get("x-amz-meta-contents_hash")

Why This Is Safe

  • s3.exists() and s3.get_size() are cheap stat_object calls (milliseconds)
  • The external path is deterministic (derived from the relative filepath), so path + size match is a strong identity signal
  • If the S3 check fails or the file doesn't exist, it falls through to the normal upload path — zero risk to existing behavior
  • This only affects the else branch (no DB entry), so already-tracked files are unaffected

Reproduction

Any dj.Imported table that inserts filepath@store attributes pointing to large files inside make():

  1. make() runs a long computation, then calls self.insert1(...) with a filepath@store attribute referencing a large file
  2. S3 upload succeeds but takes long enough that the DB connection times out
  3. Transaction rolls back (LostConnectionError)
  4. Retry: upload_filepath finds no DB entry → re-uploads the same large file → same timeout → infinite retry loop

In our pipeline, this happens with spike sorting, which produces a 10 GB recording.dat artifact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions