Skip to content

Support deleted rows in SAS data files#366

Open
hpoettker wants to merge 1 commit intoWizardMac:devfrom
hpoettker:deleted-rows
Open

Support deleted rows in SAS data files#366
hpoettker wants to merge 1 commit intoWizardMac:devfrom
hpoettker:deleted-rows

Conversation

@hpoettker
Copy link
Copy Markdown
Contributor

Resolves #284.

Introduction

This PR adds support for deleted rows in SAS data files, which may be compressed or uncompressed.

The handling of deleted uncompressed rows follows the description of the sas7bdat file format here: https://github.com/FredHutch/sas7bdat-specification/blob/master/sas7bdat.rst

The handling of deleted compressed rows is not described in the document linked above. However, it is much simpler as it doesn't involve a dedicated bitmap but only the compression type 0x05, which marks compressed data rows as deleted. The compression type 0x05 seems to be the combination of 0x01 (meaning the data can be skipped) and 0x04 (indicating a compressed data row), which fits well with my theory put forth in #365 that the compression type is actually a bitmap.

Row limits and counters

Currently, the row_limit is not only an upper limit on the number of rows but also the exact number of rows that are expected to be parsed. As the row_limit also includes the deleted rows, the corresponding counter parsed_row_count is now also increased when encountering a deleted row. This allows for the validation checks against row_limit to remain unchanged.

The two new variables deleted_row_limit and parsed_deleted_row_count serve a corresponding purpose but count only deleted rows.

The row count in the meta data is computed as row_limit - deleted_row_limit. Accordingly, the row id that is passed to the value handler and is used in error messages is now computed as parsed_row_count - parsed_deleted_row_count.

Implementation alternative

To the user of the library, deleted rows are transparent with the proposed change, which is a bit different in SAS itself. There, the number of deleted rows is visible in the metadata and the GUI viewer indicates positions of deleted rows by non-subsequent row ids.

One could consider adding the number of deleted rows to the meta data. And one could also consider marking rows as deleted either by explicitly passing the information to the value handler or by implicitly not passing data for deleted row ids to it. This would require a breaking change in the API, I think. It's mostly a discussion on what the information row_count in the meta data should exactly refer to in the case of deleted rows.

Validation

The code works as expected with the example file from #284, which uses a single page and contains only 5 rows (of which 2 are deleted).

I've also tested it successfully against data files written on both Windows and Linux that contain multiple pages and have deleted rows across them.

As generating generic test data is quite easy in this case, I can generate test files if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skip deleted observations in SAS7BDAT files

1 participant