Support deleted rows in SAS data files#366
Open
hpoettker wants to merge 1 commit intoWizardMac:devfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #284.
Introduction
This PR adds support for deleted rows in SAS data files, which may be compressed or uncompressed.
The handling of deleted uncompressed rows follows the description of the sas7bdat file format here: https://github.com/FredHutch/sas7bdat-specification/blob/master/sas7bdat.rst
The handling of deleted compressed rows is not described in the document linked above. However, it is much simpler as it doesn't involve a dedicated bitmap but only the compression type
0x05, which marks compressed data rows as deleted. The compression type0x05seems to be the combination of0x01(meaning the data can be skipped) and0x04(indicating a compressed data row), which fits well with my theory put forth in #365 that the compression type is actually a bitmap.Row limits and counters
Currently, the
row_limitis not only an upper limit on the number of rows but also the exact number of rows that are expected to be parsed. As therow_limitalso includes the deleted rows, the corresponding counterparsed_row_countis now also increased when encountering a deleted row. This allows for the validation checks againstrow_limitto remain unchanged.The two new variables
deleted_row_limitandparsed_deleted_row_countserve a corresponding purpose but count only deleted rows.The row count in the meta data is computed as
row_limit - deleted_row_limit. Accordingly, the row id that is passed to the value handler and is used in error messages is now computed asparsed_row_count - parsed_deleted_row_count.Implementation alternative
To the user of the library, deleted rows are transparent with the proposed change, which is a bit different in SAS itself. There, the number of deleted rows is visible in the metadata and the GUI viewer indicates positions of deleted rows by non-subsequent row ids.
One could consider adding the number of deleted rows to the meta data. And one could also consider marking rows as deleted either by explicitly passing the information to the value handler or by implicitly not passing data for deleted row ids to it. This would require a breaking change in the API, I think. It's mostly a discussion on what the information
row_countin the meta data should exactly refer to in the case of deleted rows.Validation
The code works as expected with the example file from #284, which uses a single page and contains only 5 rows (of which 2 are deleted).
I've also tested it successfully against data files written on both Windows and Linux that contain multiple pages and have deleted rows across them.
As generating generic test data is quite easy in this case, I can generate test files if needed.