Skip to content

Modify the Reader classes to support getting files from somewhere other than a local disk #135

@alexwlchan

Description

@alexwlchan

The classes in reader take a path to a file on disk, read that file and then parse the contents. For example:

public final class KeyValueReader {
  /**
   * Generic method to read key value pairs from the bagit files, like bagit.txt or bag-info.txt
   * 
   * @param file the file to read
   * @param splitRegex how to split the key from the value
   * @param charset the encoding of the file
   * 
   * @return a list of key value pairs
   */
  public static List<SimpleImmutableEntry<String, String>> readKeyValuesFromFile(final Path file, final String splitRegex, final Charset charset) throws IOException, InvalidBagMetadataException{
    final List<SimpleImmutableEntry<String, String>> keyValues = new ArrayList<>();
    
    try(final BufferedReader reader = Files.newBufferedReader(file, charset)){
       ...
    }

    return keyValues;
  }
}

For the Wellcome storage service (https://github.com/wellcometrust/storage-service), we aren’t keeping bags on the local disk, but in S3. If we want to read a file, we make a GetObject call to the S3 SDK, which returns an InputStream.

We could download the bag files to disk, and read them from there, but that seems a bit icky – would you be open to some pull requests that add allow parsing files even if they aren’t local files? Something like:

public final class KeyValueReader {
  public static List<SimpleImmutableEntry<String, String>> readKeyValuesFromReader(
    final BufferedReader reader,
    final String splitRegex) throws IOException, InvalidBagMetadataException{
    final List<SimpleImmutableEntry<String, String>> keyValues = new ArrayList<>();

    ...    

    return keyValues;
  }

  public static List<SimpleImmutableEntry<String, String>> readKeyValuesFromFile(
    final Path file,
    final String splitRegex,
    final Charset charset) throws IOException, InvalidBagMetadataException{
    try(final BufferedReader reader = Files.newBufferedReader(file, charset)){
       return readKeyValuesFromReader(reader, splitRegex)
    }
  }
}

So the existing API is preserved, and calls into the new method that takes any BufferedReader – and now we can call that rather than round-tripping to the filesystem first.

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions