Skip to content

Conversation

@keith-turner
Copy link
Contributor

TabletMetadataFilter had a set of columns needed for filtering available to it but did not use them. Modified it to use the columns during filtering. The comment in the change provides more details.

Attempted to modify RowFilter to support this, but the way it uses its deep copy is inverted from what is needed for this case.

TabletMetadataFilter had a set of columns needed for filtering available
to it but did not use them.  Modified it to use the columns during
filtering.  The comment in the change provides more details.

Attempted to modify RowFilter to support this, but the way it uses its
deep copy is inverted from what is needed for this case.
@keith-turner keith-turner added this to the 4.0.0 milestone Jun 4, 2025
@keith-turner
Copy link
Contributor Author

Hoping this change can really speed up scans for some of the columns like ecomp, wals, migrations etc. These columns are usually not present in most tablets. Want to eventually test this further w/ lots of tablets. May modify SplitMillionIT locally to run some performance test.

@DomGarguilo
Copy link
Member

Want to eventually test this further w/ lots of tablets. May modify SplitMillionIT locally to run some performance test.

I haven't looked at the code here yet but am wondering if the cingest manysplits application in accumulo-testing might give some rough performance measurements too.

@keith-turner
Copy link
Contributor Author

keith-turner commented Jun 6, 2025

These changes assume RFile can quickly skip an entire file column family when its not present in the file. Did some local testing to make sure that assumption is correct and found it is. Would still want to do an end to end testing, but hoping that metadata tablets w/o the family can be quickly skipped w/ this change. This is the test I wrote.

public class RFilePerfTest {
  public static void main(String[] args) throws IOException {
    Random rand = new Random();
    Files.deleteIfExists(Path.of("/tmp/test.rf"));
    try (var writer = RFile.newWriter().to("file:///tmp/test.rf").build()) {
      writer.startNewLocalityGroup("LG1", "loc", "ecomp");
      for (int i = 0; i < 10_000_000; i++) {
        String row = String.format("%09x", i);
        int port = rand.nextInt(1 << 16);
        writer.append(new Key(row, "loc", "127.0.0.1:" + port), new Value(""));
      }
      writer.startDefaultLocalityGroup();
      for (int i = 0; i < 10_000_000; i++) {
        String row = String.format("%09x", i);
        writer.append(new Key(row, "tab", "pr"), new Value(String.format("%09x", i - 1)));
      }
    }

    try (var scanner = RFile.newScanner().from("file:///tmp/test.rf").build()) {
      for (String family : List.of("loc","ecomp","tab","migration")) {
        scanner.setRange(new Range());
        scanner.clearColumns();
        scanner.fetchColumnFamily(new Text(family));
        long t1 = System.currentTimeMillis();
        long size = Iterables.size(scanner);
        long t2 = System.currentTimeMillis();
        System.out.printf("family:%10s size:%,d time:%,d\n", family, size, t2-t1);
      }
    }
  }
}

running this, seeing the following times. Experimented w/ different sizes of rfiles to ensure the family non-present times stayed around 1 to 2 ms. The ecomp family is not present in locality group LG1 and the migration family is not present in the default locality group.

family:       loc size:10,000,000 time:2,086
family:     ecomp size:0 time:2
family:       tab size:10,000,000 time:1,723
family: migration size:0 time:1

var row = nextTablet.getKeyValues().get(0).getKey().getRow();
// now that a row was found by our serach iterator, seek the main iterator with all the
// columns for that row range
super.seek(new Range(row), seekFamilies, seekInclusive);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'm interpreting things correctly here, but could seeking to a row potentially ignore the range that was input in the main seek() method? Like if the caller supplied a range that begins or ends mid-row, could this code make it so that we return data outside the requested range? This may not even be an issue if so. Just thought I would double check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does seem like it could happen. Maybe I can intersect the seek range and the row range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants