optimizes tablet metadata filter #5615

keith-turner · 2025-06-04T00:54:33Z

TabletMetadataFilter had a set of columns needed for filtering available to it but did not use them. Modified it to use the columns during filtering. The comment in the change provides more details.

Attempted to modify RowFilter to support this, but the way it uses its deep copy is inverted from what is needed for this case.

TabletMetadataFilter had a set of columns needed for filtering available to it but did not use them. Modified it to use the columns during filtering. The comment in the change provides more details. Attempted to modify RowFilter to support this, but the way it uses its deep copy is inverted from what is needed for this case.

keith-turner · 2025-06-04T16:28:25Z

Hoping this change can really speed up scans for some of the columns like ecomp, wals, migrations etc. These columns are usually not present in most tablets. Want to eventually test this further w/ lots of tablets. May modify SplitMillionIT locally to run some performance test.

DomGarguilo · 2025-06-04T16:45:29Z

Want to eventually test this further w/ lots of tablets. May modify SplitMillionIT locally to run some performance test.

I haven't looked at the code here yet but am wondering if the cingest manysplits application in accumulo-testing might give some rough performance measurements too.

keith-turner · 2025-06-06T15:52:20Z

These changes assume RFile can quickly skip an entire file column family when its not present in the file. Did some local testing to make sure that assumption is correct and found it is. Would still want to do an end to end testing, but hoping that metadata tablets w/o the family can be quickly skipped w/ this change. This is the test I wrote.

public class RFilePerfTest {
  public static void main(String[] args) throws IOException {
    Random rand = new Random();
    Files.deleteIfExists(Path.of("/tmp/test.rf"));
    try (var writer = RFile.newWriter().to("file:///tmp/test.rf").build()) {
      writer.startNewLocalityGroup("LG1", "loc", "ecomp");
      for (int i = 0; i < 10_000_000; i++) {
        String row = String.format("%09x", i);
        int port = rand.nextInt(1 << 16);
        writer.append(new Key(row, "loc", "127.0.0.1:" + port), new Value(""));
      }
      writer.startDefaultLocalityGroup();
      for (int i = 0; i < 10_000_000; i++) {
        String row = String.format("%09x", i);
        writer.append(new Key(row, "tab", "pr"), new Value(String.format("%09x", i - 1)));
      }
    }

    try (var scanner = RFile.newScanner().from("file:///tmp/test.rf").build()) {
      for (String family : List.of("loc","ecomp","tab","migration")) {
        scanner.setRange(new Range());
        scanner.clearColumns();
        scanner.fetchColumnFamily(new Text(family));
        long t1 = System.currentTimeMillis();
        long size = Iterables.size(scanner);
        long t2 = System.currentTimeMillis();
        System.out.printf("family:%10s size:%,d time:%,d\n", family, size, t2-t1);
      }
    }
  }
}

running this, seeing the following times. Experimented w/ different sizes of rfiles to ensure the family non-present times stayed around 1 to 2 ms. The ecomp family is not present in locality group LG1 and the migration family is not present in the default locality group.

family:       loc size:10,000,000 time:2,086
family:     ecomp size:0 time:2
family:       tab size:10,000,000 time:1,723
family: migration size:0 time:1

DomGarguilo · 2025-12-24T18:30:50Z

core/src/main/java/org/apache/accumulo/core/metadata/schema/filters/TabletMetadataFilter.java

+      var row = nextTablet.getKeyValues().get(0).getKey().getRow();
+      // now that a row was found by our serach iterator, seek the main iterator with all the
+      // columns for that row range
+      super.seek(new Range(row), seekFamilies, seekInclusive);


I'm not sure if I'm interpreting things correctly here, but could seeking to a row potentially ignore the range that was input in the main seek() method? Like if the caller supplied a range that begins or ends mid-row, could this code make it so that we return data outside the requested range? This may not even be an issue if so. Just thought I would double check

That does seem like it could happen. Maybe I can intersect the seek range and the row range.

keith-turner added this to the 4.0.0 milestone Jun 4, 2025

format code

09d6624

DomGarguilo reviewed Dec 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimizes tablet metadata filter #5615

optimizes tablet metadata filter #5615

keith-turner commented Jun 4, 2025

Uh oh!

keith-turner commented Jun 4, 2025

Uh oh!

DomGarguilo commented Jun 4, 2025

Uh oh!

keith-turner commented Jun 6, 2025 •

edited

Loading

Uh oh!

DomGarguilo Dec 24, 2025

Uh oh!

keith-turner Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

optimizes tablet metadata filter #5615

Are you sure you want to change the base?

optimizes tablet metadata filter #5615

Conversation

keith-turner commented Jun 4, 2025

Uh oh!

keith-turner commented Jun 4, 2025

Uh oh!

DomGarguilo commented Jun 4, 2025

Uh oh!

keith-turner commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DomGarguilo Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

keith-turner Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

keith-turner commented Jun 6, 2025 •

edited

Loading