-
Notifications
You must be signed in to change notification settings - Fork 472
optimizes tablet metadata filter #5615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
optimizes tablet metadata filter #5615
Conversation
TabletMetadataFilter had a set of columns needed for filtering available to it but did not use them. Modified it to use the columns during filtering. The comment in the change provides more details. Attempted to modify RowFilter to support this, but the way it uses its deep copy is inverted from what is needed for this case.
|
Hoping this change can really speed up scans for some of the columns like ecomp, wals, migrations etc. These columns are usually not present in most tablets. Want to eventually test this further w/ lots of tablets. May modify SplitMillionIT locally to run some performance test. |
I haven't looked at the code here yet but am wondering if the |
|
These changes assume RFile can quickly skip an entire file column family when its not present in the file. Did some local testing to make sure that assumption is correct and found it is. Would still want to do an end to end testing, but hoping that metadata tablets w/o the family can be quickly skipped w/ this change. This is the test I wrote. public class RFilePerfTest {
public static void main(String[] args) throws IOException {
Random rand = new Random();
Files.deleteIfExists(Path.of("/tmp/test.rf"));
try (var writer = RFile.newWriter().to("file:///tmp/test.rf").build()) {
writer.startNewLocalityGroup("LG1", "loc", "ecomp");
for (int i = 0; i < 10_000_000; i++) {
String row = String.format("%09x", i);
int port = rand.nextInt(1 << 16);
writer.append(new Key(row, "loc", "127.0.0.1:" + port), new Value(""));
}
writer.startDefaultLocalityGroup();
for (int i = 0; i < 10_000_000; i++) {
String row = String.format("%09x", i);
writer.append(new Key(row, "tab", "pr"), new Value(String.format("%09x", i - 1)));
}
}
try (var scanner = RFile.newScanner().from("file:///tmp/test.rf").build()) {
for (String family : List.of("loc","ecomp","tab","migration")) {
scanner.setRange(new Range());
scanner.clearColumns();
scanner.fetchColumnFamily(new Text(family));
long t1 = System.currentTimeMillis();
long size = Iterables.size(scanner);
long t2 = System.currentTimeMillis();
System.out.printf("family:%10s size:%,d time:%,d\n", family, size, t2-t1);
}
}
}
}running this, seeing the following times. Experimented w/ different sizes of rfiles to ensure the family non-present times stayed around 1 to 2 ms. The ecomp family is not present in locality group LG1 and the migration family is not present in the default locality group. |
| var row = nextTablet.getKeyValues().get(0).getKey().getRow(); | ||
| // now that a row was found by our serach iterator, seek the main iterator with all the | ||
| // columns for that row range | ||
| super.seek(new Range(row), seekFamilies, seekInclusive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I'm interpreting things correctly here, but could seeking to a row potentially ignore the range that was input in the main seek() method? Like if the caller supplied a range that begins or ends mid-row, could this code make it so that we return data outside the requested range? This may not even be an issue if so. Just thought I would double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does seem like it could happen. Maybe I can intersect the seek range and the row range.
TabletMetadataFilter had a set of columns needed for filtering available to it but did not use them. Modified it to use the columns during filtering. The comment in the change provides more details.
Attempted to modify RowFilter to support this, but the way it uses its deep copy is inverted from what is needed for this case.