Skip to content

New implementation of bulk insert#8990

Open
hvlad wants to merge 3 commits intomasterfrom
work/bulk_insert
Open

New implementation of bulk insert#8990
hvlad wants to merge 3 commits intomasterfrom
work/bulk_insert

Conversation

@hvlad
Copy link
Copy Markdown
Member

@hvlad hvlad commented Apr 15, 2026

In Firebird 5 parallel restore was introduced. It contains a "bulk insert" code that allows to lower contention on PP by concurrent writers. Also it makes each writer to use own dedicated DP to fill with records.

With shared metadata cache in v6 that code became broken and parallel restore creates a lot of unused data pages. Instead of fixing that first attempt to have bulk insert ability, I offer a new approach that fixes the issue, have better performance and could be used more widely.

Note, patch doesn't remove old code, it could be done a bit later - after agreement on the new code.

@hvlad hvlad self-assigned this Apr 15, 2026
@dyemanov
Copy link
Copy Markdown
Member

Some details about how the new approach is different from the old one would be appreciated.

@hvlad
Copy link
Copy Markdown
Member Author

hvlad commented Apr 15, 2026

The insert code path now avoids a lot of not necessary steps, such as triggers, validations (except of NOT NULL on field level) index maintenance - all of this is abcent during restore. Also, all records inserted into dedicated in-memory buffer to avoid endless latches on data page buffers. When in-memory buffer is full, its contents is copied into actual DB buffers and go to the disk in usual way. In-memory buffer size start from 1 page and resized to the 8 pages when first 8 pages of relation is filled. Also, blob contens are put into separate in-memory buffer that works in the same way.

The code is put into two main classes:

  • BulkInsert - implements in-memory buffers and code that works with records (mostly analog/copy of some DPM parts) , and
  • BulkInsertNode - replaces StoreNode and contains some further optimizations, such as pre-calculated target descriptors.

@AlexPeshkoff
Copy link
Copy Markdown
Member

AlexPeshkoff commented Apr 15, 2026 via email

@hvlad
Copy link
Copy Markdown
Member Author

hvlad commented Apr 15, 2026

But what about relPages->rel_*_data_space stored at database, not attachment level? From new code it seems that they should be moved back to attachment from database? Am I missing something?

These fields not used in bulk insert code and thus was not affected in this patch.

At next step I'm going to:

leave as is:

	ULONG rel_index_root;		// index root page number
	USHORT rel_pg_space_id;

make atomic:

	ULONG rel_data_pages;		// count of relation data pages
	ULONG rel_slot_space;		// lowest pointer page with slot space
	ULONG rel_pri_data_space;	// lowest pointer page with primary data page space
	ULONG rel_sec_data_space;	// lowest pointer page with secondary data page space

remove:

	ULONG rel_last_free_pri_dp;	// last primary data page found with space
	ULONG rel_last_free_blb_dp;	// last blob data page found with space

@dyemanov dyemanov self-requested a review April 15, 2026 16:54
@aafemt
Copy link
Copy Markdown
Contributor

aafemt commented Apr 16, 2026

I think term "bulk insert" is a little misleading here. Oracle uses "direct-path insert" for the path where data go straight into data pages.

@hvlad
Copy link
Copy Markdown
Member Author

hvlad commented May 1, 2026

@dyemanov: did you have a chance to look at it ?

@dyemanov
Copy link
Copy Markdown
Member

dyemanov commented May 1, 2026

@dyemanov: did you have a chance to look at it ?

Halfway done, hopefully to be completed during the weekend.

Copy link
Copy Markdown
Member

@dyemanov dyemanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing (besides my other comments) that I don't really like is a code duplication between dpm.epp and BulkInsert.cpp. For example, storage of a big record seems to share about 80% of code. Do you think it could be improved (i.e. by adding separate routines to dpm.epp that deal with a preallocated page buffer)?

Comment thread src/dsql/StmtNodes.cpp
Comment thread src/dsql/StmtNodes.cpp Outdated
Comment thread src/dsql/StmtNodes.cpp Outdated
Comment thread src/dsql/StmtNodes.cpp Outdated
Comment thread src/dsql/StmtNodes.cpp Outdated
Comment thread src/dsql/StmtNodes.cpp Outdated
Comment thread src/dsql/StmtNodes.cpp Outdated
Comment thread src/jrd/ods.h
Comment thread src/jrd/BulkInsert.cpp
@hvlad
Copy link
Copy Markdown
Member Author

hvlad commented May 3, 2026

The only thing (besides my other comments) that I don't really like is a code duplication between dpm.epp and BulkInsert.cpp. For example, storage of a big record seems to share about 80% of code.

Yes, this is not an easy question and I considered it, of course. The current code is best compromiіe I can think of.
I think the amount of duplicated code is relatively small.
Do you have clear idea how to not duplicate code and not create an additional branches in existing one ?

Do you think it could be improved (i.e. by adding separate routines to dpm.epp that deal with a preallocated page buffer)?

You mean - move (half of) BulkInsert into DPM ? :)

@hvlad
Copy link
Copy Markdown
Member Author

hvlad commented May 3, 2026

I'll implement suggested changes and re-test, then commit its.

@aafemt
Copy link
Copy Markdown
Contributor

aafemt commented May 4, 2026

Will this feature be documented for usage or remain internal only?

@hvlad
Copy link
Copy Markdown
Member Author

hvlad commented May 4, 2026

About cleanup of old "bulk insert" code, opinions, please: to do it in this PR or later, after the merge ?

@dyemanov
Copy link
Copy Markdown
Member

dyemanov commented May 4, 2026

About cleanup of old "bulk insert" code, opinions, please: to do it in this PR or later, after the merge ?

I'd say we need support for bulk insert into indices, so that the new implementation could be utilized by the Batch API, for example. Only then the old code should be wiped out.

@hvlad
Copy link
Copy Markdown
Member Author

hvlad commented May 4, 2026

About cleanup of old "bulk insert" code, opinions, please: to do it in this PR or later, after the merge ?

I'd say we need support for bulk insert into indices, so that the new implementation could be utilized by the Batch API, for example. Only then the old code should be wiped out.

Old code have no relation with indices. I don't get your point here.

@dyemanov
Copy link
Copy Markdown
Member

dyemanov commented May 4, 2026

Old code have no relation with indices. I don't get your point here.

But it allows Batch API to work faster than regular inserts, right? I mean the logic behind RPB_s_bulk which affects DPM. If it's removed, Batch API gonna become slower unless it could use the new implementation instead. But it cannot because the new implementation does not support indices. Do I miss anything?

@hvlad
Copy link
Copy Markdown
Member Author

hvlad commented May 4, 2026

Old code have no relation with indices. I don't get your point here.

But it allows Batch API to work faster than regular inserts, right?

No.

I mean the logic behind RPB_s_bulk which affects DPM. If it's removed, Batch API gonna become slower unless it could use the new implementation instead. But it cannot because the new implementation does not support indices. Do I miss anything?

Old code was introduced to make parallel inserts faster. Single thread inserts was not affected.
Also, old code doens't work currently.

@dyemanov
Copy link
Copy Markdown
Member

dyemanov commented May 4, 2026

Then I'd perform the cleanup in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants