diff --git a/tree/ntuple/doc/Merging.md b/tree/ntuple/doc/Merging.md index 5cebd6483613c..e69a924e47012 100644 --- a/tree/ntuple/doc/Merging.md +++ b/tree/ntuple/doc/Merging.md @@ -15,11 +15,13 @@ Please note that the RNTupleMerger is currently experimental and the content of Currently there is no guarantee for the user about which mode will be used to generate the merged RNTuple. At the moment, this is how it works: -- if both compression and encoding of the target column match those of the source column, L1 is used; -- otherwise, if compression matches but encoding doesn't, L2 is used; -- otherwise L3 is used. +- if the compression of the target column match that of the source column, L1 is used; +- otherwise, L2 is used. -Note that L0 and L4 are currently never used. +L0, L3 and L4 are currently never used. + +**NOTE**: prior to ROOT 6.42, if two columns had the same compression but different encoding they would undergo L3 merging (implying a recompression and resealing); +from 6.42 onwards the RNTupleMerger will instead attach a new column to the parent field as a new representation and L1-merge them. ## Goal The goal of the RNTuple merging process is producing one output RNTuple from *N* input RNTuples that can be used as if it were produced directly in the merged state. This means that: @@ -44,15 +46,16 @@ Consequences of R3 and R4: The following properties are currently true but they are subject to change: * P1: all output pages have the **same compression** (which may be different from the input pages' compression); -* P2: all pages in the same output column have the **same encoding** (which may be different from the inputs' encoding); -* P3: the output clusters are **the same as the input clusters**; -* P4: the output RNTuple **always has 1 cluster group** +* P2: the output clusters are **the same as the input clusters**; +* P3: the output RNTuple **always has 1 cluster group** + +Note that these properties influence and are influenced by the level of merging used. +E.g. P1 is currently true because we only support L1 merging of pages with identical compressions. This is a limitation that we intend to lift at some point (both for L1 and L0 if we ever support it). +P2 and P3 would not necessarily be true with L4 support (which might be desirable in some cases, e.g. to group pages into smaller/larger clusters). -Note that these properties influence and are influenced by the level of merging used. -E.g. P1 and P2 are currently true because we only support L1 merging of pages with identical compressions. This is a limitation that we intend to lift at some point (both for L1 and L0 if we ever support it). -P3 and P4 would not necessarily be true with L4 support (which might be desirable in some cases, e.g. to group pages into smaller/larger clusters). +Also note that the output pages coming from matching columns of a field may use mixed encodings. -Therefore we *will* want to drop these properties at some point, in order to improve the capabilities of the Merger. +Therefore we *will* want to drop at least some of these properties at some point, in order to improve the capabilities of the Merger. ## High-level description The merging process requires at least 1 input, in the form of an `RPageSource`. @@ -64,14 +67,15 @@ In `Union` mode only, we allow any subsequent input RNTuple to define new fields ## Descriptor compatibility and validation Whenever a new input is processed, we compare its descriptor with the output descriptor to verify that merging is possible. -The comparison function does 3 main things: +The comparison function does 4 main things: - collect all "extra destination fields" (i.e. fields that exist in the output but not in this input RNTuple) - collect all "extra source fields" from the input RNTuple -- collect and validate all common fields. +- collect and validate all common fields +- collect all columns that need to be extended with additional representations. -If the Merging Mode is set to **Filter** we require the "extra destination fields" list to be empty. -If the Merging Mode is set to **Strict** we require both the "extra destination fields" and "extra source fields" lists to be empty. -If the Merging Mode is set to **Union**, the "extra source fields" list is used to late model extend the destination model. +If the merging mode is set to **Filter** we require the "extra destination fields" list to be empty. +If the merging mode is set to **Strict** we require both the "extra destination fields" and "extra source fields" lists to be empty. +If the merging mode is set to **Union**, the "extra source fields" list is used to late model extend the destination model. As for common fields, they are matched by name and validated as follows: - any field that is projected in the destination must be also projected in the source and must be projected to the same field; @@ -90,3 +94,27 @@ As for common fields, they are matched by name and validated as follows: 1: these restrictions will likely not be required for L4 merging. + +## Column representation extension +In all merging modes, we allow new column representations to be attached to the source fields. This is done to allow for L1 merging of columns with different encodings, which would otherwise require recompressing. +These new column representations are added to the output RNTuple's footer and become part of its Schema Extension section. Note that in general these columns will be added as deferred *and* suppressed. + +**Technical note**: this is *not* done via the regular late model extension API, but uses internal functionality. + +We add new (physical) column representations in the following cases: + +- when one or more columns of a field has a different type than its matching counterpart in the destination RNTuple; +- when one or more columns of a field has the same type but different metadata than its matching counterpart in the destination RNTuple (e.g. in case of a Real32Quant column, different bit width or value range). + +Whenever we extend a physical column that is referred to by one or more alias columns in some projected fields, we also add a corresponding new alias column in those fields. + +#### Example +Suppose we merge source RNTuples **S1** and **S2**, each with the following fields: + +1. `foo` of type `int` +1. `fooProj` projecting onto field `foo` + +Suppose that S1 is compressed and thus its `foo` field is represented by a column of type `kSplitInt32`, whereas S2 is uncompressed and its `foo` field is represented by a column `kInt32`. +When merging S1 and S2 we collate those two representations under the same field `foo`, so that it will now have representatives: `{kSplitInt32, kInt32}`. +At the same time, we add a second alias column to the field `fooProj`, which will now have its first column aliasing the `kSplitInt32` column (column 0 of field `foo`) and its second one aliasing the `kInt32` one (column 1 of field `foo`). + diff --git a/tree/ntuple/inc/ROOT/RField/RFieldFundamental.hxx b/tree/ntuple/inc/ROOT/RField/RFieldFundamental.hxx index c0b66225023b3..4d0482601dadf 100644 --- a/tree/ntuple/inc/ROOT/RField/RFieldFundamental.hxx +++ b/tree/ntuple/inc/ROOT/RField/RFieldFundamental.hxx @@ -400,11 +400,11 @@ protected: fAvailableColumns.emplace_back(ROOT::Internal::RColumn::Create(onDiskTypes[0], 0, representationIndex)); if (onDiskTypes[0] == ROOT::ENTupleColumnType::kReal32Trunc) { const auto &fdesc = desc.GetFieldDescriptor(Base::GetOnDiskId()); - const auto &coldesc = desc.GetColumnDescriptor(fdesc.GetLogicalColumnIds()[0]); + const auto &coldesc = desc.GetColumnDescriptor(fdesc.GetLogicalColumnIds()[representationIndex]); column->SetBitsOnStorage(coldesc.GetBitsOnStorage()); } else if (onDiskTypes[0] == ROOT::ENTupleColumnType::kReal32Quant) { const auto &fdesc = desc.GetFieldDescriptor(Base::GetOnDiskId()); - const auto &coldesc = desc.GetColumnDescriptor(fdesc.GetLogicalColumnIds()[0]); + const auto &coldesc = desc.GetColumnDescriptor(fdesc.GetLogicalColumnIds()[representationIndex]); assert(coldesc.GetValueRange().has_value()); const auto [valMin, valMax] = *coldesc.GetValueRange(); column->SetBitsOnStorage(coldesc.GetBitsOnStorage()); diff --git a/tree/ntuple/inc/ROOT/RFieldBase.hxx b/tree/ntuple/inc/ROOT/RFieldBase.hxx index f82e83ea7061a..85e12aa012140 100644 --- a/tree/ntuple/inc/ROOT/RFieldBase.hxx +++ b/tree/ntuple/inc/ROOT/RFieldBase.hxx @@ -247,14 +247,15 @@ private: func(target); } - /// Translate an entry index to a column element index of the principal column and vice versa. These functions - /// take into account the role and number of repetitions on each level of the field hierarchy as follows: + /// Translate an entry index to a column element index of the principal column. This function + /// takes into account the role and number of repetitions on each level of the field hierarchy as follows: /// - Top level fields: element index == entry index /// - Record fields propagate their principal column index to the principal columns of direct descendant fields /// - Collection and variant fields set the principal column index of their children to 0 /// /// The column element index also depends on the number of repetitions of each field in the hierarchy, e.g., given a - /// field with type `std::array, 2>`, this function returns 8 for the innermost field. + /// field with type `std::array, 2>`, this function called with `globalIndex == 1` + /// returns 8 for the innermost field. ROOT::NTupleSize_t EntryToColumnElementIndex(ROOT::NTupleSize_t globalIndex) const; /// Flushes data from active columns diff --git a/tree/ntuple/inc/ROOT/RNTupleDescriptor.hxx b/tree/ntuple/inc/ROOT/RNTupleDescriptor.hxx index a5a736010c223..fa34b90f52c27 100644 --- a/tree/ntuple/inc/ROOT/RNTupleDescriptor.hxx +++ b/tree/ntuple/inc/ROOT/RNTupleDescriptor.hxx @@ -519,6 +519,12 @@ public: ROOT::NTupleSize_t GetFirstEntryIndex() const { return fFirstEntryIndex; } ROOT::NTupleSize_t GetNEntries() const { return fNEntries; } const RColumnRange &GetColumnRange(ROOT::DescriptorId_t physicalId) const { return fColumnRanges.at(physicalId); } + const RColumnRange *TryGetColumnRange(ROOT::DescriptorId_t physicalId) const + { + if (auto it = fColumnRanges.find(physicalId); it != fColumnRanges.end()) + return &it->second; + return nullptr; + } const RPageRange &GetPageRange(ROOT::DescriptorId_t physicalId) const { return fPageRanges.at(physicalId); } /// Returns an iterator over pairs { columnId, columnRange }. The iteration order is unspecified. RColumnRangeIterable GetColumnRangeIterable() const; diff --git a/tree/ntuple/inc/ROOT/RNTupleMerger.hxx b/tree/ntuple/inc/ROOT/RNTupleMerger.hxx index 10885c8406756..2ace1f9b21417 100644 --- a/tree/ntuple/inc/ROOT/RNTupleMerger.hxx +++ b/tree/ntuple/inc/ROOT/RNTupleMerger.hxx @@ -118,16 +118,16 @@ class RNTupleMerger final { std::unique_ptr fModel; [[nodiscard]] - ROOT::RResult MergeCommonColumns(ROOT::Internal::RClusterPool &clusterPool, - const ROOT::RClusterDescriptor &clusterDesc, - std::span commonColumns, - const ROOT::Internal::RCluster::ColumnSet_t &commonColumnSet, - std::size_t nCommonColumnsInCluster, RSealedPageMergeData &sealedPageData, - const RNTupleMergeData &mergeData, ROOT::Internal::RPageAllocator &pageAlloc); + ROOT::RResult + MergeCommonColumns(ROOT::Internal::RClusterPool &clusterPool, const ROOT::RClusterDescriptor &clusterDesc, + std::span commonColumns, + const ROOT::Internal::RCluster::ColumnSet_t &commonColumnSet, std::size_t nCommonColumnsInCluster, + RSealedPageMergeData &sealedPageData, const RNTupleMergeData &mergeData, + ROOT::Internal::RPageAllocator &pageAlloc); [[nodiscard]] ROOT::RResult - MergeSourceClusters(ROOT::Internal::RPageSource &source, std::span commonColumns, + MergeSourceClusters(ROOT::Internal::RPageSource &source, std::span commonColumns, std::span extraDstColumns, RNTupleMergeData &mergeData); /// Creates a RNTupleMerger with the given destination. diff --git a/tree/ntuple/inc/ROOT/RPageStorage.hxx b/tree/ntuple/inc/ROOT/RPageStorage.hxx index a7c0441e3597e..49189279eb955 100644 --- a/tree/ntuple/inc/ROOT/RPageStorage.hxx +++ b/tree/ntuple/inc/ROOT/RPageStorage.hxx @@ -544,6 +544,21 @@ public: [[nodiscard]] std::unique_ptr InitFromDescriptor(const ROOT::RNTupleDescriptor &descriptor, bool copyClusters); + struct RColumnReprElement { + ENTupleColumnType fType = ENTupleColumnType::kUnknown; + // 0 means "use default". Only valid for fixed-bitwidth column types. + std::uint16_t fBitWidth = 0; + std::optional fValueRange; + }; + /// Adds a new column representation to the given field. + /// \return The physical id of the first newly added column. + ROOT::DescriptorId_t + AddColumnRepresentation(const ROOT::RFieldDescriptor &field, std::span newRepresentation); + + /// Adds a new alias column pointing to an existing column with the given physical id to the given field. + void AddAliasColumn(const ROOT::RNTupleDescriptor &desc, const ROOT::RFieldDescriptor &field, + ROOT::DescriptorId_t physicalId); + void CommitSuppressedColumn(ColumnHandle_t columnHandle) final; void CommitPage(ColumnHandle_t columnHandle, const ROOT::Internal::RPage &page) final; void CommitSealedPage(ROOT::DescriptorId_t physicalColumnId, const RPageStorage::RSealedPage &sealedPage) final; diff --git a/tree/ntuple/src/RFieldBase.cxx b/tree/ntuple/src/RFieldBase.cxx index 38b70e5ad60b5..a3d966a677590 100644 --- a/tree/ntuple/src/RFieldBase.cxx +++ b/tree/ntuple/src/RFieldBase.cxx @@ -668,14 +668,14 @@ void ROOT::RFieldBase::Attach(std::unique_ptr child, std::stri ROOT::NTupleSize_t ROOT::RFieldBase::EntryToColumnElementIndex(ROOT::NTupleSize_t globalIndex) const { - std::size_t result = globalIndex; + ROOT::NTupleSize_t result = globalIndex; for (auto f = this; f != nullptr; f = f->GetParent()) { auto parent = f->GetParent(); if (parent && (parent->GetStructure() == ROOT::ENTupleStructure::kCollection || parent->GetStructure() == ROOT::ENTupleStructure::kVariant)) { return 0U; } - result *= std::max(f->GetNRepetitions(), std::size_t{1U}); + result *= std::max(f->GetNRepetitions(), ROOT::NTupleSize_t{1U}); } return result; } @@ -835,10 +835,7 @@ void ROOT::RFieldBase::SetColumnRepresentatives(const RColumnRepresentations::Se if (itRepresentative == std::end(validTypes)) throw RException(R__FAIL("invalid column representative")); - // don't add a duplicate representation - if (std::find_if(fColumnRepresentatives.begin(), fColumnRepresentatives.end(), - [&r](const auto &rep) { return r == rep.get(); }) == fColumnRepresentatives.end()) - fColumnRepresentatives.emplace_back(*itRepresentative); + fColumnRepresentatives.emplace_back(*itRepresentative); } } diff --git a/tree/ntuple/src/RNTupleDescriptor.cxx b/tree/ntuple/src/RNTupleDescriptor.cxx index f4a8b72b62c8b..4d5207d9917df 100644 --- a/tree/ntuple/src/RNTupleDescriptor.cxx +++ b/tree/ntuple/src/RNTupleDescriptor.cxx @@ -961,8 +961,16 @@ ROOT::Internal::RClusterDescriptorBuilder::AddExtendedColumnRanges(const RNTuple // `ROOT::RFieldBase::EntryToColumnElementIndex()`, i.e. it is a principal column reachable from the // field zero excluding subfields of collection and variant fields. if (c.IsDeferredColumn()) { - columnRange.SetFirstElementIndex(fCluster.GetFirstEntryIndex() * nRepetitions); - columnRange.SetNElements(fCluster.GetNEntries() * nRepetitions); + if (c.GetRepresentationIndex() == 0) { + columnRange.SetFirstElementIndex(fCluster.GetFirstEntryIndex() * nRepetitions); + columnRange.SetNElements(fCluster.GetNEntries() * nRepetitions); + } else { + const auto &field = desc.GetFieldDescriptor(fieldId); + const auto firstReprColumnId = field.GetLogicalColumnIds()[c.GetIndex()]; + const auto &firstReprColumnRange = fCluster.fColumnRanges[firstReprColumnId]; + columnRange.SetFirstElementIndex(firstReprColumnRange.GetFirstElementIndex()); + columnRange.SetNElements(firstReprColumnRange.GetNElements()); + } if (!columnRange.IsSuppressed()) { auto &pageRange = fCluster.fPageRanges[physicalId]; pageRange.fPhysicalColumnId = physicalId; @@ -1380,6 +1388,14 @@ void ROOT::Internal::RNTupleDescriptorBuilder::ShiftAliasColumns(std::uint32_t o R__ASSERT(fDescriptor.fColumnDescriptors.count(c.fLogicalColumnId) == 0); fDescriptor.fColumnDescriptors.emplace(c.fLogicalColumnId, std::move(c)); } + + // Patch up column ids in the header extension + if (auto &xHeader = fDescriptor.fHeaderExtension) { + for (auto &columnId : xHeader->fExtendedColumnRepresentations) { + if (columnId >= fDescriptor.GetNPhysicalColumns()) + columnId += offset; + } + } } ROOT::RResult ROOT::Internal::RNTupleDescriptorBuilder::AddCluster(RClusterDescriptor &&clusterDesc) diff --git a/tree/ntuple/src/RNTupleMerger.cxx b/tree/ntuple/src/RNTupleMerger.cxx index 22eda7acfd09d..32a4c806bd6ba 100644 --- a/tree/ntuple/src/RNTupleMerger.cxx +++ b/tree/ntuple/src/RNTupleMerger.cxx @@ -270,12 +270,10 @@ try { } namespace { -// Functor used to change the compression of a page to `options.fCompressionSettings`. +// Functor used to change the compression of a page to `fCompressionSettings`. struct RChangeCompressionFunc { const RColumnElementBase &fSrcColElement; - const RColumnElementBase &fDstColElement; - const RNTupleMergeOptions &fMergeOptions; - + std::uint32_t fCompressionSettings; RPageStorage::RSealedPage &fSealedPage; ROOT::Internal::RPageAllocator &fPageAlloc; std::uint8_t *fBuffer; @@ -284,52 +282,25 @@ struct RChangeCompressionFunc { void operator()() const { - assert(fSrcColElement.GetIdentifier() == fDstColElement.GetIdentifier()); - fSealedPage.VerifyChecksumIfEnabled().ThrowOnError(); const auto bytesPacked = fSrcColElement.GetPackedSize(fSealedPage.GetNElements()); // TODO: this buffer could be kept and reused across pages + // TODO: if Zip is a no-op (compression == 0) we can unzip directly in fBuffer! auto unzipBuf = MakeUninitArray(bytesPacked); ROOT::Internal::RNTupleDecompressor::Unzip(fSealedPage.GetBuffer(), fSealedPage.GetDataSize(), bytesPacked, unzipBuf.get()); const auto checksumSize = fWriteOpts.GetEnablePageChecksums() * sizeof(std::uint64_t); assert(fBufSize >= bytesPacked + checksumSize); - auto nBytesZipped = ROOT::Internal::RNTupleCompressor::Zip(unzipBuf.get(), bytesPacked, - fMergeOptions.fCompressionSettings.value(), fBuffer); + auto nBytesZipped = + ROOT::Internal::RNTupleCompressor::Zip(unzipBuf.get(), bytesPacked, fCompressionSettings, fBuffer); fSealedPage = {fBuffer, nBytesZipped + checksumSize, fSealedPage.GetNElements(), fSealedPage.GetHasChecksum()}; fSealedPage.ChecksumIfEnabled(); } }; -struct RResealFunc { - const RColumnElementBase &fSrcColElement; - const RColumnElementBase &fDstColElement; - const RNTupleMergeOptions &fMergeOptions; - - RPageStorage::RSealedPage &fSealedPage; - ROOT::Internal::RPageAllocator &fPageAlloc; - std::uint8_t *fBuffer; - std::size_t fBufSize; - const ROOT::RNTupleWriteOptions &fWriteOpts; - - void operator()() const - { - auto page = RPageSource::UnsealPage(fSealedPage, fSrcColElement, fPageAlloc).Unwrap(); - RPageSink::RSealPageConfig sealConf; - sealConf.fElement = &fDstColElement; - sealConf.fPage = &page; - sealConf.fBuffer = fBuffer; - sealConf.fCompressionSettings = *fMergeOptions.fCompressionSettings; - sealConf.fWriteChecksum = fWriteOpts.GetEnablePageChecksums(); - assert(fBufSize >= fSealedPage.GetDataSize() + fSealedPage.GetHasChecksum() * sizeof(std::uint64_t)); - auto refSealedPage = RPageSink::SealPage(sealConf); - fSealedPage = refSealedPage; - } -}; - struct RTaskVisitor { std::optional &fGroup; @@ -350,23 +321,47 @@ struct RCommonField { RCommonField(const ROOT::RFieldDescriptor &src, const ROOT::RFieldDescriptor &dst) : fSrc(&src), fDst(&dst) {} }; +struct RColReprMapping { + std::uint32_t fSource; + std::uint32_t fDest; +}; + +struct RColReprExtension : RColReprMapping { + std::vector fSourceRepr; +}; + +template +using FieldCollectionMap_t = std::unordered_map>; + +static std::optional +FindColumnReprMapping(const std::vector &mappings, std::uint32_t sourceReprIndex) +{ + for (const auto [src, dst] : mappings) + if (src == sourceReprIndex) + return dst; + return std::nullopt; +} + struct RDescriptorsComparison { std::vector fExtraDstFields; std::vector fExtraSrcFields; std::vector fCommonFields; + // For each field that has more than 1 column representation in the output model, + // maps the column representatives of the source field with those of the destination. + // The key is the destination field. + FieldCollectionMap_t fColReprMappings; + FieldCollectionMap_t fColReprExtensions; }; struct RColumnOutInfo { - ROOT::DescriptorId_t fColumnId; - ENTupleColumnType fColumnType; + ROOT::DescriptorId_t fColumnId = ROOT::kInvalidDescriptorId; }; -// { fully.qualified.fieldName.colInputId => colOutputInfo } +// { ".fully.qualified.fieldName.colInputIndex.colOutputReprIndex" => colOutputInfo } using ColumnIdMap_t = std::unordered_map; struct RColumnInfoGroup { std::vector fExtraDstColumns; - // These are sorted by InputId std::vector fCommonColumns; }; @@ -380,14 +375,14 @@ struct RColumnMergeInfo { // e.g. "Muon.pt.x._0" std::string fColumnName; // The column id in the source RNTuple - ROOT::DescriptorId_t fInputId; + ROOT::DescriptorId_t fInputId = kInvalidDescriptorId; // The corresponding column id in the destination RNTuple (the mapping happens in AddColumnsFromField()) - ROOT::DescriptorId_t fOutputId; - ENTupleColumnType fColumnType; + ROOT::DescriptorId_t fOutputId = kInvalidDescriptorId; + std::uint16_t fOutputReprIndex = 0; // If nullopt, use the default in-memory type std::optional fInMemoryType; - const ROOT::RFieldDescriptor *fParentFieldDescriptor; - const ROOT::RNTupleDescriptor *fParentNTupleDescriptor; + const ROOT::RFieldDescriptor *fParentFieldDescriptor = nullptr; + const ROOT::RNTupleDescriptor *fParentNTupleDescriptor = nullptr; }; // Data related to a single call of RNTupleMerger::Merge() @@ -399,6 +394,7 @@ struct RNTupleMergeData { const ROOT::RNTupleDescriptor *fSrcDescriptor = nullptr; std::vector fColumns; + // Maps input column IDs to output IDs ColumnIdMap_t fColumnIdMap; ROOT::NTupleSize_t fNumDstEntries = 0; @@ -429,33 +425,6 @@ std::ostream &operator<<(std::ostream &os, const std::optional @@ -483,8 +452,9 @@ CompareDescriptorStructure(const ROOT::RNTupleDescriptor &dst, const ROOT::RNTup } for (const auto &srcField : src.GetTopLevelFields()) { const auto dstFieldId = dst.FindFieldId(srcField.GetFieldName()); - if (dstFieldId == ROOT::kInvalidDescriptorId) + if (dstFieldId == ROOT::kInvalidDescriptorId) { res.fExtraSrcFields.push_back(&srcField); + } } // Check compatibility of common fields @@ -571,55 +541,82 @@ CompareDescriptorStructure(const ROOT::RNTupleDescriptor &dst, const ROOT::RNTup } // Require that column representations match - const auto srcNCols = field.fSrc->GetLogicalColumnIds().size(); - const auto dstNCols = field.fDst->GetLogicalColumnIds().size(); - if (srcNCols != dstNCols) { - std::stringstream ss; - ss << "Field `" << field.fSrc->GetFieldName() - << "` has a different number of columns than previously-seen field with the same name (old: " << dstNCols - << ", new: " << srcNCols << ")"; - errors.push_back(ss.str()); - } else { - for (auto i = 0u; i < srcNCols; ++i) { - const auto srcColId = field.fSrc->GetLogicalColumnIds()[i]; - const auto dstColId = field.fDst->GetLogicalColumnIds()[i]; - const auto &srcCol = src.GetColumnDescriptor(srcColId); - const auto &dstCol = dst.GetColumnDescriptor(dstColId); - // TODO(gparolini): currently we refuse to merge columns of different types unless they are Split/non-Split - // version of the same type, because we know how to treat that specific case. We should also properly handle - // different but compatible types. - if (srcCol.GetType() != dstCol.GetType() && - !IsSplitOrUnsplitVersionOf(srcCol.GetType(), dstCol.GetType())) { - std::stringstream ss; - ss << i << "-th column of field `" << field.fSrc->GetFieldName() - << "` has a different column type of the same column on the previously-seen field with the same name " - "(old: " - << RColumnElementBase::GetColumnTypeName(srcCol.GetType()) - << ", new: " << RColumnElementBase::GetColumnTypeName(dstCol.GetType()) << ")"; - errors.push_back(ss.str()); - } - if (srcCol.GetBitsOnStorage() != dstCol.GetBitsOnStorage()) { - std::stringstream ss; - ss << i << "-th column of field `" << field.fSrc->GetFieldName() - << "` has a different number of bits of the same column on the previously-seen field with the same " - "name " - "(old: " - << srcCol.GetBitsOnStorage() << ", new: " << dstCol.GetBitsOnStorage() << ")"; - errors.push_back(ss.str()); - } - if (srcCol.GetValueRange() != dstCol.GetValueRange()) { - std::stringstream ss; - ss << i << "-th column of field `" << field.fSrc->GetFieldName() - << "` has a different value range of the same column on the previously-seen field with the same name " - "(old: " - << srcCol.GetValueRange() << ", new: " << dstCol.GetValueRange() << ")"; - errors.push_back(ss.str()); - } - if (srcCol.GetRepresentationIndex() > 0) { + if (!field.fSrc->IsProjectedField()) { + const auto &srcColumns = field.fSrc->GetLogicalColumnIds(); + const auto &dstColumns = field.fDst->GetLogicalColumnIds(); + const auto srcNCols = srcColumns.size(); + const auto dstNCols = dstColumns.size(); + if (srcNCols != dstNCols) { + std::stringstream ss; + ss << "Field `" << field.fSrc->GetFieldName() + << "` has a different number of columns than previously-seen field with the same name (old: " << dstNCols + << ", new: " << srcNCols << ")"; + errors.push_back(ss.str()); + } else { + const std::uint32_t srcColCardinality = field.fSrc->GetColumnCardinality(); + const std::uint32_t dstColCardinality = field.fDst->GetColumnCardinality(); + if (srcColCardinality != dstColCardinality) { std::stringstream ss; - ss << i << "-th column of field `" << field.fSrc->GetFieldName() - << "` has a representation index higher than 0. This is not supported yet by the merger."; + ss << "Field `" << field.fSrc->GetFieldName() + << "` has a different column cardinality than previously-seen field with the same name (old: " + << dstColCardinality << ", new: " << srcColCardinality << ")"; errors.push_back(ss.str()); + } else if (srcColCardinality > 0) { + const auto srcNColReprs = srcNCols / srcColCardinality; + const auto dstNColReprs = dstNCols / dstColCardinality; + + // For each column representation of the source, check if it matches one in the descriptor. + // If so, and if it doesn't match the destination's repr index, add a mapping for it. + // If nothing matches, schedule the column representation to be added later. + // NOTE: this has quadratic complexity but the numbers involved are small so it's fine. + for (auto srcReprIdx = 0u; srcReprIdx < srcNColReprs; ++srcReprIdx) { + std::int64_t matchingRepr = -1; + for (auto dstReprIdx = 0u; dstReprIdx < dstNColReprs; ++dstReprIdx) { + bool matches = true; + for (auto reprColIdx = 0u; reprColIdx < srcColCardinality; ++reprColIdx) { + const auto srcColId = srcColumns[srcReprIdx * srcColCardinality + reprColIdx]; + const auto &srcCol = src.GetColumnDescriptor(srcColId); + const auto dstColId = dstColumns[dstReprIdx * dstColCardinality + reprColIdx]; + const auto &dstCol = dst.GetColumnDescriptor(dstColId); + if (srcCol.GetType() != dstCol.GetType() || + srcCol.GetBitsOnStorage() != dstCol.GetBitsOnStorage() || + srcCol.GetValueRange() != dstCol.GetValueRange()) { + matches = false; + break; + } + } + + if (matches) { + matchingRepr = dstReprIdx; + break; + } + } + + if (errors.empty()) { + if (matchingRepr >= 0 && matchingRepr != srcReprIdx) { + // a different matching representation was found + assert(matchingRepr < std::numeric_limits::max()); + res.fColReprMappings[field.fDst].push_back( + RColReprMapping{srcReprIdx, static_cast(matchingRepr)}); + } else if (matchingRepr < 0) { + // this representation was not found in the destination + assert(dstNColReprs < std::numeric_limits::max()); + std::vector newRepr; + newRepr.reserve(srcColCardinality); + for (auto reprColIdx = 0u; reprColIdx < srcColCardinality; ++reprColIdx) { + const auto srcColId = srcColumns[srcReprIdx * srcColCardinality + reprColIdx]; + const auto &srcCol = src.GetColumnDescriptor(srcColId); + auto &reprElement = newRepr.emplace_back(); + reprElement.fType = srcCol.GetType(); + reprElement.fBitWidth = srcCol.GetBitsOnStorage(); + reprElement.fValueRange = srcCol.GetValueRange(); + } + RColReprExtension extension{{srcReprIdx, static_cast(dstNColReprs)}, newRepr}; + res.fColReprExtensions[field.fDst].push_back(extension); + res.fColReprMappings[field.fDst].push_back(extension); + } + } + } } } } @@ -657,13 +654,13 @@ CompareDescriptorStructure(const ROOT::RNTupleDescriptor &dst, const ROOT::RNTup return ROOT::RResult(res); } -// Applies late model extension to `destination`, adding all `newFields` to it. +// Applies late model extension to `mergeData.fDestination`, adding all `descCmp.fExtraSrcFields` to it. [[nodiscard]] static ROOT::RResult -ExtendDestinationModel(std::span newFields, ROOT::RNTupleModel &dstModel, - RNTupleMergeData &mergeData, std::vector &commonFields) +ExtendDestinationModel(RDescriptorsComparison &descCmp, ROOT::RNTupleModel &dstModel, RNTupleMergeData &mergeData) { - assert(newFields.size() > 0); // no point in calling this with 0 new cols + const auto &newFields = descCmp.fExtraSrcFields; + auto &commonFields = descCmp.fCommonFields; dstModel.Unfreeze(); ROOT::Internal::RNTupleModelChangeset changeset{dstModel}; @@ -683,10 +680,19 @@ ExtendDestinationModel(std::span newFields, ROOT changeset.fAddedFields.reserve(newFields.size()); // First add all non-projected fields... for (const auto *fieldDesc : newFields) { - if (!fieldDesc->IsProjectedField()) { - auto field = fieldDesc->CreateField(*mergeData.fSrcDescriptor); - changeset.AddField(std::move(field)); + if (fieldDesc->IsProjectedField()) + continue; + + auto field = fieldDesc->CreateField(*mergeData.fSrcDescriptor); + // Explicitly set the field representatives. This prevents UpdateSchema() from changing our column + // representations via AutoAdjustColumnTypes. + ROOT::RFieldBase::ColumnRepresentation_t representatives; + for (const auto &colId : fieldDesc->GetLogicalColumnIds()) { + const auto &column = mergeData.fSrcDescriptor->GetColumnDescriptor(colId); + representatives.push_back(column.GetType()); } + field->SetColumnRepresentatives({representatives}); + changeset.AddField(std::move(field)); } // ...then add all projected fields. for (const auto *fieldDesc : newFields) { @@ -711,16 +717,39 @@ ExtendDestinationModel(std::span newFields, ROOT } dstModel.Freeze(); try { + // XXX: here we are connecting the new fields/columns to the sink! + // We should avoid doing that, as all other fields never get connected. + // NOTE: this calls AutoAdjustColumnTypes, but we have set the column representations of all fields + // explicitly, so it will not change it under the hood. mergeData.fDestination.UpdateSchema(changeset, mergeData.fNumDstEntries); } catch (const ROOT::RException &ex) { return R__FAIL(ex.GetError().GetReport()); } commonFields.reserve(commonFields.size() + newFields.size()); - for (const auto *field : newFields) { + // NOTE(gparolini): Insert the new fields at the beginning of `commonFields`. + // We need to make sure the extended fields appear before all other common fields for the following reason: + // in general, when we GatherColumnInfos we (potentially) assign new column output ids in field order; this + // assignment happens whenever we find new columns, which happens in 3 cases: + // 1. we are in the first source and we're adding the first set of (common) fields; + // 2. we are adding a new set of extended common fields (which come from this function); + // 3. we are adding new column representations for fields that we already had before processing this source. + // + // It's important that the output id assigned to the new columns is coherent with the order of the column descriptors + // as they appear in the header and footer. This is in turn determined by the order by which we append new columns to + // the dst descriptor during the merging process. + // Since we call ExtendDestinationModel (this function) *before* adding the new column representations, it is always + // the case that the dst descriptor gets updated with the new column descriptors coming from the extended fields + // (since they are added in UpdateSchema a few lines above) before it gets updated with the extended column + // representations (which happens later in sink->AddColumnRepresentation). + // However, the new column output ids are added sequentially in *field* order in GatherColumnInfos and the fields + // containing the new column representations are already in that list from earlier! So, to make sure the new output + // ids are assigned to our extended fields first, we push them in from on the list so that they are visited first. + for (auto it = newFields.rbegin(); it != newFields.rend(); ++it) { + const auto *field = *it; const auto newFieldInDstId = mergeData.fDstDescriptor.FindFieldId(field->GetFieldName()); const auto &newFieldInDst = mergeData.fDstDescriptor.GetFieldDescriptor(newFieldInDstId); - commonFields.emplace_back(*field, newFieldInDst); + commonFields.insert(commonFields.begin(), RCommonField{*field, newFieldInDst}); } return ROOT::RResult::Success(); @@ -764,10 +793,10 @@ GenerateZeroPagesForColumns(size_t nEntriesToGenerate, std::spanGetStructure(); if (structure == ROOT::ENTupleStructure::kStreamer) { - return R__FAIL( - "Destination RNTuple contains a streamer field (" + field->GetFieldName() + - ") that is not present in one of the sources. " - "Creating a default value for a streamer field is ill-defined, therefore the merging process will abort."); + return R__FAIL("Destination RNTuple contains a streamer field (" + field->GetFieldName() + + ") that is not present in one of the sources. " + "Creating a default value for a streamer field is ill-defined, therefore the merging " + "process will abort."); } // NOTE: we cannot have a Record here because it has no associated columns. @@ -814,7 +843,7 @@ GenerateZeroPagesForColumns(size_t nEntriesToGenerate, std::span RNTupleMerger::MergeCommonColumns(ROOT::Internal::RClusterPool &clusterPool, const ROOT::RClusterDescriptor &clusterDesc, - std::span commonColumns, + std::span commonColumns, const RCluster::ColumnSet_t &commonColumnSet, std::size_t nCommonColumnsInCluster, RSealedPageMergeData &sealedPageData, const RNTupleMergeData &mergeData, ROOT::Internal::RPageAllocator &pageAlloc) @@ -826,9 +855,11 @@ RNTupleMerger::MergeCommonColumns(ROOT::Internal::RClusterPool &clusterPool, const RCluster *cluster = clusterPool.GetCluster(clusterDesc.GetId(), commonColumnSet); // we expect the cluster pool to contain the requested set of columns, since they were - // validated by CompareDescriptorStructure(). + // validated by CompareDescriptorStructure() and MergeSourceClusters(). assert(cluster); + const std::uint32_t outCompression = mergeData.fMergeOpts.fCompressionSettings.value(); + for (size_t colIdx = 0; colIdx < nCommonColumnsInCluster; ++colIdx) { const auto &column = commonColumns[colIdx]; const auto &columnId = column.fInputId; @@ -838,9 +869,6 @@ RNTupleMerger::MergeCommonColumns(ROOT::Internal::RClusterPool &clusterPool, const auto srcColElement = column.fInMemoryType ? ROOT::Internal::GenerateColumnElement(*column.fInMemoryType, columnDesc.GetType()) : RColumnElementBase::Generate(columnDesc.GetType()); - const auto dstColElement = column.fInMemoryType - ? ROOT::Internal::GenerateColumnElement(*column.fInMemoryType, column.fColumnType) - : RColumnElementBase::Generate(column.fColumnType); // Now get the pages for this column in this cluster const auto &pages = clusterDesc.GetPageRange(columnId); @@ -855,24 +883,21 @@ RNTupleMerger::MergeCommonColumns(ROOT::Internal::RClusterPool &clusterPool, // L1: compression and encoding of src and dest both match: we can simply copy the page // L2: compression of dest doesn't match the src but encoding does: we must recompress the page but can avoid // resealing it. - // L3: on-disk encoding doesn't match: we need to reseal the page, which implies decompressing and recompressing + // L3: on-disk encoding doesn't match: we need to reseal the page, which implies decompressing and + // recompressing // it. - const bool compressionIsDifferent = - colRangeCompressionSettings != mergeData.fMergeOpts.fCompressionSettings.value(); - const bool needsResealing = - srcColElement->GetIdentifier().fOnDiskType != dstColElement->GetIdentifier().fOnDiskType; - const bool needsRecompressing = compressionIsDifferent || needsResealing; + const bool compressionIsDifferent = colRangeCompressionSettings != outCompression; + const bool needsRecompressing = compressionIsDifferent; if (needsRecompressing && mergeData.fMergeOpts.fExtraVerbose) { - R__LOG_INFO(NTupleMergeLog()) - << (needsResealing ? "Resealing" : "Recompressing") << " column " << column.fColumnName - << ": { compression: " << colRangeCompressionSettings << " => " - << mergeData.fMergeOpts.fCompressionSettings.value() - << ", onDiskType: " << RColumnElementBase::GetColumnTypeName(srcColElement->GetIdentifier().fOnDiskType) - << " => " << RColumnElementBase::GetColumnTypeName(dstColElement->GetIdentifier().fOnDiskType); + R__LOG_INFO(NTupleMergeLog()) << "Recompressing column " << column.fColumnName + << ": { compression: " << colRangeCompressionSettings << " => " + << mergeData.fMergeOpts.fCompressionSettings.value() << ", onDiskType: " + << RColumnElementBase::GetColumnTypeName( + srcColElement->GetIdentifier().fOnDiskType); } - size_t pageBufferBaseIdx = sealedPageData.fBuffers.size(); + const size_t pageBufferBaseIdx = sealedPageData.fBuffers.size(); // If the column range already has the right compression we don't need to allocate any new buffer, so we don't // bother reserving memory for them. if (needsRecompressing) @@ -921,29 +946,15 @@ RNTupleMerger::MergeCommonColumns(ROOT::Internal::RClusterPool &clusterPool, buffer = MakeUninitArray(bufSize); // clang-format off - if (needsResealing) { - RTaskVisitor{fTaskGroup}(RResealFunc{ - *srcColElement, - *dstColElement, - mergeData.fMergeOpts, - sealedPage, - *fPageAlloc, - buffer.get(), - bufSize, - mergeData.fDestination.GetWriteOptions() - }); - } else { - RTaskVisitor{fTaskGroup}(RChangeCompressionFunc{ - *srcColElement, - *dstColElement, - mergeData.fMergeOpts, - sealedPage, - *fPageAlloc, - buffer.get(), - bufSize, - mergeData.fDestination.GetWriteOptions() - }); - } + RTaskVisitor{fTaskGroup}(RChangeCompressionFunc{ + *srcColElement, + outCompression, + sealedPage, + *fPageAlloc, + buffer.get(), + bufSize, + mergeData.fDestination.GetWriteOptions() + }); // clang-format on } @@ -967,9 +978,9 @@ RNTupleMerger::MergeCommonColumns(ROOT::Internal::RClusterPool &clusterPool, // the destination's schemas. // The pages may be "fast-merged" (i.e. simply copied with no decompression/recompression) if the target // compression is unspecified or matches the original compression settings. -ROOT::RResult -RNTupleMerger::MergeSourceClusters(RPageSource &source, std::span commonColumns, - std::span extraDstColumns, RNTupleMergeData &mergeData) +ROOT::RResult RNTupleMerger::MergeSourceClusters(RPageSource &source, std::span commonColumns, + std::span extraDstColumns, + RNTupleMergeData &mergeData) { ROOT::Internal::RClusterPool clusterPool{source}; @@ -984,31 +995,60 @@ RNTupleMerger::MergeSourceClusters(RPageSource &source, std::span 0); - // NOTE: just because a column is in `commonColumns` it doesn't mean that each cluster in the source contains it, - // as it may be a deferred column that only has real data in a future cluster. - // We need to figure out which columns are actually present in this cluster so we only merge their pages (the - // missing columns are handled by synthesizing zero pages - see below). - size_t nCommonColumnsInCluster = commonColumns.size(); - while (nCommonColumnsInCluster > 0) { - // Since `commonColumns` is sorted by column input id, we can simply traverse it from the back and stop as - // soon as we find a common column that appears in this cluster: we know that in that case all previous - // columns must appear as well. - if (clusterDesc.ContainsColumn(commonColumns[nCommonColumnsInCluster - 1].fInputId)) - break; - --nCommonColumnsInCluster; - } - + // Deduce which columns are suppressed (cluster by cluster) by exclusion, as: + // (columns in the columnIdMap) - (columns in commonColumns which are not suppressed). + // Note that some suppressed columns may not be in commonColumns because they might not appear at all in the + // current source. + FieldCollectionMap_t activeColumns; + + // NOTE: just because a column is in `commonColumns` it doesn't mean that each cluster in the source contains + // it, as it may be a deferred column that only has real data in a future cluster. We need to figure out which + // columns are actually present in this cluster so we only merge their pages (the missing columns are handled + // by synthesizing zero pages - see below). + size_t nCommonColumnsInCluster = 0; // Convert columns to a ColumnSet for the ClusterPool query RCluster::ColumnSet_t commonColumnSet; commonColumnSet.reserve(nCommonColumnsInCluster); - for (size_t i = 0; i < nCommonColumnsInCluster; ++i) - commonColumnSet.emplace(commonColumns[i].fInputId); + // Collect all common columns appearing in this cluster into commonColumnSet and reorganize commonColumns so + // that those columns are at the start of it (whereas missing columns are at its end). + // NOTE: it's fine if this scrambles the order of columns: the RNTupleSerializer will sort them by physical ID. + std::partition(commonColumns.begin(), commonColumns.end(), [&](const auto &column) { + if (const auto *colRange = clusterDesc.TryGetColumnRange(column.fInputId)) { + if (!colRange->IsSuppressed()) { + ++nCommonColumnsInCluster; + commonColumnSet.emplace(column.fInputId); + activeColumns[column.fParentFieldDescriptor].push_back(column.fOutputId); + return true; + } + } + return false; + }); - // For each cluster, the "missing columns" are the union of the extraDstColumns and the common columns - // that are not present in the cluster. We generate zero pages for all of them. - missingColumns.resize(extraDstColumns.size()); - for (size_t i = nCommonColumnsInCluster; i < commonColumns.size(); ++i) - missingColumns.push_back(commonColumns[i]); + // Commit all suppressed columns. + // This is a fairly involved operation, as we need to commit all known columns that: + // a) do not appear in extraDstColumns (those are "missing", not suppressed), and + // b) do not appear in commonColumnSet (those are the active columns). + // Not that these may or may not appear in commonColumns as suppressed columns, since they may or may not be + // present in the current source. + // The only way to find all the columns is to go and get them from fColumnIdMap, which keeps track of every + // column we added to the destination so far. However, since it also contains the extraDstColumns, we need to + // specifically only query those columns that belong to a field that has at least 1 column in commonColumns + // (remember that commonColumns contains all columns associated to the common fields for this source). + for (const auto &[fieldDesc, activeIds] : activeColumns) { + const auto &fieldFQName = mergeData.fSrcDescriptor->GetQualifiedFieldName(fieldDesc->GetId()); + const auto cardinality = fieldDesc->GetColumnCardinality(); + for (auto i = 0u; i < fieldDesc->GetLogicalColumnIds().size(); ++i) { + const auto colIndex = i % cardinality; + const auto reprIndex = i / cardinality; + const auto colName = "." + fieldFQName + '.' + std::to_string(colIndex) + '.' + std::to_string(reprIndex); + const auto colIt = mergeData.fColumnIdMap.find(colName); + assert(colIt != mergeData.fColumnIdMap.end()); + const auto colOutId = colIt->second.fColumnId; + if (std::find(activeIds.begin(), activeIds.end(), colOutId) == activeIds.end()) { + mergeData.fDestination.CommitSuppressedColumn(ROOT::Internal::RPageStorage::ColumnHandle_t{colOutId}); + } + } + } RSealedPageMergeData sealedPageData; auto res = MergeCommonColumns(clusterPool, clusterDesc, commonColumns, commonColumnSet, nCommonColumnsInCluster, @@ -1016,6 +1056,14 @@ RNTupleMerger::MergeSourceClusters(RPageSource &source, std::span ColumnInMemoryType(std::string_view fieldT return std::nullopt; } -// Given a field, fill `columns` and `mergeData.fColumnIdMap` with information about all columns belonging to it and its -// subfields. `mergeData.fColumnIdMap` is used to map matching columns from different sources to the same output column -// in the destination. We match columns by their "fully qualified name", which is the concatenation of their ancestor -// fields' names and the column index. By this point, since we called `CompareDescriptorStructure()` earlier, we should -// be guaranteed that two matching columns will have at least compatible representations. NOTE: srcFieldDesc and -// dstFieldDesc may alias. +// Given a field, fill `columns` and `mergeData.fColumnIdMap` with information about all columns belonging to it and +// its subfields. `mergeData.fColumnIdMap` is used to map matching columns from different sources to the same output +// column in the destination. We match columns by their "fully qualified name", which is the concatenation of their +// ancestor fields' names and the column index. By this point, since we called `CompareDescriptorStructure()` +// earlier, we should be guaranteed that two matching columns will have at least compatible representations. +// This function is recursive as it needs to call itself on the entire subfield hierarchy of the source field. +// NOTE: srcFieldDesc and dstFieldDesc may alias. static void AddColumnsFromField(std::vector &columns, const ROOT::RNTupleDescriptor &srcDesc, + const FieldCollectionMap_t &colReprMappings, RNTupleMergeData &mergeData, const ROOT::RFieldDescriptor &srcFieldDesc, const ROOT::RFieldDescriptor &dstFieldDesc, const std::string &prefix = "") { std::string name = prefix + '.' + srcFieldDesc.GetFieldName(); + // We don't want to try and merge alias columns + if (srcFieldDesc.IsProjectedField()) + return; + const auto &columnIds = srcFieldDesc.GetLogicalColumnIds(); columns.reserve(columns.size() + columnIds.size()); - // NOTE: here we can match the src and dst columns by column index because we forbid merging fields with - // different column representations. - for (auto i = 0u; i < srcFieldDesc.GetLogicalColumnIds().size(); ++i) { - // We don't want to try and merge alias columns - if (srcFieldDesc.IsProjectedField()) - continue; + for (auto i = 0u; i < srcFieldDesc.GetLogicalColumnIds().size(); ++i) { auto srcColumnId = srcFieldDesc.GetLogicalColumnIds()[i]; const auto &srcColumn = srcDesc.GetColumnDescriptor(srcColumnId); RColumnMergeInfo info{}; - info.fColumnName = name + '.' + std::to_string(srcColumn.GetIndex()); info.fInputId = srcColumn.GetPhysicalId(); // NOTE(gparolini): the parent field is used when synthesizing zero pages, which happens in 2 situations: // 1. when adding extra dst columns (in which case we need to synthesize zero pages for the incoming src), and // 2. when merging a deferred column into an existing column (in which case we need to fill the "hole" with - // zeroes). For the first case srcFieldDesc and dstFieldDesc are the same (see the calling site of this function), - // but for the second case they're not, and we need to pick the source field because we will then check the - // column's *input* id inside fParentFieldDescriptor to see if it's a suppressed column (see + // zeroes). For the first case srcFieldDesc and dstFieldDesc are the same (see the calling site of this + // function), but for the second case they're not, and we need to pick the source field because we will then + // check the column's *input* id inside fParentFieldDescriptor to see if it's a suppressed column (see // GenerateZeroPagesForColumns()). info.fParentFieldDescriptor = &srcFieldDesc; // Save the parent field descriptor since this may be either the source or destination descriptor depending on @@ -1108,22 +1156,34 @@ static void AddColumnsFromField(std::vector &columns, const RO // properly walk up the field hierarchy. info.fParentNTupleDescriptor = &srcDesc; + const auto mappingsIt = colReprMappings.find(&dstFieldDesc); + std::uint16_t reprIndex = srcColumn.GetRepresentationIndex(); + if (mappingsIt != colReprMappings.end()) { + if (auto outReprIdx = FindColumnReprMapping(mappingsIt->second, reprIndex); outReprIdx) + reprIndex = *outReprIdx; + } + + info.fColumnName = name + '.' + std::to_string(srcColumn.GetIndex()) + '.' + std::to_string(reprIndex); + + ENTupleColumnType columnType = ENTupleColumnType::kUnknown; + if (auto it = mergeData.fColumnIdMap.find(info.fColumnName); it != mergeData.fColumnIdMap.end()) { + // We had already added this column to the column id map: just copy its data. info.fOutputId = it->second.fColumnId; - info.fColumnType = it->second.fColumnType; + info.fOutputReprIndex = reprIndex; } else { + // New column: assign it the next ouput id. info.fOutputId = mergeData.fColumnIdMap.size(); - // NOTE(gparolini): map the type of src column to the type of dst column. - // This mapping is only relevant for common columns and it's done to ensure we keep a consistent - // on-disk representation of the same column. - // This is also important to do for first source when it is used to generate the destination sink, - // because even in that case their column representations may differ. - // e.g. if the destination has a different compression than the source, an integer column might be - // zigzag-encoded in the source but not in the destination. - auto dstColumnId = dstFieldDesc.GetLogicalColumnIds()[i]; + // NOTE(gparolini): map the representation index of src column to that of dst column. + // This mapping is only relevant for common columns and it's done to ensure we have the correct representation + // index in the output column metadata. + assert(dstFieldDesc.GetColumnCardinality() == srcFieldDesc.GetColumnCardinality()); + const auto dstColumnIndex = reprIndex * dstFieldDesc.GetColumnCardinality() + srcColumn.GetIndex(); + const auto dstColumnId = dstFieldDesc.GetLogicalColumnIds()[dstColumnIndex]; const auto &dstColumn = mergeData.fDstDescriptor.GetColumnDescriptor(dstColumnId); - info.fColumnType = dstColumn.GetType(); - mergeData.fColumnIdMap[info.fColumnName] = {info.fOutputId, info.fColumnType}; + columnType = dstColumn.GetType(); + info.fOutputReprIndex = reprIndex; + mergeData.fColumnIdMap[info.fColumnName] = RColumnOutInfo{info.fOutputId}; } if (mergeData.fMergeOpts.fExtraVerbose) { @@ -1131,12 +1191,12 @@ static void AddColumnsFromField(std::vector &columns, const RO << ", phys.id " << srcColumn.GetPhysicalId() << ", type " << RColumnElementBase::GetColumnTypeName(srcColumn.GetType()) << " -> log.id " << info.fOutputId << ", type " - << RColumnElementBase::GetColumnTypeName(info.fColumnType); + << RColumnElementBase::GetColumnTypeName(columnType); } // Since we disallow merging fields of different types, src and dstFieldDesc must have the same type name. assert(srcFieldDesc.GetTypeName() == dstFieldDesc.GetTypeName()); - info.fInMemoryType = ColumnInMemoryType(srcFieldDesc.GetTypeName(), info.fColumnType); + info.fInMemoryType = ColumnInMemoryType(srcFieldDesc.GetTypeName(), columnType); columns.emplace_back(info); } @@ -1146,7 +1206,7 @@ static void AddColumnsFromField(std::vector &columns, const RO for (auto i = 0u; i < srcChildrenIds.size(); ++i) { const auto &srcChild = srcDesc.GetFieldDescriptor(srcChildrenIds[i]); const auto &dstChild = mergeData.fDstDescriptor.GetFieldDescriptor(dstChildrenIds[i]); - AddColumnsFromField(columns, srcDesc, mergeData, srcChild, dstChild, name); + AddColumnsFromField(columns, srcDesc, colReprMappings, mergeData, srcChild, dstChild, name); } } @@ -1158,17 +1218,12 @@ static RColumnInfoGroup GatherColumnInfos(const RDescriptorsComparison &descCmp, { RColumnInfoGroup res; for (const ROOT::RFieldDescriptor *field : descCmp.fExtraDstFields) { - AddColumnsFromField(res.fExtraDstColumns, mergeData.fDstDescriptor, mergeData, *field, *field); + AddColumnsFromField(res.fExtraDstColumns, mergeData.fDstDescriptor, descCmp.fColReprMappings, mergeData, *field, + *field); } for (const auto &[srcField, dstField] : descCmp.fCommonFields) { - AddColumnsFromField(res.fCommonColumns, srcDesc, mergeData, *srcField, *dstField); + AddColumnsFromField(res.fCommonColumns, srcDesc, descCmp.fColReprMappings, mergeData, *srcField, *dstField); } - - // Sort the commonColumns by ID so we can more easily tell how many common columns each cluster has - // (since each cluster must contain all columns of the previous cluster plus potentially some new ones) - std::sort(res.fCommonColumns.begin(), res.fCommonColumns.end(), - [](const auto &a, const auto &b) { return a.fInputId < b.fInputId; }); - return res; } @@ -1179,9 +1234,9 @@ static void PrefillColumnMap(const ROOT::RNTupleDescriptor &desc, const ROOT::RF for (const auto &colId : fieldDesc.GetLogicalColumnIds()) { const auto &colDesc = desc.GetColumnDescriptor(colId); RColumnOutInfo info{}; - const auto colName = name + '.' + std::to_string(colDesc.GetIndex()); info.fColumnId = colDesc.GetLogicalId(); - info.fColumnType = colDesc.GetType(); + const auto colName = + name + '.' + std::to_string(colDesc.GetIndex()) + '.' + std::to_string(colDesc.GetRepresentationIndex()); colIdMap[colName] = info; } @@ -1191,6 +1246,25 @@ static void PrefillColumnMap(const ROOT::RNTupleDescriptor &desc, const ROOT::RF } } +static void AddColumnExtensionsInFieldOrder( + const ROOT::RFieldDescriptor &field, const ROOT::RNTupleDescriptor &desc, + const FieldCollectionMap_t &extensions, + std::vector>> &outExtensions, + std::unordered_map> &outProjectionPointees) +{ + const auto it = extensions.find(&field); + if (it != extensions.end()) + outExtensions.emplace_back(it->first, it->second); + + if (field.IsProjectedField()) + outProjectionPointees[field.GetProjectionSourceId()].push_back(&field); + + for (auto childId : field.GetLinkIds()) { + const auto &child = desc.GetFieldDescriptor(childId); + AddColumnExtensionsInFieldOrder(child, desc, extensions, outExtensions, outProjectionPointees); + } +} + RNTupleMerger::RNTupleMerger(std::unique_ptr destination, std::unique_ptr model) // TODO(gparolini): consider using an arena allocator instead, since we know the precise lifetime @@ -1231,6 +1305,10 @@ ROOT::RResult RNTupleMerger::Merge(std::span sources, const } } + // Maps projection source fields to all their projections. + std::unordered_map> projectionPointees; + bool projectionPointeesInitialized = false; + // we should have a model if and only if the destination is initialized. if (!!fModel != fDestination->IsInitialized()) { return R__FAIL( @@ -1282,7 +1360,7 @@ ROOT::RResult RNTupleMerger::Merge(std::span sources, const } } - // Create sink from the input model if not initialized + // Create sink and model from the input descriptor if not initialized if (!fModel) { fModel = fDestination->InitFromDescriptor(srcDescriptor.GetRef(), false /* copyClusters */); } @@ -1308,10 +1386,10 @@ ROOT::RResult RNTupleMerger::Merge(std::span sources, const } // handle extra src fields - if (descCmp.fExtraSrcFields.size()) { + if (!descCmp.fExtraSrcFields.empty()) { if (mergeOpts.fMergingMode == ENTupleMergingMode::kUnion) { // late model extension for all fExtraSrcFields in Union mode - auto res = ExtendDestinationModel(descCmp.fExtraSrcFields, *fModel, mergeData, descCmp.fCommonFields); + auto res = ExtendDestinationModel(descCmp, *fModel, mergeData); if (!res) return R__FORWARD_ERROR(res); } else if (mergeOpts.fMergingMode == ENTupleMergingMode::kStrict) { @@ -1324,6 +1402,53 @@ ROOT::RResult RNTupleMerger::Merge(std::span sources, const } } + //// Extend columns if needed + if (!descCmp.fColReprExtensions.empty()) { + if (!projectionPointeesInitialized) { + for (const auto &field : descCmp.fExtraDstFields) { + if (field->IsProjectedField()) + projectionPointees[field->GetProjectionSourceId()].push_back(field); + } + projectionPointeesInitialized = true; + } + + // We need to extend the columns in the proper order, i.e. so that they appear in the same order as + // their first representation. This is to ensure that the pages we write to the cluster are in a consistent + // order as their column descriptors. The page creation order is determined by the order of + // columnInfos.fCommonColumns, which in turn depends on the common fields order (see GatherColumnInfos). + // XXX: do we need this separate sort step? Why not just create this vector directly in + // CompareDescriptorStructure? + std::vector>> colExtensions; + colExtensions.reserve(descCmp.fColReprExtensions.size()); + for (const auto &commonField : descCmp.fCommonFields) { + const auto *field = commonField.fDst; + AddColumnExtensionsInFieldOrder(*field, mergeData.fDstDescriptor, descCmp.fColReprExtensions, colExtensions, + projectionPointees); + } + for (const auto &field : descCmp.fExtraSrcFields) { + if (field->IsProjectedField()) + projectionPointees[field->GetProjectionSourceId()].push_back(field); + } + + for (const auto &[fieldDesc, extensions] : colExtensions) { + auto &mappings = descCmp.fColReprMappings[fieldDesc]; + for (const auto &extension : extensions) { + const auto firstColumnId = fDestination->AddColumnRepresentation(*fieldDesc, extension.fSourceRepr); + + // When adding new column representations to an existing field which is the source of some projected + // fields, we need to also add new alias columns to those fields so that they can point to the proper + // representation. + if (auto it = projectionPointees.find(fieldDesc->GetId()); it != projectionPointees.end()) { + for (const auto &projection : it->second) { + for (auto colIdx = 0u; colIdx < extension.fSourceRepr.size(); ++colIdx) + fDestination->AddAliasColumn(mergeData.fDstDescriptor, *projection, firstColumnId + colIdx); + } + } + mappings.push_back(extension); + } + } + } + // handle extra dst fields & common fields auto columnInfos = GatherColumnInfos(descCmp, srcDescriptor.GetRef(), mergeData); auto res = MergeSourceClusters(*source, columnInfos.fCommonColumns, columnInfos.fExtraDstColumns, mergeData); diff --git a/tree/ntuple/src/RNTupleSerialize.cxx b/tree/ntuple/src/RNTupleSerialize.cxx index 8e5f38bd7ca91..f0c197ffa18bc 100644 --- a/tree/ntuple/src/RNTupleSerialize.cxx +++ b/tree/ntuple/src/RNTupleSerialize.cxx @@ -291,6 +291,7 @@ ROOT::RResult SerializeColumnsOfFields(const ROOT::RNTupleDescrip const auto *xHeader = !forHeaderExtension ? desc.GetHeaderExtension() : nullptr; + std::vector columnsToSerialize; for (auto parentId : fieldList) { // If we're serializing the non-extended header and we already have a header extension (which may happen if // we load an RNTuple for incremental merging), we need to skip all the extended fields, as they need to be @@ -299,14 +300,21 @@ ROOT::RResult SerializeColumnsOfFields(const ROOT::RNTupleDescrip continue; for (const auto &c : desc.GetColumnIterable(parentId)) { - if (c.IsAliasColumn() || (xHeader && xHeader->ContainsExtendedColumnRepresentation(c.GetLogicalId()))) - continue; + if (!c.IsAliasColumn() && !(xHeader && xHeader->ContainsExtendedColumnRepresentation(c.GetLogicalId()))) + columnsToSerialize.push_back(&c); + } + } - if (auto res = SerializePhysicalColumn(c, context, *where)) { - pos += res.Unwrap(); - } else { - return R__FORWARD_ERROR(res); - } + // Make sure the columns are sorted by physical ID + std::sort(columnsToSerialize.begin(), columnsToSerialize.end(), [&context](const auto *a, const auto *b) { + return context.GetOnDiskColumnId(a->GetPhysicalId()) < context.GetOnDiskColumnId(b->GetPhysicalId()); + }); + + for (const auto *c : columnsToSerialize) { + if (auto res = SerializePhysicalColumn(*c, context, *where)) { + pos += res.Unwrap(); + } else { + return R__FORWARD_ERROR(res); } } diff --git a/tree/ntuple/src/RPageStorage.cxx b/tree/ntuple/src/RPageStorage.cxx index f5be9ec70af18..4c51b56445776 100644 --- a/tree/ntuple/src/RPageStorage.cxx +++ b/tree/ntuple/src/RPageStorage.cxx @@ -1049,6 +1049,88 @@ ROOT::Internal::RPagePersistentSink::InitFromDescriptor(const ROOT::RNTupleDescr return model; } +ROOT::DescriptorId_t +ROOT::Internal::RPagePersistentSink::AddColumnRepresentation(const ROOT::RFieldDescriptor &field, + std::span newRepresentation) +{ + assert(!field.IsProjectedField()); + assert(field.GetColumnCardinality() > 0); + assert(!field.GetLogicalColumnIds().empty()); + assert(newRepresentation.size() == field.GetColumnCardinality()); + + const std::size_t firstPhysicalIndex = fDescriptorBuilder.GetDescriptor().GetNPhysicalColumns(); + const std::uint16_t reprIndex = field.GetLogicalColumnIds().size() / field.GetColumnCardinality(); + + fDescriptorBuilder.ShiftAliasColumns(newRepresentation.size()); + + std::uint16_t columnIndex = 0; // index into the representation + for (auto columnRepr : newRepresentation) { + std::size_t bitsOnStorage = columnRepr.fBitWidth; + if (!bitsOnStorage) { + const auto [rangeMin, rangeMax] = ROOT::Internal::RColumnElementBase::GetValidBitRange(columnRepr.fType); + if (rangeMin != rangeMax) { + throw ROOT::RException(R__FAIL("bit width must be given for columns of variable bit width")); + } + bitsOnStorage = rangeMin; + } + + const ROOT::DescriptorId_t firstReprColumnId = field.GetLogicalColumnIds()[columnIndex]; + const auto &firstReprColumnRange = fOpenColumnRanges.at(firstReprColumnId); + const ROOT::DescriptorId_t columnId = firstPhysicalIndex + columnIndex; + + RColumnDescriptorBuilder columnBuilder; + columnBuilder.LogicalColumnId(columnId) + .PhysicalColumnId(columnId) + .FieldId(field.GetId()) + .BitsOnStorage(bitsOnStorage) + .Type(columnRepr.fType) + .Index(columnIndex) + // NOTE: marking this column as suppressed with the minus sign + .FirstElementIndex(-firstReprColumnRange.GetFirstElementIndex()) + .RepresentationIndex(reprIndex) + .ValueRange(columnRepr.fValueRange); + fDescriptorBuilder.AddColumn(columnBuilder.MakeDescriptor().Unwrap()); + + ROOT::RClusterDescriptor::RColumnRange columnRange; + columnRange.SetPhysicalColumnId(columnId); + columnRange.SetFirstElementIndex(firstReprColumnRange.GetFirstElementIndex()); + columnRange.SetNElements(0); + columnRange.SetCompressionSettings(GetWriteOptions().GetCompression()); + fOpenColumnRanges.emplace_back(columnRange); + + ROOT::RClusterDescriptor::RPageRange pageRange; + pageRange.SetPhysicalColumnId(columnId); + fOpenPageRanges.emplace_back(std::move(pageRange)); + + fSerializationContext.MapPhysicalColumnId(columnId); + + ++columnIndex; + } + + return firstPhysicalIndex; +} + +void ROOT::Internal::RPagePersistentSink::AddAliasColumn(const ROOT::RNTupleDescriptor &desc, + const ROOT::RFieldDescriptor &field, + ROOT::DescriptorId_t physicalId) +{ + const auto &pointedColumn = desc.GetColumnDescriptor(physicalId); + assert(!pointedColumn.IsAliasColumn()); + + const auto columnId = fDescriptorBuilder.GetDescriptor().GetNLogicalColumns(); + RColumnDescriptorBuilder columnBuilder; + columnBuilder.LogicalColumnId(columnId) + .PhysicalColumnId(physicalId) + .FieldId(field.GetId()) + .Type(pointedColumn.GetType()) + .Index(pointedColumn.GetIndex()) + .BitsOnStorage(pointedColumn.GetBitsOnStorage()) + .ValueRange(pointedColumn.GetValueRange()) + .FirstElementIndex(pointedColumn.GetFirstElementIndex()) + .RepresentationIndex(pointedColumn.GetRepresentationIndex()); + fDescriptorBuilder.AddColumn(columnBuilder.MakeDescriptor().Unwrap()); +} + void ROOT::Internal::RPagePersistentSink::CommitSuppressedColumn(ColumnHandle_t columnHandle) { fOpenColumnRanges.at(columnHandle.fPhysicalId).SetIsSuppressed(true); diff --git a/tree/ntuple/test/ntuple_merger.cxx b/tree/ntuple/test/ntuple_merger.cxx index de27ead34d3f9..c3cfe6686cbfe 100644 --- a/tree/ntuple/test/ntuple_merger.cxx +++ b/tree/ntuple/test/ntuple_merger.cxx @@ -1011,40 +1011,44 @@ TEST(RNTupleMerger, MergeLateModelExtension) { // Write two test ntuples to be merged, with different models. // Use EMergingMode::kUnion so the output ntuple has all the fields of its inputs. - FileRaii fileGuard1("test_ntuple_merge_in_1.root"); + FileRaii fileGuard1("test_ntuple_merge_lmext_in_1.root"); { auto model = RNTupleModel::Create(); auto fieldFoo = model->MakeField>("foo"); - auto fieldVfoo = model->MakeField>("vfoo"); + auto fieldVfoo = model->MakeField[3]>("vfoo"); auto fieldBar = model->MakeField("bar"); auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard1.GetPath(), RNTupleWriteOptions()); for (size_t i = 0; i < 10; ++i) { fieldFoo->insert(std::make_pair(std::to_string(i), i * 123)); - *fieldVfoo = {(int)i * 123}; + fieldVfoo[0] = {(int)i * 123}; + fieldVfoo[2] = {(int)i * 345}; *fieldBar = i * 321; ntuple->Fill(); } } - FileRaii fileGuard2("test_ntuple_merge_in_2.root"); + FileRaii fileGuard2("test_ntuple_merge_lmext_in_2.root"); { auto model = RNTupleModel::Create(); auto fieldBaz = model->MakeField("baz"); auto fieldFoo = model->MakeField>("foo"); - auto fieldVfoo = model->MakeField>("vfoo"); + auto fieldQux = model->MakeField("qux"); + auto fieldVfoo = model->MakeField[3]>("vfoo"); auto wopts = RNTupleWriteOptions(); wopts.SetCompression(0); auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard2.GetPath(), wopts); for (size_t i = 0; i < 10; ++i) { *fieldBaz = i * 567; fieldFoo->insert(std::make_pair(std::to_string(i), i * 765)); - *fieldVfoo = {(int)i * 765}; + fieldVfoo[0] = {(int)i * 765}; + fieldVfoo[2] = {(int)i * 987}; + *fieldQux = i * 777; ntuple->Fill(); } } // Now merge the inputs - FileRaii fileGuard3("test_ntuple_merge_out.root"); + FileRaii fileGuard3("test_ntuple_merge_lmext_out.root"); { // Gather the input sources std::vector> sources; @@ -1072,23 +1076,28 @@ TEST(RNTupleMerger, MergeLateModelExtension) auto ntuple = RNTupleReader::Open("ntuple", fileGuard3.GetPath()); EXPECT_EQ(ntuple->GetNEntries(), 20); auto foo = ntuple->GetModel().GetDefaultEntry().GetPtr>("foo"); - auto vfoo = ntuple->GetModel().GetDefaultEntry().GetPtr>("vfoo"); + auto vfoo = ntuple->GetModel().GetDefaultEntry().GetPtr[3]>("vfoo"); auto bar = ntuple->GetModel().GetDefaultEntry().GetPtr("bar"); auto baz = ntuple->GetModel().GetDefaultEntry().GetPtr("baz"); + auto qux = ntuple->GetModel().GetDefaultEntry().GetPtr("qux"); for (int i = 0; i < 10; ++i) { ntuple->LoadEntry(i); ASSERT_EQ((*foo)[std::to_string(i)], i * 123); - ASSERT_EQ((*vfoo)[0], i * 123); + ASSERT_EQ(vfoo[0][0], i * 123); + ASSERT_EQ(vfoo[2][0], i * 345); ASSERT_EQ(*bar, i * 321); ASSERT_EQ(*baz, 0); + ASSERT_EQ(*qux, 0); } for (int i = 10; i < 20; ++i) { ntuple->LoadEntry(i); ASSERT_EQ((*foo)[std::to_string(i - 10)], (i - 10) * 765); - ASSERT_EQ((*vfoo)[0], (i - 10) * 765); + ASSERT_EQ(vfoo[0][0], (i - 10) * 765); + ASSERT_EQ(vfoo[2][0], (i - 10) * 987); ASSERT_EQ(*bar, 0); ASSERT_EQ(*baz, (i - 10) * 567); + ASSERT_EQ(*qux, (i - 10) * 777); } } } @@ -1176,8 +1185,10 @@ TEST(RNTupleMerger, DifferentCompatibleRepresentations) auto model = RNTupleModel::Create(); auto pFoo = model->MakeField("foo"); auto clonedModel = model->Clone(); + auto wopts = RNTupleWriteOptions(); + wopts.SetCompression(0); { - auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard1.GetPath()); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard1.GetPath(), wopts); for (size_t i = 0; i < 10; ++i) { *pFoo = i * 123; ntuple->Fill(); @@ -1189,12 +1200,12 @@ TEST(RNTupleMerger, DifferentCompatibleRepresentations) { auto &fieldFooDbl = clonedModel->GetMutableField("foo"); fieldFooDbl.SetColumnRepresentatives({{ROOT::ENTupleColumnType::kReal32}}); - auto ntuple = RNTupleWriter::Recreate(std::move(clonedModel), "ntuple", fileGuard2.GetPath()); + auto ntuple = RNTupleWriter::Recreate(std::move(clonedModel), "ntuple", fileGuard2.GetPath(), wopts); auto e = ntuple->CreateEntry(); auto pFoo2 = e->GetPtr("foo"); for (size_t i = 0; i < 10; ++i) { *pFoo2 = i * 567; - ntuple->Fill(); + ntuple->Fill(*e); } } @@ -1214,30 +1225,18 @@ TEST(RNTupleMerger, DifferentCompatibleRepresentations) auto sourcePtrs2 = sourcePtrs; { - auto wopts = RNTupleWriteOptions(); - wopts.SetCompression(0); auto destination = std::make_unique("ntuple", fileGuard3.GetPath(), wopts); auto opts = RNTupleMergeOptions(); opts.fCompressionSettings = 0; RNTupleMerger merger{std::move(destination)}; auto res = merger.Merge(sourcePtrs, opts); - // TODO(gparolini): we want to support this in the future - EXPECT_FALSE(bool(res)); - if (res.GetError()) { - EXPECT_THAT(res.GetError()->GetReport(), testing::HasSubstr("different column type")); - } - // EXPECT_TRUE(bool(res)); + EXPECT_TRUE(bool(res)); } { auto destination = std::make_unique("ntuple", fileGuard4.GetPath(), RNTupleWriteOptions()); RNTupleMerger merger{std::move(destination)}; auto res = merger.Merge(sourcePtrs); - // TODO(gparolini): we want to support this in the future - EXPECT_FALSE(bool(res)); - if (res.GetError()) { - EXPECT_THAT(res.GetError()->GetReport(), testing::HasSubstr("different column type")); - } - // EXPECT_TRUE(bool(res)); + EXPECT_TRUE(bool(res)); } } } @@ -1507,6 +1506,115 @@ TEST(RNTupleMerger, MergeProjectedFieldsMultiple) } } +TEST(RNTupleMerger, MergeProjectedFieldsDifferentCompression) +{ + // Verify that we correctly handle projected fields with different compressions + FileRaii fileGuard1("test_ntuple_merge_proj_diff_comp_in_1.root"); + { + auto model = RNTupleModel::Create(); + auto fieldInt = model->MakeField("int"); + auto fieldFlt = model->MakeField("flt"); + auto projIntProj = std::make_unique>("intProj"); + model->AddProjectedField(std::move(projIntProj), [](const std::string &) { return "int"; }); + auto projFltProj = std::make_unique>("fltProj"); + model->AddProjectedField(std::move(projFltProj), [](const std::string &) { return "flt"; }); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard1.GetPath()); + for (size_t i = 0; i < 10; ++i) { + *fieldInt = i * 123; + *fieldFlt = i * 456; + ntuple->Fill(); + } + } + FileRaii fileGuard2("test_ntuple_merge_proj_diff_comp_in_2.root"); + { + auto model = RNTupleModel::Create(); + auto fieldInt = model->MakeField("int"); + auto fieldFlt = model->MakeField("flt"); + auto projIntProj = std::make_unique>("intProj"); + model->AddProjectedField(std::move(projIntProj), [](const std::string &) { return "int"; }); + auto projFltProj = std::make_unique>("fltProj"); + model->AddProjectedField(std::move(projFltProj), [](const std::string &) { return "flt"; }); + auto wopts = RNTupleWriteOptions(); + wopts.SetCompression(0); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard2.GetPath(), wopts); + for (size_t i = 0; i < 10; ++i) { + *fieldInt = (i + 10) * 123; + *fieldFlt = (i + 10) * 456; + ntuple->Fill(); + } + } + + FileRaii fileGuard3("test_ntuple_merge_proj_diff_comp_out.root"); + { + // Gather the input sources + std::vector> sources; + sources.push_back(RPageSource::Create("ntuple", fileGuard1.GetPath(), RNTupleReadOptions())); + sources.push_back(RPageSource::Create("ntuple", fileGuard2.GetPath(), RNTupleReadOptions())); + std::vector sourcePtrs; + for (const auto &s : sources) { + sourcePtrs.push_back(s.get()); + } + + // Now merge the inputs + auto destination = std::make_unique("ntuple", fileGuard3.GetPath(), RNTupleWriteOptions()); + RNTupleMerger merger{std::move(destination)}; + auto res = merger.Merge(sourcePtrs); + EXPECT_TRUE(bool(res)); + } + + fileGuard3.PreserveFile(); + { + auto ntuple1 = RNTupleReader::Open("ntuple", fileGuard1.GetPath()); + auto ntuple2 = RNTupleReader::Open("ntuple", fileGuard2.GetPath()); + auto ntuple3 = RNTupleReader::Open("ntuple", fileGuard3.GetPath()); + EXPECT_EQ(ntuple1->GetNEntries() + ntuple2->GetNEntries(), ntuple3->GetNEntries()); + const auto &desc1 = ntuple1->GetDescriptor(); + const auto nAliasColumns1 = desc1.GetNLogicalColumns() - desc1.GetNPhysicalColumns(); + EXPECT_EQ(nAliasColumns1, 2); + const auto &desc2 = ntuple2->GetDescriptor(); + const auto nAliasColumns2 = desc2.GetNLogicalColumns() - desc2.GetNPhysicalColumns(); + EXPECT_EQ(nAliasColumns2, 2); + const auto &desc3 = ntuple3->GetDescriptor(); + const auto nAliasColumns3 = desc3.GetNLogicalColumns() - desc3.GetNPhysicalColumns(); + EXPECT_EQ(nAliasColumns3, 4); + + auto int1 = ntuple1->GetModel().GetDefaultEntry().GetPtr("int"); + auto int2 = ntuple2->GetModel().GetDefaultEntry().GetPtr("int"); + auto int3 = ntuple3->GetModel().GetDefaultEntry().GetPtr("int"); + auto intProj1 = ntuple1->GetModel().GetDefaultEntry().GetPtr("intProj"); + auto intProj2 = ntuple2->GetModel().GetDefaultEntry().GetPtr("intProj"); + auto intProj3 = ntuple3->GetModel().GetDefaultEntry().GetPtr("intProj"); + + auto flt1 = ntuple1->GetModel().GetDefaultEntry().GetPtr("flt"); + auto flt2 = ntuple2->GetModel().GetDefaultEntry().GetPtr("flt"); + auto flt3 = ntuple3->GetModel().GetDefaultEntry().GetPtr("flt"); + auto fltProj1 = ntuple1->GetModel().GetDefaultEntry().GetPtr("fltProj"); + auto fltProj2 = ntuple2->GetModel().GetDefaultEntry().GetPtr("fltProj"); + auto fltProj3 = ntuple3->GetModel().GetDefaultEntry().GetPtr("fltProj"); + + for (auto i = 0u; i < ntuple1->GetNEntries(); ++i) { + ntuple1->LoadEntry(i); + ntuple3->LoadEntry(i); + EXPECT_EQ(*int1, *int3); + EXPECT_EQ(*intProj1, *intProj3); + EXPECT_FLOAT_EQ(*flt1, *flt3); + EXPECT_FLOAT_EQ(*fltProj1, *fltProj3); + EXPECT_FLOAT_EQ(*fltProj1, *flt1); + EXPECT_FLOAT_EQ(*fltProj3, *flt3); + } + for (auto i = 0u; i < ntuple2->GetNEntries(); ++i) { + ntuple2->LoadEntry(i); + ntuple3->LoadEntry(ntuple1->GetNEntries() + i); + EXPECT_EQ(*int2, *int3); + EXPECT_EQ(*intProj2, *intProj3); + EXPECT_FLOAT_EQ(*flt2, *flt3); + EXPECT_FLOAT_EQ(*fltProj2, *fltProj3); + EXPECT_FLOAT_EQ(*fltProj2, *flt2); + EXPECT_FLOAT_EQ(*fltProj3, *flt3); + } + } +} + TEST(RNTupleMerger, MergeProjectedFieldsOnlyFirst) { // Merge two files where the first has a projection and the second doesn't, and verify that we can @@ -1527,7 +1635,9 @@ TEST(RNTupleMerger, MergeProjectedFieldsOnlyFirst) { auto model = RNTupleModel::Create(); auto fieldFoo = model->MakeField("foo"); - auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard2.GetPath()); + auto wopts = RNTupleWriteOptions(); + wopts.SetCompression(0); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard2.GetPath(), wopts); for (size_t i = 0; i < 10; ++i) { *fieldFoo = i * 123; ntuple->Fill(); @@ -1561,16 +1671,18 @@ TEST(RNTupleMerger, MergeProjectedFieldsOnlyFirst) auto ntuple1 = RNTupleReader::Open("ntuple", fileGuard1.GetPath()); auto ntuple2 = RNTupleReader::Open("ntuple", fileGuard2.GetPath()); auto ntuple3 = RNTupleReader::Open("ntuple", fileGuardOut.GetPath()); - ASSERT_EQ(ntuple1->GetNEntries() + ntuple2->GetNEntries(), ntuple3->GetNEntries()); + EXPECT_EQ(ntuple1->GetNEntries() + ntuple2->GetNEntries(), ntuple3->GetNEntries()); const auto &desc1 = ntuple1->GetDescriptor(); const auto &desc2 = ntuple2->GetDescriptor(); const auto &desc3 = ntuple3->GetDescriptor(); const auto nAliasColumns1 = desc1.GetNLogicalColumns() - desc1.GetNPhysicalColumns(); const auto nAliasColumns2 = desc2.GetNLogicalColumns() - desc2.GetNPhysicalColumns(); const auto nAliasColumns3 = desc3.GetNLogicalColumns() - desc3.GetNPhysicalColumns(); - ASSERT_EQ(nAliasColumns1, 1); - ASSERT_EQ(nAliasColumns2, 0); - ASSERT_EQ(nAliasColumns3, 1); + EXPECT_EQ(nAliasColumns1, 1); + EXPECT_EQ(nAliasColumns2, 0); + // The output RNTuple has 2 alias columns because one was created by the merger to point to the extended + // column that was added to field "foo" (since source 2 had a different encoding than source 1). + EXPECT_EQ(nAliasColumns3, 2); auto foo1 = ntuple1->GetModel().GetDefaultEntry().GetPtr("foo"); auto foo2 = ntuple2->GetModel().GetDefaultEntry().GetPtr("foo"); @@ -1582,16 +1694,16 @@ TEST(RNTupleMerger, MergeProjectedFieldsOnlyFirst) for (auto i = 0u; i < ntuple1->GetNEntries(); ++i) { ntuple1->LoadEntry(i); ntuple3->LoadEntry(i); - ASSERT_EQ(*foo1, *foo3); - ASSERT_EQ(*bar1, *foo3); - ASSERT_EQ(*bar1, *bar3); + EXPECT_EQ(*foo1, *foo3); + EXPECT_EQ(*bar1, *foo3); + EXPECT_EQ(*bar1, *bar3); } for (auto i = 0u; i < ntuple2->GetNEntries(); ++i) { ntuple2->LoadEntry(i); ntuple3->LoadEntry(ntuple1->GetNEntries() + i); - ASSERT_EQ(*foo2, *foo3); + EXPECT_EQ(*foo2, *foo3); // we should be able to read the data from the second ntuple using the projection defined in the first. - ASSERT_EQ(*foo2, *bar3); + EXPECT_EQ(*foo2, *bar3); } } } @@ -2534,7 +2646,8 @@ TEST(RNTupleMerger, MergeDeferredAdvanced) auto model1 = RNTupleModel::Create(); auto wopts = RNTupleWriteOptions(); wopts.SetCompression(0); - auto writer1 = RNTupleWriter::Recreate(std::move(model1), "ntuple", fileGuard1.GetPath(), wopts); + auto tfile = TFile::Open((fileGuard1.GetPath() + "?reproducible").c_str(), "RECREATE"); + auto writer1 = RNTupleWriter::Append(std::move(model1), "ntuple", *tfile, wopts); auto updater = writer1->CreateModelUpdater(); updater->BeginUpdate(); updater->AddField(RFieldBase::Create("flt", "float").Unwrap()); @@ -2607,7 +2720,7 @@ TEST(RNTupleMerger, MergeDeferredAdvanced) auto pInt = reader->GetModel().GetDefaultEntry().GetPtr("int"); auto pFlt = reader->GetModel().GetDefaultEntry().GetPtr("flt"); - for (auto i = 0u; i < reader->GetNEntries(); ++i) { + for (auto i : reader->GetEntryRange()) { reader->LoadEntry(i); float expectedFlt = (i >= 10 && i < 15) ? 0 : i; EXPECT_FLOAT_EQ(*pFlt, expectedFlt); @@ -4025,3 +4138,263 @@ TEST(RNTupleMerger, MergeNewerVersion) } } } + +TEST(RNTupleMerger, MergeReal32Trunc) +{ + // Merge two files, both containing the same Real32Trunc-encoded field, but with different bit widths. + FileRaii fileGuard1("test_ntuple_merge_real32trunc_in_1.root"); + { + auto model = RNTupleModel::Create(); + auto field = std::make_unique>("flt"); + field->SetTruncated(14); + model->AddField(std::move(field)); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard1.GetPath()); + auto fieldFlt = ntuple->GetModel().GetDefaultEntry().GetPtr("flt"); + for (int i = 0; i < 10; ++i) { + *fieldFlt = i; + ntuple->Fill(); + } + } + FileRaii fileGuard2("test_ntuple_merge_real32trunc_in_2.root"); + { + auto model = RNTupleModel::Create(); + auto field = std::make_unique>("flt"); + field->SetTruncated(24); + model->AddField(std::move(field)); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard2.GetPath()); + auto fieldFlt = ntuple->GetModel().GetDefaultEntry().GetPtr("flt"); + for (int i = 0; i < 10; ++i) { + *fieldFlt = 10 + i; + ntuple->Fill(); + } + } + { + // Gather the input sources + std::vector> sources; + sources.push_back(RPageSource::Create("ntuple", fileGuard1.GetPath(), RNTupleReadOptions())); + sources.push_back(RPageSource::Create("ntuple", fileGuard2.GetPath(), RNTupleReadOptions())); + std::vector sourcePtrs; + for (const auto &s : sources) { + sourcePtrs.push_back(s.get()); + } + + // Now merge the inputs + for (const auto mmode : {ENTupleMergingMode::kFilter, ENTupleMergingMode::kStrict, ENTupleMergingMode::kUnion}) { + SCOPED_TRACE(std::string("with merging mode = ") + ToString(mmode)); + FileRaii fileGuardOut("test_ntuple_merge_real32trunc_out.root"); + { + auto destination = std::make_unique("ntuple", fileGuardOut.GetPath(), RNTupleWriteOptions()); + RNTupleMerger merger{std::move(destination)}; + RNTupleMergeOptions opts; + opts.fMergingMode = mmode; + auto res = merger.Merge(sourcePtrs, opts); + EXPECT_TRUE(bool(res)); + } + { + auto reader = ROOT::RNTupleReader::Open("ntuple", fileGuardOut.GetPath()); + EXPECT_EQ(reader->GetNEntries(), 20); + EXPECT_EQ(reader->GetDescriptor().GetNPhysicalColumns(), 2); + auto pFlt = reader->GetModel().GetDefaultEntry().GetPtr("flt"); + for (auto i : reader->GetEntryRange()) { + reader->LoadEntry(i); + EXPECT_NEAR(*pFlt, i, 0.01f); + } + } + } + } +} + +TEST(RNTupleMerger, MergeReal32Quant) +{ + // Merge two files, both containing the same Real32Quant-encoded field, but with different value ranges. + FileRaii fileGuard1("test_ntuple_merge_real32quant_in_1.root"); + { + auto model = RNTupleModel::Create(); + auto field = std::make_unique>("flt"); + field->SetQuantized(0., 100., 20); + model->AddField(std::move(field)); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard1.GetPath()); + auto fieldFlt = ntuple->GetModel().GetDefaultEntry().GetPtr("flt"); + for (int i = 0; i < 10; ++i) { + *fieldFlt = i; + ntuple->Fill(); + } + } + FileRaii fileGuard2("test_ntuple_merge_real32quant_in_2.root"); + { + auto model = RNTupleModel::Create(); + auto field = std::make_unique>("flt"); + field->SetQuantized(-100., 100., 20); + model->AddField(std::move(field)); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard2.GetPath()); + auto fieldFlt = ntuple->GetModel().GetDefaultEntry().GetPtr("flt"); + for (int i = 0; i < 10; ++i) { + *fieldFlt = 10 + i; + ntuple->Fill(); + } + } + { + // Gather the input sources + std::vector> sources; + sources.push_back(RPageSource::Create("ntuple", fileGuard1.GetPath(), RNTupleReadOptions())); + sources.push_back(RPageSource::Create("ntuple", fileGuard2.GetPath(), RNTupleReadOptions())); + std::vector sourcePtrs; + for (const auto &s : sources) { + sourcePtrs.push_back(s.get()); + } + + // Now merge the inputs + for (const auto mmode : {ENTupleMergingMode::kFilter, ENTupleMergingMode::kStrict, ENTupleMergingMode::kUnion}) { + SCOPED_TRACE(std::string("with merging mode = ") + ToString(mmode)); + FileRaii fileGuardOut("test_ntuple_merge_real32quant_out.root"); + { + auto destination = std::make_unique("ntuple", fileGuardOut.GetPath(), RNTupleWriteOptions()); + RNTupleMerger merger{std::move(destination)}; + RNTupleMergeOptions opts; + opts.fMergingMode = mmode; + auto res = merger.Merge(sourcePtrs, opts); + EXPECT_TRUE(bool(res)); + } + { + auto reader = ROOT::RNTupleReader::Open("ntuple", fileGuardOut.GetPath()); + EXPECT_EQ(reader->GetNEntries(), 20); + EXPECT_EQ(reader->GetDescriptor().GetNPhysicalColumns(), 2); + auto pFlt = reader->GetModel().GetDefaultEntry().GetPtr("flt"); + for (auto i : reader->GetEntryRange()) { + reader->LoadEntry(i); + EXPECT_NEAR(*pFlt, i, 0.01f); + } + } + } + } +} + +TEST(RNTupleMerger, MergeReal32TruncQuantMixed) +{ + // Merge two files, both containing the same field, but with the first being Real32Trunc and the second Real32Quant + FileRaii fileGuard1("test_ntuple_merge_real32truncquant_in_1.root"); + { + auto model = RNTupleModel::Create(); + auto field = std::make_unique>("flt"); + field->SetTruncated(24); + model->AddField(std::move(field)); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard1.GetPath()); + auto fieldFlt = ntuple->GetModel().GetDefaultEntry().GetPtr("flt"); + for (int i = 0; i < 10; ++i) { + *fieldFlt = i; + ntuple->Fill(); + } + } + FileRaii fileGuard2("test_ntuple_merge_real32truncquant_in_2.root"); + { + auto model = RNTupleModel::Create(); + auto field = std::make_unique>("flt"); + field->SetQuantized(-1., 100., 20); + model->AddField(std::move(field)); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard2.GetPath()); + auto fieldFlt = ntuple->GetModel().GetDefaultEntry().GetPtr("flt"); + for (int i = 0; i < 10; ++i) { + *fieldFlt = 10 + i; + ntuple->Fill(); + } + } + { + // Gather the input sources + std::vector> sources; + sources.push_back(RPageSource::Create("ntuple", fileGuard1.GetPath(), RNTupleReadOptions())); + sources.push_back(RPageSource::Create("ntuple", fileGuard2.GetPath(), RNTupleReadOptions())); + std::vector sourcePtrs; + for (const auto &s : sources) { + sourcePtrs.push_back(s.get()); + } + + // Now merge the inputs + for (const auto mmode : {ENTupleMergingMode::kFilter, ENTupleMergingMode::kStrict, ENTupleMergingMode::kUnion}) { + SCOPED_TRACE(std::string("with merging mode = ") + ToString(mmode)); + FileRaii fileGuardOut("test_ntuple_merge_real32truncquant_out.root"); + { + auto destination = std::make_unique("ntuple", fileGuardOut.GetPath(), RNTupleWriteOptions()); + RNTupleMerger merger{std::move(destination)}; + RNTupleMergeOptions opts; + opts.fMergingMode = mmode; + auto res = merger.Merge(sourcePtrs, opts); + EXPECT_TRUE(bool(res)); + } + { + auto reader = ROOT::RNTupleReader::Open("ntuple", fileGuardOut.GetPath()); + EXPECT_EQ(reader->GetNEntries(), 20); + EXPECT_EQ(reader->GetDescriptor().GetNPhysicalColumns(), 2); + auto pFlt = reader->GetModel().GetDefaultEntry().GetPtr("flt"); + for (auto i : reader->GetEntryRange()) { + reader->LoadEntry(i); + EXPECT_NEAR(*pFlt, i, 0.01f); + } + } + } + } +} + +TEST(RNTupleMerger, MergeRealRegularQuantMixed) +{ + // Merge two files, both containing the same field, but with the first being a SplitReal64 and the second Real32Quant + FileRaii fileGuard1("test_ntuple_merge_realregquant_in_1.root"); + { + auto model = RNTupleModel::Create(); + auto fieldDbl = model->MakeField("dbl"); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard1.GetPath()); + for (int i = 0; i < 10; ++i) { + *fieldDbl = i; + ntuple->Fill(); + } + } + FileRaii fileGuard2("test_ntuple_merge_realregquant_in_2.root"); + { + auto model = RNTupleModel::Create(); + auto field = std::make_unique>("dbl"); + field->SetQuantized(0., 20., 29); + model->AddField(std::move(field)); + auto ntuple = RNTupleWriter::Recreate(std::move(model), "ntuple", fileGuard2.GetPath()); + auto fieldDbl = ntuple->GetModel().GetDefaultEntry().GetPtr("dbl"); + for (int i = 0; i < 10; ++i) { + *fieldDbl = 10 + i; + ntuple->Fill(); + } + } + { + // Gather the input sources + std::vector> sources; + sources.push_back(RPageSource::Create("ntuple", fileGuard1.GetPath(), RNTupleReadOptions())); + sources.push_back(RPageSource::Create("ntuple", fileGuard2.GetPath(), RNTupleReadOptions())); + std::vector sourcePtrs; + for (const auto &s : sources) { + sourcePtrs.push_back(s.get()); + } + + // Now merge the inputs + for (const auto mmode : {ENTupleMergingMode::kFilter, ENTupleMergingMode::kStrict, ENTupleMergingMode::kUnion}) { + SCOPED_TRACE(std::string("with merging mode = ") + ToString(mmode)); + FileRaii fileGuardOut("test_ntuple_merge_realregquant_out.root"); + { + auto destination = std::make_unique("ntuple", fileGuardOut.GetPath(), RNTupleWriteOptions()); + RNTupleMerger merger{std::move(destination)}; + RNTupleMergeOptions opts; + opts.fMergingMode = mmode; + auto res = merger.Merge(sourcePtrs, opts); + EXPECT_TRUE(bool(res)); + } + { + auto reader = ROOT::RNTupleReader::Open("ntuple", fileGuardOut.GetPath()); + EXPECT_EQ(reader->GetNEntries(), 20); + EXPECT_EQ(reader->GetDescriptor().GetNPhysicalColumns(), 2); + auto pDbl = reader->GetModel().GetDefaultEntry().GetPtr("dbl"); + for (auto i : reader->GetEntryRange()) { + reader->LoadEntry(i); + if (i < 10) + EXPECT_DOUBLE_EQ(*pDbl, i); + else + EXPECT_NEAR(*pDbl, i, 0.01f); + } + } + } + } +} diff --git a/tree/ntuple/test/ntuple_multi_column.cxx b/tree/ntuple/test/ntuple_multi_column.cxx index 7a4e6bb6eb4a4..452d143148154 100644 --- a/tree/ntuple/test/ntuple_multi_column.cxx +++ b/tree/ntuple/test/ntuple_multi_column.cxx @@ -376,12 +376,3 @@ TEST(RNTuple, MultiColumnRepresentationBulk) arr = static_cast(bulk.ReadBulk(RNTupleLocalIndex(1, 0), mask.get(), 1)); EXPECT_FLOAT_EQ(2.0, arr[0]); } - -TEST(RNTuple, MultiColumnRepresentationDedup) -{ - FileRaii fileGuard("test_ntuple_multi_column_representation_dedup.root"); - - auto fldPx = RFieldBase::Create("px", "float").Unwrap(); - fldPx->SetColumnRepresentatives({{ROOT::ENTupleColumnType::kReal16}, {ROOT::ENTupleColumnType::kReal16}}); - EXPECT_EQ(fldPx->GetColumnRepresentatives().size(), 1); -}