Support cloudberry #627

zhangwenchao-123 · 2025-10-21T09:05:42Z

Add the module name, JIRA# to PR/commit and description.
Add tests for the change.

The following two operator delete functions doesn't lookup in madlib library. Because it's not added in the library script file. void operator delete (void *ptr, std::size_t sz) noexcept; void operator delete[](void *ptr, std::size_t sz) noexcept; The two functions are missing previously.

tuhaihe · 2025-10-21T09:30:36Z

src/ports/cloudberry/cmake/FindCloudberry.cmake

+set(_PG_CONFIG_VERSION_MACRO "GP_VERSION")
+set(_SEARCH_PATH_HINTS
+    "/usr/local/cloudberry-db-devel/bin"
+    "/usr/local/cloudberry-db/bin"


Is there a need to add /usr/local/cbdb/bin?

Have we used this path?

Have added path /usr/local/cloudberry/bin

I believe the line just after this "$ENV{GPHOME}/bin" will help catch most scenarios. Users will be sourcing cloudberry-env.sh (Cloudberry 3+) or greenplum_path.sh (Cloudberry 2 ).

tuhaihe · 2025-10-21T09:32:44Z

src/ports/cloudberry/cmake/FindCloudberry.cmake

+    if(_PG_CONFIG_HEADER_CONTENTS MATCHES "#define SERVERLESS 1")
+        message("-- Detected Hashdata Cloud (Cloudberry Serverless)")
+        set(CLOUDBERRY_SERVERLESS TRUE PARENT_SCOPE)


Remove these lines?

tuhaihe · 2025-10-21T09:34:56Z

src/ports/cloudberry/3/CMakeLists.txt

There is still no Cloudberry 3.0 yet. So can remove this file?

Apache MADlib should be able to build against both the REL_2_STABLE and main (3.0.0) branches. I believe it is better to keep support for Cloudberry 3.0. As main (3.0) has not released, maybe support for 3.0 can be labelled as experimental.

That makes sense. Thanks!

Agree with ed.

tuhaihe · 2025-10-21T09:47:36Z

src/ports/cloudberry/cmake/FindCloudberry.cmake

We also need to add the standard Apache license header to the new files, including FindCloudberry.cmake, and FindCloudberry_1.cmake and other files.

edespino

❌ What's Missing (Critical Issues)

1. No main CMakeLists.txt for Cloudberry:
   - src/ports/cloudberry/CMakeLists.txt doesn't exist
   - This file should mirror the structure of
     src/ports/greenplum/CMakeLists.txt (13KB, ~300 lines)
   - Should define: port configuration, source files, SQL handling,
     build functions, and version management

2. Not integrated into the build system:
   - src/ports/CMakeLists.txt only contains:
       add_subdirectory(postgres)
       add_subdirectory(greenplum)
   - Missing: add_subdirectory(cloudberry)
3. No CloudberryUtils.cmake:
   - Greenplum has GreenplumUtils.cmake with utility functions
   - May need similar utilities for Cloudberry-specific features

🔍 Current State

CMake configuration completed but:
- Cloudberry was NOT detected (the FindCloudberry code was never executed)
- Only PostgreSQL and Greenplum detection ran
- Build directory shows only postgres/ and greenplum/ subdirectories

However, there IS a Cloudberry installation:
- Location: /usr/local/cloudberry/
- Version: Based on PostgreSQL 14.4 with GP_VERSION_NUM 30000 (Cloudberry v3.0.0)
- This matches the src/ports/cloudberry/3/ directory structure

📊 Summary

The Cloudberry port is partially implemented. The detection logic and
version-specific configs exist, but they're not wired into the build
system.

To complete the implementation, you would need:

1. Create src/ports/cloudberry/CMakeLists.txt (modeled after Greenplum's)
2. Add add_subdirectory(cloudberry) to src/ports/CMakeLists.txt
3. Potentially create CloudberryUtils.cmake for Cloudberry-specific features
4. Test the full build process with Cloudberry detection

edespino · 2025-10-22T01:44:14Z

Have you looked at the website updates (https://madlib.apache.org - https://github.com/apache/madlib-site) other source documentation files? We will need to review these as well.

edespino · 2025-10-22T01:48:36Z

As @tuhaihe mentioned about ASF headers, when I ran the Apache Release Audit Tool (RAT), the following is seen (run the following in the root of the MADlib source: mvn apache-rat:check):

❯ head -30 target/rat.txt

*****************************************************
Summary
-------
Generated at: 2025-10-21T18:46:08-07:00
Notes: 4
Binaries: 5
Archives: 0
Standards: 311

Apache Licensed: 307
Generated Documents: 0

JavaDocs are generated and so license header is optional
Generated files do not required license headers

4 Unknown Licenses

*******************************

Unapproved licenses:

  src/ports/cloudberry/cmake/FindCloudberry.cmake
  src/ports/cloudberry/cmake/FindCloudberry_1.cmake
  src/ports/cloudberry/cmake/FindCloudberry_2.cmake
  src/ports/cloudberry/cmake/FindCloudberry_3.cmake

*******************************

zhangwenchao-123 · 2025-10-22T02:23:12Z

Have you looked at the website updates (https://madlib.apache.org - https://github.com/apache/madlib-site) other source documentation files? We will need to review these as well.

No, have not. Should we update this website?

tuhaihe · 2025-10-22T02:45:18Z

Have you looked at the website updates (https://madlib.apache.org - https://github.com/apache/madlib-site) other source documentation files? We will need to review these as well.

No, have not. Should we update this website?

Yes, we should update the related description on the website. I’d like to help with this.

zhangwenchao-123 · 2025-10-22T02:48:53Z

Have you looked at the website updates (https://madlib.apache.org - https://github.com/apache/madlib-site) other source documentation files? We will need to review these as well.

No, have not. Should we update this website?

Yes, we should update the related description on the website. I’d like to help with this.

Nice!

zhangwenchao-123 · 2025-10-22T02:57:50Z

❌ What's Missing (Critical Issues)

1. No main CMakeLists.txt for Cloudberry:
   - src/ports/cloudberry/CMakeLists.txt doesn't exist
   - This file should mirror the structure of
     src/ports/greenplum/CMakeLists.txt (13KB, ~300 lines)
   - Should define: port configuration, source files, SQL handling,
     build functions, and version management

2. Not integrated into the build system:
   - src/ports/CMakeLists.txt only contains:
       add_subdirectory(postgres)
       add_subdirectory(greenplum)
   - Missing: add_subdirectory(cloudberry)
3. No CloudberryUtils.cmake:
   - Greenplum has GreenplumUtils.cmake with utility functions
   - May need similar utilities for Cloudberry-specific features

🔍 Current State

CMake configuration completed but:
- Cloudberry was NOT detected (the FindCloudberry code was never executed)
- Only PostgreSQL and Greenplum detection ran
- Build directory shows only postgres/ and greenplum/ subdirectories

However, there IS a Cloudberry installation:
- Location: /usr/local/cloudberry/
- Version: Based on PostgreSQL 14.4 with GP_VERSION_NUM 30000 (Cloudberry v3.0.0)
- This matches the src/ports/cloudberry/3/ directory structure

📊 Summary

The Cloudberry port is partially implemented. The detection logic and
version-specific configs exist, but they're not wired into the build
system.

To complete the implementation, you would need:

1. Create src/ports/cloudberry/CMakeLists.txt (modeled after Greenplum's)
2. Add add_subdirectory(cloudberry) to src/ports/CMakeLists.txt
3. Potentially create CloudberryUtils.cmake for Cloudberry-specific features
4. Test the full build process with Cloudberry detection

Have fixed all mentioned problems and license lose

edespino · 2025-10-22T05:43:53Z

PR Review: Cloudberry MADlib Build Issues

CMake Configuration Command

cmake \
    -DCLOUDBERRY_3_PG_CONFIG=/usr/local/cloudberry/bin/pg_config \
    -DCMAKE_C_COMPILER=gcc \
    -DCMAKE_CXX_COMPILER=g++ \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=/usr/local/madlib \
    -DCLOUDBERRY_3_EXECUTABLE=/usr/local/cloudberry/bin/postgres \
    ..

CMake Configuration Error

Error:
CMake Error at src/CMakeLists.txt:202 (add_library):
  Cannot find source file:

    /home/cbadmin/bom-parts/madlib/src/ports/cloudberry/dbconnector/Compatibility.hpp

Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h
.hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc

CMake Error at src/CMakeLists.txt:202 (add_library):
No SOURCES given to target: madlib_cloudberry_3

CMake Generate step failed. Build files cannot be regenerated correctly.

Location: Referenced in src/ports/cloudberry/CMakeLists.txt:61

Observation: The directory /home/cbadmin/bom-parts/madlib/src/ports/cloudberry/dbconnector/ does not exist, while the equivalent Greenplum
directory does exist at /home/cbadmin/bom-parts/madlib/src/ports/greenplum/dbconnector/ containing:

Compatibility.hpp
dbconnector.hpp

Additional Build Errors (After Manual Directory Creation)

After manually creating the missing directory and copying files from Greenplum, cmake succeeded but compilation fails with multiple errors in
Compatibility.hpp:

AggState API change: aggcontext member doesn't exist (suggests aggcontexts)
WindowState renamed: T_WindowState not declared (suggests T_WindowAggState)
Missing function: format_procedure not declared
Function conflict: Ambiguous AggCheckCallContext - both the compatibility shim and PostgreSQL's native version exist

These errors indicate API differences between Greenplum's PostgreSQL base and Cloudberry's PostgreSQL base.

edespino · 2025-10-22T05:45:32Z

@zhangwenchao-123 - Unless absolutely necessary, there is no need to force push additional PR commits. This will allow us to view the PR history easily.

Fix SEGFAULT memory bugs There're weird SEGFAULT bug due to custom allocation erroneously paired with std::free (should be custom free) and we're unable to solve them. This is a workaround.

zhangwenchao-123 · 2025-10-23T07:01:03Z

PR Review: Cloudberry MADlib Build Issues

CMake Configuration Command
cmake \
    -DCLOUDBERRY_3_PG_CONFIG=/usr/local/cloudberry/bin/pg_config \
    -DCMAKE_C_COMPILER=gcc \
    -DCMAKE_CXX_COMPILER=g++ \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=/usr/local/madlib \
    -DCLOUDBERRY_3_EXECUTABLE=/usr/local/cloudberry/bin/postgres \
    ..
CMake Configuration Error
Error:
CMake Error at src/CMakeLists.txt:202 (add_library):
  Cannot find source file:

    /home/cbadmin/bom-parts/madlib/src/ports/cloudberry/dbconnector/Compatibility.hpp
Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h .hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc

CMake Error at src/CMakeLists.txt:202 (add_library): No SOURCES given to target: madlib_cloudberry_3

CMake Generate step failed. Build files cannot be regenerated correctly.

Location: Referenced in src/ports/cloudberry/CMakeLists.txt:61

Observation: The directory /home/cbadmin/bom-parts/madlib/src/ports/cloudberry/dbconnector/ does not exist, while the equivalent Greenplum directory does exist at /home/cbadmin/bom-parts/madlib/src/ports/greenplum/dbconnector/ containing:

Compatibility.hpp

dbconnector.hpp

Additional Build Errors (After Manual Directory Creation)

After manually creating the missing directory and copying files from Greenplum, cmake succeeded but compilation fails with multiple errors in Compatibility.hpp:

AggState API change: aggcontext member doesn't exist (suggests aggcontexts)

WindowState renamed: T_WindowState not declared (suggests T_WindowAggState)

Missing function: format_procedure not declared

Function conflict: Ambiguous AggCheckCallContext - both the compatibility shim and PostgreSQL's native version exist

These errors indicate API differences between Greenplum's PostgreSQL base and Cloudberry's PostgreSQL base.

Yeah, there are some other commits not picked, I will continue to complete this PR and test it.

edespino

I have a few changes to consider. I have more testing of this PR to perform.

Please do not force push changes to this PR. I want to be able to follow the history of this work. Force pushing is not helping.

edespino · 2025-10-23T07:10:42Z

src/config/Ports.yml

    name:   Greenplum DB
+
+cloudberry:
+    name:   Cloudberry DB


Cloudberry DB should be Apache Cloudberry

edespino · 2025-10-23T07:16:04Z

src/ports/cloudberry/cmake/FindCloudberry.cmake

+    )
+    set(${PKG_NAME}_ADDITIONAL_INCLUDE_DIRS
+        "${${PKG_NAME}_ADDITIONAL_INCLUDE_DIRS}/internal")
+    message("-- Detected Cloudberry")


message("-- Detected Cloudberry") should be message("-- Detected Apache Cloudberry")

edespino · 2025-10-23T07:18:59Z

src/madpack/madpack.py

                    # only need the first two digits for <= 4.3.4
                    dbver = '.'.join(map(str, dbver_split[:2]))
+            elif portid == 'cloudberry':
+                # Assume Cloudberry will stick to semantic versioning


Assume Cloudberry will stick to semantic versioning should be Assume Apache Cloudberry will stick to semantic versioning

edespino · 2025-10-23T08:04:36Z

src/ports/cloudberry/cmake/FindCloudberry_1.cmake

Is this symlink needed? I believe we should only be providing support for the Apache Cloudberry 2 & 3 (future) releases.

edespino · 2025-10-23T08:08:28Z

src/madpack/utilities.py

            # 4.3.5+ from versions < 4.3.5
            match = re.search("Greenplum[a-zA-Z\s]*(\d+\.\d+\.\d+)", versionStr)
+        elif portid == 'cloudberry':
+            match = re.search("Cloudberry[a-zA-Z\s]*(\d+\.\d+\.\d+)", versionStr)


"Cloudberry[a-zA-Z\s]*(\d+\.\d+\.\d+)" should be "Apache Cloudberry[a-zA-Z\s]*(\d+\.\d+\.\d+)" ?

I am not entirely sure about this.

Cloudberry is enough to achieve our goal, while Apache Cloudberry is more accurate that maybe is better.

edespino · 2025-10-23T08:12:24Z

requirements.txt

Why is this empty file needed?

I noticed this when I ran the Apache Release Audit tool (mvn apache-rat:check).

In apache cloudberry, it's not needed. I will remove it.

tuhaihe · 2025-10-23T08:38:38Z

src/ports/postgres/cmake/PostgreSQLUtils.cmake

                # implying we only need 1 folder for same major versions
                set(VERSION ${${PORT_UC}_VERSION_MAJOR})
+            elseif(${PORT_UC} STREQUAL "CLOUDBERRY")
+                # Assumes CBDB always follows semantic versioning


Suggested change

# Assumes CBDB always follows semantic versioning

# Assumes Apache Cloudberry always follows semantic versioning

edespino · 2025-10-23T08:52:56Z

src/madpack/madpack.py

                libdir = libdir.decode()

-            libdir = libdir.strip()+'/postgresql'
+            libdir = str(libdir.strip(), encoding='utf-8')+'/postgresql'


Testing Note: Encountering TypeError: decoding str is not supported when installing on PostgreSQL 14.19.

Root Cause:
For Postgres 13+ (line 1347), libdir is already decoded to a string via .decode(), but line 1349 attempts to decode it again with
str(libdir.strip(), encoding='utf-8'), which fails because you cannot decode a string that's already been decoded.

Recommended Solution:
Ensure libdir is always decoded to a string before line 1349, then simply strip and append the path:

libdir = subprocess.check_output(['pg_config','--libdir']) if ((portid == 'greenplum' and is_rev_gte(dbver_split, get_rev_num('7.0'))) or (portid == 'postgres' and is_rev_gte(dbver_split, get_rev_num('13.0')))): libdir = libdir.decode() else: libdir = libdir.decode('utf-8') libdir = libdir.strip() + '/postgresql' This ensures libdir is consistently a string for all code paths (older and newer versions), eliminating the type inconsistency that causes the error. Request for Review: Please validate this fix works correctly for both: - Older versions (Postgres <13, Greenplum <7) where subprocess.check_output() returns bytes - Newer versions (Postgres 13+, Greenplum 7+) where explicit decoding is needed

zhangwenchao-123 · 2025-10-24T07:22:39Z

All mentioned comments have been fixed and have tested it in cloudberry 3.0.

tuhaihe · 2025-10-29T09:37:47Z

Hi @zhangwenchao-123 could you rebase your commits on the latest madlib2-master? Let's see if the CI can pass successfully. Thanks!

zhangwenchao-123 · 2025-10-30T02:49:24Z

Hi @zhangwenchao-123 could you rebase your commits on the latest madlib2-master? Let's see if the CI can pass successfully. Thanks!

It's the NOTICE file check failed, I have fixed and test whether it can pass.

tuhaihe · 2025-10-31T03:33:17Z

src/ports/cloudberry/cmake/FindCloudberry_2.cmake

I noticed these two files, FindCloudberry_2.cmake & FindCloudberry_3.cmake, are all symbolic links to FindCloudberry.cmake. Should we create them like GP / PG as ASCII text files? FYI.

fix in e665a9f

tuhaihe · 2025-10-31T04:27:01Z

Based on the new codebase, I can build and deploy the MADlib into the Cloudberry 2.0 + 3.0 (main) gpdemo database:

Build the Cloudberry gpdemo env following the docs
Build and deploy the MADlib

## Download this PR change
git clone https://github.com/apache/madlib.git
cd madlib
git fetch origin pull/627/head:zhangwenchao-123/support_cloudberry
git switch zhangwenchao-123/support_cloudberry


## Set Python env
sudo alternatives --install /usr/bin/python python /usr/bin/python3 1

## Install required depencies to the Cloudberry Dev container
sudo dnf install boost-devel -y
sudo dnf install -y graphviz # for docs
sudo dnf install --enablerepo=crb doxygen -y # for docs
pip install mock pandas numpy xgboost scikit-learn pyyaml pyxb-x pypmml

## 
cd ~/madlib
mkdir build ; cd build

## for Cloudberry 3.0
cmake \
    -DCLOUDBERRY_3_PG_CONFIG=$GPHOME/bin/pg_config \
    -DCMAKE_C_COMPILER=gcc \
    -DCMAKE_CXX_COMPILER=g++ \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=/usr/local/madlib \
    -DCLOUDBERRY_3_EXECUTABLE=$GPHOME/bin/postgres \
    ..

## for Cloudberry 2.0
cmake \
    -DCLOUDBERRY_2_PG_CONFIG=$GPHOME/bin/pg_config \
    -DCMAKE_C_COMPILER=gcc \
    -DCMAKE_CXX_COMPILER=g++ \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=/usr/local/madlib \
    -DCLOUDBERRY_2_EXECUTABLE=$GPHOME/bin/postgres \
    ..

## Make, deploy, and run test
make -j$(nproc)
./src/bin/madpack -p cloudberry -c gpadmin@localhost:7000/postgres install
./src/bin/madpack -p cloudberry -c gpadmin@localhost:7000/postgres install-check

If something wrong, please help correct me. Thanks!

tuhaihe · 2025-12-05T10:34:12Z

src/ports/postgres/modules/pmml/table_to_pmml.sql_in

+
+import collections
+import collections.abc
+if not hasattr(collections, 'MutableSequence'):
+    collections.MutableSequence = collections.abc.MutableSequence
+    collections.MutableMapping = collections.abc.MutableMapping
+    collections.MutableSet = collections.abc.MutableSet
+


Maybe we can move the Python3 compatibility code into src/ports/postgres/modules/pmml/__init__.py_in to avoid the SQL-side code interfering with M4 macro expansion?

This change has been successfully tested in the Cloudberry environment and MADlib Jenkins CI.

fix in e665a9f

tuhaihe · 2026-02-10T07:47:36Z

src/ports/postgres/modules/pmml/__init__.py_in

We also need to add the ASF license header to this file:

# Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # http://www.apache.org/licenses/LICENSE-2.0 # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License.

fix in d018805

tuhaihe · 2026-02-11T03:18:46Z

Hi @zhangyue1818 thanks for your contribution. But I tested this PR in Cloudberry 2.0 and the coming Cloudberry 2.1 release, one test case failed:

[gpadmin@cdw build]$ ./src/bin/madpack -p cloudberry -c gpadmin@localhost:7000/postgres install-check
madpack.py: INFO : Detected Apache Cloudberry version 2.0.0.
TEST CASE RESULT|Module: array_ops|array_ops.ic.sql_in|PASS|Time: 74 milliseconds
TEST CASE RESULT|Module: bayes|bayes.ic.sql_in|PASS|Time: 320 milliseconds
TEST CASE RESULT|Module: crf|crf_test_small.ic.sql_in|PASS|Time: 285 milliseconds
TEST CASE RESULT|Module: crf|crf_train_small.ic.sql_in|PASS|Time: 285 milliseconds
TEST CASE RESULT|Module: elastic_net|elastic_net.ic.sql_in|PASS|Time: 190 milliseconds
TEST CASE RESULT|Module: linalg|svd.ic.sql_in|PASS|Time: 572 milliseconds
TEST CASE RESULT|Module: linalg|matrix_ops.ic.sql_in|PASS|Time: 822 milliseconds
TEST CASE RESULT|Module: linalg|linalg.ic.sql_in|PASS|Time: 76 milliseconds
TEST CASE RESULT|Module: pmml|pmml.ic.sql_in|PASS|Time: 452 milliseconds
TEST CASE RESULT|Module: prob|prob.ic.sql_in|PASS|Time: 28 milliseconds
TEST CASE RESULT|Module: svm|svm.ic.sql_in|PASS|Time: 315 milliseconds
TEST CASE RESULT|Module: tsa|arima.ic.sql_in|PASS|Time: 1074 milliseconds
TEST CASE RESULT|Module: stemmer|porter_stemmer.ic.sql_in|PASS|Time: 34 milliseconds
TEST CASE RESULT|Module: conjugate_gradient|conj_grad.ic.sql_in|PASS|Time: 142 milliseconds
TEST CASE RESULT|Module: knn|knn.ic.sql_in|PASS|Time: 175 milliseconds
TEST CASE RESULT|Module: lda|lda.ic.sql_in|PASS|Time: 246 milliseconds
TEST CASE RESULT|Module: stats|correlation.ic.sql_in|PASS|Time: 182 milliseconds
TEST CASE RESULT|Module: stats|mw_test.ic.sql_in|PASS|Time: 42 milliseconds
TEST CASE RESULT|Module: stats|pred_metrics.ic.sql_in|PASS|Time: 255 milliseconds
TEST CASE RESULT|Module: stats|chi2_test.ic.sql_in|PASS|Time: 37 milliseconds
TEST CASE RESULT|Module: stats|anova_test.ic.sql_in|PASS|Time: 47 milliseconds
TEST CASE RESULT|Module: stats|t_test.ic.sql_in|PASS|Time: 42 milliseconds
TEST CASE RESULT|Module: stats|cox_prop_hazards.ic.sql_in|PASS|Time: 211 milliseconds
TEST CASE RESULT|Module: stats|ks_test.ic.sql_in|PASS|Time: 84 milliseconds
TEST CASE RESULT|Module: stats|robust_and_clustered_variance_coxph.ic.sql_in|PASS|Time: 355 milliseconds
TEST CASE RESULT|Module: stats|wsr_test.ic.sql_in|PASS|Time: 46 milliseconds
TEST CASE RESULT|Module: stats|f_test.ic.sql_in|PASS|Time: 38 milliseconds
TEST CASE RESULT|Module: utilities|utilities.ic.sql_in|PASS|Time: 115 milliseconds
TEST CASE RESULT|Module: utilities|pivot.ic.sql_in|PASS|Time: 119 milliseconds
TEST CASE RESULT|Module: utilities|path.ic.sql_in|PASS|Time: 159 milliseconds
TEST CASE RESULT|Module: utilities|transform_vec_cols.ic.sql_in|PASS|Time: 156 milliseconds
TEST CASE RESULT|Module: utilities|text_utilities.ic.sql_in|PASS|Time: 126 milliseconds
TEST CASE RESULT|Module: utilities|sessionize.ic.sql_in|PASS|Time: 105 milliseconds
TEST CASE RESULT|Module: utilities|encode_categorical.ic.sql_in|PASS|Time: 186 milliseconds
TEST CASE RESULT|Module: utilities|minibatch_preprocessing.ic.sql_in|PASS|Time: 186 milliseconds
TEST CASE RESULT|Module: assoc_rules|assoc_rules.ic.sql_in|FAIL|Time: 568 milliseconds
madpack.py: ERROR : Failed executing /tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp
madpack.py: ERROR : Check the log at /tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.log
TEST CASE RESULT|Module: convex|lmf.ic.sql_in|PASS|Time: 297 milliseconds
TEST CASE RESULT|Module: convex|mlp.ic.sql_in|PASS|Time: 507 milliseconds
TEST CASE RESULT|Module: deep_learning|keras_model_arch_table.ic.sql_in|PASS|Time: 149 milliseconds
TEST CASE RESULT|Module: glm|glm.ic.sql_in|PASS|Time: 906 milliseconds
TEST CASE RESULT|Module: graph|graph.ic.sql_in|PASS|Time: 1343 milliseconds
TEST CASE RESULT|Module: linear_systems|sparse_linear_sytems.ic.sql_in|PASS|Time: 132 milliseconds
TEST CASE RESULT|Module: linear_systems|dense_linear_sytems.ic.sql_in|PASS|Time: 125 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|decision_tree.ic.sql_in|PASS|Time: 252 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|random_forest.ic.sql_in|PASS|Time: 322 milliseconds
TEST CASE RESULT|Module: regress|robust.ic.sql_in|PASS|Time: 193 milliseconds
TEST CASE RESULT|Module: regress|logistic.ic.sql_in|PASS|Time: 249 milliseconds
TEST CASE RESULT|Module: regress|linear.ic.sql_in|PASS|Time: 31 milliseconds
TEST CASE RESULT|Module: regress|clustered.ic.sql_in|PASS|Time: 189 milliseconds
TEST CASE RESULT|Module: regress|multilogistic.ic.sql_in|PASS|Time: 323 milliseconds
TEST CASE RESULT|Module: regress|marginal.ic.sql_in|PASS|Time: 457 milliseconds
TEST CASE RESULT|Module: sample|balance_sample.ic.sql_in|PASS|Time: 139 milliseconds
TEST CASE RESULT|Module: sample|train_test_split.ic.sql_in|PASS|Time: 166 milliseconds
TEST CASE RESULT|Module: sample|sample.ic.sql_in|PASS|Time: 20 milliseconds
TEST CASE RESULT|Module: sample|stratified_sample.ic.sql_in|PASS|Time: 112 milliseconds
TEST CASE RESULT|Module: summary|summary.ic.sql_in|PASS|Time: 148 milliseconds
TEST CASE RESULT|Module: kmeans|kmeans.ic.sql_in|PASS|Time: 661 milliseconds
TEST CASE RESULT|Module: pca|pca.ic.sql_in|PASS|Time: 1475 milliseconds
TEST CASE RESULT|Module: pca|pca_project.ic.sql_in|PASS|Time: 528 milliseconds
TEST CASE RESULT|Module: validation|cross_validation.ic.sql_in|PASS|Time: 332 milliseconds
INFO: Log files saved in /tmp/madlib.7qnxdkya

[gpadmin@cdw build]$ cat /tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.log
-- Switch to test user:
SET ROLE "madlib_210_installcheck_postgres";
SET
-- Set SEARCH_PATH for install-check:
SET search_path=madlib_installcheck_assoc_rules,madlib;
SET
/* ----------------------------------------------------------------------- *//**
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 *
 *//* ----------------------------------------------------------------------- */
---------------------------------------------------------------------------
-- Rules:
-- ------
-- 1) Any DB objects should be created w/o schema prefix,
--    since this file is executed in a separate schema context.
-- 2) There should be no DROP statements in this script, since
--    all objects created in the default schema will be cleaned-up outside.
---------------------------------------------------------------------------
---------------------------------------------------------------------------
-- Setup:
---------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION assoc_array_eq
    (
    arr1 TEXT[],
    arr2 TEXT[]
    )
RETURNS BOOL AS $$
    SELECT COUNT(*) = array_upper($1, 1) AND array_upper($1, 1) = array_upper($2, 1)
    FROM (SELECT unnest($1) id) t1, (SELECT unnest($2) id) t2
    WHERE t1.id = t2.id;

$$ LANGUAGE sql IMMUTABLE;
CREATE FUNCTION
CREATE OR REPLACE FUNCTION install_test() RETURNS VOID AS $$
declare
    result1        TEXT;
    result2        TEXT;
    result3        TEXT;
    result_maxiter TEXT;
    res            madlib.assoc_rules_results;
    output_schema  TEXT;
    output_table   TEXT;
    total_rules    INT;
    total_time     INTERVAL;
begin
    DROP TABLE IF EXISTS test_data1;
    CREATE TABLE test_data1 (
        trans_id INT
        , product INT
    );

    DROP TABLE IF EXISTS test_data2;
    CREATE TABLE test_data2 (
        trans_id INT
        , product VARCHAR
    );


    INSERT INTO test_data1 VALUES (1,1);
    INSERT INTO test_data1 VALUES (1,2);
    INSERT INTO test_data1 VALUES (3,3);
    INSERT INTO test_data1 VALUES (8,4);
    INSERT INTO test_data1 VALUES (10,1);
    INSERT INTO test_data1 VALUES (10,2);
    INSERT INTO test_data1 VALUES (10,3);
    INSERT INTO test_data1 VALUES (19,2);

    INSERT INTO test_data2 VALUES (1, 'beer');
    INSERT INTO test_data2 VALUES (1, 'diapers');
    INSERT INTO test_data2 VALUES (1, 'chips');
    INSERT INTO test_data2 VALUES (2, 'beer');
    INSERT INTO test_data2 VALUES (2, 'diapers');
    INSERT INTO test_data2 VALUES (3, 'beer');
    INSERT INTO test_data2 VALUES (3, 'diapers');
    INSERT INTO test_data2 VALUES (4, 'beer');
    INSERT INTO test_data2 VALUES (4, 'chips');
    INSERT INTO test_data2 VALUES (5, 'beer');
    INSERT INTO test_data2 VALUES (6, 'beer');
    INSERT INTO test_data2 VALUES (6, 'diapers');
    INSERT INTO test_data2 VALUES (6, 'chips');
    INSERT INTO test_data2 VALUES (7, 'beer');
    INSERT INTO test_data2 VALUES (7, 'diapers');

    DROP TABLE IF EXISTS test1_exp_result;
    CREATE TABLE test1_exp_result (
        ruleid integer,
        pre text[],
        post text[],
        support double precision,
        confidence double precision,
        lift double precision,
        conviction double precision
    ) ;

    DROP TABLE IF EXISTS test2_exp_result;
    CREATE TABLE test2_exp_result (
        ruleid integer,
        pre text[],
        post text[],
        support double precision,
        confidence double precision,
        lift double precision,
        conviction double precision
    ) ;


    INSERT INTO test1_exp_result VALUES (7, '{3}', '{1}', 0.20000000000000001, 0.5, 1.2499999999999998, 1.2);
    INSERT INTO test1_exp_result VALUES (4, '{2}', '{1}', 0.40000000000000002, 0.66666666666666674, 1.6666666666666667, 1.8000000000000003);
    INSERT INTO test1_exp_result VALUES (1, '{1}', '{2,3}', 0.20000000000000001, 0.5, 2.4999999999999996, 1.6000000000000001);
    INSERT INTO test1_exp_result VALUES (9, '{2,3}', '{1}', 0.20000000000000001, 1, 2.4999999999999996, 0);
    INSERT INTO test1_exp_result VALUES (6, '{1,2}', '{3}', 0.20000000000000001, 0.5, 1.2499999999999998, 1.2);
    INSERT INTO test1_exp_result VALUES (8, '{3}', '{2}', 0.20000000000000001, 0.5, 0.83333333333333337, 0.80000000000000004);
    INSERT INTO test1_exp_result VALUES (5, '{1}', '{2}', 0.40000000000000002, 1, 1.6666666666666667, 0);
    INSERT INTO test1_exp_result VALUES (2, '{3}', '{2,1}', 0.20000000000000001, 0.5, 1.2499999999999998, 1.2);
    INSERT INTO test1_exp_result VALUES (10, '{3,1}', '{2}', 0.20000000000000001, 1, 1.6666666666666667, 0);
    INSERT INTO test1_exp_result VALUES (3, '{1}', '{3}', 0.20000000000000001, 0.5, 1.2499999999999998, 1.2);

    INSERT INTO test2_exp_result VALUES (7, '{chips,diapers}', '{beer}', 0.2857142857142857, 1, 1, 0);
    INSERT INTO test2_exp_result VALUES (2, '{chips}', '{diapers}', 0.2857142857142857, 0.66666666666666663, 0.93333333333333324, 0.85714285714285698);
    INSERT INTO test2_exp_result VALUES (1, '{chips}', '{diapers,beer}', 0.2857142857142857, 0.66666666666666663, 0.93333333333333324, 0.85714285714285698);
    INSERT INTO test2_exp_result VALUES (6, '{diapers}', '{beer}', 0.7142857142857143, 1, 1, 0);
    INSERT INTO test2_exp_result VALUES (4, '{beer}', '{diapers}', 0.7142857142857143, 0.7142857142857143, 1, 1);
    INSERT INTO test2_exp_result VALUES (3, '{chips,beer}', '{diapers}', 0.2857142857142857, 0.66666666666666663, 0.93333333333333324, 0.85714285714285698);
    INSERT INTO test2_exp_result VALUES (5, '{chips}', '{beer}', 0.42857142857142855, 1, 1, 0);

    res = madlib.assoc_rules (.1, .5, 'trans_id', 'product', 'test_data1','madlib_installcheck_assoc_rules', false);

    RETURN;

end $$ language plpgsql;
CREATE FUNCTION
---------------------------------------------------------------------------
-- Test
---------------------------------------------------------------------------
SELECT install_test();
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: NOTICE:  table "test_data1" does not exist, skipping
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'trans_id' as the Apache Cloudberry data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: NOTICE:  table "test_data2" does not exist, skipping
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'trans_id' as the Apache Cloudberry data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: NOTICE:  table "test1_exp_result" does not exist, skipping
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'ruleid' as the Apache Cloudberry data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: NOTICE:  table "test2_exp_result" does not exist, skipping
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'ruleid' as the Apache Cloudberry data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: WARNING:  terminating connection because of crash of another server process  (seg0 slice3 172.17.0.6:7002 pid=45213)
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: WARNING:  terminating connection because of crash of another server process  (seg0 slice1 172.17.0.6:7002 pid=45202)
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: WARNING:  terminating connection because of crash of another server process  (seg0 172.17.0.6:7002 pid=45137)
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: WARNING:  writer gang of current global transaction is lost
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: WARNING:  Any temporary tables for this session have been dropped because the gang was disconnected (session id = 596)
psql:/tmp/madlib.7qnxdkya/assoc_rules/assoc_rules.ic.sql_in.tmp:154: ERROR:  DTX RollbackAndReleaseCurrentSubTransaction dispatch failed
CONTEXT:  PL/Python function "assoc_rules"
PL/pgSQL function install_test() line 93 at assignment

tuhaihe · 2026-02-11T04:10:22Z

Hi @zhangyue1818 thanks for your contribution. But I tested this PR in Cloudberry 2.0 and the coming Cloudberry 2.1 release, one test case failed:

The error above occurred in a Docker container environment. I retested MADlib installation and install-check on Cloudberry 2.0 and 2.1 running in a virtual machine, and all tests (including assoc_rules) passed without errors.

Thanks again.

Add bounds checking before accessing unique value arrays to prevent out-of-bounds reads in the SparseData operation loop. Problem: In op_sdata_by_sdata(), the loop increments indices i and j to traverse the unique values in left and right SparseData structures. After incrementing, the code immediately accesses vals->data[i] and vals->data[j] in the next iteration without verifying that i and j are within bounds (i.e., < unique_value_count). This could lead to reading beyond the allocated array boundaries. Solution: Add explicit bounds checking after index increments and before accessing the arrays. The check breaks the loop if either index reaches or exceeds the respective unique_value_count, preventing invalid memory access. The fix is placed after the index increment logic (lines 1088-1101) and before reading run_length values and accessing the vals arrays, ensuring all subsequent array operations are safe.

zhangyue1818 · 2026-02-11T09:28:16Z

Hi @zhangyue1818 thanks for your contribution. But I tested this PR in Cloudberry 2.0 and the coming Cloudberry 2.1 release, one test case failed:

The error above occurred in a Docker container environment. I retested MADlib installation and install-check on Cloudberry 2.0 and 2.1 running in a virtual machine, and all tests (including assoc_rules) passed without errors.

Thanks again.

fix in b57e5a9

tuhaihe · 2026-02-11T10:03:01Z

Hi @zhangyue1818 thanks for your contribution. But I tested this PR in Cloudberry 2.0 and the coming Cloudberry 2.1 release, one test case failed:

The error above occurred in a Docker container environment. I retested MADlib installation and install-check on Cloudberry 2.0 and 2.1 running in a virtual machine, and all tests (including assoc_rules) passed without errors.
Thanks again.

fix in b57e5a9

Thanks! Now tested and run well both on Docker and vitual machine env.

gfphoenix78 added 2 commits October 21, 2025 17:03

Add sized delete operator

76df2b9

tuhaihe reviewed Oct 21, 2025

View reviewed changes

edespino self-requested a review October 22, 2025 01:23

edespino requested changes Oct 22, 2025

View reviewed changes

zhangwenchao-123 force-pushed the support_cloudberry branch from 00d24da to e6d3c42 Compare October 22, 2025 02:55

zhangwenchao-123 force-pushed the support_cloudberry branch 2 times, most recently from 00de02c to 1aad3dd Compare October 22, 2025 06:57

Support madlib2 for lighting.

c725ce5

Fix SEGFAULT memory bugs There're weird SEGFAULT bug due to custom allocation erroneously paired with std::free (should be custom free) and we're unable to solve them. This is a workaround.

zhangwenchao-123 force-pushed the support_cloudberry branch from 6b73eaa to c725ce5 Compare October 23, 2025 06:12

edespino requested changes Oct 23, 2025

View reviewed changes

edespino reviewed Oct 23, 2025

View reviewed changes

tuhaihe reviewed Oct 23, 2025

View reviewed changes

edespino reviewed Oct 23, 2025

View reviewed changes

zhangwenchao-123 added 3 commits October 23, 2025 18:09

Always use python3 for madpack.

303c3fb

fix

783958e

fix

b5289b2

fix

8dac48a

fix NOTICE

4546cfb

tuhaihe reviewed Oct 31, 2025

View reviewed changes

tuhaihe reviewed Dec 5, 2025

View reviewed changes

Support python3 compatibility and use real cmake file.

e665a9f

tuhaihe reviewed Feb 10, 2026

View reviewed changes

zhangyue1818 added 2 commits February 10, 2026 16:14

Add license header.

d018805

Remove duplicated code in table_to_pmml.sql_in

9969482

zhangyue1818 added 2 commits February 11, 2026 17:24

Merge branch 'madlib2-master' into support_cloudberry

072774b

	# Assumes CBDB always follows semantic versioning
	# Assumes Apache Cloudberry always follows semantic versioning

Support cloudberry #627

Are you sure you want to change the base?

Support cloudberry #627

Conversation

zhangwenchao-123 commented Oct 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edespino left a comment

Choose a reason for hiding this comment

Uh oh!

edespino commented Oct 22, 2025

Uh oh!

edespino commented Oct 22, 2025

Uh oh!

zhangwenchao-123 commented Oct 22, 2025

Uh oh!

tuhaihe commented Oct 22, 2025

Uh oh!

zhangwenchao-123 commented Oct 22, 2025

Uh oh!

zhangwenchao-123 commented Oct 22, 2025

Uh oh!

edespino commented Oct 22, 2025

Uh oh!

edespino commented Oct 22, 2025

Uh oh!

zhangwenchao-123 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edespino left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zhangwenchao-123 commented Oct 23, 2025 •

edited

Loading

tuhaihe commented Oct 31, 2025 •

edited

Loading