Skip to content

Conversation

@misrasaurabh1
Copy link

@misrasaurabh1 misrasaurabh1 commented Nov 13, 2025

📄 137% (1.37x) speedup for fix_nan_category in deepnote_toolkit/ocelots/pandas/utils.py

⏱️ Runtime : 834 milliseconds 352 milliseconds (best of 10 runs)

📝 Explanation and details

The optimized version achieves a 137% speedup by eliminating unnecessary work through two key optimizations:

What was optimized:

  1. Pre-filtered categorical detection: Instead of checking column.dtype.name == "category" for every column in the loop, the optimization identifies all categorical columns upfront using enumerate(df.dtypes) and stores their indices.
  2. Early exit for non-categorical DataFrames: Added a guard clause that returns immediately if no categorical columns exist, avoiding any loop overhead.

Why this is faster:

  • Reduced dtype access overhead: The original code called df.iloc[:, i] (expensive pandas indexing) for every column, then checked its dtype. The optimization accesses df.dtypes once, which is much faster than repeated iloc calls.
  • Eliminated wasted iterations: For DataFrames with few/no categorical columns, the original code still iterates through all columns. The optimization skips non-categorical columns entirely and exits early when possible.

Performance characteristics from tests:

  • Large DataFrames with mixed types: Shows significant gains (16-22% faster) when many columns exist but only some are categorical
  • No categorical columns: Dramatic improvement (33-58% faster) due to early exit
  • Small DataFrames: Slight overhead (9-16% slower) due to upfront processing, but this is negligible in absolute terms (microseconds)

The line profiler confirms this: the original spent 66.8% of time on df.iloc access across all columns, while the optimized version only accesses iloc for the pre-identified categorical columns, reducing this bottleneck substantially.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 31 Passed
⏪ Replay Tests 86 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd
# imports
import pytest
from deepnote_toolkit.ocelots.pandas.utils import fix_nan_category

# unit tests

# ----------------------------- #
# Basic Test Cases
# ----------------------------- #

def test_single_categorical_column_adds_nan_category():
    # Basic: Single categorical column, no 'nan' category present
    df = pd.DataFrame({'A': pd.Series(['a', 'b', 'c'], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 302μs -> 336μs (10.3% slower)


def test_multiple_columns_mixed_types():
    # Basic: Multiple columns, some categorical, some not
    df = pd.DataFrame({
        'A': pd.Series(['a', 'b', 'c'], dtype='category'),
        'B': [1, 2, 3],
        'C': pd.Series(['x', 'y', 'z'], dtype='category')
    })
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 427μs -> 448μs (4.67% slower)
    # Non-categorical column should not have cat accessor
    with pytest.raises(AttributeError):
        _ = result['B'].cat

def test_no_categorical_columns():
    # Basic: DataFrame with no categorical columns
    df = pd.DataFrame({'A': [1, 2, 3], 'B': [4.0, 5.0, 6.0]})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 103μs -> 77.0μs (33.9% faster)

def test_empty_dataframe():
    # Basic: Empty DataFrame
    df = pd.DataFrame()
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 1.54μs -> 42.4μs (96.4% slower)

# ----------------------------- #
# Edge Test Cases
# ----------------------------- #

def test_column_with_nan_values():
    # Edge: Categorical column with actual np.nan values
    df = pd.DataFrame({'A': pd.Series(['a', None, 'b'], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 291μs -> 333μs (12.8% slower)

def test_column_with_duplicate_column_names():
    # Edge: DataFrame with duplicate column names
    df = pd.DataFrame([[1, 'a'], [2, 'b']], columns=['X', 'X'])
    df.iloc[:, 1] = pd.Series(['a', 'b'], dtype='category')
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 81.9μs -> 55.4μs (47.9% faster)

def test_all_categorical_columns():
    # Edge: All columns are categorical
    df = pd.DataFrame({
        'A': pd.Series(['foo', 'bar'], dtype='category'),
        'B': pd.Series(['x', 'y'], dtype='category')
    })
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 372μs -> 427μs (12.8% slower)


def test_column_with_integer_categories():
    # Edge: Categorical column with integer categories
    df = pd.DataFrame({'A': pd.Series([1, 2, 3], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 302μs -> 355μs (14.8% slower)

def test_column_with_boolean_categories():
    # Edge: Categorical column with boolean categories
    df = pd.DataFrame({'A': pd.Series([True, False], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 286μs -> 330μs (13.4% slower)

def test_column_with_only_nan_values():
    # Edge: Categorical column with only np.nan values
    df = pd.DataFrame({'A': pd.Series([None, None], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 282μs -> 329μs (14.3% slower)

def test_column_with_empty_category():
    # Edge: Categorical column with no categories (empty)
    df = pd.DataFrame({'A': pd.Series([], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 286μs -> 323μs (11.6% slower)
    # No rows, so nothing to check for values

def test_column_with_object_dtype():
    # Edge: Object dtype column that looks like categorical but isn't
    df = pd.DataFrame({'A': ['a', 'b', 'c']})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 66.7μs -> 70.5μs (5.46% slower)
    with pytest.raises(AttributeError):
        _ = result['A'].cat

# ----------------------------- #
# Large Scale Test Cases
# ----------------------------- #

def test_large_dataframe_many_rows():
    # Large: DataFrame with 1000 rows, single categorical column
    data = ['a', 'b', 'c', None] * 250  # 1000 values
    df = pd.DataFrame({'A': pd.Series(data, dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 281μs -> 336μs (16.3% slower)

def test_large_dataframe_many_columns():
    # Large: DataFrame with 500 categorical columns and 500 int columns
    data = {f'C{i}': pd.Series(['x', 'y'], dtype='category') for i in range(500)}
    data.update({f'I{i}': [1, 2] for i in range(500)})
    df = pd.DataFrame(data)
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 85.4ms -> 73.4ms (16.4% faster)
    # All categorical columns should have 'nan' in categories
    for i in range(500):
        pass
    # Integer columns should not have cat accessor
    for i in range(500):
        with pytest.raises(AttributeError):
            _ = result[f'I{i}'].cat

def test_large_dataframe_duplicate_column_names():
    # Large: DataFrame with 1000 columns, all named 'A', alternating categorical and int
    columns = ['A'] * 1000
    data = []
    for i in range(1000):
        if i % 2 == 0:
            data.append(pd.Series(['foo', 'bar'], dtype='category'))
        else:
            data.append([1, 2])
    df = pd.DataFrame({col: val for col, val in zip(columns, data)})
    # This will result in only one column due to dict key collision, so instead use from dict of tuples
    df = pd.DataFrame({i: data[i] for i in range(1000)})
    df.columns = ['A'] * 1000  # force duplicate column names
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 89.3ms -> 73.4ms (21.6% faster)
    # All even-indexed columns should have 'nan' in categories
    for i in range(0, 1000, 2):
        pass
    # All odd-indexed columns should not have cat accessor
    for i in range(1, 1000, 2):
        with pytest.raises(AttributeError):
            _ = result.iloc[:, i].cat

def test_large_dataframe_all_empty_categorical():
    # Large: 1000 columns, all empty categorical
    df = pd.DataFrame({f'C{i}': pd.Series([], dtype='category') for i in range(1000)})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 152ms -> 150ms (0.730% faster)
    for i in range(1000):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pandas as pd
# imports
import pytest
from deepnote_toolkit.ocelots.pandas.utils import fix_nan_category

# unit tests

# -----------------------------
# 1. Basic Test Cases
# -----------------------------

def test_basic_single_categorical_column():
    # Basic: Single categorical column, no NaNs
    df = pd.DataFrame({"A": pd.Series(["x", "y"], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 309μs -> 342μs (9.79% slower)

def test_basic_multiple_categorical_columns():
    # Basic: Multiple categorical columns
    df = pd.DataFrame({
        "A": pd.Series(["a", "b"], dtype="category"),
        "B": pd.Series(["c", "d"], dtype="category")
    })
    codeflash_output = fix_nan_category(df); result = codeflash_output # 378μs -> 426μs (11.2% slower)

def test_basic_mixed_types():
    # Basic: Categorical and non-categorical columns
    df = pd.DataFrame({
        "A": pd.Series(["a", "b"], dtype="category"),
        "B": [1, 2],
        "C": ["x", "y"]
    })
    codeflash_output = fix_nan_category(df); result = codeflash_output # 295μs -> 262μs (12.6% faster)


def test_edge_empty_dataframe():
    # Edge: Empty dataframe
    df = pd.DataFrame()
    codeflash_output = fix_nan_category(df); result = codeflash_output # 1.62μs -> 44.8μs (96.4% slower)

def test_edge_no_categorical_columns():
    # Edge: No categorical columns
    df = pd.DataFrame({"A": [1, 2], "B": ["x", "y"]})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 114μs -> 72.6μs (57.6% faster)

def test_edge_all_nan_column():
    # Edge: Categorical column with all values as NaN
    df = pd.DataFrame({"A": pd.Series([None, None], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 297μs -> 340μs (12.5% slower)

def test_edge_duplicate_column_names():
    # Edge: DataFrame with duplicate column names
    df = pd.DataFrame(
        [
            ["a", "b"],
            ["c", "d"]
        ],
        columns=["X", "X"]
    )
    df.iloc[:, 0] = pd.Series(df.iloc[:, 0], dtype="category")
    df.iloc[:, 1] = pd.Series(df.iloc[:, 1], dtype="category")
    codeflash_output = fix_nan_category(df); result = codeflash_output # 76.7μs -> 55.1μs (39.0% faster)

def test_edge_empty_categorical_column():
    # Edge: Categorical column with no rows
    df = pd.DataFrame({"A": pd.Series([], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 290μs -> 326μs (11.0% slower)

def test_edge_non_string_category():
    # Edge: Categorical column with non-string categories
    df = pd.DataFrame({"A": pd.Series([1, 2], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 289μs -> 332μs (13.0% slower)

def test_edge_nan_values_in_categorical():
    # Edge: Categorical column with actual NaN values
    df = pd.DataFrame({"A": pd.Series(["a", None], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 281μs -> 332μs (15.2% slower)

# -----------------------------
# 3. Large Scale Test Cases
# -----------------------------

def test_large_scale_many_rows():
    # Large scale: Categorical column with many rows
    values = ["cat", "dog", "mouse"] * 333 + ["cat"]  # 1000 elements
    df = pd.DataFrame({"A": pd.Series(values, dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 276μs -> 326μs (15.2% slower)

def test_large_scale_many_columns():
    # Large scale: Many categorical columns
    data = {f"col{i}": pd.Series(["x", "y"], dtype="category") for i in range(50)}
    df = pd.DataFrame(data)
    codeflash_output = fix_nan_category(df); result = codeflash_output # 7.27ms -> 7.26ms (0.022% faster)
    for col in df.columns:
        pass


def test_large_scale_all_empty_categorical():
    # Large scale: All columns are empty categorical columns
    df = pd.DataFrame({f"col{i}": pd.Series([], dtype="category") for i in range(20)})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 3.05ms -> 3.04ms (0.026% faster)
    for col in df.columns:
        pass

# -----------------------------
# Mutation Testing: Ensure failures if function mutated
# -----------------------------
def test_mutation_fail_if_not_add_nan():
    # If function does not add "nan" category, this test should fail
    df = pd.DataFrame({"A": pd.Series(["x", "y"], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 291μs -> 337μs (13.8% slower)

def test_mutation_fail_if_non_categorical_modified():
    # If function modifies non-categorical columns, this test should fail
    df = pd.DataFrame({"A": [1, 2]})
    original = df.copy()
    codeflash_output = fix_nan_category(df); result = codeflash_output # 73.5μs -> 74.3μs (1.08% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_testsunittest_dataframe_browser_py_testsunittest_config_py_testsunittest_runtime_executor_py___replay_test_0.py::test_deepnote_toolkit_ocelots_pandas_utils_fix_nan_category 245ms 18.9ms 1203%✅
test_pytest_testsunittest_xdg_paths_py_testsunittest_jinjasql_utils_py_testsunittest_url_utils_py_testsun__replay_test_0.py::test_deepnote_toolkit_ocelots_pandas_utils_fix_nan_category 244ms 18.0ms 1261%✅

To edit these changes git checkout codeflash/optimize-fix_nan_category-mhmv7xt8 and push.

Codeflash Static Badge

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling and performance when adding a placeholder category for missing values in categorical columns: the process now skips work when no categorical columns are present and applies category additions in bulk for better efficiency.

The optimized version achieves a **137% speedup** by eliminating unnecessary work through two key optimizations:

**What was optimized:**
1. **Pre-filtered categorical detection**: Instead of checking `column.dtype.name == "category"` for every column in the loop, the optimization identifies all categorical columns upfront using `enumerate(df.dtypes)` and stores their indices.
2. **Early exit for non-categorical DataFrames**: Added a guard clause that returns immediately if no categorical columns exist, avoiding any loop overhead.

**Why this is faster:**
- **Reduced dtype access overhead**: The original code called `df.iloc[:, i]` (expensive pandas indexing) for every column, then checked its dtype. The optimization accesses `df.dtypes` once, which is much faster than repeated `iloc` calls.
- **Eliminated wasted iterations**: For DataFrames with few/no categorical columns, the original code still iterates through all columns. The optimization skips non-categorical columns entirely and exits early when possible.

**Performance characteristics from tests:**
- **Large DataFrames with mixed types**: Shows significant gains (16-22% faster) when many columns exist but only some are categorical
- **No categorical columns**: Dramatic improvement (33-58% faster) due to early exit
- **Small DataFrames**: Slight overhead (9-16% slower) due to upfront processing, but this is negligible in absolute terms (microseconds)

The line profiler confirms this: the original spent 66.8% of time on `df.iloc` access across all columns, while the optimized version only accesses iloc for the pre-identified categorical columns, reducing this bottleneck substantially.
@misrasaurabh1 misrasaurabh1 requested a review from a team as a code owner November 13, 2025 00:34
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

📝 Walkthrough

Walkthrough

The fix_nan_category function in deepnote_toolkit/ocelots/pandas/utils.py now first collects indices of categorical columns, returns immediately if none are found, and then calls add_categories("nan") for all identified categorical columns in a single pass. The change reduces repeated dtype checks and avoids work when there are no categorical columns; functional behavior for datasets with categorical columns remains the same.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Caller
participant fix_nan_category
participant DataFrame
Note over fix_nan_category,DataFrame: Original flow (per-column checks)
Caller->>fix_nan_category: call
fix_nan_category->>DataFrame: iterate columns
DataFrame-->>fix_nan_category: column + dtype
alt is categorical?
fix_nan_category->>DataFrame: add_categories("nan") for column
DataFrame-->>fix_nan_category: updated column
else not categorical
DataFrame-->>fix_nan_category: skip
end
fix_nan_category-->>Caller: return

mermaid
sequenceDiagram
participant Caller
participant fix_nan_category
participant DataFrame
Note over fix_nan_category,DataFrame: New flow (collect indices then bulk update)
Caller->>fix_nan_category: call
fix_nan_category->>DataFrame: scan columns -> collect categorical indices
DataFrame-->>fix_nan_category: list of categorical indices
alt no categorical indices
fix_nan_category-->>Caller: return early
else has categorical indices
fix_nan_category->>DataFrame: add_categories("nan") for identified columns (bulk)
DataFrame-->>fix_nan_category: updated columns
fix_nan_category-->>Caller: return
end

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title clearly describes the main optimization: a 137% performance improvement to fix_nan_category function, directly matching the PR's core objective.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9994c8f and 2552607.

📒 Files selected for processing (1)
  • deepnote_toolkit/ocelots/pandas/utils.py (1 hunks)
🔇 Additional comments (1)
deepnote_toolkit/ocelots/pandas/utils.py (1)

24-29: Solid optimization: pre-filter and early exit.

Pre-collecting categorical indices and returning early when none exist avoids wasted iteration. Correct approach.

Comment on lines +32 to +34
for i in categorical_indices:
column = df.iloc[:, i]
df.iloc[:, i] = column.cat.add_categories("nan")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor: simplify the assignment.

Lines 33-34 can be combined into one statement without the intermediate variable.

     for i in categorical_indices:
-        column = df.iloc[:, i]
-        df.iloc[:, i] = column.cat.add_categories("nan")
+        df.iloc[:, i] = df.iloc[:, i].cat.add_categories("nan")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i in categorical_indices:
column = df.iloc[:, i]
df.iloc[:, i] = column.cat.add_categories("nan")
for i in categorical_indices:
df.iloc[:, i] = df.iloc[:, i].cat.add_categories("nan")
🤖 Prompt for AI Agents
In deepnote_toolkit/ocelots/pandas/utils.py around lines 32 to 34, the code uses
an intermediate variable 'column' to add a category; simplify by replacing the
two statements with a single assignment that updates the DataFrame column in
place, e.g. assign the result of df.iloc[:, i].cat.add_categories("nan")
directly back to df.iloc[:, i].

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
deepnote_toolkit/ocelots/pandas/utils.py (1)

45-47: Intermediate variable still present.

Past review suggested combining lines 46-47 into a single statement. The intermediate variable remains.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2552607 and b045cd1.

📒 Files selected for processing (1)
  • deepnote_toolkit/ocelots/pandas/utils.py (1 hunks)
🔇 Additional comments (2)
deepnote_toolkit/ocelots/pandas/utils.py (2)

41-42: Early exit is correct.

Avoids wasted iterations when no categorical columns exist. Validated by benchmark improvements.


36-48: Optimization logic is sound, but manual verification needed.

Pre-collection of categorical indices and early exit avoid repeated dtype checks. However, edge case testing couldn't run in this environment (pandas unavailable). Manually verify behavior with empty DataFrames and non-categorical inputs to confirm the early exit and iloc assignment work correctly across your pandas version.

Comment on lines +38 to +40
categorical_indices = [
i for i, dtype in enumerate(df.dtypes) if dtype.name == "category"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using pandas API for dtype checking.

dtype.name == "category" works but pd.api.types.is_categorical_dtype(dtype) is more idiomatic and robust.

     categorical_indices = [
-        i for i, dtype in enumerate(df.dtypes) if dtype.name == "category"
+        i for i, dtype in enumerate(df.dtypes) if pd.api.types.is_categorical_dtype(dtype)
     ]
🤖 Prompt for AI Agents
In deepnote_toolkit/ocelots/pandas/utils.py around lines 38 to 40, replace the
dtype name string comparison with the pandas API for checking categorical
dtypes: use pd.api.types.is_categorical_dtype(dtype) when filtering df.dtypes so
detection is more idiomatic and robust; update the list comprehension
accordingly and ensure pd.api.types is imported/accessible in the module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant