Title: Optimize pandas data types during file loading to reduce memory usage
Problem Description
The excel-toolkit does not optimize pandas data types when loading files. Pandas defaults to int64, float64, and object (string) types, which consume significantly more memory than necessary. For large files (50k-500k rows), this can double or triple memory usage.
Current Behavior
In file_handlers.py:
# Line 110: Loads with default dtypes
df = pd.read_excel(path, sheet_name=sheet_name, header=header, engine=engine, **kwargs)
# Line 313: Loads with default dtypes
df = pd.read_csv(path, encoding=encoding, delimiter=delimiter, header=header, **kwargs)
No dtype optimization is performed during loading.
Memory Impact by Data Type
| Data Type |
Memory per Element |
For 500k Rows |
int64 (default) |
8 bytes |
4MB per column |
int32 |
4 bytes |
2MB per column (50% savings) |
int16 |
2 bytes |
1MB per column (75% savings) |
int8 |
1 byte |
0.5MB per column (87.5% savings) |
float64 (default) |
8 bytes |
4MB per column |
float32 |
4 bytes |
2MB per column (50% savings) |
object (string) |
Variable |
8-16 bytes average |
category |
Variable |
1-4 bytes (50-90% savings) |
Real-World Examples
Scenario 1: E-commerce data with 20 columns
500k rows × 20 columns with default dtypes:
- 5 integer columns (int64): 5 × 4MB = 20MB
- 5 float columns (float64): 5 × 4MB = 20MB
- 10 string columns (object): 10 × 8MB = 80MB
Total: 120MB
With optimized dtypes:
- 5 integer columns (int16/int32): 5 × 2MB = 10MB
- 5 float columns (float32): 5 × 2MB = 10MB
- 10 string columns (category for low-cardinality): 10 × 2MB = 20MB
Total: 40MB (67% savings)
Scenario 2: User activity log
500k rows with columns:
- user_id: int64 (range: 1 to 500,000) → Could be int32 (2MB vs 4MB)
- action_id: int64 (range: 1 to 50) → Could be int8 (0.5MB vs 4MB)
- timestamp: object → Could be datetime64 (4MB vs 8MB)
- status: int64 (0/1) → Could be bool (0.5MB vs 4MB)
- page_url: object (100k unique URLs) → Could be category (4MB vs 8MB)
Default dtypes: 28MB
Optimized dtypes: 11MB (61% savings)
Scenario 3: Sales data
500k rows:
- quantity: int64 (range: 1 to 1000) → int16 (1MB vs 4MB)
- price: float64 → float32 (2MB vs 4MB)
- discount: float64 (0-1) → float32 (2MB vs 4MB)
- customer_id: int64 (50k unique) → int32 (2MB vs 4MB)
- region: object (10 unique) → category (0.5MB vs 4MB)
- product_category: object (50 unique) → category (1MB vs 4MB)
Default dtypes: 24MB
Optimized dtypes: 8.5MB (65% savings)
Affected Files
excel_toolkit/core/file_handlers.py (lines 110, 313)
- Potentially: New utility module for dtype optimization
Proposed Solution
1. Auto-Detect Optimal Types
def optimize_dtypes(df: pd.DataFrame) -> dict[str, str]:
"""Automatically detect optimal dtypes for DataFrame columns."""
dtype_mapping = {}
for col in df.columns:
col_dtype = df[col].dtype
# Skip if already optimized
if col_dtype in ['category', 'bool', 'datetime64[ns]']:
continue
# Integer columns
if pd.api.types.is_integer_dtype(col_dtype):
cmin = df[col].min()
cmax = df[col].max()
if cmin >= 0: # Unsigned integers
if cmax < 255:
dtype_mapping[col] = 'uint8'
elif cmax < 65535:
dtype_mapping[col] = 'uint16'
elif cmax < 4294967295:
dtype_mapping[col] = 'uint32'
else: # Signed integers
if cmin > -128 and cmax < 127:
dtype_mapping[col] = 'int8'
elif cmin > -32768 and cmax < 32767:
dtype_mapping[col] = 'int16'
elif cmin > -2147483648 and cmax < 2147483647:
dtype_mapping[col] = 'int32'
# Float columns
elif pd.api.types.is_float_dtype(col_dtype):
dtype_mapping[col] = 'float32'
# Object (string) columns
elif col_dtype == 'object':
unique_count = df[col].nunique()
total_count = len(df)
# Use category if < 50% unique values
if unique_count / total_count < 0.5:
dtype_mapping[col] = 'category'
return dtype_mapping
2. Apply During File Loading
def read_excel_optimized(path: Path, **kwargs) -> Result[pd.DataFrame, FileHandlerError]:
"""Read Excel file with dtype optimization."""
# First pass: read sample to detect dtypes
sample_df = pd.read_excel(path, nrows=10000)
# Detect optimal dtypes from sample
dtype_mapping = optimize_dtypes(sample_df)
print(f"Optimizing {len(dtype_mapping)} columns:")
for col, dtype in dtype_mapping.items():
print(f" {col}: {dtype}")
# Second pass: read full file with optimized dtypes
try:
df = pd.read_excel(path, dtype=dtype_mapping, **kwargs)
return ok(df)
except Exception as e:
return err(FileHandlerError(str(e)))
3. Add CLI Option for Dtype Optimization
@app.command()
def info(
file_path: Path,
optimize_types: bool = typer.Option(False, "--optimize-types", help="Optimize data types")
):
"""Show file information with optional dtype optimization."""
if optimize_types:
print("Loading file with dtype optimization...")
df = read_excel_optimized(file_path)
else:
df = read_excel(file_path)
4. Memory Savings Report
def report_memory_savings(df_original: pd.DataFrame, df_optimized: pd.DataFrame):
"""Report memory savings from dtype optimization."""
original_memory = df_original.memory_usage(deep=True).sum() / (1024**2)
optimized_memory = df_optimized.memory_usage(deep=True).sum() / (1024**2)
savings = original_memory - optimized_memory
savings_percent = (savings / original_memory) * 100
print(f"\nMemory Usage:")
print(f" Original: {original_memory:.1f} MB")
print(f" Optimized: {optimized_memory:.1f} MB")
print(f" Savings: {savings:.1f} MB ({savings_percent:.1f}%)")
5. Fallback for Failed Optimization
def safe_optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
"""Safely optimize dtypes with fallback."""
try:
dtype_mapping = optimize_dtypes(df)
return df.astype(dtype_mapping)
except Exception as e:
print(f"Warning: Dtype optimization failed: {e}")
print("Falling back to default dtypes")
return df
6. Categorical Type for Strings
String columns with low cardinality benefit greatly from the category dtype:
def should_be_categorical(series: pd.Series, threshold: float = 0.5) -> bool:
"""Determine if a series should use category dtype."""
if series.dtype != 'object':
return False
unique_ratio = series.nunique() / len(series)
return unique_ratio < threshold
# Examples:
# - "status" column with 5 unique values: category (95% memory reduction)
# - "country" column with 50 unique values out of 500k: category (90% reduction)
# - "user_id" with 400k unique values: object (category would be worse)
7. Type Inference from Column Names
def infer_dtype_from_column_name(col_name: str, series: pd.Series) -> str | None:
"""Infer optimal dtype based on column name patterns."""
col_lower = col_name.lower()
# Boolean patterns
if col_lower in ['is_active', 'is_deleted', 'is_verified', 'enabled', 'disabled']:
if series.isin([0, 1, True, False]).all():
return 'bool'
# ID columns (likely can be smaller than int64)
if '_id' in col_lower or col_lower.endswith('_id'):
if series.max() < 4294967295: # Max uint32
return 'uint32'
# Date/time columns
if 'date' in col_lower or 'time' in col_lower:
if series.dtype == 'object':
return 'datetime64[ns]'
# Percentage/ratio columns
if col_lower in ['percent', 'ratio', 'rate', 'discount', 'commission']:
if series.max() <= 1.0 and series.min() >= 0.0:
return 'float32'
return None # No inference possible
Implementation Strategy
Phase 1: Basic Optimization (High Priority)
- Implement
optimize_dtypes() function
- Add to file loading with sampling
- Show memory savings report
Phase 2: Advanced Features (Medium Priority)
- Add CLI option
--optimize-types
- Implement column name inference
- Add smart categorical detection
Phase 3: User Control (Low Priority)
- Allow user to specify dtype mapping
- Save dtype mapping for reuse
- Add
--dtype option to override auto-detection
Expected Impact
For a typical 500k row dataset:
| Metric |
Before |
After |
Improvement |
| File size on disk |
50MB |
50MB |
No change |
| Memory usage |
150MB |
75MB |
50% reduction |
| Load time |
10s |
11s |
+10% (detection overhead) |
| Operation speed |
Baseline |
Baseline |
No change |
The 10% slower load time is acceptable for 50% memory savings.
Special Cases to Handle
- Mixed types in columns: Fall back to object dtype
- NaN values: Some dtypes don't support NaN (e.g., int64)
- String to category: Only if < 50% unique values
- Boolean detection: Check only 0/1/True/False values
- Datetime parsing: Handle various formats
Related Issues
Title: Optimize pandas data types during file loading to reduce memory usage
Problem Description
The excel-toolkit does not optimize pandas data types when loading files. Pandas defaults to
int64,float64, andobject(string) types, which consume significantly more memory than necessary. For large files (50k-500k rows), this can double or triple memory usage.Current Behavior
In
file_handlers.py:No dtype optimization is performed during loading.
Memory Impact by Data Type
int64(default)int32int16int8float64(default)float32object(string)categoryReal-World Examples
Scenario 1: E-commerce data with 20 columns
Scenario 2: User activity log
Scenario 3: Sales data
Affected Files
excel_toolkit/core/file_handlers.py(lines 110, 313)Proposed Solution
1. Auto-Detect Optimal Types
2. Apply During File Loading
3. Add CLI Option for Dtype Optimization
4. Memory Savings Report
5. Fallback for Failed Optimization
6. Categorical Type for Strings
String columns with low cardinality benefit greatly from the
categorydtype:7. Type Inference from Column Names
Implementation Strategy
Phase 1: Basic Optimization (High Priority)
optimize_dtypes()functionPhase 2: Advanced Features (Medium Priority)
--optimize-typesPhase 3: User Control (Low Priority)
--dtypeoption to override auto-detectionExpected Impact
For a typical 500k row dataset:
The 10% slower load time is acceptable for 50% memory savings.
Special Cases to Handle
Related Issues