Performance Guide¶
This guide provides comprehensive performance optimization strategies for LavenderTown, including benchmarks, profiling techniques, and best practices for different dataset sizes.
Performance Targets¶
From the roadmap success metrics: - <2s load for 100k rows (target) - Sub-second for <10k rows (typical) - <5s for 1M rows (optimized)
Benchmarking¶
Running Benchmarks¶
The repository includes a benchmark script:
This tests detection performance across different dataset sizes (1k, 10k, 100k, 1M rows).
Typical Performance¶
| Dataset Size | Rows | Columns | Pandas (s) | Polars (s) | Status |
|---|---|---|---|---|---|
| Small | 1k | 10 | ~0.1 | ~0.1 | ✓ Fast |
| Medium | 10k | 10 | ~0.5 | ~0.3 | ✓ Fast |
| Large | 100k | 10 | ~2-5 | ~0.8-2 | ⚠️ Varies |
| Very Large | 1M | 10 | ~20-60 | ~5-15 | ⚠️ Slow |
Performance factors: - Number of columns (more columns = slower) - Data types (string operations are slower than numeric) - Number of findings (more issues detected = more processing) - System resources (CPU, memory, disk I/O)
Optimization Strategies¶
1. Choose the Right Backend¶
Pandas - Best for: - Small to medium datasets (<100k rows) - Familiar API and ecosystem - Rich feature set
Polars - Best for: - Large datasets (>100k rows) - Performance-critical workflows - Memory efficiency
# For large datasets, use Polars
import polars as pl
from lavendertown import Inspector
df = pl.read_csv("large_file.csv") # Faster CSV reading
inspector = Inspector(df) # Automatically detects Polars
findings = inspector.detect() # 2-5x faster than Pandas
Performance comparison:
import time
import pandas as pd
import polars as pl
from lavendertown import Inspector
# Large dataset
data = {"value": range(1, 1_000_001)}
# Pandas
df_pd = pd.DataFrame(data)
start = time.time()
inspector = Inspector(df_pd)
findings_pd = inspector.detect()
pandas_time = time.time() - start
# Polars
df_pl = pl.DataFrame(data)
start = time.time()
inspector = Inspector(df_pl)
findings_pl = inspector.detect()
polars_time = time.time() - start
print(f"Pandas: {pandas_time:.2f}s")
print(f"Polars: {polars_time:.2f}s")
print(f"Speedup: {pandas_time/polars_time:.2f}x")
2. Leverage Caching¶
Use Streamlit's caching to avoid redundant computation:
import streamlit as st
from lavendertown import Inspector
import pandas as pd
@st.cache_data
def analyze_data(df: pd.DataFrame):
"""Cache analysis results."""
inspector = Inspector(df)
return inspector.detect()
# First run: computes
# Subsequent runs: uses cache
findings = analyze_data(df)
Cache key considerations:
@st.cache_data(hash_funcs={pd.DataFrame: lambda x: hash(x.to_string())})
def analyze_with_custom_hash(df):
"""Use custom hash function for DataFrame."""
inspector = Inspector(df)
return inspector.detect()
3. Selective Detector Usage¶
Only enable detectors you need:
from lavendertown import Inspector
from lavendertown.detectors.null import NullGhostDetector
from lavendertown.detectors.type import TypeGhostDetector
# Instead of all detectors, use only what you need
detectors = [
NullGhostDetector(null_threshold=0.15),
# Skip OutlierGhostDetector for faster analysis
]
inspector = Inspector(df, detectors=detectors)
findings = inspector.detect() # Faster than default
4. Data Sampling for Exploration¶
For initial exploration, sample your data:
import pandas as pd
from lavendertown import Inspector
# Load full dataset
full_df = pd.read_csv("large_file.csv")
# Sample for quick analysis
sample_df = full_df.sample(n=10_000, random_state=42)
# Quick analysis on sample
inspector = Inspector(sample_df)
findings = inspector.detect()
# If sample looks good, analyze full dataset
if len([f for f in findings if f.severity == "error"]) == 0:
inspector_full = Inspector(full_df)
findings_full = inspector_full.detect()
5. Column Selection¶
Analyze only relevant columns:
from lavendertown import Inspector
# Select only columns of interest
relevant_columns = ["price", "quantity", "category"]
df_filtered = df[relevant_columns]
inspector = Inspector(df_filtered)
findings = inspector.detect() # Faster with fewer columns
6. Chunked Processing¶
For very large datasets, process in chunks:
import pandas as pd
from lavendertown import Inspector
from typing import List
from lavendertown.models import GhostFinding
def analyze_in_chunks(filepath: str, chunk_size: int = 10000) -> List[GhostFinding]:
"""Analyze large file in chunks."""
all_findings = []
for chunk in pd.read_csv(filepath, chunksize=chunk_size):
inspector = Inspector(chunk)
findings = inspector.detect()
all_findings.extend(findings)
# Deduplicate if needed
return all_findings
findings = analyze_in_chunks("very_large_file.csv")
7. Memory Optimization¶
Reduce memory usage:
import pandas as pd
# Use appropriate data types
df["id"] = df["id"].astype("int32") # Instead of int64
df["category"] = df["category"].astype("category") # Categorical for strings
df["price"] = df["price"].astype("float32") # Instead of float64
# Drop unused columns
df = df.drop(columns=["unused_col1", "unused_col2"])
# Use Polars for better memory efficiency
import polars as pl
df = pl.read_csv("file.csv") # More memory-efficient
Profiling and Measurement¶
Using cProfile¶
Profile your code to identify bottlenecks:
import cProfile
import pstats
from lavendertown import Inspector
import pandas as pd
df = pd.read_csv("data.csv")
profiler = cProfile.Profile()
profiler.enable()
inspector = Inspector(df)
findings = inspector.detect()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions
Time-Specific Operations¶
Measure specific operations:
import time
from lavendertown import Inspector
from lavendertown.detectors.null import NullGhostDetector
df = pd.read_csv("data.csv")
# Time detection
start = time.time()
inspector = Inspector(df)
findings = inspector.detect()
detection_time = time.time() - start
print(f"Detection took {detection_time:.2f} seconds")
print(f"Found {len(findings)} issues")
Memory Profiling¶
Monitor memory usage:
import tracemalloc
from lavendertown import Inspector
tracemalloc.start()
inspector = Inspector(df)
findings = inspector.detect()
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory: {current / 1024 / 1024:.2f} MB")
print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
tracemalloc.stop()
Detector-Specific Optimization¶
Null Detector¶
Optimize null detection by setting appropriate thresholds:
from lavendertown.detectors.null import NullGhostDetector
# Higher threshold = faster (checks fewer columns)
detector = NullGhostDetector(null_threshold=0.2) # Only flag >20% nulls
Outlier Detector¶
Outlier detection can be expensive. Consider:
from lavendertown.detectors.outlier import OutlierGhostDetector
# Use higher multiplier for fewer outliers detected
detector = OutlierGhostDetector(multiplier=2.5) # Less sensitive = faster
# Or skip for very large datasets
# detectors = [NullGhostDetector(), TypeGhostDetector()] # Skip outliers
Time-Series Detector¶
Optimize time-series analysis:
from lavendertown.detectors.timeseries import TimeSeriesAnomalyDetector
# Use simpler method for faster analysis
detector = TimeSeriesAnomalyDetector(
method="zscore", # Faster than "seasonal"
sensitivity=3.5, # Higher = fewer checks
window_size=5 # Smaller window = faster
)
ML Anomaly Detector¶
Optimize ML detection:
from lavendertown.detectors.ml_anomaly import MLAnomalyDetector
# Use faster algorithm
detector = MLAnomalyDetector(
algorithm="isolation_forest", # Faster than "lof" or "one_class_svm"
contamination=0.1,
max_samples=5000 # Limit samples for large datasets
)
Best Practices by Dataset Size¶
Small Datasets (<10k rows)¶
- Use default Pandas backend
- Enable all detectors
- No special optimization needed
- Typical time: <1 second
from lavendertown import Inspector
import pandas as pd
df = pd.read_csv("small_data.csv")
inspector = Inspector(df) # Default is fine
inspector.render()
Medium Datasets (10k-100k rows)¶
- Pandas or Polars (Polars recommended for >50k)
- Use caching if in Streamlit
- Consider disabling expensive detectors if not needed
- Typical time: 0.5-3 seconds
from lavendertown import Inspector
import polars as pl # Better for medium datasets
df = pl.read_csv("medium_data.csv")
inspector = Inspector(df)
findings = inspector.detect()
Large Datasets (100k-1M rows)¶
- Use Polars backend
- Sample for initial exploration
- Selective detector usage
- Implement caching
- Consider chunked processing
- Typical time: 2-15 seconds
from lavendertown import Inspector
import polars as pl
# Sample first
df_full = pl.read_csv("large_data.csv")
df_sample = df_full.sample(10000)
# Quick check on sample
inspector = Inspector(df_sample)
sample_findings = inspector.detect()
# If needed, analyze full dataset
# if needs_full_analysis:
# inspector_full = Inspector(df_full)
# findings = inspector_full.detect()
Very Large Datasets (>1M rows)¶
- Always use Polars
- Sample for exploration
- Process in chunks if needed
- Use specific detectors only
- Consider approximate methods
- Typical time: 10-60+ seconds
from lavendertown import Inspector
from lavendertown.detectors.null import NullGhostDetector
import polars as pl
# Sample for exploration
df = pl.read_csv("very_large.csv")
df_sample = df.sample(50000)
# Use only essential detectors
detectors = [NullGhostDetector(null_threshold=0.1)]
inspector = Inspector(df_sample, detectors=detectors)
findings = inspector.detect()
Performance Monitoring¶
Track Performance Metrics¶
import time
from lavendertown import Inspector
import pandas as pd
def analyze_with_metrics(df: pd.DataFrame):
"""Analyze with performance tracking."""
metrics = {}
# Time detection
start = time.time()
inspector = Inspector(df)
findings = inspector.detect()
metrics["detection_time"] = time.time() - start
# Count findings
metrics["total_findings"] = len(findings)
metrics["errors"] = len([f for f in findings if f.severity == "error"])
metrics["warnings"] = len([f for f in findings if f.severity == "warning"])
# Dataset info
metrics["rows"] = len(df)
metrics["columns"] = len(df.columns)
return findings, metrics
df = pd.read_csv("data.csv")
findings, metrics = analyze_with_metrics(df)
print(f"Analyzed {metrics['rows']:,} rows in {metrics['detection_time']:.2f}s")
print(f"Found {metrics['total_findings']} issues")
Troubleshooting Performance Issues¶
Slow Detection¶
- Check dataset size: Use sampling for very large datasets
- Check detectors: Disable expensive detectors if not needed
- Use Polars: Switch to Polars for large datasets
- Profile code: Identify bottlenecks with cProfile
Memory Issues¶
- Use Polars: More memory-efficient than Pandas
- Process in chunks: Don't load entire dataset at once
- Optimize data types: Use appropriate dtypes (int32 vs int64)
- Drop unused columns: Remove columns you don't need
High CPU Usage¶
- Sample data: Analyze sample instead of full dataset
- Reduce detectors: Use only necessary detectors
- Increase thresholds: Higher thresholds = less computation
- Use faster algorithms: Choose faster ML/time-series methods
Future Optimizations¶
Potential areas for performance improvements:
- Parallel detector execution: Run detectors in parallel
- Incremental analysis: Update findings incrementally
- Streaming processing: Process data in streams
- Approximate methods: Faster approximate algorithms
- GPU acceleration: Use GPU for certain operations
- Compiled detectors: Use Numba or Cython for hot paths
Summary¶
- Small datasets: Use defaults, no optimization needed
- Medium datasets: Consider Polars, use caching
- Large datasets: Use Polars, sample first, selective detectors
- Very large datasets: Polars + sampling + chunking + selective detectors
Remember: Measure first, optimize second. Profile your specific use case to identify actual bottlenecks before optimizing.