Drift Detection¶
Version
Drift detection was introduced in v0.2.0. Statistical tests (Kolmogorov-Smirnov, chi-square) were added in v0.5.0.
LavenderTown can compare two datasets to detect schema and distribution changes, helping you monitor data quality over time.
Overview¶
Drift detection identifies changes between a baseline dataset and a current dataset:
- Schema changes: New/removed columns, type changes, nullability changes
- Distribution changes: Value range shifts, null percentage changes, cardinality changes
Basic Usage¶
from lavendertown import Inspector
import pandas as pd
# Baseline dataset
baseline_df = pd.DataFrame({
"customer_id": [1, 2, 3, 4, 5],
"age": [25, 30, 35, 40, 45],
"purchase_amount": [100.50, 200.00, 150.75, 300.00, 250.50],
})
# Current dataset
current_df = pd.DataFrame({
"customer_id": [1, 2, 3, 4, 5, 6], # New row
"age": [25, 30, 35, 40, 45, 50], # New row
"purchase_amount": [100.50, 250.00, 150.75, 400.00, 250.50, 500.00], # Changed values
"new_column": [1, 2, 3, 4, 5, 6], # New column
})
# Detect drift
inspector = Inspector(current_df)
drift_findings = inspector.compare_with_baseline(
baseline_df=baseline_df,
comparison_type="full" # or "schema_only", "distribution_only"
)
Comparison Types¶
Full Comparison¶
Detects both schema and distribution changes:
Schema Only¶
Only detects structural changes:
drift_findings = inspector.compare_with_baseline(
baseline_df=baseline_df,
comparison_type="schema_only"
)
Detects: - New columns - Removed columns - Type changes - Nullability changes
Distribution Only¶
Only detects value distribution changes:
drift_findings = inspector.compare_with_baseline(
baseline_df=baseline_df,
comparison_type="distribution_only"
)
Detects: - Numeric range shifts - Null percentage changes - Cardinality changes - Value distribution shifts
Statistical Tests (Phase 6):
When lavendertown[stats] is installed, distribution comparison includes:
- Kolmogorov-Smirnov test for numeric columns (tests if distributions differ)
- Chi-square test for categorical columns (tests independence/contingency)
- Test statistics and p-values are included in finding metadata
Working with Drift Findings¶
Drift findings are GhostFinding objects with ghost_type="drift":
drift_findings = inspector.compare_with_baseline(baseline_df)
for finding in drift_findings:
if finding.ghost_type == "drift":
print(f"Column: {finding.column}")
print(f"Description: {finding.description}")
print(f"Change type: {finding.metadata.get('change_type')}")
print(f"Drift type: {finding.metadata.get('drift_type')}")
Finding Metadata¶
Drift findings include metadata:
change_type: Specific type of change (e.g., "column_added", "type_change")drift_type: Category ("schema" or "distribution")baseline_value: Value in baseline datasetcurrent_value: Value in current datasetks_statistic: Kolmogorov-Smirnov test statistic (numeric columns, Phase 6)p_value: Statistical test p-value (Phase 6)chi2_statistic: Chi-square test statistic (categorical columns, Phase 6)
Example Output¶
# Column: new_column
# Description: New column 'new_column' added to dataset
# Change type: column_added
# Drift type: schema
# Column: purchase_amount
# Description: Column 'purchase_amount' range shifted from [100.50, 300.00] to [100.50, 500.00]
# Change type: numeric_range
# Drift type: distribution
CLI Usage¶
Detect drift from the command line:
Options:
- --comparison-type [full|schema_only|distribution_only]: Type of comparison
- --output-format [json|csv|parquet]: Output format (Phase 6: Parquet support)
- --output-file PATH: Specific output file
Note: Parquet export requires lavendertown[parquet] (pip install lavendertown[parquet])
Use Cases¶
Data Pipeline Monitoring¶
Monitor data quality in ETL pipelines:
# After each pipeline run
current_df = load_current_data()
baseline_df = load_baseline_data()
inspector = Inspector(current_df)
drift_findings = inspector.compare_with_baseline(baseline_df)
# Alert on significant changes
for finding in drift_findings:
if finding.severity == "error":
send_alert(f"Critical drift detected: {finding.description}")
Version Comparison¶
Compare dataset versions:
# Compare production vs staging
production_df = load_production_data()
staging_df = load_staging_data()
inspector = Inspector(staging_df)
drift_findings = inspector.compare_with_baseline(production_df)
# Review changes before deployment
for finding in drift_findings:
print(f"⚠️ {finding.description}")
Best Practices¶
- Establish baselines: Use stable, validated datasets as baselines
- Monitor regularly: Run drift detection as part of your data quality checks
- Set thresholds: Define what level of drift requires attention
- Document changes: Keep records of expected vs unexpected changes
- Automate alerts: Set up alerts for critical drift findings
Next Steps¶
- Learn about Basic Usage for general data quality analysis
- See API Reference for detailed documentation