LavenderTown¶
A Streamlit-first Python package for detecting and visualizing "data ghosts": type inconsistencies, nulls, invalid values, schema drift, and anomalies in tabular datasets.
LavenderTown helps you quickly identify data quality issues in your datasets through an intuitive, interactive Streamlit interface. Perfect for data scientists, analysts, and engineers who need to understand their data quality before diving into analysis.
Features¶
Version Information
See Version Mapping for details on when features were introduced.
- Zero-config data quality insights - Get started with minimal setup
- Streamlit-native UI - Fully integrated with Streamlit, no HTML embeds
- Interactive ghost detection - Drill down into problematic rows
- Pandas & Polars support - Works with your existing data pipelines
- Exportable findings - Download results as JSON, CSV, or Parquet
- Dataset Comparison - Detect schema and distribution drift with statistical tests
- Custom Rules - Create and manage custom data quality rules via UI
- Enhanced File Upload - Drag-and-drop with animated progress and auto encoding detection
- Modular UI Components - Flexible component system for customizing the Inspector
- Interactive Visualizations - Plotly backend with zoom, pan, and 3D charts
- Advanced Time-Series Features - tsfresh integration for 700+ time-series features
- Enhanced UI Components - Streamlit Extras for improved metric cards and badges
- Database Backend - SQLAlchemy support for SQLite and PostgreSQL
- High Performance - Optimized for datasets up to millions of rows
- Enhanced CLI Tool - Interactive CLI with progress bars and formatted output
- Ecosystem Integration - Export rules to Pandera and Great Expectations
- Configuration Management - Environment-based configuration with
.envfiles - Advanced ML Detection - 40+ ML anomaly detection algorithms via PyOD
- Time-Series Analysis - Change point detection and comprehensive profiling
- Statistical Testing - Kolmogorov-Smirnov and chi-square tests for drift
Quick Start¶
import streamlit as st
from lavendertown import Inspector
import pandas as pd
# Load your data
df = pd.read_csv("your_data.csv")
# Create inspector and render
inspector = Inspector(df)
inspector.render() # This must be called within a Streamlit app context
That's it! Save this code in a file (e.g., app.py) and run streamlit run app.py to see the interactive data quality dashboard.
Installation¶
For optional features:
# Polars support
pip install lavendertown[polars]
# Ecosystem integrations
pip install lavendertown[pandera]
pip install lavendertown[great_expectations]
# Enhanced CLI with Rich formatting
pip install lavendertown[cli]
# ML and time-series features
pip install lavendertown[ml] # PyOD + scikit-learn for 40+ ML algorithms
pip install lavendertown[timeseries] # Ruptures + statsmodels + tsfresh for time-series analysis
pip install lavendertown[profiling] # ydata-profiling for comprehensive reports
pip install lavendertown[parquet] # PyArrow for Parquet export
pip install lavendertown[stats] # scipy.stats for statistical tests
# Phase 7 features (v0.7.0)
pip install lavendertown[plotly] # Plotly for interactive visualizations
pip install lavendertown[ui] # Streamlit Extras for enhanced UI components
pip install lavendertown[database] # SQLAlchemy for database backend
# All optional dependencies
pip install lavendertown[all]
Documentation¶
- Getting Started - Installation and quick start guide
- User Guide - Comprehensive usage documentation
- API Reference - Complete API documentation
- Examples - Code examples and tutorials
- Design Documentation - Architecture and design decisions
Ghost Categories¶
LavenderTown detects four main categories of data quality issues:
- Structural Ghosts - Mixed dtypes, schema drift, unexpected nullability
- Value Ghosts - Out-of-range values, regex violations, enum violations
- Completeness Ghosts - Null density thresholds, conditional nulls
- Statistical Ghosts - Outliers (IQR method), distribution shifts
Each finding includes:
- Ghost type: Category of the issue
- Column: Affected column name
- Severity: info, warning, or error
- Description: Human-readable explanation
- Row indices: Specific rows affected (when applicable)
- Metadata: Additional diagnostic information
Links¶
- GitHub Repository: https://github.com/eddiethedean/lavendertown
- PyPI Package: https://pypi.org/project/lavendertown/
- Issues: https://github.com/eddiethedean/lavendertown/issues
Made with ❤️ for the data quality community