LavenderTown¶

A Streamlit-first Python package for detecting and visualizing "data ghosts": type inconsistencies, nulls, invalid values, schema drift, and anomalies in tabular datasets.

LavenderTown helps you quickly identify data quality issues in your datasets through an intuitive, interactive Streamlit interface. Perfect for data scientists, analysts, and engineers who need to understand their data quality before diving into analysis.

Features¶

Version Information

See Version Mapping for details on when features were introduced.

Zero-config data quality insights - Get started with minimal setup
Streamlit-native UI - Fully integrated with Streamlit, no HTML embeds
Interactive ghost detection - Drill down into problematic rows
Pandas & Polars support - Works with your existing data pipelines
Exportable findings - Download results as JSON, CSV, or Parquet
Dataset Comparison - Detect schema and distribution drift with statistical tests
Custom Rules - Create and manage custom data quality rules via UI
Enhanced File Upload - Drag-and-drop with animated progress and auto encoding detection
Modular UI Components - Flexible component system for customizing the Inspector
Interactive Visualizations - Plotly backend with zoom, pan, and 3D charts
Advanced Time-Series Features - tsfresh integration for 700+ time-series features
Enhanced UI Components - Streamlit Extras for improved metric cards and badges
Database Backend - SQLAlchemy support for SQLite and PostgreSQL
High Performance - Optimized for datasets up to millions of rows
Enhanced CLI Tool - Interactive CLI with progress bars and formatted output
Ecosystem Integration - Export rules to Pandera and Great Expectations
Configuration Management - Environment-based configuration with .env files
Advanced ML Detection - 40+ ML anomaly detection algorithms via PyOD
Time-Series Analysis - Change point detection and comprehensive profiling
Statistical Testing - Kolmogorov-Smirnov and chi-square tests for drift

Quick Start¶

import streamlit as st
from lavendertown import Inspector
import pandas as pd

# Load your data
df = pd.read_csv("your_data.csv")

# Create inspector and render
inspector = Inspector(df)
inspector.render()  # This must be called within a Streamlit app context

That's it! Save this code in a file (e.g., app.py) and run streamlit run app.py to see the interactive data quality dashboard.

Installation¶

pip install lavendertown

For optional features:

# Polars support
pip install lavendertown[polars]

# Ecosystem integrations
pip install lavendertown[pandera]
pip install lavendertown[great_expectations]

# Enhanced CLI with Rich formatting
pip install lavendertown[cli]

# ML and time-series features
pip install lavendertown[ml]          # PyOD + scikit-learn for 40+ ML algorithms
pip install lavendertown[timeseries]  # Ruptures + statsmodels + tsfresh for time-series analysis
pip install lavendertown[profiling]   # ydata-profiling for comprehensive reports
pip install lavendertown[parquet]     # PyArrow for Parquet export
pip install lavendertown[stats]       # scipy.stats for statistical tests

# Phase 7 features (v0.7.0)
pip install lavendertown[plotly]      # Plotly for interactive visualizations
pip install lavendertown[ui]          # Streamlit Extras for enhanced UI components
pip install lavendertown[database]    # SQLAlchemy for database backend

# All optional dependencies
pip install lavendertown[all]

Documentation¶

Getting Started - Installation and quick start guide
User Guide - Comprehensive usage documentation
API Reference - Complete API documentation
Examples - Code examples and tutorials
Design Documentation - Architecture and design decisions

Ghost Categories¶

LavenderTown detects four main categories of data quality issues:

Structural Ghosts - Mixed dtypes, schema drift, unexpected nullability
Value Ghosts - Out-of-range values, regex violations, enum violations
Completeness Ghosts - Null density thresholds, conditional nulls
Statistical Ghosts - Outliers (IQR method), distribution shifts

Each finding includes: - Ghost type: Category of the issue - Column: Affected column name - Severity: info, warning, or error - Description: Human-readable explanation - Row indices: Specific rows affected (when applicable) - Metadata: Additional diagnostic information

Links¶

GitHub Repository: https://github.com/eddiethedean/lavendertown
PyPI Package: https://pypi.org/project/lavendertown/
Issues: https://github.com/eddiethedean/lavendertown/issues

Made with ❤️ for the data quality community