Architecture¶

This document describes the architecture for LavenderTown, a Streamlit-first data quality inspection framework focused on detecting "data ghosts".

The design prioritizes: - Clear separation of concerns - Extensibility (plugin-style detectors) - Streamlit-native rendering - Low cognitive overhead for contributors

High-Level Architecture¶

┌──────────────────────────┐
│        User / App        │
│  (Streamlit Application) │
└─────────────┬────────────┘
              │
              ▼
┌──────────────────────────┐
│        Inspector         │
│  (Orchestrator Layer)    │
└─────────────┬────────────┘
              │
   ┌──────────┼──────────┐
   ▼          ▼          ▼
┌────────┐ ┌────────┐ ┌────────┐
│Ghost   │ │Ghost   │ │Ghost   │
│Detector│ │Detector│ │Detector│
│(Nulls) │ │(Types) │ │(Stats) │
└────┬───┘ └────┬───┘ └────┬───┘
     │          │          │
     └──────────┴──────────┘
                 │
                 ▼
        ┌──────────────────┐
        │   Findings Model │
        │ (Normalized Data)│
        └─────────┬────────┘
                  │
        ┌─────────┼─────────┐
        ▼                   ▼
┌──────────────┐   ┌────────────────┐
│ Streamlit UI │   │ Exporters      │
│ (Charts &    │   │ (JSON / CSV)   │
│  Tables)     │   └────────────────┘
└──────────────┘

Core Components¶

1. Inspector (Central Orchestrator)¶

Responsibility: - Accepts a DataFrame (Pandas or Polars) - Registers and runs ghost detectors - Aggregates findings - Controls Streamlit rendering

Key API:

from lavendertown import Inspector

inspector = Inspector(df)
inspector.render()

The Inspector: - Detects backend type (Pandas vs Polars) - Applies caching (st.cache_data) where safe - Acts as the single public-facing API

2. Ghost Detectors (Plugin System)¶

Each detector is: - Stateless - Focused on a single ghost category - Easily swappable or extendable

Detectors: - NullGhostDetector - TypeGhostDetector - OutlierGhostDetector - TimeSeriesAnomalyDetector - MLAnomalyDetector - RuleBasedDetector

Interface:

class GhostDetector:
    def detect(self, df) -> list[GhostFinding]:
        ...

Detectors should never: - Render UI - Modify data - Depend on Streamlit

3. Findings Model (Normalization Layer)¶

All detectors emit findings in a shared schema.

GhostFinding:
    - ghost_type        # null, type, range, outlier, drift, rule
    - column            # affected column
    - severity          # info | warning | error
    - description       # human-readable
    - row_indices       # optional list[int]
    - metadata          # free-form dict

Benefits: - UI and exporters don't care how a ghost was detected - Easy to add new detectors without UI changes

4. Streamlit UI Layer¶

Responsibilities: - Present summaries and metrics - Visualize ghosts (charts, tables, heatmaps) - Filter and drill into problematic rows - Explain "why this is a problem"

UI Sections: - Overview metrics (total ghosts, severity counts) - Sidebar ghost category filters - Column-level visualizations - Row-level preview - Custom rule management - Export options

Rendering Rule:

UI reads from Findings only — never raw detectors

5. Export Layer¶

Supports exporting findings to: - JSON (machine readable) - CSV (analyst friendly) - Pandera schemas - Great Expectations expectations

Data Flow Summary¶

User loads data in Streamlit
Inspector initializes
Inspector runs detectors
Detectors emit normalized findings
Findings are cached and aggregated
UI renders interactive views
User optionally exports results

Why This Architecture Works¶

Encourages clean separation of logic and UI
Makes Streamlit rendering predictable
Allows incremental detector development
Enables future non-Streamlit frontends