Skip to content

Roadmap

Phase 0 — Foundations ✅ COMPLETE

  • Repo setup & packaging
  • Core Inspector class
  • Pandas support
  • Streamlit rendering shell

Phase 1 — MVP ✅ COMPLETE

  • Null detection & visualization
  • Type inconsistency detection
  • Basic outlier detection
  • CSV upload UI
  • Findings export (JSON/CSV)

Phase 2 — Power Features ✅ COMPLETE

  • Polars support
  • Rule authoring UI
  • Dataset comparison (drift)
  • Performance optimizations
  • Theme support

Phase 3 — Ecosystem Integration ✅ COMPLETE

  • Pandera rule export
  • Great Expectations export
  • CLI wrapper
  • Streamlit Cloud demo app

Phase 4 — Advanced Ghosts ✅ COMPLETE

  • Time-series anomalies
  • Cross-column logic
  • ML-assisted anomaly detection
  • Collaboration features

Phase 5 — Quick Wins ✅ COMPLETE

Based on research recommendations (see docs/RESEARCH_PYTHON_PACKAGES.md):

  • Rich CLI output enhancement
  • orjson JSON serialization
  • Hypothesis property-based testing
  • python-dotenv configuration

Phase 6 — Feature Enhancements ✅ COMPLETE

  • Typer CLI framework migration
  • PyOD anomaly detection (40+ ML algorithms)
  • Ruptures change point detection
  • scipy.stats statistical tests for drift detection
  • PyArrow Parquet export
  • Faker test data generation
  • ydata-profiling integration

Phase 7 — Advanced Integrations ✅ COMPLETE

  • Plotly interactive visualizations
  • tsfresh time-series features
  • Streamlit Extras UI components
  • SQLAlchemy database backend
  • Joblib parallel detector execution (deferred to future release)

Success Metrics

  • ✅ <5 lines to usable dashboard
  • ⚠️ Sub-2s load for 100k rows (varies by system)
  • ✅ Clear ghost explanations

Future Enhancements (Post-Phase 7)

  • Cloud-based collaboration storage integration
  • Real-time collaboration features (WebSocket-based)
  • Advanced ML models (deep learning, autoencoders)
  • Time-series forecasting integration (Prophet)
  • Automated rule suggestion based on ML findings
  • Performance benchmarking and optimization for large-scale datasets
  • Dask integration for cluster-scale deployments

Stretch Ideas

  • VS Code extension
  • Jupyter widget
  • dbt integration