Skip to content

ML Anomaly Detector

Detects anomalies using machine learning algorithms.

MLAnomalyDetector

Bases: GhostDetector

Detects anomalies using machine learning algorithms.

This detector uses unsupervised machine learning algorithms to identify anomalous patterns in numeric data. It supports multiple algorithms:

scikit-learn algorithms: - Isolation Forest: Good for general anomaly detection - Local Outlier Factor (LOF): Density-based detection - One-Class SVM: Boundary-based detection

PyOD algorithms (40+ additional algorithms available, requires pyod>=1.1.0): - ABOD: Angle-Based Outlier Detection - CBLOF: Cluster-Based Local Outlier Factor - HBOS: Histogram-based Outlier Score - KNN: k-Nearest Neighbors - MCD: Minimum Covariance Determinant - PCA: Principal Component Analysis - And many more (see PyOD documentation for full list)

The detector automatically selects numeric columns for analysis and normalizes features before applying ML algorithms. For large datasets (>100k rows), it uses sampling to improve performance.

Attributes:

Name Type Description
algorithm

ML algorithm to use. Options: "isolation_forest" (default), "lof", "one_class_svm".

contamination

Expected proportion of anomalies (0.0 to 0.5). Default is 0.1 (10%).

random_state

Random seed for reproducibility. None for random.

max_samples

Maximum number of samples to use for training. If dataset is larger, it will be sampled. None to use all data.

Example

Use default Isolation Forest::

detector = MLAnomalyDetector()
findings = detector.detect(df)

Use LOF with custom contamination::

detector = MLAnomalyDetector(
    algorithm="lof",
    contamination=0.05
)
findings = detector.detect(df)
Source code in lavendertown/detectors/ml_anomaly.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
class MLAnomalyDetector(GhostDetector):
    """Detects anomalies using machine learning algorithms.

    This detector uses unsupervised machine learning algorithms to identify
    anomalous patterns in numeric data. It supports multiple algorithms:

    scikit-learn algorithms:
    - Isolation Forest: Good for general anomaly detection
    - Local Outlier Factor (LOF): Density-based detection
    - One-Class SVM: Boundary-based detection

    PyOD algorithms (40+ additional algorithms available, requires pyod>=1.1.0):
    - ABOD: Angle-Based Outlier Detection
    - CBLOF: Cluster-Based Local Outlier Factor
    - HBOS: Histogram-based Outlier Score
    - KNN: k-Nearest Neighbors
    - MCD: Minimum Covariance Determinant
    - PCA: Principal Component Analysis
    - And many more (see PyOD documentation for full list)

    The detector automatically selects numeric columns for analysis and
    normalizes features before applying ML algorithms. For large datasets
    (>100k rows), it uses sampling to improve performance.

    Attributes:
        algorithm: ML algorithm to use. Options: "isolation_forest" (default),
            "lof", "one_class_svm".
        contamination: Expected proportion of anomalies (0.0 to 0.5).
            Default is 0.1 (10%).
        random_state: Random seed for reproducibility. None for random.
        max_samples: Maximum number of samples to use for training. If dataset
            is larger, it will be sampled. None to use all data.

    Example:
        Use default Isolation Forest::

            detector = MLAnomalyDetector()
            findings = detector.detect(df)

        Use LOF with custom contamination::

            detector = MLAnomalyDetector(
                algorithm="lof",
                contamination=0.05
            )
            findings = detector.detect(df)
    """

    def __init__(
        self,
        algorithm: str = "isolation_forest",
        contamination: float = 0.1,
        random_state: int | None = None,
        max_samples: int | None = None,
    ) -> None:
        """Initialize ML anomaly detector.

        Args:
            algorithm: ML algorithm to use. Valid options:
                scikit-learn algorithms:
                - "isolation_forest": Isolation Forest (default)
                - "lof": Local Outlier Factor
                - "one_class_svm": One-Class SVM
                PyOD algorithms (requires pyod>=1.1.0):
                - "abod": Angle-Based Outlier Detection
                - "cblof": Cluster-Based Local Outlier Factor
                - "hbos": Histogram-based Outlier Score
                - "knn": k-Nearest Neighbors
                - "mcd": Minimum Covariance Determinant
                - "pca": Principal Component Analysis
                - "iforest": PyOD's Isolation Forest
                - "ocsvm": PyOD's One-Class SVM
            contamination: Expected proportion of anomalies (0.0 to 0.5).
                Default is 0.1. Higher values detect more anomalies.
            random_state: Random seed for reproducibility. None for random.
            max_samples: Maximum number of samples for training. If dataset
                is larger, it will be sampled. None to use all data.
                Default is 10000 for performance.

        Raises:
            ValueError: If algorithm is invalid, or contamination is not
                in [0.0, 0.5], or max_samples is not positive.
            ImportError: If scikit-learn is not installed (raised during detect).
        """
        # Valid algorithms: scikit-learn and PyOD
        sklearn_algorithms = {"isolation_forest", "lof", "one_class_svm"}
        pyod_algorithms = {
            "abod",
            "cblof",
            "hbos",
            "knn",
            "mcd",
            "pca",
            "iforest",  # PyOD's Isolation Forest (alternative to sklearn)
            "ocsvm",  # PyOD's One-Class SVM (alternative to sklearn)
            "lof",  # PyOD's LOF (alternative to sklearn)
        }
        valid_algorithms = sklearn_algorithms | pyod_algorithms

        if algorithm not in valid_algorithms:
            raise ValueError(
                f"Algorithm must be one of {sorted(valid_algorithms)}, got {algorithm}"
            )

        if not 0.0 <= contamination <= 0.5:
            raise ValueError(
                f"Contamination must be between 0.0 and 0.5, got {contamination}"
            )

        if max_samples is not None and max_samples <= 0:
            raise ValueError(f"max_samples must be positive, got {max_samples}")

        self.algorithm = algorithm
        self.contamination = contamination
        self.random_state = random_state
        self.max_samples = max_samples if max_samples is not None else 10000

    def detect(self, df: object) -> list[GhostFinding]:
        """Detect anomalies using ML algorithms.

        Analyzes numeric columns in the DataFrame using machine learning
        algorithms to identify anomalous patterns. Works with both Pandas
        and Polars DataFrames.

        Args:
            df: DataFrame to analyze. Can be a pandas.DataFrame or
                polars.DataFrame. The backend is automatically detected.

        Returns:
            List of GhostFinding objects for columns containing ML-detected
            anomalies. Each finding includes:
            - ghost_type: "ml_anomaly"
            - column: Name of the column with anomalies (or "multiple" for
              multi-column analysis)
            - severity: "warning" if >5% anomalies, "info" otherwise
            - description: Human-readable description with anomaly counts
            - row_indices: List of row indices with anomalous values (Pandas only,
              None for Polars)
            - metadata: Dictionary with anomaly_count, total_count,
              anomaly_percentage, algorithm, contamination, and anomaly_scores

        Raises:
            ImportError: If scikit-learn or PyOD (depending on algorithm) is not
                installed. Install with: pip install lavendertown[ml]

        Note:
            For Polars DataFrames, row_indices will be None as Polars doesn't
            maintain index concepts. The finding will still include anomaly
            counts and statistics.
        """
        try:
            import sklearn  # noqa: F401
        except ImportError as e:
            raise ImportError(
                "scikit-learn is required for ML anomaly detection. "
                "Install with: pip install scikit-learn"
            ) from e

        backend = detect_dataframe_backend(df)

        if backend == "pandas":
            return self._detect_pandas(df)
        elif backend == "polars":
            return self._detect_polars(df)
        else:
            raise ValueError(f"Unsupported backend: {backend}")

    def _detect_pandas(self, df: object) -> list[GhostFinding]:
        """Detect anomalies using Pandas API.

        Args:
            df: pandas.DataFrame to analyze.

        Returns:
            List of GhostFinding objects for columns containing anomalies.
        """
        import numpy as np

        from sklearn.preprocessing import StandardScaler
        from sklearn.svm import OneClassSVM  # noqa: F401

        findings: list[GhostFinding] = []

        # Get numeric columns
        numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()

        if not numeric_columns:
            return findings

        # Prepare data
        data = df[numeric_columns].dropna()

        if len(data) < 3:  # Need at least 3 samples
            return findings

        # Sample if too large
        if len(data) > self.max_samples:
            data = data.sample(n=self.max_samples, random_state=self.random_state)

        # Normalize features
        scaler = StandardScaler()
        data_scaled = scaler.fit_transform(data)

        # Train ML model
        model = self._create_model()
        anomaly_labels = model.fit_predict(data_scaled)

        # Convert labels: -1 = anomaly, 1 = normal
        anomaly_mask = anomaly_labels == -1
        anomaly_count = int(anomaly_mask.sum())

        if anomaly_count > 0:
            # Get anomaly scores if available
            # Both scikit-learn and PyOD models may have decision_scores_ or decision_function
            if hasattr(model, "decision_scores_"):
                # PyOD models store decision scores in decision_scores_ attribute
                anomaly_scores = model.decision_scores_
            elif hasattr(model, "decision_function"):
                # scikit-learn models use decision_function method
                anomaly_scores = model.decision_function(data_scaled)
                # For Isolation Forest, lower scores = more anomalous
                if self.algorithm in ("isolation_forest", "iforest"):
                    anomaly_scores = -anomaly_scores
            elif hasattr(model, "score_samples"):
                # Some models use score_samples
                anomaly_scores = model.score_samples(data_scaled)
            else:
                anomaly_scores = None

            # Get original indices for anomalies
            if len(data) == len(df):
                # No sampling, use original indices
                anomaly_indices = data[anomaly_mask].index.tolist()
            else:
                # Sampling was used, map back to original
                sampled_indices = data.index
                anomaly_indices = sampled_indices[anomaly_mask].tolist()

            anomaly_percentage = anomaly_count / len(data)

            # Determine severity
            if anomaly_percentage > 0.05:  # More than 5% anomalies
                severity = "warning"
            else:
                severity = "info"

            # Create finding for multi-column analysis
            finding = GhostFinding(
                ghost_type="ml_anomaly",
                column=", ".join(numeric_columns[:3])
                + (
                    f" (+{len(numeric_columns) - 3} more)"
                    if len(numeric_columns) > 3
                    else ""
                ),
                severity=severity,
                description=(
                    f"ML analysis detected {anomaly_count:,} anomalies "
                    f"({anomaly_percentage:.1%} of {len(data):,} rows) "
                    f"across {len(numeric_columns)} numeric columns using {self.algorithm}"
                ),
                row_indices=anomaly_indices,
                metadata={
                    "anomaly_count": anomaly_count,
                    "total_count": len(data),
                    "anomaly_percentage": float(anomaly_percentage),
                    "algorithm": self.algorithm,
                    "contamination": float(self.contamination),
                    "columns_analyzed": numeric_columns,
                    "anomaly_scores": (
                        anomaly_scores[anomaly_mask].tolist()
                        if anomaly_scores is not None
                        else None
                    ),
                },
            )
            findings.append(finding)

        return findings

    def _detect_polars(self, df: object) -> list[GhostFinding]:
        """Detect anomalies using Polars API.

        Args:
            df: polars.DataFrame to analyze.

        Returns:
            List of GhostFinding objects for columns containing anomalies.
            row_indices will be None for all findings.
        """
        import polars as pl

        from sklearn.ensemble import IsolationForest  # noqa: F401
        from sklearn.neighbors import LocalOutlierFactor  # noqa: F401
        from sklearn.preprocessing import StandardScaler
        from sklearn.svm import OneClassSVM  # noqa: F401

        findings: list[GhostFinding] = []

        # Get numeric columns
        numeric_columns = [
            col
            for col, dtype in df.schema.items()
            if dtype
            in [
                pl.Int8,
                pl.Int16,
                pl.Int32,
                pl.Int64,
                pl.UInt8,
                pl.UInt16,
                pl.UInt32,
                pl.UInt64,
                pl.Float32,
                pl.Float64,
            ]
        ]

        if not numeric_columns:
            return findings

        # Convert to pandas for ML processing
        data = df.select(numeric_columns).drop_nulls().to_pandas()

        if len(data) < 3:
            return findings

        # Sample if too large
        if len(data) > self.max_samples:
            data = data.sample(n=self.max_samples, random_state=self.random_state)

        # Normalize features
        scaler = StandardScaler()
        data_scaled = scaler.fit_transform(data)

        # Train ML model
        model = self._create_model()
        anomaly_labels = model.fit_predict(data_scaled)

        # Convert labels: PyOD and scikit-learn both use 1 = normal, -1/0 = anomaly

        anomaly_mask = (anomaly_labels == -1) | (anomaly_labels == 0)
        anomaly_count = int(anomaly_mask.sum())

        if anomaly_count > 0:
            # Get anomaly scores if available
            # Both scikit-learn and PyOD models may have decision_scores_ or decision_function
            if hasattr(model, "decision_scores_"):
                # PyOD models store decision scores in decision_scores_ attribute
                anomaly_scores = model.decision_scores_
            elif hasattr(model, "decision_function"):
                # scikit-learn models use decision_function method
                anomaly_scores = model.decision_function(data_scaled)
                # For Isolation Forest, lower scores = more anomalous
                if self.algorithm in ("isolation_forest", "iforest"):
                    anomaly_scores = -anomaly_scores
            elif hasattr(model, "score_samples"):
                # Some models use score_samples
                anomaly_scores = model.score_samples(data_scaled)
            else:
                anomaly_scores = None

            anomaly_percentage = anomaly_count / len(data)

            # Determine severity
            if anomaly_percentage > 0.05:
                severity = "warning"
            else:
                severity = "info"

            finding = GhostFinding(
                ghost_type="ml_anomaly",
                column=", ".join(numeric_columns[:3])
                + (
                    f" (+{len(numeric_columns) - 3} more)"
                    if len(numeric_columns) > 3
                    else ""
                ),
                severity=severity,
                description=(
                    f"ML analysis detected {anomaly_count:,} anomalies "
                    f"({anomaly_percentage:.1%} of {len(data):,} rows) "
                    f"across {len(numeric_columns)} numeric columns using {self.algorithm}"
                ),
                row_indices=None,  # Polars doesn't maintain indices
                metadata={
                    "anomaly_count": anomaly_count,
                    "total_count": len(data),
                    "anomaly_percentage": float(anomaly_percentage),
                    "algorithm": self.algorithm,
                    "contamination": float(self.contamination),
                    "columns_analyzed": numeric_columns,
                    "anomaly_scores": (
                        anomaly_scores[anomaly_mask].tolist()
                        if anomaly_scores is not None
                        else None
                    ),
                },
            )
            findings.append(finding)

        return findings

    def _create_model(self) -> object:
        """Create and return ML model instance.

        Returns:
            ML model instance (scikit-learn or PyOD estimator).

        Raises:
            ImportError: If required library (scikit-learn or PyOD) is not installed.
        """
        # Check if this is a PyOD algorithm
        pyod_algorithms = {
            "abod",
            "cblof",
            "hbos",
            "knn",
            "mcd",
            "pca",
            "iforest",
            "ocsvm",
        }

        if self.algorithm in pyod_algorithms:
            # Use PyOD
            try:
                import pyod  # noqa: F401
            except ImportError as e:
                raise ImportError(
                    "PyOD is required for PyOD algorithms. "
                    "Install with: pip install lavendertown[ml]"
                ) from e

            return self._create_pyod_model()

        # Use scikit-learn
        from sklearn.ensemble import IsolationForest
        from sklearn.neighbors import LocalOutlierFactor
        from sklearn.svm import OneClassSVM

        if self.algorithm == "isolation_forest":
            return IsolationForest(
                contamination=self.contamination,
                random_state=self.random_state,
            )
        elif self.algorithm == "lof":
            return LocalOutlierFactor(
                contamination=self.contamination,
                novelty=False,  # Using for outlier detection, not novelty
            )
        elif self.algorithm == "one_class_svm":
            return OneClassSVM(
                nu=self.contamination,
                gamma="scale",
            )
        else:
            raise ValueError(f"Unknown algorithm: {self.algorithm}")

    def _create_pyod_model(self) -> object:
        """Create PyOD model instance.

        Returns:
            PyOD detector instance.
        """
        # Import PyOD models
        from pyod.models.abod import ABOD
        from pyod.models.cblof import CBLOF
        from pyod.models.hbos import HBOS
        from pyod.models.knn import KNN
        from pyod.models.mcd import MCD
        from pyod.models.pca import PCA
        from pyod.models.iforest import IForest
        from pyod.models.ocsvm import OCSVM

        # Map algorithm names to PyOD classes
        algorithm_map = {
            "abod": ABOD,
            "cblof": CBLOF,
            "hbos": HBOS,
            "knn": KNN,
            "mcd": MCD,
            "pca": PCA,
            "iforest": IForest,
            "ocsvm": OCSVM,
        }

        model_class = algorithm_map[self.algorithm]

        # Create model with contamination parameter
        # Most PyOD models use contamination parameter
        if self.algorithm == "ocsvm":
            # OCSVM uses contamination
            return model_class(contamination=self.contamination)
        elif self.algorithm == "knn":
            # KNN uses contamination
            return model_class(contamination=self.contamination)
        elif self.algorithm == "iforest":
            # IForest supports random_state
            return model_class(
                contamination=self.contamination, random_state=self.random_state
            )
        else:
            # Most other PyOD models use contamination
            return model_class(contamination=self.contamination)

Functions

__init__

__init__(
    algorithm="isolation_forest",
    contamination=0.1,
    random_state=None,
    max_samples=None,
)

Parameters:

Name Type Description Default
algorithm str

ML algorithm to use. Valid options: scikit-learn algorithms: - "isolation_forest": Isolation Forest (default) - "lof": Local Outlier Factor - "one_class_svm": One-Class SVM PyOD algorithms (requires pyod>=1.1.0): - "abod": Angle-Based Outlier Detection - "cblof": Cluster-Based Local Outlier Factor - "hbos": Histogram-based Outlier Score - "knn": k-Nearest Neighbors - "mcd": Minimum Covariance Determinant - "pca": Principal Component Analysis - "iforest": PyOD's Isolation Forest - "ocsvm": PyOD's One-Class SVM

'isolation_forest'
contamination float

Expected proportion of anomalies (0.0 to 0.5). Default is 0.1. Higher values detect more anomalies.

0.1
random_state int | None

Random seed for reproducibility. None for random.

None
max_samples int | None

Maximum number of samples for training. If dataset is larger, it will be sampled. None to use all data. Default is 10000 for performance.

None

Raises:

Type Description
ValueError

If algorithm is invalid, or contamination is not in [0.0, 0.5], or max_samples is not positive.

ImportError

If scikit-learn is not installed (raised during detect).

Source code in lavendertown/detectors/ml_anomaly.py
def __init__(
    self,
    algorithm: str = "isolation_forest",
    contamination: float = 0.1,
    random_state: int | None = None,
    max_samples: int | None = None,
) -> None:
    """Initialize ML anomaly detector.

    Args:
        algorithm: ML algorithm to use. Valid options:
            scikit-learn algorithms:
            - "isolation_forest": Isolation Forest (default)
            - "lof": Local Outlier Factor
            - "one_class_svm": One-Class SVM
            PyOD algorithms (requires pyod>=1.1.0):
            - "abod": Angle-Based Outlier Detection
            - "cblof": Cluster-Based Local Outlier Factor
            - "hbos": Histogram-based Outlier Score
            - "knn": k-Nearest Neighbors
            - "mcd": Minimum Covariance Determinant
            - "pca": Principal Component Analysis
            - "iforest": PyOD's Isolation Forest
            - "ocsvm": PyOD's One-Class SVM
        contamination: Expected proportion of anomalies (0.0 to 0.5).
            Default is 0.1. Higher values detect more anomalies.
        random_state: Random seed for reproducibility. None for random.
        max_samples: Maximum number of samples for training. If dataset
            is larger, it will be sampled. None to use all data.
            Default is 10000 for performance.

    Raises:
        ValueError: If algorithm is invalid, or contamination is not
            in [0.0, 0.5], or max_samples is not positive.
        ImportError: If scikit-learn is not installed (raised during detect).
    """
    # Valid algorithms: scikit-learn and PyOD
    sklearn_algorithms = {"isolation_forest", "lof", "one_class_svm"}
    pyod_algorithms = {
        "abod",
        "cblof",
        "hbos",
        "knn",
        "mcd",
        "pca",
        "iforest",  # PyOD's Isolation Forest (alternative to sklearn)
        "ocsvm",  # PyOD's One-Class SVM (alternative to sklearn)
        "lof",  # PyOD's LOF (alternative to sklearn)
    }
    valid_algorithms = sklearn_algorithms | pyod_algorithms

    if algorithm not in valid_algorithms:
        raise ValueError(
            f"Algorithm must be one of {sorted(valid_algorithms)}, got {algorithm}"
        )

    if not 0.0 <= contamination <= 0.5:
        raise ValueError(
            f"Contamination must be between 0.0 and 0.5, got {contamination}"
        )

    if max_samples is not None and max_samples <= 0:
        raise ValueError(f"max_samples must be positive, got {max_samples}")

    self.algorithm = algorithm
    self.contamination = contamination
    self.random_state = random_state
    self.max_samples = max_samples if max_samples is not None else 10000

detect

detect(df)

Detect anomalies using ML algorithms.

Analyzes numeric columns in the DataFrame using machine learning algorithms to identify anomalous patterns. Works with both Pandas and Polars DataFrames.

Parameters:

Name Type Description Default
df object

DataFrame to analyze. Can be a pandas.DataFrame or polars.DataFrame. The backend is automatically detected.

required

Returns:

Type Description
list[GhostFinding]

List of GhostFinding objects for columns containing ML-detected

list[GhostFinding]

anomalies. Each finding includes:

list[GhostFinding]
  • ghost_type: "ml_anomaly"
list[GhostFinding]
  • column: Name of the column with anomalies (or "multiple" for multi-column analysis)
list[GhostFinding]
  • severity: "warning" if >5% anomalies, "info" otherwise
list[GhostFinding]
  • description: Human-readable description with anomaly counts
list[GhostFinding]
  • row_indices: List of row indices with anomalous values (Pandas only, None for Polars)
list[GhostFinding]
  • metadata: Dictionary with anomaly_count, total_count, anomaly_percentage, algorithm, contamination, and anomaly_scores

Raises:

Type Description
ImportError

If scikit-learn or PyOD (depending on algorithm) is not installed. Install with: pip install lavendertown[ml]

Note

For Polars DataFrames, row_indices will be None as Polars doesn't maintain index concepts. The finding will still include anomaly counts and statistics.

Source code in lavendertown/detectors/ml_anomaly.py
def detect(self, df: object) -> list[GhostFinding]:
    """Detect anomalies using ML algorithms.

    Analyzes numeric columns in the DataFrame using machine learning
    algorithms to identify anomalous patterns. Works with both Pandas
    and Polars DataFrames.

    Args:
        df: DataFrame to analyze. Can be a pandas.DataFrame or
            polars.DataFrame. The backend is automatically detected.

    Returns:
        List of GhostFinding objects for columns containing ML-detected
        anomalies. Each finding includes:
        - ghost_type: "ml_anomaly"
        - column: Name of the column with anomalies (or "multiple" for
          multi-column analysis)
        - severity: "warning" if >5% anomalies, "info" otherwise
        - description: Human-readable description with anomaly counts
        - row_indices: List of row indices with anomalous values (Pandas only,
          None for Polars)
        - metadata: Dictionary with anomaly_count, total_count,
          anomaly_percentage, algorithm, contamination, and anomaly_scores

    Raises:
        ImportError: If scikit-learn or PyOD (depending on algorithm) is not
            installed. Install with: pip install lavendertown[ml]

    Note:
        For Polars DataFrames, row_indices will be None as Polars doesn't
        maintain index concepts. The finding will still include anomaly
        counts and statistics.
    """
    try:
        import sklearn  # noqa: F401
    except ImportError as e:
        raise ImportError(
            "scikit-learn is required for ML anomaly detection. "
            "Install with: pip install scikit-learn"
        ) from e

    backend = detect_dataframe_backend(df)

    if backend == "pandas":
        return self._detect_pandas(df)
    elif backend == "polars":
        return self._detect_polars(df)
    else:
        raise ValueError(f"Unsupported backend: {backend}")