---
title: "Introduction to quickOutlier"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to quickOutlier}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

`quickOutlier` is a comprehensive toolkit for Data Mining in R. It simplifies the process of detecting, visualizing, and treating anomalies in your datasets using both statistical and machine learning approaches.

### Setup

```{r setup}
library(quickOutlier)
```

## 1. Univariate Detection (Z-Score & IQR)

The most common way to find outliers is looking at one variable at a time.

-   **Z-Score:** Best for normal (Gaussian) distributions.
-   **IQR:** Best for skewed data (robust against extreme values).

```{r univariate}
# Create dummy data with one obvious outlier (500)
df <- data.frame(
  id = 1:10,
  revenue = c(10, 12, 11, 10, 12, 11, 13, 10, 500, 11)
)

# Detect using Interquartile Range (IQR)
outliers <- detect_outliers(df, column = "revenue", method = "iqr")
print(outliers)
```

## 2. Visualizing Anomalies

Visual inspection is crucial. `quickOutlier` provides an instant `ggplot2` visualization to see where your anomalies fall compared to the distribution.

```{r plot, fig.width=6, fig.height=4}
plot_outliers(df, column = "revenue", method = "iqr")
```

## 3. Treating Data (Winsorization)

Sometimes you don't want to delete the data, but "cap" it to a maximum reasonable value. This is called Winsorization.

```{r treat}
# Cap the outliers based on IQR limits
df_clean <- treat_outliers(df, column = "revenue", method = "iqr")

# The value 500 has been replaced by the upper bound
print(df_clean$revenue)
```

## 4. Multivariate Detection (Mahalanobis)

Some outliers are only visible when looking at two variables together (e.g., a person who is short but weighs a lot).

```{r multivariate}
# Generate data: y correlates with x
df_multi <- data.frame(x = rnorm(50), y = rnorm(50))
df_multi$y <- df_multi$x * 2 + rnorm(50, sd = 0.5)

# Add an anomaly: normal x, but impossible y given x
anomaly <- data.frame(x = 0, y = 10) 
df_multi <- rbind(df_multi, anomaly)

# Detect using Mahalanobis Distance
detect_multivariate(df_multi, columns = c("x", "y"))
```

## 5. Density-Based Detection (LOF)

For complex clusters where statistical methods fail, we use the Local Outlier Factor (LOF). This identifies points that are isolated from their local neighbors.

```{r lof}
# Create a dense cluster and one distant point
df_density <- data.frame(
  x = c(rnorm(50), 10), 
  y = c(rnorm(50), 10)
)

# Run LOF detection
detect_density(df_density, k = 5)
```

## 6. Full Dataset Scan

If you have a large dataset, you can scan all numeric columns at once to get a summary report.

```{r scan}
scan_data(df, method = "iqr")
```