Introduction to quickOutlier

Introduction

quickOutlier is a comprehensive toolkit for Data Mining in R. It simplifies the process of detecting, visualizing, and treating anomalies in your datasets using both statistical and machine learning approaches.

Setup

library(quickOutlier)

1. Univariate Detection (Z-Score & IQR)

The most common way to find outliers is looking at one variable at a time.

# Create dummy data with one obvious outlier (500)
df <- data.frame(
  id = 1:10,
  revenue = c(10, 12, 11, 10, 12, 11, 13, 10, 500, 11)
)

# Detect using Interquartile Range (IQR)
outliers <- detect_outliers(df, column = "revenue", method = "iqr")
print(outliers)
#>   id revenue  iqr_bounds
#> 9  9     500 [5 - 17.25]

2. Visualizing Anomalies

Visual inspection is crucial. quickOutlier provides an instant ggplot2 visualization to see where your anomalies fall compared to the distribution.

plot_outliers(df, column = "revenue", method = "iqr")

3. Treating Data (Winsorization)

Sometimes you don’t want to delete the data, but “cap” it to a maximum reasonable value. This is called Winsorization.

# Cap the outliers based on IQR limits
df_clean <- treat_outliers(df, column = "revenue", method = "iqr")

# The value 500 has been replaced by the upper bound
print(df_clean$revenue)
#>  [1] 10.000 12.000 11.000 10.000 12.000 11.000 13.000 10.000 14.625 11.000

4. Multivariate Detection (Mahalanobis)

Some outliers are only visible when looking at two variables together (e.g., a person who is short but weighs a lot).

# Generate data: y correlates with x
df_multi <- data.frame(x = rnorm(50), y = rnorm(50))
df_multi$y <- df_multi$x * 2 + rnorm(50, sd = 0.5)

# Add an anomaly: normal x, but impossible y given x
anomaly <- data.frame(x = 0, y = 10) 
df_multi <- rbind(df_multi, anomaly)

# Detect using Mahalanobis Distance
detect_multivariate(df_multi, columns = c("x", "y"))
#>    x  y mahalanobis_dist
#> 51 0 10            43.42

5. Density-Based Detection (LOF)

For complex clusters where statistical methods fail, we use the Local Outlier Factor (LOF). This identifies points that are isolated from their local neighbors.

# Create a dense cluster and one distant point
df_density <- data.frame(
  x = c(rnorm(50), 10), 
  y = c(rnorm(50), 10)
)

# Run LOF detection
detect_density(df_density, k = 5)
#>            x          y lof_score
#> 23  1.555625 -0.4752413      1.60
#> 37  2.504114  0.6185133      1.55
#> 51 10.000000 10.0000000     18.18

6. Full Dataset Scan

If you have a large dataset, you can scan all numeric columns at once to get a summary report.

scan_data(df, method = "iqr")
#>    Column Outlier_Count Percentage
#> 1      id             0          0
#> 2 revenue             1         10