--- title: "Introduction to quickOutlier" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to quickOutlier} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction `quickOutlier` is a comprehensive toolkit for Data Mining in R. It simplifies the process of detecting, visualizing, and treating anomalies in your datasets using both statistical and machine learning approaches. ### Setup ```{r setup} library(quickOutlier) ``` ## 1. Univariate Detection (Z-Score & IQR) The most common way to find outliers is looking at one variable at a time. - **Z-Score:** Best for normal (Gaussian) distributions. - **IQR:** Best for skewed data (robust against extreme values). ```{r univariate} # Create dummy data with one obvious outlier (500) df <- data.frame( id = 1:10, revenue = c(10, 12, 11, 10, 12, 11, 13, 10, 500, 11) ) # Detect using Interquartile Range (IQR) outliers <- detect_outliers(df, column = "revenue", method = "iqr") print(outliers) ``` ## 2. Visualizing Anomalies Visual inspection is crucial. `quickOutlier` provides an instant `ggplot2` visualization to see where your anomalies fall compared to the distribution. ```{r plot, fig.width=6, fig.height=4} plot_outliers(df, column = "revenue", method = "iqr") ``` ## 3. Treating Data (Winsorization) Sometimes you don't want to delete the data, but "cap" it to a maximum reasonable value. This is called Winsorization. ```{r treat} # Cap the outliers based on IQR limits df_clean <- treat_outliers(df, column = "revenue", method = "iqr") # The value 500 has been replaced by the upper bound print(df_clean$revenue) ``` ## 4. Multivariate Detection (Mahalanobis) Some outliers are only visible when looking at two variables together (e.g., a person who is short but weighs a lot). ```{r multivariate} # Generate data: y correlates with x df_multi <- data.frame(x = rnorm(50), y = rnorm(50)) df_multi$y <- df_multi$x * 2 + rnorm(50, sd = 0.5) # Add an anomaly: normal x, but impossible y given x anomaly <- data.frame(x = 0, y = 10) df_multi <- rbind(df_multi, anomaly) # Detect using Mahalanobis Distance detect_multivariate(df_multi, columns = c("x", "y")) ``` ## 5. Density-Based Detection (LOF) For complex clusters where statistical methods fail, we use the Local Outlier Factor (LOF). This identifies points that are isolated from their local neighbors. ```{r lof} # Create a dense cluster and one distant point df_density <- data.frame( x = c(rnorm(50), 10), y = c(rnorm(50), 10) ) # Run LOF detection detect_density(df_density, k = 5) ``` ## 6. Full Dataset Scan If you have a large dataset, you can scan all numeric columns at once to get a summary report. ```{r scan} scan_data(df, method = "iqr") ```