In general, dann will struggle as unrelated variables are intermingled with informative variables. To deal with this, sub_dann projects the data onto a unique subspace and then calls dann. sub_dann is able to mitigate the use of noise variables. See section 3 of Discriminate Adaptive Nearest Neighbor Classification for details. Section 4 compares dann and sub_dann to a number of other approaches.
In the below example there are 2 related variables and 5 that are unrelated. Lets see how dann, sub_dann, and dann with only the correct features perform. First, lets make a data set to work with.
library(dann)
library(mlbench)
library(magrittr)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
######################
# Circle data with unrelated variables
######################
set.seed(1)
train <- mlbench.circle(500, 2) %>%
tibble::as_tibble()
colnames(train)[1:3] <- c("X1", "X2", "Y")
train <- train %>%
mutate(Y = as.numeric(Y))
# Add 5 unrelated variables
train <- train %>%
mutate(
U1 = runif(500, -1, 1),
U2 = runif(500, -1, 1),
U3 = runif(500, -1, 1),
U4 = runif(500, -1, 1),
U5 = runif(500, -1, 1)
)
test <- mlbench.circle(500, 2) %>%
tibble::as_tibble()
colnames(test)[1:3] <- c("X1", "X2", "Y")
test <- test %>%
mutate(Y = as.numeric(Y))
# Add 5 unrelated variables
test <- test %>%
mutate(
U1 = runif(500, -1, 1),
U2 = runif(500, -1, 1),
U3 = runif(500, -1, 1),
U4 = runif(500, -1, 1),
U5 = runif(500, -1, 1)
)As expected, dann is not permanent.
dannPreds <- dann_df(
formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5,
train = train, test = test,
k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
mean(dannPreds == test$Y)## [1] 0.668
Moving on to sub_dann, the dimension of the subspace should be chosen based on the number of large eigenvalues. The graph suggests 2 (the correct answer).
graph_eigenvalues_df(formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5, train = train,
neighborhood_size = 50, weighted = FALSE, sphere = "mcd")While continuing to use unrelated variables, sub_dann did much better than dann.
subDannPreds <- sub_dann_df(formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5,
train = train, test = test,
k = 3, neighborhood_size = 50, epsilon = 1,
probability = FALSE,
weighted = FALSE, sphere = "mcd", numDim = 2)
mean(subDannPreds == test$Y)## [1] 0.882
As an upper bound on performance for this A.I. approach, lets try dann using only the informative variables. Is there much of a difference?
variableSelectionDann <- dann_df(formula = Y~X1 + X2,
train = train, test = test,
k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
mean(variableSelectionDann == test$Y)## [1] 0.944
Using only the related variables produced the best model. Many times, the related variables are unknown. sub_dann was able to produce a model nearly as performant.