---
title: "Knowledge Discovery by Accuracy Maximization"
author: "Stefano Cacciatore, Leonardo Tenori"
date: "`r Sys.Date()`"
output:
  pdf_document:
    highlight: null
    number_sections: no
    fig_caption: yes
vignette: >
  %\VignetteIndexEntry{Knowledge Discovery by Accuracy Maximization}
  %\VignetteDepends{clinical}
  %\VignetteKeywords{clinical}
  %\VignettePackage{clinical}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


# Introduction

The `clinical` package is designed to facilitate exploratory data analysis and statistical testing on clinical datasets. This vignette presents the full usage of all core functions included in the package.

 
## 2	Installation

### 2.1 Installation via CRAN

The R package clinical  is part of the Comprehensive R Archive Network (CRAN)^[https://cran.r-project.org/]. The simplest way to install the package is to enter the following command into your R session: `install.packages("clinical")`. 


### 2.3 Compatibility issues

All versions downloadable from CRAN have been built using R version, R.3.2.3. The package should work without major issues on R versions > 3.5.0.

## 3  Getting Started 

To load the package, enter the following instruction in your R session:

```{r setup, include = FALSE}
library(clinical)  
```


If this command terminates without any error messages, you can be sure that the package has been installed successfully. The clinical package is now ready for use.

The package includes both a user manual (this document) and a reference manual (help pages for each function). To view the user manual, enter `vignette("clinical")`. Help pages can be viewed using the help command `help(package="clinical")`.




# Prostate Data

The `clinical` package includes a simulated dataset representing clinical information from patients diagnosed with prostate cancer. This dataset is provided as a `data.frame` and is intended for demonstration and instructional purposes.

The dataset includes the following variables:

- **Hospital**: Factor indicating the hospital where the patient was treated.
- **Gender**: Factor indicating the patient's gender.
- **Gleason**: Ordered factor representing the Gleason score assigned to the tumor.
- **BMI**: Numeric value for the patient's Body Mass Index.
- **Age**: Numeric value for the patient's age (in years).
- **Hypertension**: Factor indicating whether the patient has hypertension.

To load the dataset:

```{r load-data, message=FALSE, warning=FALSE}
data(prostate)
head(prostate)
```


# Function: `txtsummary`

The `txtsummary()` function provides a concise textual summary of a numeric variable using either the **mean** or **median**, along with a measure of variability such as the **interquartile range (IQR)**, **95% confidence interval**, **standard deviation**, or **full range**.

This is particularly useful when preparing reports or markdown documents where inline descriptive statistics are needed in a clean format.

### Example with the `prostate` dataset

```{r}
# Summarize Age using mean and IQR
txtsummary(prostate$Age, f = "mean", digits = 2, range = "IQR")
```

# Function: `continuous.test`

## Comparing Continuous Variables Across Groups

The `continuous.test()` function allows comparison of a continuous variable across groups. It returns a formatted summary of the data (e.g., mean and standard deviation or median and IQR) for each group, along with a p-value from a statistical test. This is useful for generating clean, publication-ready result tables.

### Function Parameters

- `feature`: A string indicating the name of the variable.
- `values`: A numeric vector containing the continuous data.
- `group`: A factor or character vector indicating group membership.
- `center`: The measure of central tendency, either `"mean"` or `"median"`.
- `range`: A measure of variability: `"sd"`, `"IQR"`, `"range"`, or `"95%CI"`.
- `method`: Statistical method to use: `"parametric"` or `"non-parametric"`. The function uses:
  - t-test (2 groups) or ANOVA (>2 groups) for parametric
  - Wilcoxon (2 groups) or Kruskal-Wallis (>2 groups) for non-parametric

---

### Example 1: Wilcoxon Rank-Sum Test (2 Groups, Non-Parametric)

```{r}

# Non-parametric comparison using Wilcoxon test
result_wilcox <- continuous.test(
  name = "Age",
  x = prostate$Age,
  y = prostate$Hospital,
  center = "median",
  range = "IQR",
  method = "non-parametric"
)

print(result_wilcox)
```
### Example 1: Wilcoxon Rank-Sum Test (2 Groups, Non-Parametric)

```{r}
# Non-parametric comparison using Wilcoxon test
result_wilcox <- continuous.test(
  name = "Age",
  x = prostate$Age,
  y = prostate$Hospital,
  center = "median",
  range = "IQR",
  method = "non-parametric"
)

print(result_wilcox)
```

# Function: `categorical.test`

## Comparing Categorical Variables Across Groups

The `categorical.test()` function compares categorical variables across groups and returns a formatted summary table with the test result. It automatically selects the appropriate statistical test depending on whether the categorical variable is **ordered** or not:

- **Unordered factor**: Uses **Fisher's exact test** (2 groups) or **Chi-squared test** (>2 groups).
- **Ordered factor**: Uses the **Jonckheere–Terpstra test** to detect monotonic trends across ordered categories.

The output is suitable for inclusion in summary tables or reports.

### Function Parameters

- `feature`: A string indicating the name of the categorical variable.
- `values`: A factor or ordered factor representing the categorical data.
- `group`: A factor or character vector indicating group membership.

---

### Example 1: Unordered Categorical Variable (Fisher's Exact Test)

```{r}
# Compare Gender (unordered factor) across hospitals
categorical_test_result <- categorical.test(
  name = "Gender",
  x = prostate$Gender,
  y = prostate$Hospital
)

print(categorical_test_result)
```


### Example 2: Ordered Categorical Variable (Jonckheere–Terpstra Test)

```{r}
# Compare Gleason score (ordered factor) across hospitals
categorical_test_result <- categorical.test(
  name = "Gleason",
  x = prostate$Gleason,
  y = prostate$Hospital
)

print(categorical_test_result)
```


# Function: `correlation.test`

Computes Pearson, Spearman, or MINE correlation between two numeric vectors.

```{r}
correlation_result <- correlation.test(prostate$Age, prostate$BMI, method = "spearman", name = "Age vs BMI")
print(correlation_result)
```

# Function: `multi_analysis`

Applies a test (continuous or correlation) across multiple features of a dataset.

```{r}
multi_cont <- multi_analysis(prostate[, c("Age", "BMI")], prostate$Hospital, FUN = "continuous.test")
print(multi_cont)
```

```{r}
multi_corr <- multi_analysis(prostate[, c("Age", "BMI")], prostate$BMI, FUN = "correlation.test")
print(multi_corr)
```


# Function: `intersect`

Finds the intersection of multiple vectors.

```{r}
v1 <- c("A", "B", "C")
v2 <- c("B", "C", "D")
v3 <- c("C", "B", "E")

intersect(v1, v2, v3)
```

# Function: `frequency_matching`

Matches samples across classes (e.g., control vs case) by discretizing numeric features into bins and stratifying selection.

```{r}
hosp=prostate[,"Hospital"]
gender=prostate[,"Gender"]
GS=prostate[,"Gleason score"]
BMI=prostate[,"BMI"]
age=prostate[,"Age"]

A=categorical.test("Gender",gender,hosp)
B=categorical.test("Gleason score",GS,hosp)

C=continuous.test("BMI",BMI,hosp,digits=2)
D=continuous.test("Age",age,hosp,digits=1)

# Analysis without matching
rbind(A,B,C,D)



# The order is important. Right is more important than left in the vector
# So, Ethnicity will be more important than Age
var=c("Age","BMI","Gleason score")
data.categorized=prostate[,var]

# Extract the Age vector
x <- data.categorized[["Age"]]

# Compute quantiles (0%, 25%, 50%, 75%, 100%) with NA handling
breaks <- quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

# Apply the cut and update the Age column with labeled bins
data.categorized[["Age"]] <- cut(x, breaks = breaks, include.lowest = TRUE)

# Extract the Age vector
x <- data.categorized[["BMI"]]

# Compute quantiles (0%, 25%, 50%, 75%, 100%) with NA handling
breaks <- quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

# Apply the cut and update the Age column with labeled bins
data.categorized[["BMI"]] <- cut(x, breaks = breaks, include.lowest = TRUE)

times=c(1,1)
names(times)=c("Hospital A","Hospital B")
t=frequency_matching(data.categorized,prostate[,"Hospital"],times=times)



newdata=prostate[t$selection,]

hosp.new=newdata[,"Hospital"]
gender.new=newdata[,"Gender"]
GS.new=newdata[,"Gleason score"]
BMI.new=newdata[,"BMI"]
age.new=newdata[,"Age"]

A=categorical.test("Gender",gender.new,hosp.new)
B=categorical.test("Gleason score",GS.new,hosp.new)

C=continuous.test("BMI",BMI.new,hosp.new,digits=2)
D=continuous.test("Age",age.new,hosp.new,digits=1)

# Analysis with matching
rbind(A,B,C,D)
```

# Conclusion

The `clinical` package provides an extensive toolkit for evaluating clinical datasets, from statistical comparisons to frequency matching and summarization. This vignette serves as a comprehensive guide for using each function effectively.