---
title: "Survival Model-Based Imputation for Laboratory Non-Detect Data"
author: "survlab package"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
    fig_width: 7
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{Survival Model-Based Imputation for Laboratory Non-Detect Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE,
  fig.width = 7,
  fig.height = 5
)
```

## Introduction

The `survlab` package provides tools for imputing non-detect values in environmental laboratory data using survival models (including Tobit models). This is particularly useful for environmental engineers, consultants, and laboratory professionals working with analytical data where measurements fall below detection limits or limits of quantification (LOQ).

Non-detect values are common in environmental monitoring programs, contamination assessments, and regulatory compliance testing. Traditional approaches like substitution with half the detection limit can introduce bias and affect statistical analyses.

The package automatically:
- Selects the best-fitting distribution from multiple options
- Generates realistic imputed values below their respective detection limits
- Ensures all imputed values are unique and properly constrained
- Provides validation and diagnostic tools specifically designed for laboratory data

## Installation

```{r eval=FALSE}
# Install from GitHub (once published)
# devtools::install_github("yourusername/survlab")

# Load the package
library(survlab)
```

```{r setup}
library(survlab)
library(data.table)
library(ggplot2)
```

## Basic Usage

### Load Example Data

The package includes a synthetic environmental laboratory dataset with non-detect values:

```{r}
# Load example data
data(multi_censored_data)

# Explore the dataset
multi_censored_data[, .(
  total_samples = .N,
  non_detects = sum(censored == 0),
  detects = sum(censored == 1),
  min_value = min(value),
  max_value = max(value)
)]
```

```{r}
# View the different detection limit levels
detection_limits <- multi_censored_data[censored == 0, unique(value)]
cat("Detection limit levels:", paste(sort(detection_limits), collapse = ", "))
```

### Perform Imputation

The main function `impute_nondetect()` automatically validates data quality, selects the best distribution, and generates imputed values:

```{r}
# Set seed for reproducibility
set.seed(123)

# Perform imputation with parameter validation
result <- impute_nondetect(
  dt = multi_censored_data,
  value_col = "value", 
  cens_col = "censored",
  parameter_col = "parameter",
  unit_col = "unit"
)
```

### Validate Results

Use the validation function to check imputation quality:

```{r}
# Validate the imputation
validate_imputation(result)
```

### Examine Results

```{r}
# Look at the first 10 non-detect observations
result[censored == 0, .(
  original_detection_limit = value,
  imputed_value = round(value_imputed, 4),
  final_value = round(value_final, 4)
)][1:10]
```

## Visualization

You can create plots to visualize the imputation results:

```{r fig.width=7, fig.height=5}
# Prepare data for plotting
plot_data <- rbind(
  result[censored == 1, .(value = value, type = "Detected")],
  result[censored == 0, .(value = value_imputed, type = "Imputed")]
)

# Create histogram
ggplot(plot_data, aes(x = value, fill = type)) +
  geom_histogram(alpha = 0.7, bins = 30, position = "identity") +
  geom_vline(xintercept = attr(result, "max_detection_limit"), 
             linetype = "dashed", color = "red", linewidth = 1) +
  labs(title = "Distribution Comparison: Detected vs Imputed Values",
       subtitle = paste("Red line shows maximum detection limit =", 
                        round(attr(result, "max_detection_limit"), 3)),
       x = "Value", y = "Count", fill = "Type") +
  theme_minimal() +
  scale_fill_manual(values = c("Detected" = "blue", "Imputed" = "orange"))
```

```{r fig.width=7, fig.height=5}
# Q-Q plot to check distribution fit
ggplot(result[censored == 0], aes(sample = value_imputed)) +
  stat_qq() + 
  stat_qq_line() +
  labs(title = "Q-Q Plot of Imputed Values",
       subtitle = paste("Expected distribution:", 
                        attr(result, "best_distribution"))) +
  theme_minimal()
```

## Advanced Usage

### Custom Distribution Selection

You can specify which distributions to test and adjust validation thresholds:

```{r}
# Test only specific distributions with custom validation
result_custom <- impute_nondetect(
  dt = multi_censored_data,
  dist = c("gaussian", "lognormal", "weibull"),
  min_observations = 50,
  max_censored_pct = 50
)
```

### Model Information

The function returns useful model information as attributes:

```{r}
# Extract model information
cat("Best distribution:", attr(result, "best_distribution"), "\n")
cat("Model AIC:", round(attr(result, "aic"), 2), "\n")
cat("Parameter:", attr(result, "parameter"), "\n")
cat("Unit:", attr(result, "unit"), "\n")
cat("Sample size:", attr(result, "sample_size"), "\n")
cat("Censoring percentage:", attr(result, "censored_pct"), "%\n")
cat("Detection limits found:", paste(attr(result, "detection_limits"), collapse = ", "), "\n")
cat("Maximum detection limit:", attr(result, "max_detection_limit"), "\n")

# Access the fitted model
model <- attr(result, "best_model")
summary(model)
```

## Understanding the Data Structure

The package expects laboratory data with a specific structure:

- **value_col**: Contains either detected values (for samples above detection limit) or detection limit values (for non-detect samples)
- **cens_col**: Binary indicator where 0 = non-detect (below detection limit), 1 = detected (above detection limit)

For non-detect observations, the value in `value_col` is treated as the detection limit for that specific analysis, allowing for different detection limits across samples or analytical methods.

## Tips for Real Laboratory Data

1. **Multiple Detection Limits**: The package handles data with different detection limits automatically
2. **Distribution Selection**: Let the function test multiple distributions for best fit
3. **Validation**: Always run `validate_imputation()` to check results
4. **Seed Setting**: Use `set.seed()` for reproducible results in reports
5. **Large Datasets**: The package uses `data.table` for efficient memory usage
6. **Environmental Data**: Works well with typical environmental contaminant distributions (often lognormal)

## Conclusion

The `survlab` package provides a robust solution for imputing non-detect values in environmental laboratory data using survival models. The automatic distribution selection and built-in validation ensure reliable results for environmental monitoring, contamination assessment, and regulatory compliance applications.