--- title: "Survival Model-Based Imputation for Laboratory Non-Detect Data" author: "survlab package" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 fig_width: 7 fig_height: 5 vignette: > %\VignetteIndexEntry{Survival Model-Based Imputation for Laboratory Non-Detect Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, fig.width = 7, fig.height = 5 ) ``` ## Introduction The `survlab` package provides tools for imputing non-detect values in environmental laboratory data using survival models (including Tobit models). This is particularly useful for environmental engineers, consultants, and laboratory professionals working with analytical data where measurements fall below detection limits or limits of quantification (LOQ). Non-detect values are common in environmental monitoring programs, contamination assessments, and regulatory compliance testing. Traditional approaches like substitution with half the detection limit can introduce bias and affect statistical analyses. The package automatically: - Selects the best-fitting distribution from multiple options - Generates realistic imputed values below their respective detection limits - Ensures all imputed values are unique and properly constrained - Provides validation and diagnostic tools specifically designed for laboratory data ## Installation ```{r eval=FALSE} # Install from GitHub (once published) # devtools::install_github("yourusername/survlab") # Load the package library(survlab) ``` ```{r setup} library(survlab) library(data.table) library(ggplot2) ``` ## Basic Usage ### Load Example Data The package includes a synthetic environmental laboratory dataset with non-detect values: ```{r} # Load example data data(multi_censored_data) # Explore the dataset multi_censored_data[, .( total_samples = .N, non_detects = sum(censored == 0), detects = sum(censored == 1), min_value = min(value), max_value = max(value) )] ``` ```{r} # View the different detection limit levels detection_limits <- multi_censored_data[censored == 0, unique(value)] cat("Detection limit levels:", paste(sort(detection_limits), collapse = ", ")) ``` ### Perform Imputation The main function `impute_nondetect()` automatically validates data quality, selects the best distribution, and generates imputed values: ```{r} # Set seed for reproducibility set.seed(123) # Perform imputation with parameter validation result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", parameter_col = "parameter", unit_col = "unit" ) ``` ### Validate Results Use the validation function to check imputation quality: ```{r} # Validate the imputation validate_imputation(result) ``` ### Examine Results ```{r} # Look at the first 10 non-detect observations result[censored == 0, .( original_detection_limit = value, imputed_value = round(value_imputed, 4), final_value = round(value_final, 4) )][1:10] ``` ## Visualization You can create plots to visualize the imputation results: ```{r fig.width=7, fig.height=5} # Prepare data for plotting plot_data <- rbind( result[censored == 1, .(value = value, type = "Detected")], result[censored == 0, .(value = value_imputed, type = "Imputed")] ) # Create histogram ggplot(plot_data, aes(x = value, fill = type)) + geom_histogram(alpha = 0.7, bins = 30, position = "identity") + geom_vline(xintercept = attr(result, "max_detection_limit"), linetype = "dashed", color = "red", linewidth = 1) + labs(title = "Distribution Comparison: Detected vs Imputed Values", subtitle = paste("Red line shows maximum detection limit =", round(attr(result, "max_detection_limit"), 3)), x = "Value", y = "Count", fill = "Type") + theme_minimal() + scale_fill_manual(values = c("Detected" = "blue", "Imputed" = "orange")) ``` ```{r fig.width=7, fig.height=5} # Q-Q plot to check distribution fit ggplot(result[censored == 0], aes(sample = value_imputed)) + stat_qq() + stat_qq_line() + labs(title = "Q-Q Plot of Imputed Values", subtitle = paste("Expected distribution:", attr(result, "best_distribution"))) + theme_minimal() ``` ## Advanced Usage ### Custom Distribution Selection You can specify which distributions to test and adjust validation thresholds: ```{r} # Test only specific distributions with custom validation result_custom <- impute_nondetect( dt = multi_censored_data, dist = c("gaussian", "lognormal", "weibull"), min_observations = 50, max_censored_pct = 50 ) ``` ### Model Information The function returns useful model information as attributes: ```{r} # Extract model information cat("Best distribution:", attr(result, "best_distribution"), "\n") cat("Model AIC:", round(attr(result, "aic"), 2), "\n") cat("Parameter:", attr(result, "parameter"), "\n") cat("Unit:", attr(result, "unit"), "\n") cat("Sample size:", attr(result, "sample_size"), "\n") cat("Censoring percentage:", attr(result, "censored_pct"), "%\n") cat("Detection limits found:", paste(attr(result, "detection_limits"), collapse = ", "), "\n") cat("Maximum detection limit:", attr(result, "max_detection_limit"), "\n") # Access the fitted model model <- attr(result, "best_model") summary(model) ``` ## Understanding the Data Structure The package expects laboratory data with a specific structure: - **value_col**: Contains either detected values (for samples above detection limit) or detection limit values (for non-detect samples) - **cens_col**: Binary indicator where 0 = non-detect (below detection limit), 1 = detected (above detection limit) For non-detect observations, the value in `value_col` is treated as the detection limit for that specific analysis, allowing for different detection limits across samples or analytical methods. ## Tips for Real Laboratory Data 1. **Multiple Detection Limits**: The package handles data with different detection limits automatically 2. **Distribution Selection**: Let the function test multiple distributions for best fit 3. **Validation**: Always run `validate_imputation()` to check results 4. **Seed Setting**: Use `set.seed()` for reproducible results in reports 5. **Large Datasets**: The package uses `data.table` for efficient memory usage 6. **Environmental Data**: Works well with typical environmental contaminant distributions (often lognormal) ## Conclusion The `survlab` package provides a robust solution for imputing non-detect values in environmental laboratory data using survival models. The automatic distribution selection and built-in validation ensure reliable results for environmental monitoring, contamination assessment, and regulatory compliance applications.