---
title: "Getting Started with putior"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with putior}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

```{r setup}
library(putior)
```

## Introduction

The `putior` package helps you document and visualize workflows by extracting structured annotations from your R and Python source files. This vignette shows you how to get started with PUT annotations and workflow extraction.

**PUT** stands for **P**UT + **I**nput + **O**utput + **R**, reflecting the package's core purpose: tracking data inputs and outputs through your analysis pipeline using special annotations.

## Why Use putior?

- **Automatic documentation**: Your workflow documentation stays in sync with your code
- **Multi-language support**: Works with R, Python, SQL, and other file types
- **Data lineage tracking**: See how data flows through your processing steps
- **Team collaboration**: Help colleagues understand complex workflows
- **Visual workflow creation**: Extract structured data ready for flowchart generation

## Quick Start

The fastest way to see putior in action is to run the built-in example:

```{r eval=FALSE}
# Run the complete example
source(system.file("examples", "reprex.R", package = "putior"))
```

This creates a sample multi-language workflow and demonstrates the workflow extraction capabilities of putior.

## Basic Workflow

### Step 1: Add PUT Annotations to Your Code

PUT annotations are special comments that describe workflow nodes. Here's how to add them to your source files:

**R script example:**

    # data_processing.R
    #put id:"load_data", label:"Load Customer Data", node_type:"input", output:"raw_data.csv"
    
    # Your actual code
    data <- read.csv("customer_data.csv")
    write.csv(data, "raw_data.csv")
    
    #put id:"clean_data", label:"Clean and Validate", node_type:"process", input:"raw_data.csv", output:"clean_data.csv"
    
    # Data cleaning code
    cleaned_data <- data %>%
      filter(!is.na(customer_id)) %>%
      mutate(purchase_date = as.Date(purchase_date))
    
    write.csv(cleaned_data, "clean_data.csv")

**Python script example:**

    # analysis.py
    #put id:"analyze_sales", label:"Sales Analysis", node_type:"process", input:"clean_data.csv", output:"sales_report.json"
    
    import pandas as pd
    import json
    
    # Load cleaned data
    data = pd.read_csv("clean_data.csv")
    
    # Perform analysis
    sales_summary = {
        "total_sales": data["amount"].sum(),
        "avg_order": data["amount"].mean(),
        "customer_count": data["customer_id"].nunique()
    }
    
    # Save results
    with open("sales_report.json", "w") as f:
        json.dump(sales_summary, f)

### Step 2: Extract the Workflow

Use the `put()` function to scan your files and extract workflow information:

```{r}
# Scan all R and Python files in a directory
workflow <- put("./src/")

# View the extracted workflow
print(workflow)
```

Expected output:
```{r echo=FALSE, eval=TRUE}
# Create example output for documentation
example_output <- data.frame(
  file_name = c("data_processing.R", "data_processing.R", "analysis.py"),
  file_type = c("r", "r", "py"),
  input = c(NA, "raw_data.csv", "clean_data.csv"),
  label = c("Load Customer Data", "Clean and Validate", "Sales Analysis"),
  id = c("load_data", "clean_data", "analyze_sales"),
  node_type = c("input", "process", "process"),
  output = c("raw_data.csv", "clean_data.csv", "sales_report.json"),
  stringsAsFactors = FALSE
)
print(example_output)
```

### Step 3: Understand the Results

The output is a data frame where each row represents a workflow node. The columns include:

- **file_name**: Which script contains this node
- **file_type**: Programming language (r, py, sql, etc.)
- **id**: Unique identifier for the node
- **label**: Human-readable description
- **node_type**: Type of operation (input, process, output)
- **input**: Files consumed by this step
- **output**: Files produced by this step
- **Custom properties**: Any additional metadata you defined

## PUT Annotation Syntax

### Basic Format

The general syntax for PUT annotations is:

    #put property1:"value1", property2:"value2", property3:"value3"

### Flexible Syntax Options

PUT annotations support several formats to fit different coding styles:

    #put id:"my_node", label:"My Process"           # Standard format
    # put id:"my_node", label:"My Process"          # Space after #
    #put| id:"my_node", label:"My Process"          # Pipe separator
    #put id:'my_node', label:'Single quotes'        # Single quotes
    #put id:"my_node", label:'Mixed quotes'         # Mixed quote styles

### Core Properties

While putior accepts any properties you define, these are commonly used:

| Property | Purpose | Example Values |
|----------|---------|----------------|
| `id` | Unique identifier | `"load_data"`, `"process_sales"` |
| `label` | Human description | `"Load Customer Data"` |
| `node_type` | Operation type | `"input"`, `"process"`, `"output"` |
| `input` | Input files | `"raw_data.csv"`, `"data/*.json"` |
| `output` | Output files | `"processed_data.csv"` |

### Standard Node Types

For consistency across projects, consider using these standard node types:

- **`input`**: Data collection, file loading, API calls
- **`process`**: Data transformation, analysis, computation
- **`output`**: Report generation, data export, visualization
- **`decision`**: Conditional logic, branching workflows

### Custom Properties

Add any properties you need for visualization or metadata:

    #put id:"train_model", label:"Train ML Model", node_type:"process", color:"green", group:"machine_learning", duration:"45min", priority:"high"

These custom properties can be used by visualization tools or workflow management systems.

## Advanced Usage

### Processing Individual Files

You can process single files instead of entire directories:

```{r}
# Process a single file
workflow <- put("./scripts/analysis.R")
```

### Recursive Directory Scanning

Include subdirectories in your scan:

```{r}
# Search subdirectories recursively
workflow <- put("./project/", recursive = TRUE)
```

### Custom File Patterns

Control which files are processed:

```{r}
# Only R files
workflow <- put("./src/", pattern = "\\.R$")

# R and SQL files only
workflow <- put("./src/", pattern = "\\.(R|sql)$")

# All supported file types (default)
workflow <- put("./src/", pattern = "\\.(R|r|py|sql|sh|jl)$")
```

### Including Line Numbers

For debugging annotation issues, include line numbers:

```{r}
# Include line numbers for debugging
workflow <- put("./src/", include_line_numbers = TRUE)
```

### Validation Control

Control annotation validation:

```{r}
# Enable validation (default) - provides helpful warnings
workflow <- put("./src/", validate = TRUE)

# Disable validation warnings
workflow <- put("./src/", validate = FALSE)
```

### Automatic ID Generation

If you omit the `id` field, putior will automatically generate a unique UUID:

```{r}
# Annotations without explicit IDs get auto-generated UUIDs
#put label:"Load Data", node_type:"input", output:"data.csv"
#put label:"Process Data", node_type:"process", input:"data.csv", output:"clean.csv"

# Extract workflow - IDs will be auto-generated
workflow <- put("./")
print(workflow$id)  # Will show UUIDs like "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
```

Note: If you provide an empty `id` (e.g., `id:""`), you'll get a validation warning.

### Automatic Output Defaulting

If you omit the `output` field, putior automatically uses the file name as the output:

```{r}
# In process_data.R:
#put label:"Process Step", node_type:"process", input:"raw.csv"
# No output specified - will default to "process_data.R"

# In analyze_data.R:
#put label:"Analyze", node_type:"process", input:"process_data.R", output:"results.csv"
# This creates a connection from process_data.R to analyze_data.R
```

This feature ensures that scripts can be connected in workflows even when explicit output files aren't specified.

### Tracking Source Relationships

When you have scripts that source other scripts, use this annotation pattern:

```{r}
# In main.R (sources other scripts):
#put label:"Main Analysis", input:"load_data.R,process_data.R", output:"report.pdf"
source("load_data.R")    # Reading load_data.R into main.R
source("process_data.R") # Reading process_data.R into main.R

# In load_data.R (sourced by main.R):
#put label:"Data Loader", node_type:"input"
# output defaults to "load_data.R"

# In process_data.R (sourced by main.R, depends on load_data.R):
#put label:"Data Processor", input:"load_data.R"
# output defaults to "process_data.R"
```

This correctly shows the flow: sourced scripts are **inputs** to the main script.

## Real-World Example

Let's walk through a complete data science workflow:

### 1. Data Collection (Python)

    # 01_collect_data.py
    #put id:"fetch_api_data", label:"Fetch Data from API", node_type:"input", output:"raw_api_data.json"
    
    import requests
    import json
    
    response = requests.get("https://api.example.com/sales")
    data = response.json()
    
    with open("raw_api_data.json", "w") as f:
        json.dump(data, f)

### 2. Data Processing (R)

    # 02_process_data.R  
    #put id:"clean_api_data", label:"Clean and Structure Data", node_type:"process", input:"raw_api_data.json", output:"processed_sales.csv"
    
    library(jsonlite)
    library(dplyr)
    
    # Load raw data
    raw_data <- fromJSON("raw_api_data.json")
    
    # Process and clean
    processed <- raw_data %>%
      filter(!is.na(sale_amount)) %>%
      mutate(
        sale_date = as.Date(sale_date),
        sale_amount = as.numeric(sale_amount)
      ) %>%
      arrange(sale_date)
    
    # Save processed data
    write.csv(processed, "processed_sales.csv", row.names = FALSE)

### 3. Analysis and Reporting (R)

    # 03_analyze_report.R
    #put id:"sales_analysis", label:"Perform Sales Analysis", node_type:"process", input:"processed_sales.csv", output:"analysis_results.rds"
    #put id:"generate_report", label:"Generate HTML Report", node_type:"output", input:"analysis_results.rds", output:"sales_report.html"
    
    library(dplyr)
    
    # Load processed data  
    sales_data <- read.csv("processed_sales.csv")
    
    # Perform analysis
    analysis_results <- list(
      total_sales = sum(sales_data$sale_amount),
      monthly_trends = sales_data %>% 
        group_by(month = format(sale_date, "%Y-%m")) %>%
        summarise(monthly_total = sum(sale_amount)),
      top_products = sales_data %>%
        group_by(product) %>%
        summarise(product_sales = sum(sale_amount)) %>%
        arrange(desc(product_sales)) %>%
        head(10)
    )
    
    # Save analysis
    saveRDS(analysis_results, "analysis_results.rds")
    
    # Generate report
    rmarkdown::render("report_template.Rmd", 
                      output_file = "sales_report.html")

### 4. Extract the Complete Workflow
```{r}
# Extract workflow from all files
complete_workflow <- put("./sales_project/", recursive = TRUE)
print(complete_workflow)
```

This would show the complete data flow: API → JSON → CSV → Analysis → Report

## Best Practices

### 1. Use Descriptive Names

Choose clear, descriptive names that explain what each step does:

    # Good
    #put name:"load_customer_transactions", label:"Load Customer Transaction Data"
    #put name:"calculate_monthly_revenue", label:"Calculate Monthly Revenue Totals"
    
    # Less descriptive
    #put name:"step1", label:"Load data"
    #put name:"process", label:"Do calculations"

### 2. Document Data Dependencies

Always specify inputs and outputs for data processing steps:

    #put name:"merge_datasets", label:"Merge Customer and Transaction Data", input:"customers.csv,transactions.csv", output:"merged_data.csv"

### 3. Use Consistent Node Types

Stick to a standard set of node types across your team:

    #put name:"load_raw_data", label:"Load Raw Sales Data", node_type:"input"
    #put name:"clean_data", label:"Clean and Validate", node_type:"process"  
    #put name:"export_results", label:"Export Final Results", node_type:"output"

### 4. Add Helpful Metadata

Include metadata that helps with workflow understanding:

    #put name:"train_model", label:"Train Random Forest Model", node_type:"process", estimated_time:"30min", requires:"tidymodels", memory_intensive:"true"

### 5. Group Related Operations

Use grouping properties to organize complex workflows:

    #put name:"feature_engineering", label:"Engineer Features", group:"preprocessing", stage:"1"
    #put name:"model_training", label:"Train Model", group:"modeling", stage:"2"
    #put name:"model_evaluation", label:"Evaluate Model", group:"modeling", stage:"3"

## Troubleshooting

### No Annotations Found

If `put()` returns an empty data frame:

1. **Check file patterns**: Ensure your files match the pattern (default: R, Python, SQL, shell, Julia)
2. **Verify annotation syntax**: Use `is_valid_put_annotation()` to test individual annotations
3. **Check file paths**: Ensure the directory exists and contains the expected files

```{r}
# Test annotation syntax
is_valid_put_annotation('#put name:"test", label:"Test Node"') # Should return TRUE
is_valid_put_annotation("#put invalid syntax") # Should return FALSE

# Check what files are found
list.files("./src/", pattern = "\\.(R|py)$")
```

### Validation Warnings

If you see validation warnings:

1. **Missing name**: Add a `name` property to all annotations
2. **Invalid node_type**: Use standard types (`input`, `process`, `output`)
3. **File extensions**: Ensure file references include extensions

```r
# Enable detailed validation output
workflow <- put("./src/", validate = TRUE, include_line_numbers = TRUE)
```

### Parsing Issues

If annotations aren't parsed correctly:

1. **Check quotes**: Ensure all values are properly quoted
2. **Escape commas**: Values with commas should be in quotes
3. **Avoid nested quotes**: Use consistent quote styles

Good example:

    #put name:"step1", description:"Process data, clean outliers", type:"process"

Problematic example:

    #put name:"step1", description:Process data, clean outliers, type:process

## Next Steps

Now that you understand the basics of putior:

1. **Try the complete example**: `source(system.file("examples", "reprex.R", package = "putior"))`
2. **Add annotations to your existing projects**: Start with key data processing scripts
3. **Build visualization tools**: Use the extracted workflow data to create flowcharts
4. **Integrate into CI/CD**: Automatically update workflow documentation
5. **Explore advanced features**: Check out the advanced usage vignette

For more detailed information, see:
- `?put` - Complete function documentation
- Advanced usage vignette - Complex workflows and integration
- Best practices vignette - Team collaboration and style guides