--- title: "Qploidy" date: "Last update: 2025-04-21" vignette: > %\VignetteIndexEntry{Qploidy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: rmdformats::html_clean: highlight: kate --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` This vignette describes how to use `Qploidy`, an R package designed for ploidy and aneuploidy estimation using genotyping platform data. For a detailed explanation of the `Qploidy` methodology, please refer to its publication. ### When does the `Qploidy` methodology work? The `Qploidy` approach is effective under the following conditions: - Your marker data originates from **Axiom** or **Illumina genotyping arrays**. - Your marker data is derived from **targeted sequencing platforms** (e.g., DArTag, GTseq, AgriSeq). - All DNA samples were prepared following the **same library preparation protocol**. - You know the **ploidy** of at least a subset of 60 samples *or* you know the **most common ploidy** in the dataset. - Your dataset includes **heterozygous samples**. ### When does the `Qploidy` methodology NOT work? The methodology will not be effective under the following circumstances: - Your marker data comes from **RADseq** or **GBS** (Genotyping-by-Sequencing) platforms. - You intend to **combine datasets from different sequencing batches**. - For example: If you extracted DNA and sequenced two plates (192 samples) as one batch, and later sequenced an additional three plates (288 samples) as a second batch, you would need to analyze the two batches **separately** in `Qploidy`. Combining all 480 samples into a single analysis will lead to incorrect results. - You **do not have a subset of samples with known ploidy** or **lack a predominant ploidy** in your dataset. - Your samples consist of **inbred lines** (homozygous individuals). # Installation To install the `Qploidy` package, you can use the following command: ```{r, eval=FALSE} # install.packages("devtools") devtools::install_github("cristianetaniguti/Qploidy") ``` # Input files In this vignette, we demonstrate the workflow using a simulated VCF file containing 50 samples and 500 genetic markers, allowing for a concise and computationally efficient example. We also present results derived from a real-world rose Axiom array dataset, which includes 524 garden rose cultivars and a total of 137,786 genetic markers. ## Target sequencing VCF file Let's generate our simulated VCF file. This file contains 500 markers and 60 samples, with a mix of tetraploid, diploid, and triploid samples. The tetraploid samples are used as the reference for standardization because they are the most common ploidy in the dataset. ```{r} library(Qploidy) # Simulate a VCF file with 500 markers and 60 samples simulate_vcf( seed = 1234, file_path = "vcf_example_simulated.vcf", n_tetraploid = 35, n_diploid = 5, n_triploid = 10, n_markers = 500 ) ``` ```{r} data <- qploidy_read_vcf(vcf_file = "vcf_example_simulated.vcf") head(data) ``` ## From Axiom array summary file In case your samples were genotyped using the Axiom array platform, you should have a `summary` file or `.CEL` files. If you have `.CEL` files, you can convert them into a `summary` file using the `apt-probeset-genotype` plugin available in the Analysis Power Tools (APT) platform from Thermo Fisher Scientific (Waltham, MA). A `summary` is composed by a header that can have many lines and a body that looks like this: ```{r, echo=FALSE} temp1 <- tempfile(fileext = ".txt") simulate_axiom_summary(seed = 1234, file_path = temp1) temp1 <- read.table(temp1, header = TRUE) head(temp1) ``` In `Qploidy`, use the `read_axiom` function to read the `summary` file. It skips the header and reads the body of the file. The function will return a data frame. ```{r, eval=FALSE} data <- read_axiom("AxiomGT1_summary.txt") head(data) ``` ``` MarkerName SampleName X Y R ratio 1 AX-86752740 Lemon_Fiz 977.7548 793.4656 1771.220 0.4479768 2 AX-86752740 High_Voltage 1292.2205 582.7886 1875.009 0.3108191 3 AX-86752740 Lemon_Fiz.1 919.8055 870.9619 1790.767 0.4863624 4 AX-86752740 High_Voltage.1 1368.1266 645.9481 2014.075 0.3207170 5 AX-86752740 Brite_Eyes 238.5416 1242.7593 1481.301 0.8389648 6 AX-86752740 Morden_Blush.1 302.1575 1542.4037 1844.561 0.8361900 ``` ## From Illumina array In case you have a Illumina array summary file, `Qploidy` expects the following format: ``` [Header] GSGT Version 2.0.5 Processing Date 4/17/2023 9:31 AM Content specieX.bpm Num SNPs 26251 Total SNPs 26251 Num Samples 100 Total Samples 100 [Data] SNP Name Sample ID GC Score Theta X Y X Raw Y Raw Log R Ratio mk-1 SAMP-2 0.6814 0.247 0.023 0.009 504 100 0.5211668 mk-10 SAMP-2 0.9204 0.000 0.006 0.000 346 27 -1.021583 mk-100 SAMP-2 0.1695 0.367 0.030 0.020 576 158 1.177404 mk-101 SAMP-2 0.8655 0.222 0.026 0.010 538 102 0.3459635 mk-103 SAMP-2 0.8932 0.000 0.019 0.000 465 18 0.7578704 mk-104 SAMP-2 0.5146 0.175 0.041 0.012 675 114 1.016254 mk-105 SAMP-2 0.8060 0.450 0.018 0.015 458 132 0.4305636 mk-106 SAMP-2 0.0000 0.386 0.002 0.002 315 57 NaN mk-107 SAMP-2 0.4804 0.495 0.026 0.025 534 189 1.0151 ``` Verify if you file look like this one. If confirmed, you can use the `read_illumina_array` function to read the data. If you include more than one file, the function will merge them into a single data frame. ```{r, eval=FALSE} data <- read_illumina_array( "data/illumina/Set1.txt", # This is a example code, data not available "data/illumina/Set2.txt", "data/illumina/Set3.txt" ) head(input_data) ``` ``` MarkerName SampleName X Y R ratio 1 CT1 S1-2_1 0.023 0.009 0.032 0.2812500 2 CT10 S1-2_1 0.006 0.000 0.006 0.0000000 3 CT100 S1-2_1 0.030 0.020 0.050 0.4000000 4 CT101 S1-2_1 0.026 0.010 0.036 0.2777778 5 CT103 S1-2_1 0.019 0.000 0.019 0.0000000 6 CT104 S1-2_1 0.041 0.012 0.053 0.2264151 ``` # Define Reference Samples - Choose Path (1) or (2) `Qploidy` applies a standardization method to allele intensities or read counts to enable the comparison of copy numbers across the genome of a sample. To achieve this, it requires a set of reference samples. There are two ways to select these reference samples: 1. **If you have a subset of samples with known ploidy**, `Qploidy` will use this subset as the reference. 2. **If you know the most common ploidy among your samples**, you can initially run `Qploidy` using all samples as the reference. From this first run, you can identify samples with the common ploidy, then select this subset to use as the reference in a second run of `Qploidy`. Below are the steps to proceed for each case, tagged as (1) and (2): ## (1) Using a Subset with Known Ploidy For this method, you will need to separate your subset of samples with known ploidy: In our simulated example, the sample names tell us which ploidy they are. Let's just create a vector with the tetraploid samples for us to use later: ```{r} # All samples all_samples <- unique(data$SampleName) all_samples tetraploid_samples <- all_samples[grep("Tetraploid", all_samples)] tetraploid_samples ``` ## (2) Using the Most Common Ploidy For this method, you will use the entire dataset as the reference during the first round of the `Qploidy` run: ```{r, eval=FALSE} data_reference <- data ``` # (1) and (2) Run Dosage Caller To proceed, you need to determine the dosage for all reference samples. If you haven’t done this yet, you can use any suitable dosage caller. - For **array data**, we recommend using the [`fitPoly`]( https://CRAN.R-project.org/package=fitPoly) package. - For **sequencing data**, we suggest using the [`updog`](https://github.com/dcgerard/updog) or [`polyRAD`](https://github.com/lvclark/polyRAD) packages. These packages determine dosages by analyzing the distribution of allele intensities or read counts across all samples for each marker, using an informed ploidy model. You will provide either the **most common ploidy in your population** or the **known ploidy of your selected subset** as the informed ploidy. - **Path (2):** If you provide the most common ploidy, samples with different ploidy levels may receive incorrect dosages. However, these samples can be filtered out in the next step. **Warning:** Depending on the number of samples and markers, this step may take a significant amount of time to complete. It is highly recommended to run it on a high-performance computing system where you can utilize multiple cores and avoid issues with excessive RAM usage. ### Target sequencing VCF data * If the VCF file already contains dosage/genotype information called for the subset/most common ploidy: ```{r} genos <- qploidy_read_vcf("vcf_example_simulated.vcf", geno = TRUE) dim(genos) genos.pos <- qploidy_read_vcf("vcf_example_simulated.vcf", geno.pos = TRUE) head(genos.pos) ``` * If the VCF file lacks marker names in the `ID` column of the fixed portion, the `MarkerName` column will contain `NAs`. To resolve this, you can use the following code to concatenate the chromosome and position as marker names: ```{r, eval=FALSE} genos.pos$MarkerName <- paste0(genos.pos$Chromosome, "_", genos.pos$Position) head(genos.pos) ``` #### (1) Filter the genos object to keep only the known ploidy samples subset ```{r} genos <- genos[which(genos$SampleName %in% tetraploid_samples), ] ``` * If the VCF file doesn't contain dosage/genotype information (GT format field) called for the subset/most common ploidy: ```{r} library(tidyr) # Prepare inputs for updog ## Approach (1) # tobe_genotyped <- data[which(data$SampleName %in% tetraploid_samples),] ## Approach (2) tobe_genotyped <- data ref <- pivot_wider(tobe_genotyped[, 1:3], names_from = SampleName, values_from = X) ref <- as.matrix(ref) rownames_ref <- ref[, 1] ref <- ref[, -1] ref <- apply(ref, 2, as.numeric) rownames(ref) <- rownames_ref size <- pivot_wider(tobe_genotyped[, c(1, 2, 5)], names_from = SampleName, values_from = R) size <- as.matrix(size) rownames_size <- size[, 1] size <- size[, -1] size <- apply(size, 2, as.numeric) rownames(size) <- rownames_size size[1:10, 1:10] ref[1:10, 1:10] ``` ```{r} library(updog) multidog_obj <- multidog( refmat = ref, sizemat = size, model = "norm", # It depends of your population structure ploidy = 4, # Most common ploidy in your population nc = 6 ) # Change the parameters accordingly!! genos <- data.frame( MarkerName = multidog_obj$inddf$snp, SampleName = multidog_obj$inddf$ind, geno = multidog_obj$inddf$geno, prob = multidog_obj$inddf$maxpostprob ) head(genos) ``` Notice that, if you are following approach (2) diploids and triploids samples genotypes will be wrongly called as tetraploids. But, because they are not the most common samples in the dataset, they should not interfere significantly in the dosage call of the tetraploids. You can go back to this step after a first round of ploidy evaluation and call the dosage of the tetraploids only. ### Array data Here we show an option of genotype calling for array data using fitpoly software: ```{r, eval=FALSE} library(fitPoly) saveMarkerModels( ploidy = 4, # Most common ploidy among Texas Garden Roses collection data = data_reference, # output of the previous function filePrefix = "fitpoly_output/texas_roses", # Define the path and prefix for the result files ncores = 2 ) # Change it accordingly to your system!!!! library(vroom) # Package to speed up reading files fitpoly_scores <- vroom("fitpoly_output/texas_roses_scores.dat") genos <- data.frame( MarkerName = fitpoly_scores$MarkerName, SampleName = fitpoly_scores$SampleName, geno = fitpoly_scores$maxgeno, prob = fitpoly_scores$maxP ) head(genos) ``` ``` MarkerName SampleName geno prob 1 AX-86752740 1000 Wishes 2 0.7831438 2 AX-86752740 10004-N008 2 0.9892338 3 AX-86752740 10037_N046 2 0.9844171 4 AX-86752740 10038-N001 2 0.9916813 5 AX-86752740 10043_N019 2 0.9871057 6 AX-86752740 10043_N049 2 0.9955792 ``` Differently than a VCF file, a `summary` file does not contain the genomic positions of the markers. You will need to provide these information in a separate file as the following example. ```{r, eval=FALSE} # File containing markers genomic positions genos.pos_file <- read.table("geno.pos_roses_texas.txt", header = T) head(genos.pos) ``` ``` SNP probes chr pos 1 Affx-86843634 AX-86752747 6 255852 2 Affx-86842821 AX-86752763 6 62032267 3 Affx-86839613 AX-86752769 5 9169310 4 Affx-86838724 AX-86752790 1 44259488 5 Affx-86840823 AX-86752809 7 12991931 6 Affx-86842443 AX-86752817 2 53861719 ``` ```{r, eval=FALSE} # Edit for Qploidy input format genos.pos <- data.frame( MarkerName = genos.pos_file$probes, Chromosome = genos.pos_file$chr, Position = genos.pos_file$pos ) head(genos.pos) ``` ``` MarkerName Chromosome Position 1 AX-86752747 6 255852 2 AX-86752763 6 62032267 3 AX-86752769 5 9169310 4 AX-86752790 1 44259488 5 AX-86752809 7 12991931 6 AX-86752817 2 53861719 ``` # (1) and (2) Standardization ```{r, echo=FALSE} temp_file <- tempfile(fileext = ".tsv.gz") # saving in a temporary file simu_data_standardized <- standardize( data = data, genos = genos, geno.pos = genos.pos, ploidy.standardization = 4, threshold.n.clusters = 5, n.cores = 1, out_filename = temp_file, verbose = TRUE ) simu_data_standardized ``` ```{r, eval=FALSE} simu_data_standardized <- standardize( data = data, genos = genos, geno.pos = genos.pos, ploidy.standardization = 4, threshold.n.clusters = 5, n.cores = 1, out_filename = "my_standardized_data.tsv.gz", verbose = TRUE ) simu_data_standardized ``` How it look like for our real roses dataset: ``` This is on object of class 'ploidy_standardization' -------------------------------------------------------------------- Parameters 1 standardization type: intensities 2 Ploidy: 4 3 Minimum number of heterozygous classes (clusters) present: 5 4 Maximum number of missing genotype by marker: 0.1 5 Minimum genotype probability: 0.8 -------------------------------------------------------------------- Filters 1 Number of markers at raw data: 137786 (100%) 2 Percentage of filtered genotypes by probability threshold: - (16.79 %) 3 Number of markers filtered by missing data: 5659 (4.11 %) 4 Number of markers filtered for not having the minimum number of clusters: 101853 (73.92 %) 5 Number of markers filtered for not having genomic information: 252 (0.18 %) 6 Number of markers with estimated BAF: 28100 (20.39 %) ``` See the details of the parameters with `?standardize`. The `print` function of the resulting object provides information about marker filtering during the process. One critical filtering step involves the number of dosage clusters represented for each marker. If your samples are highly inbred at certain loci, you might not have individuals representing all possible dosages for those marker locations. For example, when using tetraploids as the reference, some markers may have individuals with dosages of only 0, 1, 3, and 4 (missing dosage 2). Without representatives for the missing dosage, `Qploidy` standardization cannot be performed. - If `threshold.n.clusters` is set to `ploidy + 1`, markers with missing dosages will be discarded. - If `threshold.n.clusters` is smaller than `ploidy + 1`, the missing dosage cluster centers will be imputed for standardization purposes. ## Recover the `ploidy_standardization` Object from a Saved File To revisit this step later without re-running the upstream functions, you can load the previously generated file using the `read_qploidy_standardization` function: ```{r, echo=FALSE} simu_data_standardized <- read_qploidy_standardization(temp_file) ``` ```{r, eval=FALSE} simu_data_standardized <- read_qploidy_standardization("my_standardized_data.tsv.gz") ``` # (1) and (2) Plot results `Qploidy` provides several options for visualization of the results. Check the complete description with `?plot_qploidy_standaridization`. Bellow you can see examples: ```{r} # Select a sample to be evaluated samples <- unique(simu_data_standardized$data$SampleName) head(samples) ``` ```{r} sample <- "Tetraploid1" # Proportion of heterozygous loci, BAF (Qploidy standardized ratio), and zscore by genomic positions p <- plot_qploidy_standardization( x = simu_data_standardized, sample = sample, type = c("het", "BAF", "zscore"), dot.size = 0.05, chr = 1:2 ) p ``` See how it look like for our real roses dataset: ![](fig1.jpg) ```{r} # Heterozygous frequency, BAF (Qploidy standardized ratio), and zscore by genomic positions # centromere positions added p <- plot_qploidy_standardization( x = simu_data_standardized, sample = sample, type = c("het", "BAF", "zscore"), dot.size = 0.05, chr = 1:2, add_centromeres = TRUE, centromeres = c("chr1" = 15000000, "chr2" = 19000000) ) p ``` See how it look like for our real roses dataset: ![](fig2.jpg) ```{r} # Raw allele intensity or read count ratio and BAF (Qploidy standardized ratio) histograms # combining all markers in the sample (sample level resolution) p <- plot_qploidy_standardization( x = simu_data_standardized, sample = sample, type = c("Ratio_hist_overall", "BAF_hist_overall"), chr = 1:2, ploidy = 4, add_expected_peaks = TRUE ) p ``` Notice that, because it is a simulated data, the ratio and the BAF don't differ much. See how it look like for our real roses dataset: ![](fig3.jpg) ```{r} # BAF histograms (chromosome level resolution) and Z score p <- plot_qploidy_standardization( x = simu_data_standardized, sample = sample, type = c("BAF_hist", "zscore"), chr = 1:2, add_expected_peaks = TRUE, ploidy = c(4, 4) ) p ``` See how it look like for our real roses dataset: ![](fig4.jpg) ```{r} # BAF histograms combining all markers in the sample (chromosome-arm level resolution) and Z score p <- plot_qploidy_standardization( x = simu_data_standardized, sample = sample, type = c("BAF_hist", "zscore"), chr = 1:2, ploidy = c(4, 4, 4, 4), # Provide ploidy for each arm add_expected_peaks = TRUE, add_centromeres = TRUE, centromeres = c("chr1" = 15000000, "chr2" = 19000000) ) p ``` See how it look like for our real roses dataset: ![](fig5.jpg) It is important to note that the quality of results can vary across samples, as observed in different plots. Key aspects to assess include: - How sharp the peaks are in the histogram plots. - How well the dots cluster in the BAF vs. genomic position plots. - Whether the patterns match the decay or increase of the Z-score (Z). In our publication, we categorize sample quality into different **resolutions** based on the ability to estimate copy numbers, ranked from highest to lowest resolution: 1. **Chromosome-arm resolution**: When the copy number of all chromosome arms can be estimated. 2. **Chromosome resolution**: When at least one chromosome arm’s copy number cannot be estimated. 3. **Sample resolution**: When the copy number of at least one entire chromosome cannot be estimated. 4. **None**: When the histogram peaks, considering all markers, do not fit any of the expected ploidy levels tested. For more details and examples, please refer to the publication. # (1) and (2) Ploidy Estimation for All Samples **Warning**: This method may not be fully accurate in certain situations. While it offers a helpful initial guide, **visual inspection** of the plots described in the previous section is essential to assess the resolution and confirm the ploidy. ```{r} # If sample level resolution estimated_ploidies_sample <- area_estimate_ploidy( qploidy_standardization = simu_data_standardized, # standardization result object samples = "all", # Samples or "all" to estimate all samples level = "sample", # Resolution level ploidies = c(2, 3, 4, 5) ) # Ploidies to be tested estimated_ploidies_sample ``` See how it look like for our real roses dataset: ``` Object of class qploidy_area_ploidy_estimation 1 Number of samples: 524 2 Chromosomes: 1,3,5,2,6,7,4,0 3 Tested ploidies: 1,2,3,4,5 4 Number of euploid samples: 459 5 Number of potential aneuploid samples: 0 6 Number of highly inbred samples: 65 ``` Highly inbred samples cannot be evaluated for copy number using this method. As a result, `Qploidy` returns a missing value (`NA`) for these cases. ```{r} head(estimated_ploidies_sample$ploidy) ``` See how it look like for our real roses dataset: ``` [,1] Lemon_Fiz 4 High_Voltage 4 Lemon_Fiz.1 4 High_Voltage.1 4 Brite_Eyes 4 Morden_Blush.1 4 ``` ```{r} # If chromosome resolution estimated_ploidies_chromosome <- area_estimate_ploidy( qploidy_standardization = simu_data_standardized, samples = "all", level = "chromosome", ploidies = c(2, 3, 4, 5) ) estimated_ploidies_chromosome ``` See how it look like for our real roses dataset: ``` Object of class qploidy_area_ploidy_estimation 1 Number of samples: 524 2 Chromosomes: 0,1,2,3,4,5,6,7 3 Tested ploidies: 1,2,3,4,5 4 Number of euploid samples: 375 5 Number of potential aneuploid samples: 92 6 Number of highly inbred samples: 57 ``` ```{r} head(estimated_ploidies_chromosome$ploidy) ``` See how it look like for our real roses dataset: ``` 0 1 2 3 4 5 6 7 Lemon_Fiz 4 4 4 4 4 4 4 4 High_Voltage 4 4 4 4 4 4 4 4 Lemon_Fiz.1 4 4 4 4 4 4 4 4 High_Voltage.1 4 4 4 4 4 4 4 4 Brite_Eyes 4 4 4 4 4 4 4 4 Morden_Blush.1 4 4 4 4 4 4 4 4 ``` ```{r} # If chromosome-arm resolution estimated_ploidies_chromosome_arm <- area_estimate_ploidy( qploidy_standardization = simu_data_standardized, samples = "all", level = "chromosome-arm", ploidies = c(2, 3, 4, 5), centromeres = c("chr1" = 15000000, "chr2" = 19000000) ) estimated_ploidies_chromosome_arm ``` See how it look like for our real roses dataset: ``` Object of class qploidy_area_ploidy_estimation 1 Number of samples: 524 2 Chromosomes: 0,1.1,1.2,2.1,2.2,3.1,3.2,4.1,4.2,5.1,5.2,6.1,6.2,7.1,7.2 3 Tested ploidies: 1,2,3,4,5 4 Number of euploid samples: 304 5 Number of potential aneuploid samples: 171 6 Number of highly inbred samples: 49 ``` ```{r} head(estimated_ploidies_chromosome_arm$ploidy) ``` See how it look like for our real roses dataset: ``` 0 1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2 Lemon_Fiz 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 High_Voltage 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Lemon_Fiz.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 High_Voltage.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Brite_Eyes 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Morden_Blush.1 4 4 4 4 4 NA 4 4 4 4 4 4 4 4 4 ``` Note that one of the chromosome arms (Morden_Blush.1 - 3.2) returned a value of `NA`. This occurs because the arm contains a high number of inbred loci. ```{r} estimated_ploidies_format <- merge_arms_format(estimated_ploidies_chromosome_arm) estimated_ploidies_format ``` See how it look like for our real roses dataset: ``` Object of class qploidy_area_ploidy_estimation 1 Number of samples: 524 2 Chromosomes: 0,1,2,3,4,5,6,7 3 Tested ploidies: 1,2,3,4,5 4 Number of euploid samples: 304 5 Number of potential aneuploid samples: 171 6 Number of highly inbred samples: 49 ``` ```{r} head(estimated_ploidies_format$ploidy) ``` See how it look like for our real roses dataset: ``` 0 1 2 3 4 5 6 7 Lemon_Fiz "4" "4" "4" "4" "4" "4" "4" "4" High_Voltage "4" "4" "4" "4" "4" "4" "4" "4" Lemon_Fiz.1 "4" "4" "4" "4" "4" "4" "4" "4" High_Voltage.1 "4" "4" "4" "4" "4" "4" "4" "4" Brite_Eyes "4" "4" "4" "4" "4" "4" "4" "4" Morden_Blush.1 "4" "4" "4" "NA/4" "4" "4" "4" "4" ``` ### Export ploidy result table ```{r, eval=FALSE} write.csv(estimated_ploidies_format$ploidy, file = "ploidies_before_visual_evaluation.csv") ``` ## (1) and (2) Save plots for all samples Facilitate individual visual inspection by generating figures for all samples in a loop: ```{r, eval=FALSE} dir.create("figures") for (i in seq_along(samples)) { print(paste0("Part:", i, "/", length(samples))) print(paste("Generating figures for", samples[i], "...")) # This function will generate figures for sample, chromosome and chromosome-arm resolution level evaluation all_resolutions_plots( data_standardized = simu_data_standardized, sample = samples[i], ploidy = estimated_ploidies_chromosome$ploidy[i, 1:2], chr = 1:2, centromeres = c("chr1" = 15000000, "chr2" = 19000000), file_name = paste0("figures/", samples[i]) ) print(paste("Done!")) } ``` By visualizing the plots, you can correct the results from `area_estimate_ploidy` and add resolution information for each sample. In our perfect simulated data, all samples correctly identified therefore we will just move forward with the analysis. ```{r} ploidy_corrected <- estimated_ploidies_chromosome$ploidy ``` # (2) Improve standardization selecting a known ploidy subset after first round If you used the most common ploidy as the reference instead of a subset with known ploidy, the standardization may not be as accurate. You can improve the resolution by following these steps: i. Select the highest resolution plots from the first standardization round. ii. Filter these samples from the `data` object. iii. Run standardization again using this subset as the reference. iv. Reevaluate the plots and estimate the ploidy once more. ```{r, eval=FALSE} # i) Select the highest resolution plots from the first standardization round. tetraploids <- apply(ploidy_corrected, 1, function(x) all(x == 4)) tetraploids data_reference2 <- rownames(ploidy_corrected)[which(tetraploids)] length(data_reference2) # ii) Filter these samples from the `data` object. genos_round2 <- genos[which(genos$SampleName %in% data_reference2), ] # iii) Run standardization again using this subset as the reference. simu_data_standardized_round2 <- standardize( data = data, genos = genos_round2, geno.pos = genos.pos, ploidy.standardization = 4, threshold.n.clusters = 5, n.cores = 1, out_filename = "my_round2_standardization.tsv.gz", verbose = TRUE ) simu_data_standardized_round2 ``` For our perfect simulated data, plot resolution didn't change between first and second round but see how it look like for our real roses dataset: ``` This is on object of class 'ploidy_standardization' -------------------------------------------------------------------- Parameters 1 standardization type: intensities 2 Ploidy: 4 3 Minimum number of heterozygous classes (clusters) present: 5 4 Maximum number of missing genotype by marker: 0.1 5 Minimum genotype probability: 0.8 -------------------------------------------------------------------- Filters 1 Number of markers at raw data: 137786 (100%) 2 Percentage of filtered genotypes by probability threshold: - (16.34 %) 3 Number of markers filtered by missing data: 6108 (4.43 %) 4 Number of markers filtered for not having the minimum number of clusters: 111977 (81.27 %) 5 Number of markers filtered for not having genomic information: 145 (0.11 %) 6 Number of markers with estimated BAF: 17634 (12.8 %) ``` New plots for roses real data after round 1 and 2 of standardization: ![Round 1](fig1.jpg) ![Round 2](fig7.jpg) ![Round 1](fig4.jpg) ![Round 2](fig6.jpg) # How to cite Taniguti, C.H; Lau, J.; Hochhaus, T.; Lopez Arias, D. C.; Hokanson, S.C.; Zlesak, D. C.; Byrne, D. H.; Klein, P.E. and Riera-Lizarazu, O. Exploring Chromosomal Variations in Garden Roses: Insights from High-density SNP Array Data and a New Tool, Qploidy. 2025. Submitted. # Acknowledgments This work is funded in part by the Robert E. Basye Endowment in Rose Genetics, Dept. of Horticultural Sciences, Texas A&M University, and USDA’s National Institute of Food and Agriculture (NIFA), Specialty Crop Research Initiative (SCRI) projects: ‘‘Tools for Genomics-Assisted Breeding in Polyploids: Development of a Community Resource’’ (Award No. 2020-51181-32156); and ‘‘Developing Sustainable Rose Landscapes via Rose Rosette Disease Education, Socioeconomic Assessments, and Breeding RRD-Resistant Roses with Stable Black Spot Resistance’’ (Award No. 2022-51181-38330).