The echo.find package provides a function (echo_find()) designed to find rhythms from expression data using extended harmonic oscillators. To read more about our inital work on this project and cite us, see Circadian Rhythms in Neurospora Exhibit Biologically Relevant Driven and Damped Harmonic Oscillations by H. De los Santos et al. (2017) Further, for users who prefer an interface more than coding, as well as built-in visualizations, our GitHub repository can be found here. There, you can find a shiny application for finding rhythms and automatically visualizing results, with features such as Venn diagrams, heat maps, gene expression plots (with or without replicates visualized), and parameter density graphs. A FAQ for possible user errors can also be found there.
In this vignette, we’ll walk through an example of how to use echo.find, and how to choose from the several different built-in methods of preprocessing.
We’ll start by loading our library, which contains the echo_find() function. It also has an example dataframe, expressions, which we’ll be using throughout this vignette. Here we’ll look at the first few rows and columns of our dataset.
library(echo.find)
head(expressions[,1:5])##   Gene.Name      CT2.1      CT2.2      CT2.3      CT4.1
## 1  Sample 1  1.6331179  1.4976053  1.5138102  1.3095535
## 2  Sample 2 -0.6303192 -0.6027464 -0.5105009 -0.5062033
## 3  Sample 3         NA -0.7802214 -0.7767950  0.2847617
## 4  Sample 4  0.4659923  0.4940659         NA  0.1018655
## 5  Sample 5  0.7026372  0.6405812  1.0235155  1.7199453
## 6  Sample 6  0.9261508  0.8858768  0.8035570         NANote the data format: its first column first column has gene labels/names, and all other columns have numerical expression data. This expression data is ordered by time point then by replicate, and has evenly spaced time points. Any missing data has cells left blank. In order to use the echo_find() function, data must be in this format. Now, let’s look at one the data expressions, Sample 2. Here we plot each of the replicates in a different color, then plot the difference between them in gray.
library(ggplot2)
tp <- seq(2,48,by=2) # our time points
num_reps <- 3 # number of replicates
samp <- 2 # sample we want to look at
ex.df <- expressions[samp,-1] # expression data for the first sample
# our visualization data frame       
ribbon.df <- data.frame(matrix(ncol = 3+num_reps, nrow = length(tp)))
# assigning column names
colnames(ribbon.df) <- c("Times","Min","Max", 
                         paste(rep("Rep",num_reps),c(1:num_reps), sep=".")) 
ribbon.df$Times <- tp
# getting min values of replicates
ribbon.df$Min <- sapply(seq(1,ncol(ex.df), by = num_reps),
                        function(x) min(unlist(ex.df[,c(x:(num_reps-1+x))]), na.rm = TRUE))
# getting max values of replicates
ribbon.df$Max <- sapply(seq(1,ncol(ex.df), by = num_reps),
                        function(x) max(unlist(ex.df[,c(x:(num_reps-1+x))]), na.rm = TRUE))
# assign each of the replicates to the visualization data frame
for (i in 1:num_reps){ 
  ribbon.df[,3+i] <- t(ex.df[,seq(i,ncol(ex.df),by=num_reps)])
}
# color names
color_bar <- c("Rep.1"="red","Rep.2"="blue","Rep.3"="green")
# visualize, with shading for each row
p <- ggplot(data = ribbon.df,aes(x=Times))+ # declare the dataframe and main variables
  geom_ribbon(aes(x=Times, ymax=Max, ymin=Min, colour="Original"),
              fill = "gray", alpha = 0.5)+ # create shading
  ggtitle(expressions[samp,1])+ # gene name is title
  scale_color_manual("",values=color_bar)+
  scale_fill_manual("",values=color_bar)+
  theme(plot.title = element_text(hjust = .5),
        legend.position = "bottom",legend.direction = "horizontal")+
  labs(x="Hours", y="Expression") #Label for axes
# add specific replicate lines 
for (i in 1:num_reps){
  p <- p +
    geom_line(data = ribbon.df,
              aes_string(x="Times",y=paste("Rep",i,sep = ".")),
              colour=color_bar[i])
}
suppressWarnings(p) # to ignore warnings for missing valuesIt very clearly has an oscillitory pattern with a small amount of damping, making echo_find() the perfect function for our dataset.
So we begin by assigning our parameters and running the echo_find() function. In this first run, we look for rhythms between 20 and 26 hours, with no preprocessing, assigning these results to a new dataframe.
# echo_find() parameters
genes <- expressions
begin <- 2 # first time point
end <- 48 # last time point
resol <- 2 # time point resolution
num_reps <- 3 # number of replicates
low <- 20 # low period seeking
high <- 26 # high period seeking
run_all_per <- FALSE # we are not looking for all periods
paired <- FALSE # these replicates are unrelated, that is, a replicate being 
  # called "replicate 1" at time point 2 means nothing
rem_unexpr <- FALSE # do not remove unexpressed genes
# we do not assign rem_unexpr_amt, since we're not removing unexpressed genes
is_normal <- FALSE # do not normalize
is_de_linear_trend <- FALSE # do not remove linear trends
is_smooth <- FALSE # do not smooth the data
results <- echo_find(genes = genes, begin = begin, end = end, resol = resol, 
  num_reps = num_reps, low = low, high = high, run_all_per = run_all_per,
  paired = paired, rem_unexpr = rem_unexpr, is_normal = is_normal,
  is_de_linear_trend = is_de_linear_trend, is_smooth = is_smooth)
head(results[,1:15])##   Gene Name Convergence Iterations Forcing.Coefficient Oscillation Type
## 1  Sample 1           1          7         -0.05319486           Driven
## 2  Sample 2           1          4          0.01320725           Damped
## 3  Sample 3           1          6         -0.07006987           Driven
## 4  Sample 4           1         20         -0.07987795           Driven
## 5  Sample 5           1         10         -0.04007403           Driven
## 6  Sample 6           1         22          0.11424173           Damped
##   Amplitude Radian.Frequency   Period Phase Shift Hours Shifted
## 1 1.2113583        0.3141593 20.00000  -0.4740203      1.508853
## 2 0.7143808        0.2749550 22.85169   2.4479303     13.948666
## 3 1.6853526        0.2811149 22.35095   3.7599389      8.975854
## 4 1.0511889        0.2416610 26.00000   0.3759410     24.444345
## 5 1.4596106        0.3141593 20.00000   4.6366571      5.241062
## 6 0.8895897        0.2416610 26.00000   4.5536533      7.156853
##   Equilibrium Value       Tau      P-Value BH Adj P-Value BY Adj P-Value
## 1        0.32913648 0.9020772 9.664994e-41   2.899498e-40   8.997754e-40
## 2        0.12870140 0.9139466 5.990448e-43   2.396179e-42   7.435849e-42
## 3       -0.09077891 0.9920870 1.419260e-68   1.703112e-67   5.285114e-67
## 4       -0.13255814 0.8397626 1.085176e-31   1.860301e-31   5.772907e-31
## 5        0.44750824 0.9117506 3.291152e-43   1.974691e-42   6.127882e-42
## 6        0.43648666 0.7836930 3.312238e-26   4.968357e-26   1.541786e-25Now we can see that the results data frame has information about the parameters, including forcing coefficient values (whether the oscillation is damped, driven, harmonic, etc.) and p-values. Let’s look at how the fit and parameters turned out for our initial sample. Here we add the fitted values to our plot in black and print the parameters to the console.
# assign the fit to the visualization data frame
ribbon.df$Fit <- t(results[samp,(16+(length(tp)*num_reps)):ncol(results)])
# visualize, with shading for each row
# add Fit line
p <- p +
  geom_line(data = ribbon.df,
            aes_string(x="Times",y="Fit"),
            colour="black")
suppressWarnings(p) # to ignore warnings for missing values# print sample's parameters
cat(paste0("Gene Name: ",results$`Gene Name`[samp],"\n",
           "Convergence:", results$Convergence[samp],"\n",
           "Iterations:",results$Iterations[samp],"\n",
           "Forcing Coefficient:", results$Forcing.Coefficient[samp],"\n",
           "Oscillation Type:",results$`Oscillation Type`[samp],"\n",
           "Amplitude", results$Amplitude[samp],"\n",
           "Radian.Frequency:",results$Radian.Frequency[samp],"\n",
           "Period:",results$Period[samp],"\n",
           "Phase Shift:",results$`Phase Shift`[samp],"\n",
           "Hours Shifted:",results$`Hours Shifted`[samp],"\n",
           "P-Value:",results$`P-Value`[samp],"\n",
           "BH Adj P-Value:",results$`BH Adj P-Value`[samp],"\n",
           "BY Adj P-Value:",results$`BY Adj P-Value`[samp],"\n"))## Gene Name: Sample 2
## Convergence:1
## Iterations:4
## Forcing Coefficient:0.0132072534540353
## Oscillation Type:Damped
## Amplitude0.714380770631262
## Radian.Frequency:0.274954971032941
## Period:22.8516883458231
## Phase Shift:2.44793026287669
## Hours Shifted:13.9486659575375
## P-Value:5.99044824813101e-43
## BH Adj P-Value:2.3961792992524e-42
## BY Adj P-Value:7.43584918834744e-42This fit matches pretty closely to the trend, which is emphasized by the very low adjusted p-value. As we predicted, the oscillation is also damped, which is shown by the positive forcing coefficient and the designation of the oscillation type.
Now let’s see how preprocessing affects the results. Here we search for all possible periods, using the default values for low and high, as well as allowing for all our preprocessing options: removing unexpressed genes, normalizing, removing linear trends, and smoothing.
run_all_per <- TRUE # looking for all possible periods
rem_unexpr <- TRUE # remove unexpressed genes
rem_unexpr_amt <- 70 # percentage of unexpressed genes
is_normal <- TRUE # normalize
is_de_linear_trend <- TRUE # remove linear trends
is_smooth <- TRUE # smooth the data
# we're using the default values of low and high, since we're looking for all periods
results <- echo_find(genes = genes, begin = begin, end = end, resol = resol, 
  num_reps = num_reps, run_all_per = run_all_per, paired = paired, 
  rem_unexpr = rem_unexpr, rem_unexpr_amt = rem_unexpr_amt, is_normal = is_normal,
  is_de_linear_trend = is_de_linear_trend, is_smooth = is_smooth)
head(results[,1:15])##   Gene Name Convergence Iterations Forcing.Coefficient Oscillation Type
## 1  Sample 1           1          7        -0.051536453           Driven
## 2  Sample 2           1          5         0.011582231           Damped
## 3  Sample 3           1         22         0.005357267         Harmonic
## 4  Sample 4           1          9        -0.078117229           Driven
## 5  Sample 5           1          9        -0.026718082           Driven
## 6  Sample 6           1         19         0.036078605           Damped
##    Amplitude Radian.Frequency   Period Phase Shift Hours Shifted
## 1  0.6164627        0.3243511 19.37155  -0.6585025      2.030215
## 2  1.5772248        0.2731878 22.99951   2.5290548     13.741942
## 3 -0.6307775        0.4780950 13.14213  -1.5783226      9.872337
## 4  0.4262496        0.2191198 28.67466   0.9115305     24.514698
## 5  0.9433894        0.3249436 19.33624  -1.9850863      6.109019
## 6  1.6033760        0.1655006 37.96474   5.9626819      1.936570
##   Equilibrium Value       Tau      P-Value BH Adj P-Value BY Adj P-Value
## 1       0.002159052 0.9515331 9.508710e-52   5.705226e-51   1.770452e-50
## 2       0.095058330 0.9505440 1.812635e-51   7.250539e-51   2.249995e-50
## 3       0.002815542 0.1562809 7.412930e-02   7.412930e-02   2.300388e-01
## 4       0.110273440 0.9287834 4.887969e-46   1.466391e-45   4.550520e-45
## 5      -0.267901020 0.8925659 9.792497e-40   1.678714e-39   5.209402e-39
## 6      -0.040058335 0.7127098 1.937502e-20   2.325002e-20   7.214971e-20Since we’ve now searched for all possible periods, periods can now fall outside our predetermined range of 20 to 26 that we set in our first run. Let’s see how this affected the fit and parameters of the sample we looked at.
rep_genes <- results[samp,16:(15+(length(tp)*num_reps))]
 # getting min values of replicates
ribbon.df$Min <- sapply(seq(1,ncol(rep_genes), by = num_reps), 
                        function(x) min(unlist(rep_genes[,c(x:(num_reps-1+x))]),
                                        na.rm = TRUE))
 # getting max values of replicates
ribbon.df$Max <- sapply(seq(1,ncol(rep_genes), by = num_reps),
                        function(x) max(unlist(rep_genes[,c(x:(num_reps-1+x))]),
                                        na.rm = TRUE))
for (i in 1:num_reps){ # assign each of the replicates
  ribbon.df[,3+i] <- t(rep_genes[,seq(i,ncol(rep_genes),by=num_reps)])
}
# assign the fit to the visualization data frame
ribbon.df$Fit <- t(results[samp,(16+(length(tp)*num_reps)):ncol(results)])
# visualize, with shading for each row
p <- ggplot(data = ribbon.df,aes(x=Times))+ # declare the dataframe and main variables
  geom_ribbon(aes(x=Times, ymax=Max, ymin=Min, colour="Original"),
              fill = "gray", alpha = 0.5)+ # create shading
  ggtitle(expressions[samp,1])+ # gene name is title
  scale_color_manual("",values=color_bar)+
  scale_fill_manual("",values=color_bar)+
  theme(plot.title = element_text(hjust = .5),
        legend.position = "bottom",legend.direction = "horizontal")+
  labs(x="Hours", y="Expression") #Label for axes
# add specific replicate lines 
for (i in 1:num_reps){
  p <- p +
    geom_line(data = ribbon.df,
              aes_string(x="Times",y=paste("Rep",i,sep = ".")),
              colour=color_bar[i])
}
# add Fit line
p <- p +
  geom_line(data = ribbon.df,
            aes_string(x="Times",y="Fit"),
            colour="black")
suppressWarnings(p) # to ignore warnings for missing values# print sample's parameters
cat(paste0("Gene Name: ",results$`Gene Name`[samp],"\n",
           "Convergence:", results$Convergence[samp],"\n",
           "Iterations:",results$Iterations[samp],"\n",
           "Forcing Coefficient:", results$Forcing.Coefficient[samp],"\n",
           "Oscillation Type:",results$`Oscillation Type`[samp],"\n",
           "Amplitude", results$Amplitude[samp],"\n",
           "Radian.Frequency:",results$Radian.Frequency[samp],"\n",
           "Period:",results$Period[samp],"\n",
           "Phase Shift:",results$`Phase Shift`[samp],"\n",
           "Hours Shifted:",results$`Hours Shifted`[samp],"\n",
           "P-Value:",results$`P-Value`[samp],"\n",
           "BH Adj P-Value:",results$`BH Adj P-Value`[samp],"\n",
           "BY Adj P-Value:",results$`BY Adj P-Value`[samp],"\n"))## Gene Name: Sample 2
## Convergence:1
## Iterations:5
## Forcing Coefficient:0.0115822306119123
## Oscillation Type:Damped
## Amplitude1.57722478904992
## Radian.Frequency:0.273187781527227
## Period:22.999510710377
## Phase Shift:2.52905477304299
## Hours Shifted:13.7419415800719
## P-Value:1.81263485251528e-51
## BH Adj P-Value:7.25053941006114e-51
## BY Adj P-Value:2.24999513200891e-50We can see that the fit hasn’t changed very much, but that the smoothing has gotten the replicates much closer to each other. This smoothing has also reduced the amount of damping in the system: the forcing coefficient has decreased by about .002.
Now that you understand the basics of using the echo_find() function, feel free to experiment and play around with this vignette and the example data. Good luck with your rhythm searches!