Researchers should estimate and report the reliability of their cognitive tasks. Unfortunately, this is the exception rather than the rule. The routine reporting of reliability estimates stands to benefit our science in several ways (I also discuss this in greater depth elsewhere):
Reporting the reliability of our tasks enables us to make more informed judgments about the implications of our analysis. To quote LeBel and Paunonen, “researchers should calibrate their confidence in their experimental results as a function of the amount of random measurement error contaminating the scores of the dependent variable.” (2011, pp. 578). In other words, we should be more confident in our results if our tasks are reliable, and less confident if our tasks show low reliability.
The replicability of our findings is directly influenced by the reliability of our tasks (LeBel & Paunonen, 2011). Thus, we have a greater chance of replicating results indicating real effects, when our tasks are more reliable. In other terms, greater task reliability results in greater power to detect an effect.
The routine reporting of task reliability should facilitate the development of our experimental tasks. I am not advocating simply discarding tasks that show low reliability. Rather, researchers utilizing and developing those tasks are more able to make informed decisions about the reliability of specific variations of their tasks. I do argue however, that it is difficult to improve our measures if their reliability is not routinely estimated and reported.
The DPsplithalf function was developed to provide a relatively straightforward way for researchers to calculate the split-half reliability of their dot-probe tasks, specifically the reliability of the bias index. For those not familiar with the emotional dot-probe;in the dot-probe task (e.g. MacLeod et al., 1986) one threat stimuli and one neutral stimuli are presented simultaneously at two locations. Participants then respond to a probe which is presented in the location of one of these stimuli. Speeded responses to probes that appeared in the location of the threat stimuli, compared to the neutral stimuli are assumed to represent a selective attentional bias towards threat. The DPsplithalf function returns the split-half reliability of this bias index.
Before moving to the examples, I have three further notes.
First, blanket statements are often made about the low reliability of the dot-probe task. However, the task reliability is not routinely reported. This offers two possibilities; either the reliability is being calculated and not reported, or researchers are referring to the mixed results in the dot-probe literature, which, although related is a very different beast. Several studies report very low task reliability, and have argued for the unreliability of the dot-probe (e.g. Schmukle, 2005; Staugaard, 2009). However, given the breadth of research that utilizes variations of the task, generalizing these poor results to the wider literature is perhaps misguided. I propose that routine reporting of task reliability estimates would help to solve these issues.
Second, reliability estimates in most papers (at least in the dot-probe literature) have reported the split-half reliability using an odd-even split. To illustrate, a bias index is calculated from only odd numbered trials and compared to the bias index calculated from only even numbered trials. This is a common approach. However, these estimates can be less stable, and a Monte Carlo simulation process repeating a large number of calculations based on randomly selecting halves will yield a more stable estimate of reliability (Enock, Hofmann, McNally, 2014). Therefore, the functions included in this package have been developed to facilitate this method and it is recommended that this approach is used.
Third, the functions contained within this package will also be applicable to other cognitive tasks. We encourage researchers to contact us regarding specific tasks that do not easily conform to any of the splithalf package functions. This will allow us to develop new functions tailored to those tasks, employing the same underlying processes to estimate reliability.
In this document I will present two examples of the functions contained within the splithalf package. The first will be straightforward, using a simulated data set in which there are no missing data and the variables are named as required by the function. The second will include several lines of code that will ensure that the congruency of the trials and the conditions are correctly named.
First, lets look at the data. This is simulated data contained within the package. It has been designed to be as straightforward as possible for the purposes of this example. You will note that the data contains no missing data, the variables do not need to be renamed, and the congruency is appropriately named (example 2 will cover when this is not the case).
str(DPdata)## 'data.frame':    3840 obs. of  6 variables:
##  $ subject   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ blockcode : Factor w/ 2 levels "block1","block2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trialnum  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ congruency: Factor w/ 2 levels "Congruent","Incongruent": 1 2 1 2 1 2 1 2 1 2 ...
##  $ latency   : num  27.1 24.7 24.2 25 23.3 ...
##  $ correct   : num  1 1 1 1 1 1 1 1 1 1 ...head(DPdata)##   subject blockcode trialnum  congruency  latency correct
## 1       1    block1        1   Congruent 27.10857       1
## 2       1    block1        2 Incongruent 24.66168       1
## 3       1    block1        3   Congruent 24.20617       1
## 4       1    block1        4 Incongruent 24.97494       1
## 5       1    block1        5   Congruent 23.32984       1
## 6       1    block1        6 Incongruent 24.80058       1sum(is.na(DPdata))## [1] 0In this example, we want to calculate the split-half reliability for block 1 and block 2. We want to use the “random” method to process 5000 iterations of randomly generated splits, and return estimates for both blocks. The function will take a little time to run (the time taken while creating this vignette was 23s).
library(splithalf)## 
## Attaching package: 'splithalf'## The following object is masked _by_ '.GlobalEnv':
## 
##     DPdataDPsplithalf(DPdata, conditionlist = c("block1","block2"), halftype = "random", no.iterations = 5000)## [1] "condition block1 complete"
## [1] "condition block2 complete"
## [1] "Calculating split half estimates"
## [1] "Split half estimates for 5000 random splits"##   condition  N  splithalf spearmanbrown
## 1    block1 20 -0.9352083     -32.44127
## 2    block2 20 -0.9396368     -34.95403What you see first is the console output. You will be informed as each condition has been processed and at which point the estimates are being calculated. This is purely so the user can keep track of the progress of the function.
Next, the function output. The data frame returns each condition separately. The N column indicates the number of participants’ data that has been processed, which is important to check as it is an early indication of missing data. The splithalf column is the raw split half estimate, in this case the average of 5000 random splits. The spearmanbrown column returns the spearman-brown corrected reliability estimate. You will quickly note that this estimate is flawed in this example; this is due to the high negative splithalf estimate. This is expected with this randomly simulated data.
The purpose of this example is to provide a more practical example of how to use the DPsplithalf() function with real data. Therefore, I have included snippets of script that will enable users to adapt their data where necessary. We will take the example step-by-step to cover many of the issues that could arise. First, note that the data frame does not actually contain any missing data, however, there is no data for subject 15 in block 2.
sum(is.na(DPdata_missing))## [1] 0head(subset(DPdata_missing, DPdata_missing$accuracy == 0))##      subject  block trialnumber    trialtype responsetime accuracy
## 2785      15 block2           1   Congruent1     25.36413        0
## 2786      15 block2           2 Incongruent2     25.64574        0
## 2787      15 block2           3   Congruent1     24.67010        0
## 2788      15 block2           4 Incongruent2     23.43138        0
## 2789      15 block2           5   Congruent1     25.88791        0
## 2790      15 block2           6 Incongruent2     24.04770        0From the head() data above we can also see that the variable names do not conform to what is needed to run the function. First, we will rename the Congruent and Incongruent trials. The function requires that a “congruency” variable specifies whether the trial is “Congruent” of “Incongruent” (case sensitive). One such method is to run the following. The code searches in the trialtype variable for strings that contain “Incongruent” and returns “Incongruent” if present, and “Congruent” if not. This simple line of script can be adapted to meet most namings. Note, ensure that the name searched for is not present in other conditions, otherwise these will be included too.
DPdata_missing$congruency <- as.factor(ifelse(grepl("Incongruent", DPdata_missing$trialtype), "Incongruent", "Congruent"))
str(DPdata_missing$congruency)##  Factor w/ 2 levels "Congruent","Incongruent": 1 2 1 2 1 2 1 2 1 2 ...Next, at the moment we have 4 different conditions in this task, however we want the same 2. This may happen for example, when you have two stimuli lists that are counterbalanced between participants. The next line of code is a slight variation on the above to return the blockname without the additional “a”. Also note that there are a large number of methods to do this within R. This example is meant to be informative to those with minimal R experience, as well as being applicable across a number of contexts. In short, the code tests whether the names of the block conform to either condition, and return that condition.
DPdata_missing$block <- as.factor(ifelse(DPdata_missing$block  == "block1" | 
                                         DPdata_missing$block  == "block1a",
                                         "block1",
                                  ifelse(DPdata_missing$block  == "block2" |
                                         DPdata_missing$block  == "block2a",
                                        "block2","")))
str(DPdata_missing$block)##  Factor w/ 2 levels "block1","block2": 1 1 1 1 1 1 1 1 1 1 ...Now that our congruency and block names are correct, all we need to do is ensure that the variable names are specified within the function. Reminder that the variable names are defaulted to; var.RT = “latency” var.condition = “blockcode” var.participant = “subject” var.correct = “correct” var.trialnum = ‘trialnum’
We will therefore specify; condition, trialnum, RT, correct
example2 <- DPsplithalf(DPdata_missing, conditionlist = c("block1","block2"), halftype = "random",                                   no.iterations = 5000, var.condition = "block", var.trialnum = "trialnumber", 
                        var.RT = "responsetime", var.correct = "accuracy" )## [1] "condition block1 complete"
## [1] "condition block2 complete"
## [1] "Calculating split half estimates"
## [1] "the following are participants/conditions with missing data"
##        condition participant
## 170001    block2          15
## [1] "note: these iterations will be removed from the split half\n          reliability calculations, in that condition"## Warning in DPsplithalf(DPdata_missing, conditionlist = c("block1", "block2"), : Bias indices missing:
##           at least one participant has missing data from at one condition
##           These cases are removed from calculating reliability estimates
##           $omitted contains the missing cases## [1] "Split half estimates for 5000 random splits"example2$Estimates##   condition  N  splithalf spearmanbrown
## 1    block1 20 -0.9252310     -27.81283
## 2    block2 19 -0.8940196     -18.98773head(example2$omitted)##        condition participant iteration bias1 bias2
## 170001    block2          15         1   NaN   NaN
## 170002    block2          15         2   NaN   NaN
## 170003    block2          15         3   NaN   NaN
## 170004    block2          15         4   NaN   NaN
## 170005    block2          15         5   NaN   NaN
## 170006    block2          15         6   NaN   NaNYou will notice that we have much more in our output than in example 1. You will see a warning message pointing out that there is missing data. In the console you will see a short message “the following are participants/conditions with missing data”, which is followed by a data frame showing which participants have have missing data from which conditions. The function will still calculate split half reliability. The function output two data frames (I also recommend loading the function into an object to ease this step). The Estimates data frame includes the split-half reliability estimates as described in example 1. This will also highlight which conditions have missing participants in the N column. The omitted data frame is a complete data frame of all omitted iterations. In this example we have one participant in one condition missing and therefore the omitted data frame contains all 5000 iterations of missing data.
An important question is; how long does this function take to run? The quicker the functions run, the more usable. Given that ideally we will run 5000 simulations as standard, we want the script to run quickly. No-body is keen to run a 2 hour script on 20 participants’ data to get only a single value as an output. Much of the recent development of this package has actually been reducing the run time of the functions. This has been very successful, going from some 45 minutes for roughly 40 participants in 2 conditions, to about 45 seconds. To illustrate the likely run-time of the script I have created the below figure. The first section of the script was used to simulate the data (I then replicated the output in the next section, in order to avoid the need to run the simulation to knit this document). Next is the data table, and the graph showing the run time in seconds across 1,2,3,4,5, and 10 conditions, with sample sizes of 10,20,30,40,50,100,200, and 500. This should give a rough idea of the likely run time of the DPsplithalf function for other data sets (note that the simulated data does not have missing values).
conditionnumber <- c(1,2,3,4,5,10)
samplesize <- c(10,20,30,40,50,100,200,500)
temp <- NA
times <- NULL
for(i in conditionnumber)
{
  for(j in samplesize)
  {
    # ensure the correct number of conditions
    if(i == 1) {
      conlist <- c("block1")
    } else if(i == 2) {
      conlist <- c("block1","block2")
    } else if(i == 3) {
      conlist <- c("block1","block2","block3")
    } else if(i == 4) {
      conlist <- c("block1","block2","block3","block4")
    } else if(i == 5) {
      conlist <- c("block1","block2","block3","block4","block5")
    } else {
      conlist <- c("block1","block2","block3","block4","block5",
                   "block6","block7","block8","block9","block10")
    }
  # generate the simulated data  
   temp <-  data.frame(subject = rep(1:j, each = (96*i)),
                       blockcode = rep(conlist, each = 96, length.out = i*j*96),
                       trialnum = rep(1:96, length.out = i*j*96),
                       congruency = rep(c("Congruent","Incongruent"), length.out = i*j*96),
                       latency = rep(rnorm(100,100,25), length.out = i*j*96),
                       correct = rep(1, length.out = i*j*96) )
  
    # run DPsplithalf
    time[j] <- system.time(DPsplithalf(temp, conditionlist = conlist, 
                                       halftype = "random", no.iterations = 5000))
   
   # save the data
   times <- rbind(times, c(i,j,time[j]))
   
   # keep track of the runs
   print(paste("completed",i,"conditions",j,"participants", sep = " "))
  }
}times <- read.csv("timestable.csv")
summary(lm(data=times, System.time ~ Conditions + Sample.size))## 
## Call:
## lm(formula = System.time ~ Conditions + Sample.size, data = times)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -552.79  -55.82    7.85   75.60 1138.02 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -234.5706    62.8531  -3.732 0.000531 ***
## Conditions    56.7447    11.3265   5.010 8.90e-06 ***
## Sample.size    1.8793     0.2124   8.847 2.09e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 228.4 on 45 degrees of freedom
## Multiple R-squared:  0.6967, Adjusted R-squared:  0.6832 
## F-statistic: 51.68 on 2 and 45 DF,  p-value: 2.2e-12library(ggplot2)
times$Conditions <- as.factor(times$Conditions)
ggplot2::ggplot(times, aes(x = Sample.size, y = System.time, linetype = Conditions)) +
  geom_line(size = 1)The figure highlights the run time for 48 simulations of the DPsplithalf function. As you can see, there is a linear increase in the run time as a function of the number of conditions and the sample size. It is worth noting that most research utilizing cognitive tasks such as the Dot-Probe recruit smaller sample sizes, and so for most data sets, the estimation of task reliability should take no more than a few minutes. A linear model also yields the following equation to estimate the amount of time needed for the function to run: -234 + 56.7(Conditions) + 1.9(Sample size). Please note that the operating system and the actual data will have a large effect on the time taken to run the functions.
Enock, P. M., Hofmann, S. G., & McNally, R. J. (2014). Attention Bias Modification Training Via Smartphone to Reduce Social Anxiety: A Randomized, Controlled Multi-Session Experiment. Cognitive Therapy and Research, 38(2), 200–216.
LeBel, E. P., & Paunonen, S. V. (2011). Sexy But Often Unreliable: The Impact of Unreliability on the Replicability of Experimental Findings With Implicit Measures. Personality and Social Psychology Bulletin, 37(4), 570–583.
Schmukle, S. C. (2005). Unreliability of the dot probe task. European Journal of Personality, 19(7), 595–605.
Staugaard, S. R. (2009). Reliability of two versions of the dot-probe task using photographic faces. Psychology Science Quarterly, 51(3), 339–350.