In combinatorics, sometimes it can be difficult to figure out exactly which tool/method we need to attack our problem. Do we need combinations? permutations? Cartesian product? partitions? compositions? What about repetition or multiplicity? The list goes on. Oftentimes the solution ends up being overly complicated and prone to error, or more commonly, a simple brute force solution is employed. The latter is okay in some situations, but in many real world problems, this approach becomes untenable very quickly. The two types of problems addressed below fall into this category.
Given a list of vectors, v1, v2, … , vn, where the intersection of two or more vectors in non-empty, find all unique combinations (order does not matter) of elements of the Cartesian product of all of the vectors.
For example, lets say we have: v1 = 1:4 and
v2 = 2:5. The Cartesian product is given by
expand.grid(v1, v2) (We continue to use the ht
function defined in the Combination
and Permutation Basics vignette):
ht <- function(d, m = 5, n = m) {
  ## print the head and tail together
  cat("head -->\n")
  print(head(d, m))
  cat("--------\n")
  cat("tail -->\n")
  print(tail(d, n))
}
expand.grid(1:4, 2:5)
#>    Var1 Var2
#> 1     1    2
#> 2     2    2
#> 3     3    2  ### <-- Same as row 6
#> 4     4    2  ### <-- Same as row 10
#> 5     1    3
#> 6     2    3  ### <-- Same as row 3
#> 7     3    3
#> 8     4    3  ### <-- Same as row 11
#> 9     1    4
#> 10    2    4  ### <-- Same as row 4
#> 11    3    4  ### <-- Same as row 8
#> 12    4    4
#> 13    1    5
#> 14    2    5
#> 15    3    5
#> 16    4    5If we don’t care about order, the following row pairs are considered equal and can therefore be pruned to obtain our desired results:
With comboGrid no duplicates are generated:
library(RcppAlgos)
comboGrid(1:4, 2:5)
#>       Var1 Var2
#>  [1,]    1    2
#>  [2,]    1    3
#>  [3,]    1    4
#>  [4,]    1    5
#>  [5,]    2    2
#>  [6,]    2    3
#>  [7,]    2    4
#>  [8,]    2    5
#>  [9,]    3    3
#> [10,]    3    4
#> [11,]    3    5
#> [12,]    4    4
#> [13,]    4    5Note that the order of expand.grid and
comboGrid differ. The order of comboGrid is
lexicographical meaning that the last column will vary the fastest
whereas with expand.grid, the first column will vary the
fastest.
You will also note that the output of expand.grid is a
data.frame whereas with comboGrid, we get a
matrix. With comboGrid, we only get a
data.frame when the classes of each vector are different as
generally speaking, working with matrices is preferable.
With the small example above, we only had to filter out 3 out of 16
total results (less than 20%). That isn’t that bad. If this was the
general case, we might as well just stick with expand.grid
as it is very efficient. Unfortunately, this is not the general case and
as the number of vectors with overlap increases, filtering will become
impractical.
Consider the following example:
pools = list(c(1, 10, 14, 6),
             c(7, 2, 4, 8, 3, 11, 12),
             c(11, 3, 13, 4, 15, 8, 6, 5),
             c(10, 1, 3, 2, 9, 5,  7),
             c(1, 5, 10, 3, 8, 14),
             c(15, 3, 7, 10, 4, 5, 8, 6),
             c(14, 9, 11, 15),
             c(7, 6, 13, 14, 10, 11, 9, 4),
             c(6,  3,  2, 14,  7, 12,  9),
             c(6, 11,  2,  5, 15,  7))
## If we used expand.grid, we would have to filter
## more than 100 million results
prod(lengths(pools))
#> [1] 101154816
## With comboGrid, this is no problem
system.time(myCombs <- comboGrid(pools))
#>    user  system elapsed 
#>   0.259   0.024   0.283
print(object.size(myCombs), unit = "Mb")
#> 92 Mb
ht(myCombs)
#> head -->
#>      Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
#> [1,]    1    2    3    1    1    3    9    4    2     2
#> [2,]    1    2    3    1    1    3    9    4    2     5
#> [3,]    1    2    3    1    1    3    9    4    2     6
#> [4,]    1    2    3    1    1    3    9    4    2     7
#> [5,]    1    2    3    1    1    3    9    4    2    11
#> --------
#> tail -->
#>            Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
#> [1205736,]   14   12   15   10   14   15   15   11   12    15
#> [1205737,]   14   12   15   10   14   15   15   13   12    15
#> [1205738,]   14   12   15   10   14   15   15   13   14    15
#> [1205739,]   14   12   15   10   14   15   15   14   12    15
#> [1205740,]   14   12   15   10   14   15   15   14   14    15
## This is just the time to create the cartesian product
## Generating keys, then filtering will take much more time
system.time(cartProd <- expand.grid(pools))
#>    user  system elapsed 
#>   3.575   0.817   4.393
## Creates huge object
print(object.size(cartProd), unit = "Mb")
#> 7717.5 Mb
## What if we want results with unique values...
## Simply set repetition = FALSE
system.time(myCombsNoRep <- comboGrid(pools, repetition = FALSE))
#>    user  system elapsed 
#>   0.003   0.000   0.003
ht(myCombsNoRep)
#> head -->
#>      Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
#> [1,]    1    2    3    5    8    4    9    6    7    11
#> [2,]    1    2    3    5    8    4    9    6    7    15
#> [3,]    1    2    3    5    8    4    9    6   12     7
#> [4,]    1    2    3    5    8    4    9    6   12    11
#> [5,]    1    2    3    5    8    4    9    6   12    15
#> --------
#> tail -->
#>         Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
#> [2977,]   14    3    4    5    8   15    9   13   12    11
#> [2978,]   14    3    4    7    5   15    9   13   12    11
#> [2979,]   14    3    4    7    8   15    9   13   12    11
#> [2980,]   14    3    5    7    8   15    9   13   12    11
#> [2981,]   14    4    5    7    8   15    9   13   12    11The function comboGrid was highly inspired by the
following question on stackoverflow:
Currenlty, the underlying algorithm is not the gold standard. By that, we mean that results are not generated one by one. Efforts are underway to achieve this, but up until this point it has proven quite difficult (See the comprehensive answer by Tim Peters (yes, that Tim Peters)).
The algorithm in comboGrid leverages The
Fundamental Theorem of Arithmetic to efficiently generate keys that
will be used in a hash function to determine if a particular combination
of elements have been encountered. For greater efficiency, we make use
of deduplication as user2357112
suggests.
comboGroupsGiven a vector of length n and k groups, where
k divides n, each group is comprised of a combination
of the vector chosen g = n / k at a time. As is stated in the
documentation (see ?comboGroups), these can be constructed
by first generating all permutations of the vector and subsequently
removing entries with permuted groups. Let us consider the following
example. Given v = 1:12, generate all partitions
v into 3 groups each of size 4.
funBruteGrp <- function(myLow = 1, myUp) {
    mat <- do.call(rbind, permuteGeneral(12, lower = myLow, upper = myUp,
        FUN = function(x) {
        sapply(seq(0, 8, 4), function(y) {
             paste0(c("(", x[(y + 1):(y + 4)], ")"), collapse = " ")
        })
    }))
    colnames(mat) <- paste0("Grp", 1:3)
    rownames(mat) <- myLow:myUp
    mat
}
## All of these are the same as only the 3rd group is being permuted
funBruteGrp(myUp = 6)
#>   Grp1          Grp2          Grp3            
#> 1 "( 1 2 3 4 )" "( 5 6 7 8 )" "( 9 10 11 12 )"
#> 2 "( 1 2 3 4 )" "( 5 6 7 8 )" "( 9 10 12 11 )"
#> 3 "( 1 2 3 4 )" "( 5 6 7 8 )" "( 9 11 10 12 )"
#> 4 "( 1 2 3 4 )" "( 5 6 7 8 )" "( 9 11 12 10 )"
#> 5 "( 1 2 3 4 )" "( 5 6 7 8 )" "( 9 12 10 11 )"
#> 6 "( 1 2 3 4 )" "( 5 6 7 8 )" "( 9 12 11 10 )"
## We found our second distinct partition
funBruteGrp(myLow = 23, myUp = 26)
#>    Grp1          Grp2          Grp3            
#> 23 "( 1 2 3 4 )" "( 5 6 7 8 )" "( 12 11 9 10 )"
#> 24 "( 1 2 3 4 )" "( 5 6 7 8 )" "( 12 11 10 9 )"
#> 25 "( 1 2 3 4 )" "( 5 6 7 9 )" "( 8 10 11 12 )"  ### <-- 2nd distinct partition of groups
#> 26 "( 1 2 3 4 )" "( 5 6 7 9 )" "( 8 10 12 11 )"
funBruteGrp(myLow = 48, myUp = 50)
#>    Grp1          Grp2           Grp3            
#> 48 "( 1 2 3 4 )" "( 5 6 7 9 )"  "( 12 11 10 8 )"
#> 49 "( 1 2 3 4 )" "( 5 6 7 10 )" "( 8 9 11 12 )"  ### <-- 3rd distinct partition of groups
#> 50 "( 1 2 3 4 )" "( 5 6 7 10 )" "( 8 9 12 11 )"We are starting to see a pattern. Each new partition is exactly 24
spots away. This makes sense as there are factorial(4) = 24
permutations of size 4. Now, this is an oversimplification as if we
simply generate every 24th permutation, we will
still get duplication as they start to carry over to the other groups.
Observe:
do.call(rbind, lapply(seq(1, 169, 24), function(x) {
    funBruteGrp(myLow = x, myUp = x)
}))
#>     Grp1          Grp2           Grp3            
#> 1   "( 1 2 3 4 )" "( 5 6 7 8 )"  "( 9 10 11 12 )"
#> 25  "( 1 2 3 4 )" "( 5 6 7 9 )"  "( 8 10 11 12 )"
#> 49  "( 1 2 3 4 )" "( 5 6 7 10 )" "( 8 9 11 12 )" 
#> 73  "( 1 2 3 4 )" "( 5 6 7 11 )" "( 8 9 10 12 )" 
#> 97  "( 1 2 3 4 )" "( 5 6 7 12 )" "( 8 9 10 11 )" 
#> 121 "( 1 2 3 4 )" "( 5 6 8 7 )"  "( 9 10 11 12 )"  ### <-- This is the same as the 1st
#> 145 "( 1 2 3 4 )" "( 5 6 8 9 )"  "( 7 10 11 12 )"  ### partition. The only difference is
#> 169 "( 1 2 3 4 )" "( 5 6 8 10 )" "( 7 9 11 12 )"   ### that the 2nd Grp has been permutedThis only gets more muddled as the number of groups increases. It is also very inefficient, however this exercise hopefully serves to better illustrate these structures.
The algorithm in comboGroups avoids all of this
duplication by implementing a novel algorithm akin to std::next_permutation
from the algorithm library in C++.
system.time(comboGroups(12, numGroups = 3))
#>    user  system elapsed 
#>       0       0       0
ht(comboGroups(12, numGroups = 3))
#> head -->
#>      Grp1 Grp1 Grp1 Grp1 Grp2 Grp2 Grp2 Grp2 Grp3 Grp3 Grp3 Grp3
#> [1,]    1    2    3    4    5    6    7    8    9   10   11   12
#> [2,]    1    2    3    4    5    6    7    9    8   10   11   12
#> [3,]    1    2    3    4    5    6    7   10    8    9   11   12
#> [4,]    1    2    3    4    5    6    7   11    8    9   10   12
#> [5,]    1    2    3    4    5    6    7   12    8    9   10   11
#> --------
#> tail -->
#>         Grp1 Grp1 Grp1 Grp1 Grp2 Grp2 Grp2 Grp2 Grp3 Grp3 Grp3 Grp3
#> [5771,]    1   10   11   12    2    5    8    9    3    4    6    7
#> [5772,]    1   10   11   12    2    6    7    8    3    4    5    9
#> [5773,]    1   10   11   12    2    6    7    9    3    4    5    8
#> [5774,]    1   10   11   12    2    6    8    9    3    4    5    7
#> [5775,]    1   10   11   12    2    7    8    9    3    4    5    6Just as in {combo|permute}General, we can utilize the
arguments lower, upper, Parallel,
and nThreads.
comboGroupsCount(30, 6)
#> Big Integer ('bigz') :
#> [1] 123378675083039376
system.time(a1 <- comboGroups(30, numGroups = 6,
                              lower = "123378675000000000",
                              upper = "123378675005000000"))
#>    user  system elapsed 
#>   0.115   0.034   0.149
## Use specific number of threads
system.time(a2 <- comboGroups(30, numGroups = 6,
                              lower = "123378675000000000",
                              upper = "123378675005000000", nThreads = 4))
#>    user  system elapsed 
#>   0.125   0.044   0.044
## Use n - 1 number of threads (in this case, there are 7)
system.time(a3 <- comboGroups(30, numGroups = 6,
                              lower = "123378675000000000",
                              upper = "123378675005000000", Parallel = TRUE))
#>    user  system elapsed 
#>   0.179   0.071   0.043
identical(a1, a2)
#> [1] TRUE
identical(a1, a3)
#> [1] TRUEAs of 2.8.+ we can generate partitions of groups of
varying sizes. For example, say we want to generate all partitions of
the vector v = 1:15 into 2 groups of 3, 1 groups of 4, and
1 group of 5:
system.time(a4 <- comboGroups(15, grpSizes = c(3, 3, 4, 5)))
#>    user  system elapsed 
#>   0.119   0.020   0.138
ht(a4)
#> head -->
#>      Grp1 Grp1 Grp1 Grp2 Grp2 Grp2 Grp3 Grp3 Grp3 Grp3 Grp4 Grp4 Grp4 Grp4 Grp4
#> [1,]    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
#> [2,]    1    2    3    4    5    6    7    8    9   11   10   12   13   14   15
#> [3,]    1    2    3    4    5    6    7    8    9   12   10   11   13   14   15
#> [4,]    1    2    3    4    5    6    7    8    9   13   10   11   12   14   15
#> [5,]    1    2    3    4    5    6    7    8    9   14   10   11   12   13   15
#> --------
#> tail -->
#>            Grp1 Grp1 Grp1 Grp2 Grp2 Grp2 Grp3 Grp3 Grp3 Grp3 Grp4 Grp4 Grp4
#> [6306296,]   10   14   15   11   12   13    5    6    7    8    1    2    3
#> [6306297,]   10   14   15   11   12   13    5    6    7    9    1    2    3
#> [6306298,]   10   14   15   11   12   13    5    6    8    9    1    2    3
#> [6306299,]   10   14   15   11   12   13    5    7    8    9    1    2    3
#> [6306300,]   10   14   15   11   12   13    6    7    8    9    1    2    3
#>            Grp4 Grp4
#> [6306296,]    4    9
#> [6306297,]    4    8
#> [6306298,]    4    7
#> [6306299,]    4    6
#> [6306300,]    4    5
dim(a4)
#> [1] 6306300      15All of the flexibility offered with groups of equal size are present with groups of varying sizes as well. For example, we can generate groups of varying sizes in parallel:
system.time(a5 <- comboGroups(15, grpSizes = c(3, 3, 4, 5), nThreads = 4))
#>    user  system elapsed 
#>   0.128   0.030   0.040
identical(a4, a5)
#> [1] TRUEThere is one additional argument (i.e. retType) not
present in the other two general functions that allows the user to
specify the type of object returned. The user can select between
"matrix" (the default) and "3Darray". This
structure has a natural connection to 3D space when the size of each
group is uniform. We have a particular result (1st
dimension) broken down into groups (2nd dimension)
of a certain size (3rd dimension).
my3D <- comboGroups(factor(month.abb), 4, retType = "3Darray")
my3D[1, , ]
#>      Grp1 Grp2 Grp3 Grp4
#> [1,] Jan  Apr  Jul  Oct 
#> [2,] Feb  May  Aug  Nov 
#> [3,] Mar  Jun  Sep  Dec 
#> Levels: Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
comboGroupsCount(12, 4)
#> [1] 15400
my3D[15400, , ]
#>      Grp1 Grp2 Grp3 Grp4
#> [1,] Jan  Feb  Mar  Apr 
#> [2,] Nov  Sep  Jul  May 
#> [3,] Dec  Oct  Aug  Jun 
#> Levels: Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep