The TDAvec package provides implementations of several
vector summaries of persistence diagrams studied in Topological Data
Analysis (TDA). Each is obtained by discretizing the associated summary
function computed from a persistence diagram. The summary functions
included in this package are
For improved computational efficiency, all code behind the vector
summaries is written in C++ using the Rcpp
package. Whenever applicable, when compare our code with existing
implementations in terms of accuracy and run-time cost. In this
vignette, we illustrate the basic usage of the TDAvec
package using simple examples.
Let’s first load the required libraries.
library(TDAvec)
library(TDA) # to compute persistence diagrams
#> Warning: package 'TDA' was built under R version 4.0.5
library(microbenchmark) # to compare computational costsNow, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise:
N <- 100 # point cloud size
set.seed(123)
X <- circleUnif(N) + rnorm(2*N,mean = 0,sd = 0.2)
# plot the point cloud
plot(X,pch = 20,asp = 1)Next, we use the TDA package to compute the persistence
diagram (PD) using the Vietoris-Rips filtration built on top of the
point cloud \(X\).
D <- ripsDiag(X,maxdimension = 1,maxscale = 2)$diagram
sum(D[,1]==0) # number of connected components
#> [1] 100
sum(D[,1]==1) # number of loops
#> [1] 13
sum(D[,1]==2) # number of voids
#> [1] 0In the ripsDiag() function, maxdimension is
the maximum homological dimension of the topological features to be
computed (connected components if maxdimension=0, connected components
and loops if 1, connected components, loops and voids if 2, etc.)
maxscale is the maximum value of the scale parameter of the
filtration (which we set equal to 2 since the points are sampled from a
circle with diameter 2).
The persistence diagram \(D\) has
100 connected components (the same as the point cloud size), 13 loops
(one with long life-span, the rest are short-lived) and 0 voids along
with their birth and death values. To plot the
diagram, we can use the plot() function.
plot(D)# the solid dots represent connected components
# the red triangles represent loopsLet’s compute the persistence landscape (PL) and persistence silhouette (PS) summaries from the persistence diagram \(D\).
# sequence of scale values to vectorize the summary function
scaleSeq = seq(0,2,length.out=11) 
# compute the PL and PS summaries for homological dimension H_0
computePL(D,homDim = 0,scaleSeq,k=1)
#>  [1] 0.0 0.2 0.4 0.6 0.8 1.0 0.8 0.6 0.4 0.2 0.0
computePS(D,homDim = 0,scaleSeq,p=1)
#>  [1] 0.00000000 0.04682016 0.04923603 0.07385405 0.09847206 0.12309008
#>  [7] 0.09847206 0.07385405 0.04923603 0.02461802 0.00000000Here, the vectorization is performed by evaluating the respective
summary function at each element of scaleSeq (i.e. \(0.0,0.2,0.4,\ldots,2.0\)) and arranging the
values into a vector.
The parameter k in computePL() is the order
of the persistence landscape function. The p in
computePS() is the power of the weights for the silhouette
function.
To compute the summaries for homological dimension \(H_1\), set the homDim argument
equal to 1:
# compute the PL and PS summaries for homological dimension H_1
computePL(D,homDim = 1,scaleSeq,k=1)
#>  [1] 0.000000000 0.014217630 0.000931983 0.191114536 0.391114536 0.269212150
#>  [7] 0.069212150 0.000000000 0.000000000 0.000000000 0.000000000
computePS(D,homDim = 1,scaleSeq,p=1)
#>  [1] 0.000000e+00 1.114105e-03 5.251128e-05 1.362787e-01 2.788933e-01
#>  [6] 1.919680e-01 4.935333e-02 0.000000e+00 0.000000e+00 0.000000e+00
#> [11] 0.000000e+00The TDA package also provides implementations of the
persistence landscapes and silhouettes. Below we compare the two
implementations in terms of accuracy of results and run-time cost.
pl1 <- computePL(D,homDim = 0,k=1,scaleSeq)
pl2 <- as.vector(landscape(D,dimension = 0,KK = 1, tseq = scaleSeq))
all.equal(pl1,pl2) # -> TRUE (the results are the same)
#> [1] TRUE
compCost <- microbenchmark(
  computePL(D,homDim = 0,k=1,scaleSeq),
  landscape(D,dimension = 0,KK = 1, tseq = scaleSeq),
  times = 500
)
sm <- summary(compCost)
costRatioPL <- sm$mean[2]/sm$mean[1] # ratio of computational time means
print(costRatioPL)
#> [1] 236.5175For homological dimension \(H_0\),
it took TDA::landscape() about 237 times more time than
TDAvec::computePL(). Similarly, we can compare
TDA::silhouette() and TDAvec::computePL():
ps1 <- computePS(D,homDim = 0, p = 1,scaleSeq)
ps2 <- as.vector(silhouette(D,dimension = 0,p = 1, tseq = scaleSeq))
all.equal(ps1,ps2) # -> TRUE (the results are the same)
#> [1] TRUE
compCost <- microbenchmark(
  computePS(D,homDim = 0, p = 1,scaleSeq),
  silhouette(D,dimension = 0,p = 1, tseq = scaleSeq),
  times = 500
)
sm <- summary(compCost)
costRatioPS <- sm$mean[2]/sm$mean[1]
print(costRatioPS)
#> [1] 71.20121To discretize the persistent entropy summary function, Euler characteristic curve and Betti curve, we employ a different vectorization scheme. Rather than evaluating a summary function at increasing scales and arranging the values into a vector, we compute the average values of the summary function between two consecutive scale points using integration. More specifically, if \(f\) is a (univariate) summary function and \(t_1,t_2,\ldots,t_n\) are increasing scale values, we discretize \(f\) as:
\[\begin{equation} \Big(\frac{1}{\Delta t_1}\int_{t_1}^{t_2}f(t)dt,\frac{1}{\Delta t_2}\int_{t_2}^{t_3}f(t)dt,\ldots,\frac{1}{\Delta t_{n-1}}\int_{t_{n-1}}^{t_n}f(t)dt\Big)\in\mathbb{R}^{n-1}, \end{equation}\]
where \(\Delta t_k=t_{k+1}-t_k\), \(k=1,\ldots,n-1\). For the above three summary functions, the computation of integrals can be done analytically and efficiently implemented. For the Betti curve, we use the identity weight function (i.e., \(w(b,d)\equiv 1\)). If the weight function \(w(b,d)=-(l/L)\log_2(l/L)\), where \(l=d-b\) (life-span) and \(L=\sum_{i=1}^n l_i\) (sum of all life-spans), the vector summary of the Betti curve will be the same as that of the persistent entropy summary function.
# Persistent Entropy Summary (PES) function
# compute PES for homological dimension H0
computePES(D,homDim = 0,scaleSeq) 
#>  [1] 4.8268301 1.0173158 0.3720045 0.3720045 0.3720045 0.3720045 0.3720045
#>  [8] 0.3720045 0.3720045 0.3720045
# compute PES for homological dimension H1
computePES(D,homDim = 1,scaleSeq)
#>  [1] 0.05677674 0.26458225 0.33845861 0.34789260 0.34789260 0.34789260
#>  [7] 0.12039198 0.00000000 0.00000000 0.00000000
# Euler Characteristic Curve (ECC) 
computeECC(D,maxhomDim = 1,scaleSeq) # maxhomDim = maximal homological dimension considered
#>  [1] 65.64393340  5.93893485 -0.02801848  0.00000000  0.00000000  0.00000000
#>  [7]  0.65393925  1.00000000  1.00000000  1.00000000
# Vector of Averaged Bettis (VAB) - a vectorization of Betti Curve
# compute VAB for homological dimension H0
computeVAB(D,homDim = 0,scaleSeq) 
#>  [1] 65.928137  7.313179  1.000000  1.000000  1.000000  1.000000  1.000000
#>  [8]  1.000000  1.000000  1.000000
# compute VAB for homological dimension H1
computeVAB(D,homDim = 1,scaleSeq)
#>  [1] 0.2842034 1.3742440 1.0280185 1.0000000 1.0000000 1.0000000 0.3460608
#>  [8] 0.0000000 0.0000000 0.0000000To discretize the persistence surface and persistence block, we first need to switch from the birth-death to the birth-persistence coordinates.
D[,3] <- D[,3] - D[,2] 
colnames(D)[3] <- "Persistence"The resulting vector summaries are called the persistence image (PI) and the vectorized of persistence block (VPB) respectively.
# Persistence Image (PI)
res <- 5 # resolution or grid size
# find min and max persistence values
minPH0 <- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3])
sigma <- 0.5*(maxPH0-minPH0)/res # default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram
# construct one-dimensional grid of scale values
ySeqH0 <- seq(minPH0,maxPH0,length.out=res+1)
# compute PI for homological dimension H_0
computePI(D,homDim=0,xSeq=NA,ySeqH0,res,sigma)
#> [1] 4.56670703 0.96562735 0.01135007 0.02272340 0.47724987
# Vectorized Persistence Block (VPB)
# construct one-dimensional grid of scale values
ySeqH0 <- unique(quantile(D[D[,1]==0,3],probs = seq(0,1,by=0.2))) 
tau <- 0.3 # parameter in [0,1] which controls the size of blocks around each point of the diagram 
# compute VPB for homological dimension H_0
computeVPB(D,homDim = 0,xSeq=NA,ySeqH0,tau) 
#> [1] 0.809001 3.644815 5.486612 6.382816 1.013953Since the \(H_0\) features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization.
For homological dimension \(H_1\), the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid. For the VPB summary, since the blocks around each point of the persistence diagram have different sizes, we construct the grid with scale values spread out non-uniformly (i.e. the rectangular grid cells have different dimensions). In experiments, this construction of the grid tends to provide better performance over the grid with equal cell sizes.
# PI
res <- 5 # resolution or grid size
# find min & max birth and persistence values
minBH1 <- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2])
minPH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3])
xSeqH1 <- seq(minBH1,maxBH1,length.out=res+1)
ySeqH1 <- seq(minPH1,maxPH1,length.out=res+1)
sigma <- 0.5*(maxPH1-minPH1)/res
# compute PI for homological dimension H_1
computePI(D,homDim=1,xSeqH1,ySeqH1,res,sigma)
#>  [1] 4.371601e-02 4.955567e-02 4.512378e-02 3.299957e-02 1.864442e-02
#>  [6] 7.527654e-03 7.760847e-03 6.305986e-03 4.250972e-03 2.278547e-03
#> [11] 6.038006e-05 5.841014e-05 4.427399e-05 3.094726e-05 2.021064e-05
#> [16] 1.834675e-04 8.921009e-04 2.714215e-03 5.171823e-03 6.175378e-03
#> [21] 3.853840e-03 1.874023e-02 5.701774e-02 1.086451e-01 1.297270e-01
# VPB
xSeqH1 <- unique(quantile(D[D[,1]==1,2],probs = seq(0,1,by=0.2)))
ySeqH1 <- unique(quantile(D[D[,1]==1,3],probs = seq(0,1,by=0.2)))
tau <- 0.3
# compute VPB for homological dimension H_1
computeVPB(D,homDim = 1,xSeqH1,ySeqH1,tau) 
#>  [1] 0.0000000000 0.0004995598 0.0090337733 0.0305098818 0.0057973687
#>  [6] 0.0095194496 0.0261053157 0.0203673036 0.0042870130 0.0107446149
#> [11] 0.0425768331 0.1053468321 0.1458807241 0.0000000000 0.0000000000
#> [16] 0.0328035481 0.0276785592 0.1089088594 0.1318822254 0.0168304895
#> [21] 0.2939394561 0.3162986126 0.3344510989 0.3567486932 0.3636226941As a note, the code for computePI() is adopted from the
pers.image() function (available in the
kernelTDA package) with minor modifications. For example,
pers.image() uses a one-dimensional grid for homological
dimension \(H_0\) regardless of the
filtration type. In contast, computePI() uses a
one-dimensional grid only if additionally the birth values are the same
(which may not be true for some filtrations such as the lower-star
filtration).
Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(1), 77-102.
Chazal, F., Fasy, B. T., Lecci, F., Rinaldo, A., & Wasserman, L. (2014, June). Stochastic convergence of persistence landscapes and silhouettes. In Proceedings of the thirtieth annual symposium on Computational geometry (pp. 474-483).
Atienza, N., Gonzalez-Díaz, R., & Soriano-Trigueros, M. (2020). On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recognition, 107, 107509.
Richardson, E., & Werman, M. (2014). Efficient classification using the Euler characteristic. Pattern Recognition Letters, 49, 99-106.
Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. Frontiers in Artificial Intelligence, 108.
Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. Advances in Computational Mathematics, 48(1), 1-42.
Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., … & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18.
Chan, K. C., Islambekov, U., Luchinsky, A., & Sanders, R. (2022). A computationally efficient framework for vector representation of persistence diagrams. Journal of Machine Learning Research, 23, 1-33.