library(scellpam)To cite this package please, use preferently the paper of BMC mentioned at the bibliography below with citation Domingo, Leon, and Dura (2023)
The package scellpam is meant as a way to apply the PAM
algorithm to results of single-cell RNA-Seq data sets.
Differently to other packages, it can deal with a big number of cells (limited in principle only by the amount of available RAM) and make the calculations in parallel.
It uses a special format to store binary matrices on disk, the
jmatrix format. Please, make yourself familiar with the
jmatrix creation and manipulation and with the PAM
application looking first at the other vignettes of this package,
jmatrixsc and parallelpamsc.
WARNING: you must NOT load explicitly neither jmatrix
nor parallelpam. Indeed, you do not need even to install
them. All their functions have been included here, too, so doing
library(scellpam) is enough.
First of all, the package can show quite informative (but sometimes verbose) messages in the console. To turn on/off such messages you can use,
# Initially, state of debug is FALSE. Turn it on with
ScellpamSetDebug(TRUE,debparpam=TRUE,debjmat=TRUE)
#> Debugging for scellpam (biological part) of the package set to ON.
#> Debugging for parallelpam inside scellpam package set to ON.
#> Debugging for jmatrix inside scellpam package set to ON.
# We have also turned on debugging of the parallel PAM algorithm and of the binary matrix creation/manipulation
# but if this annoys you their default value is FALSE for both cases even when the scellpam debug is set to true, so just do
# ScellpamSetDebug(TRUE)As stated before, the limitations in memory suggested us to use
intermediate binary files stored in disk as a way to keep the results of
each step of the process. These files have a concrete internal format;
see the vignette jmatrixsc in this package to know more
about the jmatrix (Domingo
(2023a)) package.
Thus, the first step is to load raw data (sequencing counts) coming
either from a .csv file or from the main formats used in single cell
RNA-Seq data sets and store them as a binary matrix file in
jmatrix format.
The input formats can be a .csv file, a sparse matrix
stored into the S4 object created by the Seurat package (Hao et al. (2021)) or a R NumericMatrix
extracted from any package that uses a single-cell experiment object.
Unfortunately, the internal structure of such objects does not seem to
be the same for all packages so we have not been able to provide a
single function to extract from all of them.
Assuming a .csv input file, the following call reads the
file and writes to disk its binary representation. The
jmatrix format can create full, sparse or symmetric
matrices.
CsvToJMat("countsfile.csv","countsfile.bin")By default, data are stored as a sparse matrix of numbers as floats (4 bytes per item in most architectures). Indeed, default parameters are
CsvToJMat("countsfile.csv","countsfile.bin",mtype="sparse",csep=",",
          ctype="raw",valuetype="float",transpose=FALSE,comment="")where mtype can be sparse or full and
valuetype can be float or double. Changing the
default values for these parameters is strongly discouraged, except when
the sparse matrix is bigger than the full matrix, which happens when
there are few zeros. In such a case use mtype="full". The
names for rows and columns are read from the .csv file and
stored as metadata in the binary file after the raw data of counts.
Other parameters you may (and need to) change are the separation
character for fields in the .csv file (csep) if your fields
are not comma-separated; indeed, you can read tab-separated files (.tsv)
using csep='\t'. Also, you can set the type of
normalization (ctype) which can be raw to write the raw
value of the counts, log1 to write the
log2(counts+1), rawn which is as
row but normalized and log1n which is like
log1 but normalized.
WARNING: normalize ALWAYS NORMALIZES BY COLUMNS (before transposition, if requested to transpose). The logarithm is taken base-2.
Finally, you may want to write the matrix transposed and add any comment of up to 1024 characters as a string, if you want. Such comment will be stored in the metadata, after the row and column names.
In our case, the data are organized with genes by rows and cells by columns (that is why we choose to normalize by colums), which is not what is needed later to calculate the dissimilarity matrix. Therefore, our typical call would be
CsvToJMat("countsfile.csv","countsfile.bin",ctype="log1n",transpose=TRUE,
            comment="Obtained from countsfile.csv")As a result of the calculations the functions of this package
generate binary files, also in jmatrix format. If you want
to inspect them they can be written back as .csv files,
which allows them to be imported as R tables, if your computer has
memory enough to hold them. This is done as
JMatToCsv("countsfile.bin","countsback.csv",csep=",",withquotes=FALSE)Also, you can selectively read some rows and/or columns of them. See
appropriate functions in the vignette called jmatrixsc
attached to this package.
Notice that in this case counstfile.csv and
countsback.csv would not be equal; row and column names
will have been swapped and that is OK, since we wrote the binary file
with transpose=TRUE.
Also, remember that if you use comma as the separator character, you
can just get rid of the csep parameter. The parameter withquotes, if you
decide to use it and you set it to TRUE, writes each row and column name
in the .csv surrounded by double quotes (but not the
numbers).
It is fair to have .csv files since they can be
inspected to check errors but obviously these files can be quite large
and writing them may be quite slow.
If your data are not in a .csv file but stored in a S4
object created by package seurat (Hao et al. (2021)) (and frequently stored for
example in a compressed .rds file) it is likely that this
is an object of the class dgCMatrix, defined in package
Matrix (Bates, Maechler, and Jagan
(2022)). We provide a function, too, to write it to
jmatrix.
q <- readRDS("yourdata.rds")
dgCMatToJMat(q@assays$RNA@counts,"countsfile.bin",transpose=TRUE,
             comment="Obtained from yourdata.rds")Nevertheless, sometimes the dgCMatrix is stored in a different slot of the object like
dgCMatToJMat(q@raw.data,"countsfile.bin",transpose=TRUE,
             comment="Obtained from other data")It is up to you to inspect the original object (with
str(q)) and see what you must do.
Parameters mtype, ctype and
valuetype are available, too, with the same meaning as for
the former function. The .bin file can obviously be dumped
to .csv, as explained before.
Also, in the case of Seurat, each cell belongs to a sample, being the sample a set of cells with some common characteristic like coming from the same living being or from a common stage in a process of cellular change. Cells are then labelled with a factor and this could be obtained with
Gr=GetSeuratGroups(q)where Gr is a numeric vector of integers with as many
components as cells that contains a consecutive numeric identifier of
the group starting at 1. But again this may depend on the internal
structure of that particular Seurat object.
Similarly, if you are using packages that use the sce
(Single-cell experiment) S4 object (like splatter (Zappia, Phipson, and Oshlack (2017)),
DuoClustering (Duò and Soneson
(2021)) and others) you will have to extract the matrix of
counts. Unfortunately, again this does not seem to be uniform since each
package has a slightly different internal structure for the S4 object.
The most we have been able to do is to provide a function which takes
the sparse matrix of counts (as a numeric matrix) and the name of the
generated binary file (as before) and optionally the row and column
names (as StringVectors) if they were not attached to the
matrix itself, followed by the parameters used for the csv/Seurat
formats.
The next examples are not marked for execution since they need the loading of packages splatter/scatter and DuoClustering2018 respectively and a test to see if they are available using installed.packages is forbidden if you want to pass the CRAN tests. It you want to run them, please copy+paste this code in R.
Your code, for example with splatter, would probably
look like
suppressPackageStartupMessages({
   library(splatter)
   library(scater)
})
set.seed(1)
Splattersce <- mockSCE()
SceToJMat(Splattersce@assays@data@listData$counts,"mockSCE.bin",
           mtype="full",ctype="log1n",transpose=TRUE,
           comment="Generated by splatter with mockSCE() and normalized
                    to log2(counts+1).")Nevertheless, in DuoClustering you would have something
like
suppressPackageStartupMessages({
   library(DuoClustering2018)
})
KuTCCsce <- sce_filteredExpr10_KumarTCC()
SceToJMat(KuTCCsce@assays$data@listData$counts,"KuTCC.bin",
          mtype="full",ctype="log1n",transpose=TRUE,
          comment="Generated by DuoCLustering with Kumar data and
                normalized to log2(counts+1).")Please, notice the subtle difference between
Splattersce@assays@data@listData$counts and
KuTCCsce@assays$data@listData$counts
We have noticed that in simulations with splatter, and
depending on the parameters, the number of zeros can be not too high and
therefore storing the binary matrix as sparse matrix makes its size
bigger, instead of smaller. In such a case you can store it as a full
matrix passing the parameter mtype="full" to
SceToJMat.
Parameters ctype and valuetype are
available, too, with the same meaning as for the former functions.
Other useful function in this step is JMatInfo, which
prints in the console or in a text file information about a stored
binary file. As an example, let us load and write a .csv
file and then let us see the information of the generated binary
file.
tmpfile=paste0(tempdir(),"/Trapnell.bin")
CsvToJMat("extdata/conquer_GSE52529_Trapnell_sample.csv",tmpfile,
          transpose=TRUE,comment="Experiment conquer GSE52529-GPL16791")
#> 288 columns of values (not including the column of names) in file extdata/conquer_GSE52529_Trapnell_sample.csv.
#> 5218 lines (excluding header) in file extdata/conquer_GSE52529_Trapnell_sample.csv
#> Data will be read from each line and stored as float values.
#> Reading line... 0 1000 2000 3000 4000 5000 
#> Read 5218 data lines of file extdata/conquer_GSE52529_Trapnell_sample.csv, as expected.
#> Transposing matrix of (5218x288) to a matrix of (288x5218)
#> Writing binary matrix /tmp/RtmpG9IRdw/Trapnell.bin of (288x5218)
#> End of block of binary data at offset 3030456
#>    Writing row names (288 strings written, from GSM1268960 to GSM1269247).
#>    Writing column names (5218 strings written, from ENSG00000164675.10 to ENSG00000093100.13).
#>    Writing comment: Experiment conquer GSE52529-GPL16791
JMatInfo(tmpfile)
#> File:               /tmp/RtmpG9IRdw/Trapnell.bin
#> Matrix type:        SparseMatrix
#> Number of elements: 1502784
#> Data type:          float
#> Endianness:         little endian (same as this machine)
#> Number of rows:     288
#> Number of columns:  5218
#> Metadata:           Stored names of rows and columns.
#> Metadata comment:  "Experiment conquer GSE52529-GPL16791"
#> Binary data size:   3030328 bytes, which is 50.4119% of the full matrix size (which would be 6011136 bytes).If you want to keep the information about the binary file in a text file, just do
tmpfile=paste0(tempdir(),"/Trapnell.bin")
tmptxtfile=paste0(tempdir(),"/TrapnellInfo.txt")
JMatInfo(tmpfile,tmptxtfile)The binary files can be expurged getting rid of some genes or cells.
The function provided for this task is FilterBinByName,
which takes as inputs a binary file and a R list of R-strings with the
names of the genes or cells that must remain (first and second
parameters). The third one is the name of the binary filtered file,
which will have the same nature (full or sparse matrix) and data type of
the matrix in the original file. The fourth parameter,
namesat, indicates if whatever you want to filter (genes or
cells) are in columns (“cols”) or in rows (“rows”). As an example, let
us filter some 1000 of the 65218 genes contained in file
conquer_GSE52529-GPL16791_Trapnell.csv and whose names we
have put in the list stored in variable v by reading the
file Trapnell_remain_genes.csv.
v<-as.vector(read.table("extdata/Trapnell_remain_genes.csv")[[1]])
tmpfile1=paste0(tempdir(),"/Trapnell.bin")
tmpfiltfile1=paste0(tempdir(),"/Trapnell_filtered.bin")
FilterJMatByName(tmpfile1,v,tmpfiltfile1,namesat="cols")
#> 1000 columns of the 5218 in the original matrix will be kept.
#> Writing binary matrix /tmp/RtmpG9IRdw/Trapnell_filtered.bin of (288x1000)
#> End of block of binary data at offset 1276144
#>    Writing row names (288 strings written, from GSM1268960 to GSM1269247).
#>    Writing column names (1000 strings written, from ENSG00000067113.16 to ENSG00000093100.13).
#>    Writing comment: Experiment conquer GSE52529-GPL16791If the list contains names not present in the original file the program emits a warning (and keeps only whose which are present).
This is the most computationally intensive part of the process
(particularly, for samples with a high number of cells) and therefore
has been programmed in parallel, taking advantage of the multiple cores
of the machine, if available. See vignette parallelpamsc of
this package to know more about parallelpam (Domingo (2023b)). From now on, just the
prospective call to comment on the parameters.
tmpfile1=paste0(tempdir(),"/Trapnell.bin")
tmppeafile1=paste0(tempdir(),"/TrapnellPearson.bin")
CalcAndWriteDissimilarityMatrix(tmpfile1,tmppeafile1,distype="Pearson",nthreads=0,
                comment="Pearson dissimilarities for cell in experiment conquer GSE52529-GPL16791")The input and output files (first and second parameters) are of course compulsory. Input file can be a sparse of full binary matrix (but obviously, not a symmetric matrix).
WARNING: notice that the vectors to calculate dissimilarities amongst them MUST be stored BY ROWS. This is due to efficiency reasons and it is the reason by which we transposed the counts matrix before writing it as binary.
Output dissimilarity matrix will always be a binary symmetric
(obviously square) matrix with a number of rows (and columns) equal to
the number of rows (in this case, cells) of the counts file. The type of
distance/dissimilarity can be L1 (Manhattan distance),
L2 (Euclidean distance) or Pearson (Pearson
dissimilarity coefficient). The resulting matrix stores only names for
the rows, which are the names of the cells stored as rows in file
Trapnell.bin. If the number of cells is \(N\), only \(N(N+1)/2\) dissimilarity values are really
stored.
A note on the number of threads, valid also for other algorithms that will be explained later. Possible values for the number of threads is:
-1 (or any negative number) to indicate you do not
want to use threads (strictly sequential computation).
0 to allow the program to choose the number of
threads according to the problem size and the number of available
cores.
Any positive number to force the use of such number of threads.
Choosing explicitly a number of threads bigger than the number of
available
cores is allowed, but discouraged and the program emits a warning about
it.
With respect to option 0 (the program chooses), for
small problems (in this case, less than 1000 cells) the function makes
the choice of not using threads, since the overhead of opening and
waiting termination is not worth. For bigger problems the number of
chosen threads is the number of available cores. Nevertheless, this
choice may not be the best, depending on your machine.
Now, let us really do the call with this small dataset.
tmpfile1=paste0(tempdir(),"/Trapnell.bin")
tmppeafile1=paste0(tempdir(),"/TrapnellPearson.bin")
CalcAndWriteDissimilarityMatrix(tmpfile1,tmppeafile1,nthreads=-1,
                                comment="Pearson dissimilarities for cell
                                         in experiment conquer
                                         GSE52529-GPL16791")
#> Input matrix is a sparse matrix  with elements of type 'float' and size (288,5218)
#> Read sparse matrix from file /tmp/RtmpG9IRdw/Trapnell.bin.
#>    Its size is [288 x 5218] and it uses 378647 elements, half of 4 bytes and half of 4 bytes each, with accounts for 2.88995 MBytes.
#> Creating dissimilarity matrix of size (288x288)
#> Package memuse is installed. OK.
#>   Memory used by the matrix: 162 KiB, which is 0% of the available memory, which is 257995880 Kib.
#>   That seems OK.
#> End of dissimilarity matrix calculation (serial version). Elapsed time: 0.748278 s
#> Writing binary matrix /tmp/RtmpG9IRdw/TrapnellPearson.bin
#> End of block of binary data at offset 166592
#>    Writing row names (288 strings written, from GSM1268960 to GSM1269247).
#>    Writing comment: Pearson dissimilarities for cell
#>                                          in experiment conquer
#>                                          GSE52529-GPL16791
#> Output binary file /tmp/RtmpG9IRdw/TrapnellPearson.bin written.WARNING: the normal way of calling
CalcAndWriteDissimilarityMatrix would use nthreads=0 to
make use of all available cores in your machine. Nevertheless, this does
not seem to be allowed by CRAN to pass the test so I have had to use the
serial version invoked with nthreads=-1. In your normal use of code try
always nthreads=0.
The resulting matrix is stored as a binary symmetric matrix of float values, as we can check.
JMatInfo(tmppeafile1)
#> File:               /tmp/RtmpG9IRdw/TrapnellPearson.bin
#> Matrix type:        SymmetricMatrix
#> Number of elements: 82944 (41616 really stored)
#> Data type:          float
#> Endianness:         little endian (same as this machine)
#> Number of rows:     288
#> Number of columns:  288
#> Metadata:           Stored only names of rows.
#> Metadata comment:  "Pearson dissimilarities for cell
#>                                          in experiment conquer
#>                                          GSE52529-GPL16791"The last step is to take the previously calculated matrix and apply
the Partitioning Around Medoids classifier. Function is
ApplyPAM. Default parameters are:
tmppeafile1=paste0(tempdir(),"/TrapnellPearson.bin")
L=ApplyPAM(tmppeafile1,k=10,init_method="BUILD",
           max_iter=1000,nthreads=-1)
#> Reading symmetric distance/dissimilarity matrix /tmp/RtmpG9IRdw/TrapnellPearson.bin
#> Package memuse is installed. OK.
#>   Memory used by the matrix: 162 KiB, which is 0% of the available memory, which is 257995776 Kib.
#>   That seems OK.
#> Read symmetric matrix with size (288,288)
#>   Matrix is a correct distance/dissimilarity matrix.
#> Calculating with a single thread, since you have only 288 vectors and the overhead of using threads would be excessive.
#> Starting BUILD initialization method, serial version
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 50. TD=11914.134766
#> Looking for medoid 2. Medoid 2 found. Point 28. 8 reassigned points. TD=10251.591797
#> Looking for medoid 3. Medoid 3 found. Point 284. 79 reassigned points. TD=9409.731445
#> Looking for medoid 4. Medoid 4 found. Point 107. 0 reassigned points. TD=8996.875977
#> Looking for medoid 5. Medoid 5 found. Point 268. 21 reassigned points. TD=8661.666016
#> Looking for medoid 6. Medoid 6 found. Point 118. 0 reassigned points. TD=8333.292969
#> Looking for medoid 7. Medoid 7 found. Point 17. 108 reassigned points. TD=8016.939453
#> Looking for medoid 8. Medoid 8 found. Point 154. 0 reassigned points. TD=7747.802734
#> Looking for medoid 9. Medoid 9 found. Point 126. 0 reassigned points. TD=7500.490234
#> Looking for medoid 10. Medoid 10 found. Point 173. 5 reassigned points. TD=7287.086914
#> Current TD: 7287.086914
#> BUILD initialization method (serial version) finished. Elapsed time: 0.00186713 s
#> Starting improved FastPAM1 method in serial implementation...
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Medoid at place 1 (point 51) swapped with point 209; TD-change=-103.419434; TD=7183.665039. 60 reassigned points.
#> Iteration 1. Medoid at place 7 (point 17) swapped with point 39; TD-change=-43.391033; TD=7140.273926. 11 reassigned points.
#> Iteration 2.    Exiting, since DeltaTDst is 0.000000. Final value of TD is 7140.273926
#> Optimization method (serial version) finished. Elapsed time: 0.000887453 s
#> Time summary  (serial implementation).
#>    Initalization: 0.00186713 s (method BUILD).
#>    Optimization:  0.000887453 s in 1 iterations (0.000887453 seconds/iteration).
#>    Total time:    0.00275458 s (0 minutes, 0.00275458 seconds).WARNING: the normal way of calling ApplyPAM would use
nthreads=0 to make use of all available cores in your machine.
Nevertheless, this does not seem to be allowed by CRAN to pass the test
so I have had to use the serial version invoked with nthreads=-1. In
your normal use of code try always nthreads=0.
The dissimilarity matrix, as formerly calculated, is compulsory, as
long as the number of medoids, k.
Parameters init_method (and one optional parameter,
initial_med) deserve special comment. The first is the
method to initialize the medoids. Its possible values are
BUILD, LAB and PREV. The rest of
the algorithm make medoid swapping between the points of the initial set
made with BUILD or LAB and the rest of points
until no swap can reduce the objective function, which is the sum of
distances of each point to its closest medoid. But this may fall (and
indeed falls) in local minima. If you initialize with BUILD
or LAB the optional parameter initial_med
cannot be used.
The initialization methods BUILD and LAB
are described in the paper by Schubert at al. (Schubert and Rousseeuw (2019)).
BUILD is deterministic. LAB uses a sample of
the total points to initialize. Obviously, you can run LAB
to get different initializations and compare the results.
The returned object is a list with two fields: med and
clasif. This will be explained later.
From now on, typical calls to obtain only the initial medoids would be
tmppeafile1=paste0(tempdir(),"/TrapnellPearson.bin")
Lbuild=ApplyPAM(tmppeafile1,k=10,init_method="BUILD",max_iter=0)
#> Reading symmetric distance/dissimilarity matrix /tmp/RtmpG9IRdw/TrapnellPearson.bin
#> Package memuse is installed. OK.
#>   Memory used by the matrix: 162 KiB, which is 0% of the available memory, which is 257995776 Kib.
#>   That seems OK.
#> Read symmetric matrix with size (288,288)
#>   Matrix is a correct distance/dissimilarity matrix.
#> Calculating with a single thread, since you have only 288 vectors and the overhead of using threads would be excessive.
#> Starting BUILD initialization method, serial version
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 50. TD=11914.134766
#> Looking for medoid 2. Medoid 2 found. Point 28. 8 reassigned points. TD=10251.591797
#> Looking for medoid 3. Medoid 3 found. Point 284. 79 reassigned points. TD=9409.731445
#> Looking for medoid 4. Medoid 4 found. Point 107. 0 reassigned points. TD=8996.875977
#> Looking for medoid 5. Medoid 5 found. Point 268. 21 reassigned points. TD=8661.666016
#> Looking for medoid 6. Medoid 6 found. Point 118. 0 reassigned points. TD=8333.292969
#> Looking for medoid 7. Medoid 7 found. Point 17. 108 reassigned points. TD=8016.939453
#> Looking for medoid 8. Medoid 8 found. Point 154. 0 reassigned points. TD=7747.802734
#> Looking for medoid 9. Medoid 9 found. Point 126. 0 reassigned points. TD=7500.490234
#> Looking for medoid 10. Medoid 10 found. Point 173. 5 reassigned points. TD=7287.086914
#> Current TD: 7287.086914
#> BUILD initialization method (serial version) finished. Elapsed time: 0.00184884 s
#> Time summary  (serial implementation).
#>    Initalization: 0.00184884 s (method BUILD).
#>    Optimization:  0 s in 0 iterations.
#>    Total time:    0.00184884 s (0 minutes, 0.00184884 seconds).
Llab1=ApplyPAM(tmppeafile1,k=10,init_method="LAB",max_iter=0)
#> Reading symmetric distance/dissimilarity matrix /tmp/RtmpG9IRdw/TrapnellPearson.bin
#> Package memuse is installed. OK.
#>   Memory used by the matrix: 162 KiB, which is 0% of the available memory, which is 257995776 Kib.
#>   That seems OK.
#> Read symmetric matrix with size (288,288)
#>   Matrix is a correct distance/dissimilarity matrix.
#> Calculating with a single thread, since you have only 288 vectors and the overhead of using threads would be excessive.
#> Starting LAB initialization method, serial version.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 50. TD=11914.134766
#> Looking for medoid 2. Medoid 2 found. Point 200. 91 reassigned points. TD=11372.759766
#> Looking for medoid 3. Medoid 3 found. Point 28. 9 reassigned points. TD=9710.217773
#> Looking for medoid 4. Medoid 4 found. Point 209. 76 reassigned points. TD=9401.531250
#> Looking for medoid 5. Medoid 5 found. Point 4. 81 reassigned points. TD=9121.417969
#> Looking for medoid 6. Medoid 6 found. Point 91. 18 reassigned points. TD=8947.547852
#> Looking for medoid 7. Medoid 7 found. Point 282. 36 reassigned points. TD=8587.501953
#> Looking for medoid 8. Medoid 8 found. Point 84. 36 reassigned points. TD=8515.791016
#> Looking for medoid 9. Medoid 9 found. Point 143. 20 reassigned points. TD=8422.551758
#> Looking for medoid 10. Medoid 10 found. Point 284. 36 reassigned points. TD=8287.095703
#> Current TD: 8287.095703
#> LAB initialization method (serial version) finished. Elapsed time: 0.000222514 s
#> Time summary  (serial implementation).
#>    Initalization: 0.000222514 s (method LAB).
#>    Optimization:  0 s in 0 iterations.
#>    Total time:    0.000222514 s (0 minutes, 0.000222514 seconds).
Llab2=ApplyPAM(tmppeafile1,k=10,init_method="LAB",max_iter=0)
#> Reading symmetric distance/dissimilarity matrix /tmp/RtmpG9IRdw/TrapnellPearson.bin
#> Package memuse is installed. OK.
#>   Memory used by the matrix: 162 KiB, which is 0% of the available memory, which is 257995776 Kib.
#>   That seems OK.
#> Read symmetric matrix with size (288,288)
#>   Matrix is a correct distance/dissimilarity matrix.
#> Calculating with a single thread, since you have only 288 vectors and the overhead of using threads would be excessive.
#> Starting LAB initialization method, serial version.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 13. TD=12031.589844
#> Looking for medoid 2. Medoid 2 found. Point 40. 9 reassigned points. TD=10483.274414
#> Looking for medoid 3. Medoid 3 found. Point 132. 68 reassigned points. TD=9646.358398
#> Looking for medoid 4. Medoid 4 found. Point 73. 104 reassigned points. TD=9317.611328
#> Looking for medoid 5. Medoid 5 found. Point 133. 4 reassigned points. TD=9119.197266
#> Looking for medoid 6. Medoid 6 found. Point 130. 74 reassigned points. TD=8836.474609
#> Looking for medoid 7. Medoid 7 found. Point 268. 22 reassigned points. TD=8604.005859
#> Looking for medoid 8. Medoid 8 found. Point 224. 8 reassigned points. TD=8513.661133
#> Looking for medoid 9. Medoid 9 found. Point 70. 5 reassigned points. TD=8259.406250
#> Looking for medoid 10. Medoid 10 found. Point 140. 34 reassigned points. TD=8153.130371
#> Current TD: 8153.130371
#> LAB initialization method (serial version) finished. Elapsed time: 0.000218896 s
#> Time summary  (serial implementation).
#>    Initalization: 0.000218896 s (method LAB).
#>    Optimization:  0 s in 0 iterations.
#>    Total time:    0.000218896 s (0 minutes, 0.000218896 seconds).WARNING: the normal way of calling ApplyPAM would use
nthreads=0 to make use of all available cores in your machine.
Nevertheless, this does not seem to be allowed by CRAN to pass the test
so I have had to use the serial version invoked with nthreads=-1. In
your normal use of code try always nthreads=0. For the LAB method this
does not matter, since parallel implementation is not yet provided.
As you can see, to get and compare different initializations you must
set the parameter max_iter to value 0. In this
case no iterations of objective function reduction are performed, and
the returned object contains the initial medoids and the classification
induced by them. Notice that even looking equal, the results of the
latter two calls are different since LAB initializes with a
random component (the sample to choose initial medoids is chosen
randomly).
You can check that the medoids, stored in Llab1$med and
Llab2$med (see more on this below) are in general,
different.
Now, these results can be used to initialize PAM if you find that any
of them contains a specially good set of medoids. This is the role of
method PREV that we have mentioned before. A typical call
would be
tmppeafile1=paste0(tempdir(),"/TrapnellPearson.bin")
Llab2Final=ApplyPAM(tmppeafile1,k=10,init_method="PREV",
                    initial_med=Llab2$med,nthreads=-1)
#> Reading symmetric distance/dissimilarity matrix /tmp/RtmpG9IRdw/TrapnellPearson.bin
#> Package memuse is installed. OK.
#>   Memory used by the matrix: 162 KiB, which is 0% of the available memory, which is 257995776 Kib.
#>   That seems OK.
#> Read symmetric matrix with size (288,288)
#>   Matrix is a correct distance/dissimilarity matrix.
#> Calculating with a single thread, since you have only 288 vectors and the overhead of using threads would be excessive.
#> Starting improved FastPAM1 method in serial implementation...
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Medoid at place 1 (point 14) swapped with point 107; TD-change=-339.236053; TD=7813.903809. 25 reassigned points.
#> Iteration 1. Medoid at place 8 (point 224) swapped with point 118; TD-change=-238.033478; TD=7575.869629. 9 reassigned points.
#> Iteration 2. Medoid at place 10 (point 140) swapped with point 154; TD-change=-109.771622; TD=7466.098145. 43 reassigned points.
#> Iteration 3. Medoid at place 3 (point 132) swapped with point 284; TD-change=-90.818726; TD=7375.279297. 13 reassigned points.
#> Iteration 4. Medoid at place 2 (point 40) swapped with point 126; TD-change=-76.211250; TD=7299.068359. 5 reassigned points.
#> Iteration 5. Medoid at place 6 (point 130) swapped with point 209; TD-change=-70.004341; TD=7229.064453. 10 reassigned points.
#> Iteration 6. Medoid at place 9 (point 70) swapped with point 28; TD-change=-43.613960; TD=7185.450195. 0 reassigned points.
#> Optimization method (serial version) finished. Elapsed time: 0.00209995 s
#> Time summary  (serial implementation).
#>    Initalization: 0 s (method PREV).
#>    Optimization:  0.00209995 s in 6 iterations (0.000349992 seconds/iteration).
#>    Total time:    0.00209995 s (0 minutes, 0.00209995 seconds).where the initial set of medoids is taken from the object returned by the former calls.
With respect to that object, as we said it is a list with two
vectors. The first one, L$med, has as many components as
requested medoids and the second, L$clasif, has as many
components as instances (here, cells).
Medoids are expressed in L$med by its number in the
array of points (row in the dissimilarity matrix) starting at 1 (R
convention).
L$clasif contains the number of the medoid (i.e.: the
cluster) to which each instance has been assigned, according to their
order in L$med (also from 1).
This means that if L$clasif[p] is m, the
point p belongs to the class grouped around medoid
L$med[m]. Let us see it:
# Which are the indexes of the points chosen as medoids?
L$med
#> GSM1269168 GSM1268987 GSM1269243 GSM1269066 GSM1269227 GSM1269077 GSM1268998 
#>        209         28        284        107        268        118         39 
#> GSM1269113 GSM1269085 GSM1269132 
#>        154        126        173
#
# In which class has point 147 been classified?
L$clasif[147]
#> GSM1269106 
#>          7
#
# And which is the index (row in the dissimilarity matrix) of the medoid closest to point 147?
L$med[L$clasif[147]]
#> GSM1268998 
#>         39In this way, values in L$clasif are between 1 and the number of medoids, as we can see.
min(L$clasif)
#> [1] 1
max(L$clasif)
#> [1] 10Therefore, they can be used as factors.
To help in the data analysis a data frame can be built, too, with the names of the cells instead of their numbers. It is created as
tmppeafile1=paste0(tempdir(),"/TrapnellPearson.bin")
F <- ClassifAsDataFrame(L,tmppeafile1)
#> Read symmetric matrix with size (288,288)The second parameter is the dissimilarity matrix created before when
we called CalcAndWriteDissimilarityMatrix. The reason of
using it here is because the returned dataframe has three columns:
CellName with name of each cell, NNCellName
with the name of the cell which is the center of the cluster to which
CellName belongs to and NNDistance which is
the distance between the cells CellName and
NNCellName.
This data frame implicitly knows which cells are medoids: those whose distance to the closest medoid (themselves) is 0. They can be extracted as
# Extract column 1 (cell name) of those rows for which distance to
# closest medoid (column 3) is 0
F[which(F[,3]==0),1]
#>  [1] "GSM1268987" "GSM1268998" "GSM1269066" "GSM1269077" "GSM1269085"
#>  [6] "GSM1269113" "GSM1269132" "GSM1269168" "GSM1269227" "GSM1269243"Finally, an abundance matrix can be obtained if the cells came from
different groups (samples) by combining such groups with the obtained
clusters. This is done by function
BuildAbundanceMatrix,
M=BuildAbundanceMatrix(L$clasif,Gr)where the Gr vector should have been obtained before
(either from the Seurat object with GetSeuratGroups, or by
any other means from your input object, if possible). Since the data of
this example does not come from Seurat, this is not calculated now, but
in a real case the resulting object is a matrix with as many rows as
clusters and as many columns as groups. Each entry has the number of
cells of its sample which have been classified at that group. This
should be analyzed to determine if being in a different group has
sufficient influence on the cluster to which cells belong and therefore
if these clusters have any real biological significance.
It is interesting to filter cells based on the degree in which they belongs to a cluster. Indeed, cluster refinement can be done getting rid of cells far away from any cluster center, or which are at a similar distance of two or more of them.
This is characterized by the silhouette of each point (cell). Three
functions deal with this: CalculateSilhouette,
FilterBySilhouetteQuantile and
FilterBySilhouetteThreshold.
tmppeafile1=paste0(tempdir(),"/TrapnellPearson.bin")
S=CalculateSilhouette(Llab2$clasif,tmppeafile1,nthreads=-1)
#> Package memuse is installed. OK.
#>   Memory used by the matrix: 162 KiB, which is 0% of the available memory, which is 257995776 Kib.
#>   That seems OK.
#>    Calculating silhouette (serial implementation)...
#> Read symmetric matrix with size (288,288)
#> 288 points classified in 10 classes.
#> Finished serial implementation of silhouette (including dissimilarity matrix load). Elapsed time: 0.000253905 sWARNING: the normal way of calling CalculateSilhouette
would use nthreads=0 to make use of all available cores in your machine.
Nevertheless, this does not seem to be allowed by CRAN to pass the test
so I have had to use the serial version invoked with nthreads=-1. In
your normal use of code try always nthreads=0.
The parameters to function CalculateSilhouette are the
array of class membership, as returned by ApplyPAM in its
clasif field, and the file with the matrix of
dissimilarities.
A parallel implementation has been programmed, being nthreads as explained before.
Silhouette is a number in \([-1,1]\); the higher its value, the most centered a point is in its cluster.
The returned object S is a numeric vector with the value
of the silhouette for each point, which will be named if the
classification vector was named.
This vector can be converted to object of the class
cluster:silhouette with the function
NumSilToClusterSil (which needs the vector of
classifications, too). This is done so that, if you load the package
cluster, plot will generate the kind of silhouette plots included in
such package.
Sclus <- NumSilToClusterSil(Llab2$clasif,S)
library(cluster)
plot(Sclus)Once the silhouette is calculated we can filter it by quantile or by
threshold. All points under this quantile or threshold will be
discarded, except if they are medoids. Parameters are the silhouette, as
returned by CalculateSilhouette, the list of
medoids/clasif, as returned by ApplyPAM, the file with
matrix of counts for the whole set of cells, the file that will contain
the matrix of counts of the remaining cells, the file with dissimilarity
matrix for the whole set, the file that will contain the dissimilarity
for the remaining cells and (depending on the function used) the
quantile in \([0,1]\) or the silhouette
threshold in \([-1,1]\). As an
example:
tmpfile1=paste0(tempdir(),"/Trapnell.bin")
tmpfiltfile1=paste0(tempdir(),"/TrapnellFilt.bin")
tmppeafile1=paste0(tempdir(),"/TrapnellPearson.bin")
tmppeafiltfile1=paste0(tempdir(),"/TrapnellPearsonFilt.bin")
Lfilt=FilterBySilhouetteQuantile(S,Llab2,tmpfile1,tmpfiltfile1,
                                 tmppeafile1,tmppeafiltfile1,0.2)
#> After filtering silhouette with quantile 0.2 (threshold -0.10666) 232 of the 288 points remain.
#> Warning in FilterBySilhouetteQuantile(S, Llab2, tmpfile1, tmpfiltfile1, : 2 of
#> the medoids have been kept, even they were below the threshold (which seems
#> problematic. Check your clusters...).
#> 2 of the medoids have been kept, even they were below the threshold (which seems problematic. Check your clusters...).
#> Writing binary matrix /tmp/RtmpG9IRdw/TrapnellFilt.bin of (232x5218)
#> End of block of binary data at offset 2404456
#>    Writing row names (232 strings written, from GSM1268961 to GSM1269245).
#>    Writing column names (5218 strings written, from ENSG00000164675.10 to ENSG00000093100.13).
#>    Writing comment: Experiment conquer GSE52529-GPL16791 Filtered by silhouette from file /tmp/RtmpG9IRdw/Trapnell.bin with quantile 0.2. 
#> Read symmetric matrix with size (288,288)
#> Writing binary matrix /tmp/RtmpG9IRdw/TrapnellPearsonFilt.bin
#> End of block of binary data at offset 108240
#>    Writing row names (232 strings written, from GSM1268961 to GSM1269245).
#>    Writing comment: Pearson dissimilarities for cell
#>                                          in experiment conquer
#>                                          GSE52529-GPL16791 Filtered by silhouette from file /tmp/RtmpG9IRdw/TrapnellPearson.bin with quantile 0.2.If the original matrix contained row (in this case, cell) and column (in this case, gene) names, the column names are copied and the row names are transported for those rows (cells) that remain. The same happens with respect to rows of the dissimilarity matrix.
Notice that the new dissimilarity matrix could have been calculated
from the matrix of filtered counts with
CalcAndWriteDissimilarityMatrix but creating it here,
simply getting rid of the filtered rows and columns is much faster.
Also, if a medoid is below the silhouette quantile, it will not be filtered out, but a warning message will be shown, since this is a strange situation that may indicate that some of your clusters are not real but artifacts due to a few outliers that are close to each other.
Of course you can use the filtered cells to build a data frame, as long as the dissimilarity matrix contains cell names,
tmppeafiltfile1=paste0(tempdir(),"/TrapnellPearsonFilt.bin")
Ffilt=ClassifAsDataFrame(Lfilt,tmppeafiltfile1)
#> Read symmetric matrix with size (232,232)or to recreate a .csv file of counts of the remaining cells with
their correct names using BinMatToCsv like this.
tmpfiltfile1=paste0(tempdir(),"/TrapnellFilt.bin")
tmpfiltcsvfile1=paste0(tempdir(),"/TrapnellFilt.csv")
JMatToCsv(tmpfiltfile1,tmpfiltcsvfile1)But remember that this was the result of the first step of the PAM algorithm, so probably you will want to make them iterate.
tmppeafiltfile1=paste0(tempdir(),"/TrapnellPearsonFilt.bin")
Lfinal=ApplyPAM(tmppeafiltfile1,k=length(Lfilt$med),init_method="PREV",initial_med=Lfilt$med,nthreads=-1)
#> Reading symmetric distance/dissimilarity matrix /tmp/RtmpG9IRdw/TrapnellPearsonFilt.bin
#> Package memuse is installed. OK.
#>   Memory used by the matrix: 105 KiB, which is 0% of the available memory, which is 257996264 Kib.
#>   That seems OK.
#> Read symmetric matrix with size (232,232)
#>   Matrix is a correct distance/dissimilarity matrix.
#> Calculating with a single thread, since you have only 232 vectors and the overhead of using threads would be excessive.
#> Starting improved FastPAM1 method in serial implementation...
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Medoid at place 1 (point 13) swapped with point 89; TD-change=-457.934906; TD=8117.083984. 4 reassigned points.
#> Iteration 1. Medoid at place 5 (point 113) swapped with point 100; TD-change=-300.896240; TD=7816.187988. 2 reassigned points.
#> Iteration 2. Medoid at place 8 (point 183) swapped with point 128; TD-change=-226.102707; TD=7590.084961. 8 reassigned points.
#> Iteration 3. Medoid at place 10 (point 117) swapped with point 107; TD-change=-120.124222; TD=7469.960449. 36 reassigned points.
#> Iteration 4. Medoid at place 6 (point 111) swapped with point 171; TD-change=-80.808510; TD=7389.151855. 3 reassigned points.
#> Iteration 5. Medoid at place 2 (point 34) swapped with point 136; TD-change=-60.378704; TD=7328.772949. 43 reassigned points.
#> Iteration 6. Medoid at place 3 (point 112) swapped with point 165; TD-change=-83.143143; TD=7245.629883. 1 reassigned points.
#> Iteration 7. Medoid at place 9 (point 59) swapped with point 25; TD-change=-54.141468; TD=7191.488281. 0 reassigned points.
#> Optimization method (serial version) finished. Elapsed time: 0.0014881 s
#> Time summary  (serial implementation).
#>    Initalization: 0 s (method PREV).
#>    Optimization:  0.0014881 s in 7 iterations (0.000212586 seconds/iteration).
#>    Total time:    0.0014881 s (0 minutes, 0.0014881 seconds).WARNING: the normal way of calling ApplyPAM would use
nthreads=0 to make use of all available cores in your machine.
Nevertheless, this does not seem to be allowed by CRAN to pass the test
so I have had to use the serial version invoked with nthreads=-1. In
your normal use of code try always nthreads=0.
Of course, we might have used simply 10 as number of medoids,
k, since this does not change by filtering, but this is to
emphasize the fact that ApplyPAM with method
PREV requires both parameters to be consistent.