%% use JSS class but for now with nojss option
\documentclass[nojss,shortnames,article]{jss}
%\usepackage{float}
\usepackage{booktabs}

\author{Dirk Eddelbuettel\\Debian Project} % \And Second Author\\Plus Affiliation}
\title{Benchmarking Single- and Multi-Core BLAS Implementations and GPUs for use with \proglang{R}}

\Plainauthor{Dirk Eddelbuettel} % , Second Author} %% comma-separated
\Plaintitle{Benchmarking Single- and Multi-Core BLAS Implementations and GPUs for use with R}
\Shorttitle{Benchmarking BLAS and GPUs for use with R}

\Abstract{
  \noindent
  We provide timing results for common linear algebra subroutines across BLAS
  (Basic Linear Algebra Subprograms) and GPU (Graphics Processing Unit)-based
  implementations. Several BLAS implementations are compared. The first is
  the unoptimised reference BLAS which provides a baseline to measure
  against. Second is the Atlas tuned BLAS, configured for single-threaded
  mode. Third is the development version of Atlas, configured for
  multi-threaded mode. Fourth is the optimised and multi-threaded Goto
  BLAS. Fifth is the multi-threaded BLAS contained in the commercial Intel
  MKL package. We also measure the performance of a GPU-based implementation
  for \proglang{R} \citep{RCore:R} provided by the package \pkg{gputools}
  \citep{cran:gputools}.

  Several frequently-used linear algebra computations are compared across
  BLAS (and LAPACK) implementations and via GPU computing: matrix
  multiplication as well as QR, SVD and LU decompositions.  The tests are
  performed from an end-user perspective, and `net' times (including all
  necessary data transfers) are compared.

  While results are by their very nature dependent on the hardware of the
  test platforms, a few general lessons can be drawn. Unsurprisingly,
  accelerated BLAS clearly outperform the reference implementation.
  Similarly, multi-threaded BLAS hold a clear advantage over single-threaded
  BLAS when used on modern multi-core machines. Between the multi-threaded
  BLAS implementations, Goto is seen to have a slight advantage over MKL and
  Atlas.  GPU computing is showing promise but requires relatively large
  matrices to outperform multi-threaded BLAS.

  We also provide the framework for computing these benchmark results as the
  corresponding \proglang{R} package \pkg{gcbd}. It enables other
  researchers to compute similar benchmark results which could be the basis
  for heuristics helping to select optimial computing strategies for a given
  platform, library, problem and size combination.
}

\Keywords{BLAS, Atlas, Goto, MKL, GPU, \proglang{R}, Linux} %% at least one keyword must be supplied
\Plainkeywords{BLAS, Atlas, Goto, MKL, GPU, R, Linux} %% without formatting

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{13}
%% \Issue{9}
%% \Month{September}
%% \Year{2004}
%% \Submitdate{2004-09-29}
%% \Acceptdate{2004-09-29}

\Address{
  Dirk Eddelbuettel \\
  Debian Project \\
  Chicago, IL, USA\\
  %% Telephone: +43/1/31336-5053
  %% Fax: +43/1/31336-734
  E-mail: \email{edd@debian.org}\\
  URL: \url{http://dirk.eddelbuettel.com}
}

%% need no \usepackage{Sweave.sty}

% ------------------------------------------------------------------------

\begin{document}
\SweaveOpts{engine=R,eps=FALSE,echo=FALSE,prefix.string=figures/chart}
%\VignetteIndexEntry{BLAS and GPU Benchmarking for Use with R}
%\VignetteDepends{gcbd}
%\VignetteKeywords{blas,gpu,atlas,mkl,goto,R,Linux}
%\VignettePackage{gcbd}

\shortcites{Brodtkorb_et_al_2010,Blackford_et_al:2002,lapack}

%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.

<<prelim,echo=FALSE,print=FALSE>>=
options(width=50)
library(gcbd)
gcbd.version <- packageDescription("gcbd")$Version
gcbd.date <- packageDescription("gcbd")$Date
now.date <- strftime(Sys.Date(), "%B %d, %Y")
# create figures/ if not present
if ( ! (file.exists("figures") && file.info("figures")$isdir) ) dir.create("figures")
@
%


\makeatletter
\if@nojss
  Detailed comments from Roger Bivand, Rog\'{e}rio Brito, Allan
  Engelhardt, Bryan Lewis, James MacKinnon, Junji Nakano, Mark Seligman, Brian
  Smith and Luke Tierney are very gratefully acknowledged. All remaining errors
  are mine.

  \vfill
  This version corresponds to \pkg{gcbd} version \Sexpr{gcbd.version} and was
  typeset on \Sexpr{now.date}.
\fi
\makeatother

\pagebreak
\section[Introduction]{Introduction}

Analysts are often eager to reap the maximum performance from their computing
platforms.  A popular suggestion in recent years has been to consider
optimised basic linear algebra subprograms (BLAS).  Optimised BLAS libraries
have been included with some (commercial) analysis platforms for a decade
\citep{Moler:2000}, and have also been available for (at least some) Linux
distributions for an equally long time \citep{Maguire:1999}.  Setting BLAS up
can be daunting: the \proglang{R} language and environment devotes a detailed
discussion to the topic in its \textsl{Installation and Administration}
manual \citep[appendix A.3.1]{RCore:InstAdmin}.

Among the available BLAS implementations, several popular choices have
emerged. Atlas (an acronym for \textsl{Automatically Tuned Linear Algebra
  System}) is popular as it has shown very good performance due to its
automated and CPU-specific tuning \citep{Whaley_Dongarra:1999,
  Whaley_Petitet:2005}. It is also licensed in such a way that it permits
redistribution leading to fairly wide availability of Atlas.\footnote{The
  Atlas FAQ lists Maple, Matlab and Mathematica as using Atlas, and
  mentions that GSL, Octave, R, Scientific Python, and Scilab can be used with
  Atlas.}  We deploy Atlas in both a single-threaded and a multi-threaded
configuration. Another popular BLAS implementation is Goto BLAS which is named
after its main developer, Kazushige Goto \citep{Goto_VanDeGeijin:2008}. While
`free to use', its license does not permit redistribution putting the onus of
configuration, compilation and installation on the end-user.  Lastly, the
Intel Math Kernel Library (MKL), a commercial product, also includes an
optimised BLAS library.

A recent addition to the tool chain of high-performance computing are
graphical processing units (GPUs).  Originally designed for optimised
single-precision arithmetic to accelerate computing as performed by graphics
cards, these devices are increasingly used in numerical analysis.  Earlier
criticism of insufficient floating-point precision or severe performance
penalties for double-precision calculation are being addressed by the newest
models. Dependence on particular vendors remains a concern with NVidia's CUDA
toolkit \citep{nVidia:2010} currently still the preferred development choice
whereas the newer OpenCL standard \citep{OpenCL:2010} may become a more
generic alternative that is independent of hardware vendors.
\citet{Brodtkorb_et_al_2010} provide an excellent recent survey.

But what has been lacking is a comparison of the effective performance of these
alternatives.  This paper works towards answering this question.  By analysing
performance across five different BLAS implementations---as well as a
GPU-based solution---we are able to provide a reasonably broad comparison.

Performance is measured as an end-user would experience it: we record
computing times from launching commands in the interactive \proglang{R}
environment \citep{RCore:R} to their completion.  While implemented in
\proglang{R}, these benchmark results are more general and valid beyond the
\proglang{R} system as there is only a very thin translation layer between
the higher-level commands and the underlying implementations (such as, say,
\code{dgemm} for double-precision matrix multiplications) in the respective
libraries.  This lack of (strong) dependence on the test environment makes
our results more generally applicable. However, \proglang{R} is very useful
as an environment to generate test data, execute the benchmarks, and to
collect the results which are subsequently analysed and visualized.

The rest of the paper is organised as follows. In the next section, the
technical background is briefly discussed. The implementation of our
benchmark tests is outlined in section 3. We provide results in section 4,
before a summary concludes in section 5.

\section{Background}

Basic Linear Algebra Subprograms (BLAS) provide an Application Programming
Interface (API) for linear algebra.  For a given task such as, say, a
multiplication of two conformant matrices, an interface is described via a
function declaration, in this case \code{sgemm} for single precision and
\code{dgemm} for double precision. The actual implementation becomes
interchangeable thanks to the API definition and can be supplied by different
approaches or algorithms.  This is one of the fundamental code design
features we are using here to benchmark the difference in performance from
different implementations.

A second key aspect is the difference between static and shared linking.  In
static linking, object code is taken from the underlying library and copied
into the resulting executable.  This has several key implications. First, the
executable becomes larger due to the copy of the binary code. Second, it
makes it marginally faster as the library code is present and no additional
look-up and subsequent redirection has to be performed. The actual amount of
this performance difference is the subject of near-endless debate. We should
also note that this usually amounts to only a small load-time penalty
combined with a function pointer redirection---the actual computation effort
is unchanged as the actual object code is identical. Third, it makes the program
more robust as fewer external dependencies are required.  However, this last
point also has a downside: no changes in the underlying library will be
reflected in the binary unless a new build is executed.  Shared library
builds, on the other hand, result in smaller binaries that may run marginally
slower---but which can make use of different libraries without a rebuild.

That last feature is key here: it offers us the ability to use different BLAS
implementations in a `plug and play' mode which facilitates comparisons.
So because of both the standardised interface of the BLAS, and the fact that we
have several alternative implementations at our disposal, we can switch
between these alternatives. To do so easily makes use of a package mechanism
used by Ubuntu, the Debian-based Linux distribution we employ. However, a
simpler (yet less robust) approach would also be available. This technical
aspect is discussed further below.

The first available BLAS implementation stems from the original unoptimised
code on Netlib. On Debian-based systems, it is provided by the (source)
package \pkg{blas} \citep{Blackford_et_al:2002} and used with the
\pkg{lapack} package \citep{lapack}. This implementation is commonly referred
to as `reference BLAS' (and we will use `ref' as a shorthand) as it provides
a reference implementation.

The second and third BLAS implementations are provided by Atlas
\citep{Whaley_Dongarra:1999,Whaley_Petitet:2005} which stands for
\textsl{Automatically Tuned Linear Algebra Software} as it optimises its
performance and parameters (such as cache sizes) during its initial build.
Another notable aspect is its liberal licensing which permits wide
distribution. Consequently, Atlas is available on most if not all Linux
distributions. Windows binaries are also widely available.  We used two
different Atlas versions. The first one is the (pre-compiled) version in the
Ubuntu and Debian releases. It is based on Atlas 3.6.0 and built only for
single-threaded operation.  The second version is based on the current Atlas
development version 3.9.25, built for multi-threaded mode and locally
compiled.\footnote{A critical change enabling multi-threaded mode which we
  added was also filed as Debian bug report \#595326 and will be reflected in
  future Debian / Ubuntu packages.} We will refer to the second variant as
`Atlas39' and `Atl39' when we report results below.

The fourth BLAS implementation is provided by the Goto BLAS. Upon the
required user registration, these are freely available from the University of
Texas, albeit under a license that prohibits redistribution.  This prevents
inclusion into popular Linux distributions, a clear disadvantage for easy
installation and de-installation.  However, the contributed Debian repository
at the Institute of Statistical Mathematics in Japan provides a `helper
package' \citep{nakano_nakama:2009,Nakama:2010} which, given the required
registration information, downloads the source code from the University of
Texas site, and then configures and compiles the Goto BLAS in such a manner
that a local binary Debian package is produced---which is also optimised for
the local installation.  This permits us to use the Goto BLAS via the
resulting package as a fourth BLAS alternative.

The fifth available BLAS implementation is part of the Intel Math Kernel
Library (MKL), a commercial product. However, Ubuntu release 9.10
(``karmic'') contains a set of packages sponsored by Revolution Analytics
which comprises version 10.2 (dated March 2009) of the Intel MKL in a setup
directly usable by \proglang{R}. We use these packages here as a fifth set of
optimised BLAS.  The AMD Core Math Library (AMCL) could provide a sixth' BLAS
variant. However, given that Intel is currently not truly challenged by AMD
for high-performance CPU, we considered analysing the Intel MKL to be a
sufficient representation of the vendor BLAS market.

The first \proglang{R} extension for graphics-processing units has been
implemented by the \pkg{gputools} package \citep*{cran:gputools}. It provides
a number of functions which use the GPU instead of the CPU. This can provide
a significant performance increase for `large enough' problems due a the
large number of cores---from several dozen to several hundred on newer
cards---on a GPU. However, because data has to be transferred from the CPU to
the GPU, a fixed cost in communications has to be borne by every invocation
of the GPU.  For sizable problems, this cost can be outweighed by the
benefits of the massively parallel GPU computation.  Exactly where the
indifference point lies beyond which GPU computing has an advantage is
unfortunately dependent on the particular problem and algorithm as well as
the given hardware and software combination.

A recent addition to the set of \proglang{R} packages is \pkg{magma}
\citep{cran:magma} which interfaces the \pkg{Magma} library project of the
same name \citep{Tomov_et_al:2009,Tomov_et_al:2010} (where we will capitalize
the name of the library to distinguish it from the \proglang{R} package). The
stated goal of the Magma project it to \textsl{develop a dense linear algebra
  library similar to LAPACK but for heterogeneous/hybrid architectures,
  starting with current ``Multicore+GPU'' systems}.  The current release of
Magma is version 0.2 which is available for Linux systems with NVidia's CUDA
toolkit. It provides LU, QR, and Cholesky factorizations as well as linear
solvers based on these along with implementations of several BLAS function
including \code{dgemm}, \code{dgemv}, and \code{dsymv} (and well as their
single-precision counterparts).  Figure~\ref{fig:magma} illustrates the changes of
relative workload between CPU and GPU when using Magma for a
(single-precision) QR decomposition.  It clearly depicts how the share of the
total computing effort that goes to the GPU increases as a function of the
matrix size.

\begin{figure}[t!]
  \centering
  \includegraphics[width=4in]{MagmaTimes.png}
  \caption{Breakdown of CPU and GPU computing percentages for
    single-precision QR decomposition using the hybrid Magma
    approach. (Source: \citet[p.7]{Tomov_et_al:2009})}
  \label{fig:magma}
\end{figure}

A flexible and hybrid approach has the potential to improve upon solutions
that use either the CPU or the GPU. However, the current version of the
\pkg{Magma} library was exhibiting stability problems when used with GPU card
employed (a Quadro FX48000). Analysing the performance of \pkg{Magma}
relative to accelerated BLAS and other GPU-based approached is therefore left
as topic for future research.

Finally, an important, and possibly under-appreciated, topic is how to
balance explicitly parallel code that distributes load across cores with
multi-threaded BLAS.  Execution of code may already be parallelised in a
`coarse-grained' fashion across all local cores or CPUs, maybe via tools that
explicitly spawn multiple threads, maybe via compiler-driven directives as
for example with Open MP, or maybe via cluster-computing tools across a local
network.  If such a setup `booked' all available cores, a key assumption of
multi-threaded BLAS no longer holds: other cores are not idle but also busy. In
this case contention arises between the explicit parallelism and the implicit
parallelism from the multi-threaded BLAS, and performance is bound to
suffer. \citet{bivand2010} discusses this issues and provides several
illustrations using examples from data-intensive GIS applications.  The
simplest remedy is to withdraw one of the mechanisms for multi-threaded use by
limiting the number of cores to one.  This can be done for Goto BLAS using
the \code{GOTO_NUM_THREADS} environment variable. For the MKL one can use the
environment variable \code{OMP_NUM_THREADS} (of the underlying Open MP
implementation), or the R function \code{setMKLthreads()} of the
corresponding package. Atlas, on the other hand, fixes the number of cores
used at compile-time and cannot vary this setting at run-time.

\section{Implementation}

\subsection{Requirements}

In order to undertake the automated benchmarking, we need to be able to
switch between different implementations of the BLAS API.  As discussed
above, dynamic libraries are one possible solution that avoids having to
rebuild \proglang{R} explicitly for each library.  However, this also
requires that \proglang{R} itself is built with shared-library support, as
well as with support for external BLAS and LAPACK libraries.  This
requirement is however the default configuration on Debian and Ubuntu systems.

The reference BLAS as well as Atlas have been available for essentially all
Debian and Ubuntu releases. Atlas 3.9.25 was packaged locally; the package and
helper scripts are available upon request. The Intel MKL is available for
Ubuntu following the Revolution R upload for Ubuntu release 9.10.

For Goto BLAS, we are using a helper script provided in the contributed
\texttt{gotoblas2-helper} package \citep{nakano_nakama:2009,Nakama:2010}.
This package arranges for a download of the Goto BLAS sources (provided the
required account and password information for the University of Texas
software download center) and an automated Debian package build and
installation via the command \code{sudo /etc/init.d/gotoblas2-helper
  start}. Note that the initial invocation of this command will trigger a
build of the package which may take up to two hours.  While designed for
Debian, it also works perfectly on the Ubuntu systems used here.\footnote{The
  \texttt{/etc/init.d/gotoblas2-helper} script required one change from
  \texttt{/bin/sh} to \texttt{/bin/bash}.}

For GPU-based testing we require the \proglang{R} package \pkg{gputools}
which in turn requires support for CUDA and the NVidia SDK (as
well as appropriate NVidia hardware). Detailed installation instructions are
provided by the package so we will defer to these and assume that the
packages are in fact installed.  Helper scripts in our package will then
verify this availability while the benchmark is executed.

\subsection{Benchmark implementation}

The benchmarks described in this paper are produced by the package \pkg{gcbd}
which is part of the larger \pkg{gcb} project \citep*{rforge:gcb} on the
R-Forge hosting site \citep{RJournal:Theussl+Zeileis:2009}. The \pkg{gcbd}
package (where the acronym expands to `GPU/CPU Benchmarking on Deb-based
systems') contains a number of \proglang{R} helper functions as well as an
actual benchmarking script which is executed.

The helper functions fall into two groups: utilities, and benchmarks. The
utilities fall into several categories:
\begin{itemize}
\item initial feature detection via a function \code{requirements()} which asserts
  that a number of testable features of the host operating system are met;
\item feature tests via the function \code{hasGputools()}
  allowing the benchmark script to bypass GPU-based tests in systems without
  a GPU;
\item installation and removal functions which interface the package
  management layer and install (or remove) the Atlas, Atlas39, MKL or Goto BLAS
  packages, respectively, which helps ensure that at any one point in time
  only one accelerated BLAS library package is present;
\item database creation where the results database (and table schema) is
  created if not already present;
\item recording of simulation results in the database.
\end{itemize}

The benchmark functions can also be categorized:
\begin{itemize}
\item creation of random data for standard \pkg{Matrix} objects;
\item actual benchmarking code for
  \begin{itemize}
  \item matrix crossproducts;
  \item SV decomposition;
  \item QR decomposition;
  \item LU decomposition;
  \end{itemize}
  for BLAS and \pkg{gputools}, respectively.
\end{itemize}

For the QR decomposition, we set the flag \code{LAPACK=TRUE}. For
\proglang{R}, the default QR operation is provided by LINPACK which uses
level 1 BLAS operations where LAPACK can reap a larger benefit from
accelerated or optimised BLAS libraries.  For the LU decompositions, we use
the function from the \pkg{Matrix} package by \citet{cran:matrix}.

\subsection{Benchmark script}

The benchmark execution can then be triggered by the script
\code{benchmark.r} in the sub-directory \code{scripts} of the \pkg{gcbd}
package. It is implemented as an executable \proglang{R} script which uses the
\pkg{getopt} package \citep{cran:getopt} for command-line parsing:

\begin{verbatim}
$ ./benchmark.r -h
Usage: benchmark.r [-[-verbose|v]] [-[-help|h]] [-[-nobs|n] <integer>]
                   [-[-runs|r] <integer>] [-[-benchmark|b] <character>]
    -v|--verbose      verbose operations, default is false
    -h|--help         help on options
    -n|--nobs         number of rows and columns in matrix, default is 250
    -r|--runs         number of benchmark runs, default is 30
    -b|--benchmark    benchmark to run (matmult [default], qr, svd, lu)
\end{verbatim}

Each benchmark experiment consists of $r$ runs for a matrix of size $n \times
n$ observations using the chosen benchmark---matrix cross product or one of
the QR, LU or SVD decompositions---over all five BLAS implementations and the
GPU package.  The run against the GPU package is optional and dependent on
the GPU package being present.

At the end of each benchmark experiment, the results are appended to a SQLite
database.

We use the `elapsed time' as measured by the \proglang{R} function
\code{system.time()}.  This measure is preferable over the sum of system time
and user time which adds up total cputime---but without adjusting for multiple
threads or cores. Elapsed time correctly measures from start to finish,
whether one or multiple threads or cores are involved (but is susceptible to
be influenced by system load).

Using this script \code{benchmark.r}, users can collect benchmark results on
their systems.  These could be used to aggregate more performance data which
could then be used to estimate realistic performance numbers for a much wider
variety of hardware configurations.  Doing so is beyond the scope of this
paper but a possible avenue for future work.

\subsection{Alternative implementation}

On platforms that do not have access to pre-built BLAS library packages, an
alternative approach could consist of locally installing the different
libraries into sub-directories of, say, \code{/opt/blas}. One could then use
the environment variable \verb|LD_LIBRARY_PATH| to select one of these
directories at a time.  However, such an approach places a larger burden on
the user of the benchmarking software as she would have to download,
configure, compile and install the BLAS
libraries which is typically not a trivial step.

\subsection{Hardware and software considerations}
\label{subsec:hardware}

For benchmarking linear algebra performance, hardware and software aspects
matter greatly for the overall results. Different CPUs, different GPUs,
different memory type as well as different compilers (or operating systems)
may generate different performance profiles.

We have been running the results presented here on these two platforms:
\begin{itemize}
\item a four-core Intel i7 920, 2.67 GHz clock speed, 8192 kb CPU cache, 6 gb RAM,
  in hyper-threaded mode for eight visible cores, running Ubuntu 10.4 in 64-bit mode;
\item a dual-CPU four-core Intel Xeon 5570, 2.96 Ghz clock speed, 8192 kb CPU
  cache, 16gb RAM, in hyper-threaded mode
  for sixteen visible cores, running Ubuntu 10.4 in 64-bit mode.
\end{itemize}

The i7 system also has a NVidia Quadro FX4800 GPU with 192 cores and 1.5 gb
of RAM. This card has a so-called `compute-number' of 1.3 whereas newer models
(`Fermi') have a compute number of 2.0 signalling the availability of more
hardware capabilities.

Different hardware platforms could be reflected by other users installing the
\pkg{gcbd} package on their system and reporting results back by supplying
the resulting SQLite database files.

The software configuration was held constant by running on 64-bit Ubuntu
10.4 in both cases.

It should be noted that hyper-threading is a potential distraction. While it was
not possible to reconfigure the machines used here, it could be informative
to compare results on identical hardware with hyper-threading turned on and
off, respectively.

Lastly, comparing across operating system would also be of interest. While
the test platform used here makes extensive use of the packaging system
available on Debian / Ubuntu, it would of course be possible to set up
something similar on, say, OS X to measure the performance of the vecLib system.

%\pagebreak
\section[Results]{Results}

<<data,print=FALSE>>=
i7 <- getBenchmarkData("i7_920")
xeon <- getBenchmarkData("xeon_X5570")
D <- subset(i7[,-c(1:2,5)], type=="matmult")
@

%\subsection{BLAS comparison}

We present the benchmark results (which are also included in the \pkg{gcbd}
package in the file \code{sql/gcbd.sqlite}).  We show the four different
benchmarks for the two test architectures (currently one of i7 or xeon as
detailed in section~\ref{subsec:hardware}) for eight different panels in a
lattice plot \citep{lattice}. Each of these panels displays results for all
the available BLAS (and GPU) implementations, with the matrix dimension on
the horizontal axis and the elapsed time in seconds on the vertical axis. In
Figure~\ref{fig:lattice}, we first display the raw results with the elapsed
time capped at 30 seconds.  Figure~\ref{fig:logloglattice} then shows the
same results using logarithmic axes in order to better differentiate between
alternative implementations.\footnote{I am particularly grateful to Allan
  Engelhardt for suggesting to switch from a plot with logged y-axis to a
  log-log plot.}

\setkeys{Gin}{width=0.99\textwidth}
\begin{figure}[t!]
  \centering
<<Lattice,fig=TRUE,height=6.875,width=11>>=
figure_Lattice(titles=FALSE)
@
\caption{Benchmark results for BLAS and GPUs across four tests.}
\label{fig:lattice}
\end{figure}

\setkeys{Gin}{width=0.99\textwidth}
\begin{figure}[t!]
  \centering
<<LogLogLattice,fig=TRUE,height=6.875,width=11>>=
figure_LogLogLattice(titles=FALSE)
@
\caption{Benchmark results for BLAS and GPUs across four tests using logarithmic axes.}
\label{fig:logloglattice}
\end{figure}


\subsubsection{Matrix multiplication: i7 with GPU.}

% \setkeys{Gin}{width=0.99\textwidth}
% \begin{figure}[t!]
%   \centering
% <<matmulti7,fig=TRUE,height=5.25,width=11>>=
% figure_MatMult_i7(i7)
% @
% \caption{Matrix multiplications on i7 with GPU.}
% \end{figure}

It is immediately apparent that the reference BLAS underperform very
significantly. Similarly, the single-threaded Atlas performance is also
dominated by the multi-threaded BLAS (Atlas39, Goto, MKL) and the GPU-based
gputools.

We also see (especially in the log/log plot) that the GPU-based solution
bears a clear penalty for smaller dimensions. It only crosses below the
single-threaded Atlas around $N=1000$ and approaches the best performing
multi-threaded BLAS at the very end for $N=5000$.  On the other hand, it
clearly exhibits a much slower slope implying a lesser time penalty as matrix
sizes increases.

The three multi-threaded BLAS implementations are essentially
indistinguishable for this test and hardware platform.

\subsubsection{Matrix multiplication: Xeon}

% \begin{figure}[t!]
%   \centering
% <<matmultxeon,fig=TRUE,height=5.25,width=11>>=
% figure_MatMult_xeon(xeon)
% @
% \caption{Matrix multiplications on xeon.}
% \end{figure}

As above, the reference BLAS underperform significantly, and single-threaded
Atlas is dominated by all multi-threaded BLAS implementations. On the other
hand, on this platform the log/log plot offers a better differentiation
between the multi-threaded BLAS implementations. Goto is clearly ahead of its
competitors, although Atlas93 catches up for the largest matrices.  Atlas39
also beats MKL for essentially all sizes but cannot catch Goto BLAS'
performances. MKL shows a lower slope and higher times for small sizes
hinting at a suboptimal configuration for small sizes.

%\clearpage

\subsubsection{QR decomposition: i7 with GPU}

% \begin{figure}[t!]
%   \centering
% <<qri7,fig=TRUE,height=5.25,width=11>>=
% figure_QR_i7(i7)
% @
% \caption{QR decomposition on i7 with GPU.}
% \end{figure}

As before, reference BLAS are behind all other implementations. The
single-threaded Atlas is on-par with the multi-threaded Atlas39---and the GPU
solution. Goto and MKL beat all others, with Goto dominating overall.  The
log/log plot once again illustrates the higher `startup cost' of the GPU
solution due to the communications cost being high (in relatively terms) for
small matrices where the computations are still very fast.

\subsubsection{QR decomposition: Xeon}

% \begin{figure}[b!]
%   \centering
% <<qrxeon,fig=TRUE,height=5.25,width=11>>=
% figure_QR_xeon(xeon)
% @
% \caption{QR decomposition on xeon.}
% \end{figure}

Results fall basically into three groups. The Reference BLAS are behind
everybody. A second group of both Atlas variants and MKL forms the middle
with MKL having a slight edge over the Atlas libraries. However, Goto is
ahead of everybody (yet shows a somewhat surprising non-linearity in the
log/log plot).

%\clearpage

\subsubsection{SVD decomposition: i7 with GPU}

% \begin{figure}[t!]
%   \centering
% <<svdi7,fig=TRUE,height=5.25,width=11>>=
% figure_SVD_i7(i7)
% @
% \caption{SVD on i7 with GPU.}
% \end{figure}

For the SVD, we have the familiar ordering of Reference BLAS behind
single-threaded Atlas which is behind multi-threaded Atlas.  MKL, Goto and
GPU are all competitive, with the GPU lagging for small sizes but beating all
competitors for the largest matrix size tests.  In the log/log chart, the GPU
performance also demonstrates a much slower slope, yet along with the much
higher intercept.

\subsubsection{SVD: Xeon}

% \begin{figure}[t!]
%   \centering
% <<svdxeon,fig=TRUE,height=5.25,width=11>>=
% figure_SVD_xeon(xeon)
% @
% \caption{SVD on xeon.}
% \end{figure}

On the Xeon, results are similar to the i7 with the main difference that the
multi-threaded Atlas39 is closer to its competitors, in particular the
MKL. Goto BLAS are once again ahead.


\subsubsection{LU decomposition: i7 with GPU}

% \begin{figure}[t!]
%   \centering
% <<lui7,fig=TRUE,height=5.25,width=11>>=
% figure_LU_i7(i7)
% @
% \caption{LU decomposition on i7 with GPU.}
% \end{figure}

LU decomposition results, using the function from the \pkg{Matrix} package
\citep{cran:matrix}, are similar to earlier results. Reference BLAS are by
far the slowest, single-threaded Atlas beats them clearly and loses equally
clearly to the multi-threaded BLAS. Multi-threaded Atlas is very close to
Goto and MKL, which are ever so slightly ahead and essentially
indistinguishable.  For this algorithm, no GPU-based performance numbers are
available as the current version of the \pkg{gputools} package does not
provide a LU decomposition.


\subsubsection{LU decomposition: Xeon}

% \begin{figure}[t!]
%   \centering
% <<luxeon,fig=TRUE,height=5.25,width=11>>=
% figure_LU_xeon(xeon)
% @
% \caption{LU decomposition on xeon.}
% \end{figure}

On the xeon chip we see once again a clearer separation between the three accelerated BLAS
implementations on the one hand and the single-threaded Atlas and the
Reference BLAS on the other hand.  Goto has a clear edge over MKL, which is
also overtaken by Atlas39 for medium-to-large size matrices. MKL also
exhibits some irregularity for small-to-medium sized matrices.

\subsubsection{Comparison}

The charts shown above, and in particular Figure~\ref{fig:logloglattice},
suggest a further analysis: a regression of the logarithm of the elapsed
times on the logarithm of the matrix dimension.  These are shown in
Table~\ref{tab:logloganalysis} below. For these results, a higher slope
coefficient implies a higher increase in elapsed time per increase in matrix
dimension, and moreover in a non-linear fashion as we have taken logarithms.

<<table,echo=FALSE,print=FALSE>>=
C <- loglogAnalysis()[["coefs"]]
C[,-c(1,2)] <- format(round(C[,-c(1,2)], digits=2))
@

\begin{table}[b!]
  \begin{center}
    \begin{tabular*}{0.975\textwidth}{@{\extracolsep{\fill}} l r rr rr rr rr rr}
    %\begin{tabular}{lrrrrrr}
      \toprule
      Platform & BLAS    & \multicolumn{2}{c}{Mat.Mult} & \multicolumn{2}{c}{QR} & \multicolumn{2}{c}{SVD}  & \multicolumn{2}{c}{LU} \\
               &         & \multicolumn{1}{c}{$\alpha$} & \multicolumn{1}{c}{$\beta$}
               & \multicolumn{1}{c}{$\alpha$} & \multicolumn{1}{c}{$\beta$}
               & \multicolumn{1}{c}{$\alpha$} & \multicolumn{1}{c}{$\beta$}
               & \multicolumn{1}{c}{$\alpha$} & \multicolumn{1}{c}{$\beta$} \\
      \cmidrule(r){3-4} \cmidrule(r){5-6} \cmidrule(r){7-8} \cmidrule(r){9-10}
      \\[2pt]
      \multicolumn{1}{r}{i7}
               & Ref    & \Sexpr{C[2,3]} & \Sexpr{C[2,4]} & \Sexpr{C[3,3]} & \Sexpr{C[3,4]} & \Sexpr{C[4,3]} & \Sexpr{C[4,4]} & \Sexpr{C[1,3]} & \Sexpr{C[1,4]} \\
               & Atlas  & \Sexpr{C[2,5]} & \Sexpr{C[2,6]} & \Sexpr{C[3,5]} & \Sexpr{C[3,6]} & \Sexpr{C[4,5]} & \Sexpr{C[4,6]} & \Sexpr{C[1,5]} & \Sexpr{C[1,6]} \\
               & Atlas39& \Sexpr{C[2,7]} & \Sexpr{C[2,8]} & \Sexpr{C[3,7]} & \Sexpr{C[3,8]} & \Sexpr{C[4,7]} & \Sexpr{C[4,8]} & \Sexpr{C[1,7]} & \Sexpr{C[1,8]} \\
               & Goto   & \Sexpr{C[2,9]} & \Sexpr{C[2,10]}& \Sexpr{C[3,9]} & \Sexpr{C[3,10]}& \Sexpr{C[4,9]} & \Sexpr{C[4,10]}& \Sexpr{C[1,9]} & \Sexpr{C[1,10]}\\
               & MKL    & \Sexpr{C[2,11]}& \Sexpr{C[2,12]}& \Sexpr{C[3,11]}& \Sexpr{C[3,12]}& \Sexpr{C[4,11]}& \Sexpr{C[4,12]}& \Sexpr{C[1,11]}& \Sexpr{C[1,12]}\\
               & GPU    & \Sexpr{C[2,13]}& \Sexpr{C[2,14]}& \Sexpr{C[3,13]}& \Sexpr{C[3,14]}& \Sexpr{C[4,13]}& \Sexpr{C[4,14]}& & \\[4pt]

      \multicolumn{1}{r}{xeon}
               & Ref    & \Sexpr{C[6,3]} & \Sexpr{C[6,4]} & \Sexpr{C[7,3]} & \Sexpr{C[7,4]} & \Sexpr{C[8,3]} & \Sexpr{C[8,4]} & \Sexpr{C[5,3]} & \Sexpr{C[5,4]} \\
               & Atlas  & \Sexpr{C[6,5]} & \Sexpr{C[6,6]} & \Sexpr{C[7,5]} & \Sexpr{C[7,6]} & \Sexpr{C[8,5]} & \Sexpr{C[8,6]} & \Sexpr{C[5,5]} & \Sexpr{C[5,6]} \\
               & Atlas39& \Sexpr{C[6,7]} & \Sexpr{C[6,8]} & \Sexpr{C[7,7]} & \Sexpr{C[7,8]} & \Sexpr{C[8,7]} & \Sexpr{C[8,8]} & \Sexpr{C[5,7]} & \Sexpr{C[5,8]} \\
               & Goto   & \Sexpr{C[6,9]} & \Sexpr{C[6,10]}& \Sexpr{C[7,9]} & \Sexpr{C[7,10]}& \Sexpr{C[8,9]} & \Sexpr{C[8,10]}& \Sexpr{C[5,9]} & \Sexpr{C[5,10]}\\
               & MKL    & \Sexpr{C[6,11]}& \Sexpr{C[6,12]}& \Sexpr{C[7,11]}& \Sexpr{C[7,12]}& \Sexpr{C[8,11]}& \Sexpr{C[8,12]}& \Sexpr{C[5,11]}& \Sexpr{C[5,12]}\\
      \bottomrule
    \end{tabular*}
  \end{center}
  \caption{Comparison of intercepts $\alpha$ and slopes $\beta$ in log/log analysis of BLAS performance.}
  \label{tab:logloganalysis}
\end{table}

The results in Table~\ref{tab:logloganalysis} as well as the visualization of
different slope estimates in Figure~\ref{fig:slopeloglog}, formalise the
rank-ordering of BLAS implementation seen in the analysis of the individual
charts above.

\begin{itemize}
\item Unsurprisingly, Reference BLAS are clearly dominated by all other
  alternatives and exhibit the highest increase in elapsed time per increase in
  matrix dimension.
\item Single-threaded Atlas is clearly improving on the Reference
  BLAS, but is dominated by the other multi-threaded libraries and
  the GPU-based solution; the QR decompostion is an exception where both Atlas
  variants and the MKL are comparable and just behind Goto.
\item Multi-threaded Atlas39 is roughly comparable to the MKL (and ahead for
  matrix multiplication) and just behind Goto BLAS.
\item The MKL Blas perform well, generally on-par with or ahead of Atlas39
  but trailing Goto BLAS for two of the four algorithms.
\item Goto BLAS are ahead for the QR and SVD tests, and on par or behind
  for matrix multiplication and LU decompositions.
\item GPU computing is seen to have the lowest slope corresponding to the lowest cost
  per additional matrix dimension increases---but this has to be balanced
  with the cost for smaller to medium sizes which can be seen from the
  corresponding analysis for intercepts (not shown but available in the
  package).
\end{itemize}

\begin{figure}[t!]
  \centering
<<loglogslopes,fig=TRUE,height=4.25,width=11>>=
figure_LogLogSlopes()
@
\caption{Comparison of slopes in log/log analysis of BLAS performance.}
\label{fig:slopeloglog}
\end{figure}


% \subsection{Magma: GPU and BLAS combined}

% \pkg{Magma}, as a hybrid system comprising both CPU and GPU-based computing, has a
% lot of potential for improving over solutions using just one of the
% processing units. In this section we are trying to measure just how much of
% this potential is already realised with the early versions of Magma.

% \subsubsection{Performance of magma and BLAS relative to BLAS}

% \begin{figure}[t!]
%   \centering
% < <magmammqr,fig=TRUE,height=5.25,width=11> >=
% figure_magma_MatMult_QR()
% @
% \caption{Matrix multiplication and QR decomposition with Magma}
% \end{figure}

% This chart illustrates a significant improvement for matrix multiplication in
% the case of the single-threaded Atlas libraries.  As matrix size increases,
% the ratio increases and approaches 3.5 indicating a 3.5-times speedup for
% standard Atlas when combined with \pkg{magma}.  However, for the
% already-multithreaded BLAS, the opposite effect is realized: combined
% performance is actually \textsl{slower} in the hybrid case. This result seems
% invariant to the choice of multi-threaded BLAS as all three implementations perform similarly
% poorly when combined with \pkg{magma}.  The other noteworthy effect is the
% variability of results as $N$ approaches 1000.

% Attempts to run a QR decomposition resulted in segmentation fault on the card
% used in these tests (Quadro FX 4800) once the matrix dimensions exceeded the
% (relatively modest) size of 64.\footnote{The same library and operating
%   system combination worked correctly on a newer card (Fermi C 1060) for
%   Brian Smith (personal communication). However, we were unable to test on
%   that platform.}  This has been reported to the \pkg{Magma} development
% team; hopefully a future library version will allow these tests to be
% performed.

% \begin{figure}[t!]
%   \centering
% < <magmasvdlu,fig=TRUE,height=5.25,width=11> >=
% figure_magma_SVD_LU()
% @
% \caption{SVD and LU decompositions with Magma}
% \end{figure}

% For the SVD decomposition, we see strong variability for all four BLAS
% libraries tested along with Magma. An initial performance penalty (below
% $N=500$) gets outweighed by a performance gain up to $N=1000$. However, once
% the matrix dimension increases, performance becomes indistinguishable from the
% use of just the accelerated BLAS. We should note this case differs from the
% others as \pkg{Magma} has not yet implemented an algorithm for the SVD. In
% that sense, this test provides an illustration of the potential cost to naive
% users and it is in fact an acceptable result that the ratio generally
% approaches one.

% For LU decomposition, we experience similar difficulties as for QR
% decomposition once the matrix dimension exceeded 2000.  For smaller matrices,
% results are promising for the single-threaded Atlas---similar to the case of matrix
% multiplication---and unimpressive for Goto and MKL.  In all cases small sizes
% lead to performance penalties. For medium-sized matrices Atlas39 manages a
% ratio of just above 1.0 indicating a small gain.  The opposite is true for
% Goto and MKL.

\section[Summary]{Summary}

We present benchmarking results comparing five different BLAS implementations
as well as a GPU-based approach for four different matrix computations on two
different hardware platforms.  We find reference BLAS to be dominated in all
cases. Single-threaded Atlas BLAS improves on the reference BLAS but loses to
multi-threaded BLAS.  For multi-threaded BLAS we find the Goto BLAS dominate
the Intel MKL, with a single exception of the QR decomposition on the
xeon-based system which may reveal an error. The development version of
Atlas, when compiled in multi-threaded mode is competitive with both Goto
BLAS and the MKL.  GPU computing is found to be compelling only for very
large matrix sizes.

Our benchmarking framework in the \pkg{gcbd} package can be employed by
others through the \proglang{R} packaging system which could lead to a wider
set of benchmark results. These results could be helpful for next-generation
systems which may need to make heuristic choices about when to compute on the
CPU and when to compute on the GPU. A strong empirical basis may make these
heuristics more robust.

\section*{Computational details}

The results in this paper were obtained using \proglang{R}
\Sexpr{paste(R.Version()[6:7], collapse = ".")} with the packages
\pkg{RSQLite}, \pkg{DBI},\pkg{getopt} and the now-archived package
\pkg{gputools}. \proglang{R} itself and all packages used are available from
CRAN at \url{http://CRAN.R-project.org/}. 

The following Ubuntu packages were used to provide the different BLAS
implementations: \pkg{libblas} 1.2-build1, \pkg{liblapack3gf} 3.2.1-2,
\pkg{libatlas3gf-base} 3.6.0-24ubuntu, \pkg{revolution-mkl} 3.0.0-1ubuntu1
(containing MKL 10.2 dated March 2009),  and \pkg{gotoblas2} 1.13-1 which was
built using \pkg{gotoblas2-helper} 0.1-12 by \citet{Nakama:2010}. Apart from
\pkg{gotoblas2-helper} and \pkg{gotoblas2}, all these package are available
via every Ubuntu mirror.

\section*{Disclaimers}

NVidia provided the Quadro FX 4800 GPU card  for evaluation / research
purposes.  REvolution Computing (now Revolution Analytics) employed the
author as a paid consultant for the creation of the \pkg{revolution-mkl}
package and integration of REvolution R into Ubuntu 9.10.

\bibliography{gcbd}

\end{document}