The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The arrow R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
The arrow package lets you work efficiently with large, multi-file datasets using dplyr methods. See vignette("dataset", package = "arrow") for an overview.
arrow provides some simple functions for using the Arrow C++ library to read and write files. These functions are designed to drop into your normal R workflow without requiring any knowledge of the Arrow C++ library and use naming conventions and arguments that follow popular R packages, particularly readr. The readers return data.frames (or if you use the tibble package, they will act like tbl_dfs), and the writers take data.frames.
Importantly, arrow provides basic read and write support for the Apache Parquet columnar data file format.
library(arrow)
df <- read_parquet("path/to/file.parquet")Just as you can read, you can write Parquet files:
write_parquet(df, "path/to/different_file.parquet")The arrow package also includes a faster and more robust implementation of the Feather file format, providing read_feather() and write_feather(). This implementation depends on the same underlying C++ library as the Python version does, resulting in more reliable and consistent behavior across the two languages, as well as improved performance. arrow also by default writes the Feather V2 format (the Arrow IPC file format), which supports a wider range of data types, as well as compression.
For CSV and line-delimited JSON, there are read_csv_arrow() and read_json_arrow(), respectively. While read_csv_arrow() currently has fewer parsing options for dealing with every CSV format variation in the wild, for the files it can read, it is often significantly faster than other R CSV readers, such as base::read.csv, readr::read_csv, and data.table::fread.
Using reticulate, arrow lets you share data between R and Python (pyarrow) efficiently, enabling you to take advantage of the vibrant ecosystem of Python packages that build on top of Apache Arrow. See vignette("python", package = "arrow") for details.
The arrow package also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the sparklyr package has support for using Arrow to move data to and from Spark, yielding significant performance gains.
Arrow defines the following classes for representing metadata:
| Class | Description | How to create an instance | 
|---|---|---|
| DataType | attribute controlling how values are represented | functions in help("data-type") | 
| Field | a character string name and a DataType | field(name, type) | 
| Schema | list of Fields | schema(...) | 
Arrow defines the following classes for representing zero-dimensional (scalar), one-dimensional (array/vector-like), and two-dimensional (tabular/data frame-like) data:
| Dim | Class | Description | How to create an instance | 
|---|---|---|---|
| 0 | Scalar | single value and its DataType | Scalar$create(value, type) | 
| 1 | Array | vector of values and its DataType | Array$create(vector, type) | 
| 1 | ChunkedArray | vectors of values and their DataType | ChunkedArray$create(..., type)or aliaschunked_array(..., type) | 
| 2 | RecordBatch | list of Arrays with aSchema | RecordBatch$create(...)or aliasrecord_batch(...) | 
| 2 | Table | list of ChunkedArraywith aSchema | Table$create(...), aliasarrow_table(...), orarrow::read_*(file, as_data_frame = FALSE) | 
| 2 | Dataset | list of Tables with the sameSchema | Dataset$create(sources, schema)or aliasopen_dataset(sources, schema) | 
Each of these is defined as an R6 class in the arrow R package and corresponds to a class of the same name in the Arrow C++ library. The arrow package provides a variety of R6 and S3 methods for interacting with instances of these classes.
For convenience, the arrow package also defines several synthetic classes that do not exist in the C++ library, including:
ArrowDatum: inherited by Scalar, Array, and ChunkedArrayArrowTabular: inherited by RecordBatch and TableArrowObject: inherited by all Arrow objectsArrow has a rich data type system that includes direct parallels with R’s data types and much more.
In the tables, entries with a - are not currently implemented.
| R type | Arrow type | 
|---|---|
| logical | boolean | 
| integer | int32 | 
| double (“numeric”) | float641 | 
| character | utf82 | 
| factor | dictionary | 
| raw | uint8 | 
| Date | date32 | 
| POSIXct | timestamp | 
| POSIXlt | struct | 
| data.frame | struct | 
| list3 | list | 
| bit64::integer64 | int64 | 
| hms::hms | time32 | 
| difftime | duration | 
| vctrs::vctrs_unspecified | null | 
1: float64 and double are the same concept and data type in Arrow C++; however, only float64() is used in arrow as the function double() already exists in base R
2: If the character vector exceeds 2GB of strings, it will be converted to a large_utf8 Arrow type
3: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a “list of” some type).
| Arrow type | R type | 
|---|---|
| boolean | logical | 
| int8 | integer | 
| int16 | integer | 
| int32 | integer | 
| int64 | integer1 | 
| uint8 | integer | 
| uint16 | integer | 
| uint32 | integer1 | 
| uint64 | integer1 | 
| float16 | -2 | 
| float32 | double | 
| float64 | double | 
| utf8 | character | 
| large_utf8 | character | 
| binary | arrow_binary 3 | 
| large_binary | arrow_large_binary 3 | 
| fixed_size_binary | arrow_fixed_size_binary 3 | 
| date32 | Date | 
| date64 | POSIXct | 
| time32 | hms::hms | 
| time64 | hms::hms | 
| timestamp | POSIXct | 
| duration | difftime | 
| decimal | double | 
| dictionary | factor4 | 
| list | arrow_list 5 | 
| large_list | arrow_large_list 5 | 
| fixed_size_list | arrow_fixed_size_list 5 | 
| struct | data.frame | 
| null | vctrs::vctrs_unspecified | 
| map | arrow_list 5 | 
| union | -2 | 
1: These integer types may contain values that exceed the range of R’s integer type (32-bit signed integer). When they do, uint32 and uint64 are converted to double (“numeric”) and int64 is converted to bit64::integer64. This conversion can be disabled (so that int64 always yields a bit64::integer64 vector) by setting options(arrow.int64_downcast = FALSE).
2: Some Arrow data types do not currently have an R equivalent and will raise an error if cast to or mapped to via a schema.
3: arrow*_binary classes are implemented as lists of raw vectors.
4: Due to the limitation of R factors, Arrow dictionary values are coerced to string when translated to R if they are not already strings.
5: arrow*_list classes are implemented as subclasses of vctrs_list_of with a ptype attribute set to what an empty Array of the value type converts to.
Arrow supports custom key-value metadata attached to Schemas. When we convert a data.frame to an Arrow Table or RecordBatch, the package stores any attributes() attached to the columns of the data.frame in the Arrow object’s Schema. These attributes are stored under the “r” key; you can assign additional string metadata under any other key you wish, like x$metadata$new_key <- "new value".
This metadata is preserved when writing the table to Feather or Parquet, and when reading those files into R, or when calling as.data.frame() on a Table/RecordBatch, the column attributes are restored to the columns of the resulting data.frame. This means that custom data types, including haven::labelled, vctrs annotations, and others, are preserved when doing a round-trip through Arrow.
Note that the attributes() stored in $metadata$r are only understood by R. If you write a data.frame with haven columns to a Feather file and read that in Pandas, the haven metadata won’t be recognized there. (Similarly, Pandas writes its own custom metadata, which the R package does not consume.) You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys. For more details, see the documentation for schema().
C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as R6 classes, most of which are exported from the namespace.
In order to match the C++ naming conventions, the R6 classes are in TitleCase, e.g. RecordBatch. This makes it easy to look up the relevant C++ implementations in the code or documentation. To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has arrow::io::FileOutputStream, it is just FileOutputStream in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So arrow::csv::TableReader becomes CsvTableReader, and arrow::json::TableReader becomes JsonTableReader.
Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the $create() method to instantiate an object. For example, rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10)) will create a RecordBatch. Many of these factory methods that an R user might most often encounter also have a snake_case alias, in order to be more familiar for contemporary R users. So record_batch(int = 1:10, dbl = as.numeric(1:10)) would do the same as RecordBatch$create() above.
The typical user of the arrow R package may never deal directly with the R6 objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call read_parquet() without knowing or caring that they’re instantiating a ParquetFileReader object and calling the $ReadFile() method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.