The RAthena package aims to make it easier to work with data stored in AWS Athena. RAthena package attempts to provide three levels of interacting with AWS Athena:
AWS Athena backend utilising the AWS SDK paws. This includes configuring AWS Athena Work Groups to assuming different roles within AWS when connecting to AWS Athena.RAthena, by providing a DBI interface to AWS Athena. Users are able to interact with AWS Athena utilising familiar functions and methods they have used for other Databases from R.dplyr is coming more popular, RAthena aims to give dplyr a seamless interface into AWS Athena.RAthena:As RAthena utilising the python AWS SDK boto3, Python 3+ is required. Please install Python 3+ either by Python or Python Anaconda. To install RAthena:
# cran version
install.packages("RAthena")
# Dev version
remotes::install_github("dyfanjones/RAthena")Next is to install Python boto3. This can be done either by RAthena’s installation method:
Or pip method:
pip install boto3
If RAthena doesn’t pick up boto3 after using install_boto(), please consider specifying the python environment.install_boto() creates RAthena environment. This is either a Python virtual environment or a conda environment depending on your system.
library(DBI)
# Specify python conda environment and force reticulate to use it
reticulate::use_condaenv("RAthena", required = TRUE)
# Or specify python virtual environment and force reticulate to use it
reticulate::use_virtualenv("RAthena", required = TRUE)
con <- dbConnect(RAthena::athena())Note: Python environments are not required if boto3 is either in the root Python or if R and Python are in their own environment (for example conda environment).
library(DBI)
con <- dbConnect(RAthena::athena())
# Get metadata
dbGetInfo(con)
# $profile_name
# [1] "default"
#
# $s3_staging
# [1] ######## NOTE: Please don't share your S3 bucket to the public
#
# $dbms.name
# [1] "default"
#
# $work_group
# [1] "primary"
#
# $poll_interval
# NULL
#
# $encryption_option
# NULL
#
# $kms_key
# NULL
#
# $expiration
# NULL
#
# $region_name
# [1] "eu-west-1"
#
# $boto3
# [1] "1.11.5"
#
# $RAthena
# [1] "1.7.1"
# create table to AWS Athena
dbWriteTable(con, "iris", iris)
dbGetQuery(con, "select * from iris limit 10")
# Info: (Data scanned: 860 Bytes)
# sepal_length sepal_width petal_length petal_width species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# 6: 5.4 3.9 1.7 0.4 setosa
# 7: 4.6 3.4 1.4 0.3 setosa
# 8: 5.0 3.4 1.5 0.2 setosa
# 9: 4.4 2.9 1.4 0.2 setosa
# 10: 4.9 3.1 1.5 0.1 setosalibrary(dplyr)
athena_iris <- tbl(con, "iris")
athena_iris %>%
select(species, sepal_length, sepal_width) %>%
head(10) %>%
collect()
# Info: (Data scanned: 860 Bytes)
# # A tibble: 10 x 3
# species sepal_length sepal_width
# <chr> <dbl> <dbl>
# 1 setosa 5.1 3.5
# 2 setosa 4.9 3
# 3 setosa 4.7 3.2
# 4 setosa 4.6 3.1
# 5 setosa 5 3.6
# 6 setosa 5.4 3.9
# 7 setosa 4.6 3.4
# 8 setosa 5 3.4
# 9 setosa 4.4 2.9
# 10 setosa 4.9 3.1