The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The goal of nanoarrow is to provide minimal useful bindings to the Arrow C Data and Arrow C Stream interfaces using the nanoarrow C library.
You can install the released version of nanoarrow from CRAN with:
install.packages("nanoarrow")
You can install the development version of nanoarrow from GitHub with:
# install.packages("remotes")
::install_github("apache/arrow-nanoarrow/r") remotes
If you can load the package, you’re good to go!
library(nanoarrow)
The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the ArrowSchema
which represents a data type of
an array, the ArrowArray
which represents the values of an
array, and an ArrowArrayStream
, which represents zero or
more ArrowArray
s with a common ArrowSchema
.
All three can be wrapped by R objects using the nanoarrow R package.
Use infer_nanoarrow_schema()
to get the ArrowSchema
object that corresponds to a given R vector type; use
as_nanoarrow_schema()
to convert an object from some other
data type representation (e.g., an arrow R package DataType
like arrow::int32()
); or use na_XXX()
functions to construct them.
infer_nanoarrow_schema(1:5)
#> <nanoarrow_schema int32>
#> $ format : chr "i"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 2
#> $ children : list()
#> $ dictionary: NULL
as_nanoarrow_schema(arrow::schema(col1 = arrow::float64()))
#> <nanoarrow_schema struct>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 0
#> $ children :List of 1
#> ..$ col1:<nanoarrow_schema double>
#> .. ..$ format : chr "g"
#> .. ..$ name : chr "col1"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> $ dictionary: NULL
na_int64()
#> <nanoarrow_schema int64>
#> $ format : chr "l"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 2
#> $ children : list()
#> $ dictionary: NULL
Use as_nanoarrow_array()
to convert an object to an
ArrowArray object:
as_nanoarrow_array(1:5)
#> <nanoarrow_array int32[5]>
#> $ length : int 5
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 2
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `1 2 3 4 5`
#> $ dictionary: NULL
#> $ children : list()
as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
#> <nanoarrow_array struct[2]>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:<nanoarrow_array double[2]>
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL
You can use as.vector()
or as.data.frame()
to get the R representation of the object back:
<- as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
array as.data.frame(array)
#> col1
#> 1 1.1
#> 2 2.2
Even though at the C level the ArrowArray is distinct from the
ArrowSchema, at the R level we attach a schema wherever possible. You
can access the attached schema using
infer_nanoarrow_schema()
:
infer_nanoarrow_schema(array)
#> <nanoarrow_schema struct>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 0
#> $ children :List of 1
#> ..$ col1:<nanoarrow_schema double>
#> .. ..$ format : chr "g"
#> .. ..$ name : chr "col1"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> $ dictionary: NULL
The easiest way to create an ArrowArrayStream is from a list of
arrays or objects that can be converted to an array using
as_nanoarrow_array()
:
<- basic_array_stream(
stream list(
data.frame(col1 = c(1.1, 2.2)),
data.frame(col1 = c(3.3, 4.4))
) )
You can pull batches from the stream using the
$get_next()
method. The last batch will return
NULL
.
$get_next()
stream#> <nanoarrow_array struct[2]>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:<nanoarrow_array double[2]>
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL
$get_next()
stream#> <nanoarrow_array struct[2]>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:<nanoarrow_array double[2]>
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `3.3 4.4`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL
$get_next()
stream#> NULL
You can pull all the batches into a data.frame()
by
calling as.data.frame()
or as.vector()
:
<- basic_array_stream(
stream list(
data.frame(col1 = c(1.1, 2.2)),
data.frame(col1 = c(3.3, 4.4))
)
)
as.data.frame(stream)
#> col1
#> 1 1.1
#> 2 2.2
#> 3 3.3
#> 4 4.4
After consuming a stream, you should call the release method as soon as you can. This lets the implementation of the stream release any resources (like open files) it may be holding in a more predictable way than waiting for the garbage collector to clean up the object.
The nanoarrow package implements as_nanoarrow_schema()
,
as_nanoarrow_array()
, and
as_nanoarrow_array_stream()
for most arrow package types.
Similarly, it implements arrow::as_arrow_array()
,
arrow::as_record_batch()
,
arrow::as_arrow_table()
,
arrow::as_record_batch_reader()
,
arrow::infer_type()
, arrow::as_data_type()
,
and arrow::as_schema()
for nanoarrow objects such that you
can pass equivalent nanoarrow objects into many arrow functions and vice
versa.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.