Table$to_arrow_dataset

Table$to_arrow_dataset(max_results=NULL, variables=NULL, batch_preprocessor=NULL, max_parallelization=parallely::availableCores()) → Arrow Dataset

Returns an Arrow Dataset representing a table on Redivis. Arrow datasets are backed by files on disk, rather than in memory, allowing you to load a table without contributing to memory usage. The file used by the dataset is stored in your operating system's temp directory.

Since the underlying files for arrow datasets are stored on the filesystem, you should remove them once you're done to prevent excess disk utilization. The following command will remove the temp files associated with the arrow dataset:

arrow_ds <- redivis_table$to_arrow_dataset()
# ... do work ...
# Remove files to clean up:
sapply(arrow_ds$files, unlink)

Parameters:

max_results : int, default NULL The max number of records to load into the arrow dataset. If not specified, the entire table will be loaded.

variables : list(str) | character vector The specific variables to return, e.g., variables = c("name", "date") . If not specified, all variables in the table will be returned.

batch_preprocessor : function, default NULL Function used to preprocess the data, invoked for each batch of records as they are initially loaded. This can be helpful in reducing the size of the data before the final table is loaded. The function accepts one argument, an Arrow RecordBatch, and must return a Arrow RecordBatch or NULL. If you prefer to work with the data solely in a streaming manner, see Table$to_arrow_batch_reader()

max_parallelization : int, default parallely::availableCores() The maximum parallelization when loading the table. Uses the future::multicore strategy when supported, falling back to future::multisession if not.

Returns:

Arrow Dataset

Last updated