Table.to_dask_dataframe

Table.to_dask_dataframe(max_results=None, *, variables=None, progress=True, batch_preprocessor=None, max_parallelization=os.cpu_count()) → dask.DataFrame

Returns a representation of the table as a dask.DataFrame, which can be used for parallel processing and larger-than-memory analysis. The underlying dask dataframe is backed by a Parquet file on disk, meaning that loading a table in this method will not lead to significant memory consumption. The parquet file is stored in your operating system's temp directory, unless the REDIVIS_TMPDIR environment variable is set.

Parameters:

max_results : int, default None The maximum number of rows to return. If not specified, all rows in the table will be read.

variables : list<str>, default None A list of variable names to read, improving performance when not all variables are needed. If unspecified, all variables will be represented in the returned rows. Variable names are case-insensitive, though the names in the results will reflect the variable's true casing. The order of the columns returned will correspond to the order of names in this list.

progress : bool, default True Whether to show a progress bar.

batch_preprocessor : function, default None Function used to preprocess the data, invoked for each batch of records as they are initially loaded. This can be helpful in reducing the size of the data before being loaded into a dataframe. The function accepts one argument, a pyarrow.RecordBatch, and must return a pyarrow.RecordBatch or None. If you prefer to work with the data solely in a streaming manner, see Table.to_arrow_batch_iterator()

max_parallelization : int, default os.cpu_count() The maximum number of threads utilized when loading the table.

Returns:

dask.DataFrame

Last updated