Upload

class Upload

Uploads are the interface for bringing data into Redivis. They are associated with a particular table, and can be created on any table belonging to an unreleased version of a dataset. Multiple uploads can be added to a table, in which case they are "stacked" together (equivalent to a union join, with mixed schemas supported).

Attributes

properties: A dict containing the API resource representation of the upload. This will only be populated after certain methods are called (see below), and will otherwise be None.

Methods

create(data, *, type="delimited", transfer_specification=None, delimiter=None, schema=None, has_header_row=True, skip_bad_records=False, has_quoted_newlines=False, quote_character=None, escape_character=None, allow_jagged_rows=False, if_not_exists=False, rename_on_conflict=False, replace_on_conflict=False, remove_on_fail=False, wait_for_finish=True, raise_on_fail=True)

Creates a new upload on a table and sends the provided data. The table must belong to an unreleased version, otherwise the upload will fail. After calling create, the properties attribute will be fully populated.

Parameters:

  • data (file, string): The data to upload. Required, unless either 1) the upload is of type="stream" , in which case data must be omitted (and sent later via upload.insert_rows()); or 2) a transfer_specification is provided. Can either be an open file object, string, or other file-like io stream.

  • type (str): The type of file being uploaded. A list of valid types can be found in the upload.post API documentation. If no type is provided, the type will be inferred based on any file extension in the upload's name, or an error will be thrown if the file extension isn't recognized.

  • transfer_specification(dict<sourceType, sourcePath, identity>): Used for transferring files from an external source, such as s3 or a URL, rather than uploading directly. The values provided should match the specification for transferSpecification in the upload.post payload.

  • schema (list<dict<name, type>>, optional): Only relevant for uploads of type stream. Defines an initial schema that will be validated on subsequent calls to insertRows. Takes the form: [{ "name": "var_name", "type": "integer"}, ...]

  • metadata (dict<name, dict<label, description, valueLabels>>, optional): Provide optional metadata on the variables in the file. This parameter is a dict of variable names mapping to the metadata for that variable, which is a dict containing any of "label": str, "description": str, and "valueLabels": {"value": "label"}. Variable names are matched case-insensitive.

  • delimiter (str, optional): Only relevant for delimited type, the character used as the delimiter in the data. If not specified, will be automatically inferred by scanning the first 10MB of the file.

  • has_header_row (bool, default True): Only relevant for delimited type; whether the first row of the data is a header containing variable names.

  • has_quoted_newlines (bool, default False): Only relevant for delimited type. Set to True if there are line breaks within any of the data values in the file, at the tradeoff of substantially reduced import performance.

  • quote_character (str, default None): Only applicable for delimited type. The character used to escape fields that contain the delimiter (most often " for compliant delimited files). If set to None, Redivis will attempt to auto-infer the quote character by scanning the first 10MB of the file.

  • escape_character(str, default None): Only applicable for delimited type. The character that precedes any occurrences of the quote character when it should be treated as its literal value, rather than the start or end of a quote sequence (typically, the escape character will match the quote character, but sometimes is represented as a backward slash \). If set to None, Redivis will attempt to auto-infer the quote character by scanning the first 10MB of the file.

  • allow_jagged_rows (bool, default False): Whether to allow rows that have more or fewer columns than the header row. Use caution when setting to true, as jagged rows often suggest a parsing issue; ignoring those errors could lead to data corruption.

  • if_not_exists (bool, default False): Only create the upload if an upload with this name doesn't already exist, otherwise, return the current upload.

  • rename_on_conflict (bool, default False): By default, creating an upload with the same name as one that already exists for the particular table + version will raise an error. If set to True, a new upload will be created, with a counter added to its name to ensure name uniqueness across all uploads on the current version of the table. This option will be ignored if if_not_exists == True. Only one of rename_on_conflict and replace_on_conflict may be True.

  • replace_on_conflict (bool, default False): By default, creating an upload with the same name as one that already exists for the particular table + version will raise an error. If set to True, the previous upload with the same name will be deleted, and then this upload will be created. This option will be ignored if if_not_exists == True. Only one of rename_on_conflict and replace_on_conflict may be True.

  • skip_bad_records (bool, default False): Whether to ignore invalid or unparsable records. If False, the upload will fail if it encounters any bad records. If True, the badRecordsCount attribute will be set in the upload properties.

  • remove_on_fail (bool, default False): If True, the upload will automatically be deleted if the import fails

  • wait_for_finish (bool, default True): If True, wait for the upload to be fully imported before returning. If False, will return as soon as the data has been transferred to Redivis, but before it has been fully validated and processed. When False, remove_on_fail is ignored.

  • raise_on_fail (bool, default True): Whether to raise an exception if the upload fails.

Returns: self

delete()

Deletes the upload. Will raise an error if called on an upload belonging to a released version.

Returns: void

exists()

Check whether the upload exists.

Returns: bool

get()

Fetches the upload, after which upload.properties will contain a dict with entries corresponding to the properties on the upload resource definition. Will raise an error if the table does not exist.

Returns: self

insert_rows(rows, *, update_schema=False)

Insert rows into the upload. Can only be called on unreleased uploads of type "stream". Should be called at most once per second, per upload; for increased performance try batching as many rows as is reasonable into a single request, up to a limit of 10MB per request.

Parameters:

  • rows (list<dict<varname, val>>): The rows to insert. A list of dicts, with each dict representing a single row, where the keys are the variable names, and the values are the value for that variable in that row. E.g., [{ "var1": 1, "var2": "foo"}, { "var1": None, "var2": "bar" }]

  • update_schema (bool, default False): whether to automatically update the schema as new rows come in, relaxing variable types and adding new variables. If false, an error will be thrown if any of the rows in the insert request would cause a schema update.

Returns: <insertRows response>

list_variables(max_results)

Returns a list of variables associated with the current table. Each entry in the list will be an instance of the variables class, whose properties are populated with the values on the variable.list resource definition.

Parameters:

  • max_results (int, optional): if specified, will only return up to max_results variables

Returns: list<class<Variable>>

list_rows(max_results, *, variables, progress=True)

Return a list of rows in the upload. Rows are only available for unreleased uploads — calling on a released upload will throw an error.

Parameters:

  • max_results (int, optional): The maximum number of rows to return. If not specified, all rows in the upload will be read.

  • variables (list<str>, optional): A list of variable names to read, improving performance when not all variables are needed. If unspecified, all variables will be represented in the returned rows. Variable names are case-insensitive, though the names in the results will reflect the variable's true casing. The order of the columns returned will correspond to the order of names in this list.

  • progress (bool, default True): Whether to show progress bar.

to_dataframe(max_results, *, variables, geography_variable, progress=True)

Returns a representation of the upload as a dataframe. Only available for unreleased uploads — calling on a released upload will throw an error.

Parameters:

  • max_results (int, optional): The maximum number of rows to return. If not specified, all rows in the upload will be read.

  • variables (list<str>, optional): A list of variable names to read, improving performance when not all variables are needed. If unspecified, all variables will be represented in the returned rows. Variable names are case-insensitive, though the names in the results will reflect the variable's true casing. The order of the columns returned will correspond to the order of names in this list.

  • geography_variable (str, optional): Only relevant if one of your variables is of type "geography". The variable to use as the geopandas geometry. If unset, the first geography variable in the dataframe will be used. If set to None, a normal pandas.DataFrame will be returned instead, with all geography variables stored as their WKT string representation.

  • progress (bool, default True): Whether to show progress bar.

Returns: pandas.DataFrame | geopandas.DataFrame -> See the reading data guide for more details

Examples

dataset = redivis.user("user_name").dataset("dataset_name", version="next")
table = dataset.table("table_name")

with open("data.csv", "rb") as file:
    upload = table.upload("data.csv").create(
        file, 
        type="delimited",
        remove_on_fail=True,    # Remove the upload if a failure occurs
        wait_for_finish=True,   # Wait for the upload to finish processing
        raise_on_fail=True      # Raise an error on failure
    )

Last updated