morpheus.io.utils#
IO utilities.
Functions
|
Checks a cudf DataFrame for string columns that exceed a maximum number of bytes and thus need to be truncated by calling |
|
Filters out null row in a dataframe's 'data' column if it exists. |
Return the appropriate CSV reader based on the execution mode. |
|
Return the appropriate JSON reader based on the execution mode. |
|
Return the appropriate Parquet reader based on the execution mode. |
|
|
Truncates all string columns in a dataframe to a maximum number of bytes. |
- cudf_string_cols_exceed_max_bytes(df, column_max_bytes)[source]#
Checks a cudf DataFrame for string columns that exceed a maximum number of bytes and thus need to be truncated by calling
truncate_string_cols_by_bytes.This method utilizes a cudf method
Series.str.byte_count()method that pandas lacks, which can avoid a costly call to truncate_string_cols_by_bytes.- Parameters:
- dfDataFrameType
The dataframe to check.
- column_max_bytes: dict[str, int]
A mapping of string column names to the maximum number of bytes for each column.
- Returns:
- bool
True if truncation is needed, False otherwise.
- filter_null_data(x, column_name='data')[source]#
Filters out null row in a dataframe’s ‘data’ column if it exists.
- Parameters:
- xDataFrameType
The dataframe to fix.
- column_namestr, default ‘data’
The column name to filter on.
- get_csv_reader(
- selector: Literal['cudf', 'pandas'],
- get_csv_reader(
- selector: ExecutionMode,
Return the appropriate CSV reader based on the execution mode.
- get_json_reader(
- selector: Literal['cudf', 'pandas'],
- get_json_reader(
- selector: ExecutionMode,
Return the appropriate JSON reader based on the execution mode.
- get_parquet_reader(
- selector: Literal['cudf', 'pandas'],
- get_parquet_reader(
- selector: ExecutionMode,
Return the appropriate Parquet reader based on the execution mode.
- truncate_string_cols_by_bytes(
- df,
- column_max_bytes,
- warn_on_truncate=True,
Truncates all string columns in a dataframe to a maximum number of bytes. This operation is performed in-place on the dataframe.
- Parameters:
- dfDataFrameType
The dataframe to truncate.
- column_max_bytes: dict[str, int]
A mapping of string column names to the maximum number of bytes for each column.
- warn_on_truncate: bool, default True
Whether to log a warning when truncating a column.
- Returns:
- bool
True if truncation was performed, False otherwise.