morpheus.io.utils#

IO utilities.

Functions

cudf_string_cols_exceed_max_bytes(df, ...)

Checks a cudf DataFrame for string columns that exceed a maximum number of bytes and thus need to be truncated by calling truncate_string_cols_by_bytes.

filter_null_data(x[, column_name])

Filters out null row in a dataframe's 'data' column if it exists.

get_csv_reader()

Return the appropriate CSV reader based on the execution mode.

get_json_reader()

Return the appropriate JSON reader based on the execution mode.

get_parquet_reader()

Return the appropriate Parquet reader based on the execution mode.

truncate_string_cols_by_bytes(df, ...[, ...])

Truncates all string columns in a dataframe to a maximum number of bytes.

cudf_string_cols_exceed_max_bytes(df, column_max_bytes)[source]#

Checks a cudf DataFrame for string columns that exceed a maximum number of bytes and thus need to be truncated by calling truncate_string_cols_by_bytes.

This method utilizes a cudf method Series.str.byte_count() method that pandas lacks, which can avoid a costly call to truncate_string_cols_by_bytes.

Parameters:
dfDataFrameType

The dataframe to check.

column_max_bytes: dict[str, int]

A mapping of string column names to the maximum number of bytes for each column.

Returns:
bool

True if truncation is needed, False otherwise.

filter_null_data(x, column_name='data')[source]#

Filters out null row in a dataframe’s ‘data’ column if it exists.

Parameters:
xDataFrameType

The dataframe to fix.

column_namestr, default ‘data’

The column name to filter on.

get_csv_reader(
selector: Literal['cudf', 'pandas'],
) Callable[..., DataFrameType][source]#
get_csv_reader(
selector: ExecutionMode,
) Callable[..., DataFrameType]

Return the appropriate CSV reader based on the execution mode.

get_json_reader(
selector: Literal['cudf', 'pandas'],
) Callable[..., DataFrameType][source]#
get_json_reader(
selector: ExecutionMode,
) Callable[..., DataFrameType]

Return the appropriate JSON reader based on the execution mode.

get_parquet_reader(
selector: Literal['cudf', 'pandas'],
) Callable[..., DataFrameType][source]#
get_parquet_reader(
selector: ExecutionMode,
) Callable[..., DataFrameType]

Return the appropriate Parquet reader based on the execution mode.

truncate_string_cols_by_bytes(
df,
column_max_bytes,
warn_on_truncate=True,
)[source]#

Truncates all string columns in a dataframe to a maximum number of bytes. This operation is performed in-place on the dataframe.

Parameters:
dfDataFrameType

The dataframe to truncate.

column_max_bytes: dict[str, int]

A mapping of string column names to the maximum number of bytes for each column.

warn_on_truncate: bool, default True

Whether to log a warning when truncating a column.

Returns:
bool

True if truncation was performed, False otherwise.