morpheus.io.utils
IO utilities.
Functions
cudf_string_cols_exceed_max_bytes (df, ...) |
Checks a cudf DataFrame for string columns that exceed a maximum number of bytes and thus need to be truncated by calling truncate_string_cols_by_bytes . |
filter_null_data (x[, column_name]) |
Filters out null row in a dataframe's 'data' column if it exists. |
get_csv_reader () |
Return the appropriate CSV reader based on the execution mode. |
get_json_reader () |
Return the appropriate JSON reader based on the execution mode. |
get_parquet_reader () |
Return the appropriate Parquet reader based on the execution mode. |
truncate_string_cols_by_bytes (df, ...[, ...]) |
Truncates all string columns in a dataframe to a maximum number of bytes. |
- cudf_string_cols_exceed_max_bytes(df, column_max_bytes)[source]
Checks a cudf DataFrame for string columns that exceed a maximum number of bytes and thus need to be truncated by calling
truncate_string_cols_by_bytes
.This method utilizes a cudf method
Series.str.byte_count()
method that pandas lacks, which can avoid a costly call to truncate_string_cols_by_bytes.- Parameters
- dfDataFrameType
The dataframe to check.
- column_max_bytes: dict[str, int]
A mapping of string column names to the maximum number of bytes for each column.
- Returns
- bool
True if truncation is needed, False otherwise.
- filter_null_data(x, column_name='data')[source]
Filters out null row in a dataframe’s ‘data’ column if it exists.
- Parameters
- xDataFrameType
The dataframe to fix.
- column_namestr, default ‘data’
The column name to filter on.
- get_csv_reader(selector: Literal['cudf', 'pandas']) → Callable[..., DataFrameType][source]
- get_csv_reader(selector: morpheus.config.ExecutionMode) → Callable[..., DataFrameType]
Return the appropriate CSV reader based on the execution mode.
- get_json_reader(selector: Literal['cudf', 'pandas']) → Callable[..., DataFrameType][source]
- get_json_reader(selector: morpheus.config.ExecutionMode) → Callable[..., DataFrameType]
Return the appropriate JSON reader based on the execution mode.
- get_parquet_reader(selector: Literal['cudf', 'pandas']) → Callable[..., DataFrameType][source]
- get_parquet_reader(selector: morpheus.config.ExecutionMode) → Callable[..., DataFrameType]
Return the appropriate Parquet reader based on the execution mode.
- truncate_string_cols_by_bytes(df, column_max_bytes, warn_on_truncate=True)[source]
Truncates all string columns in a dataframe to a maximum number of bytes. This operation is performed in-place on the dataframe.
- Parameters
- dfDataFrameType
The dataframe to truncate.
- column_max_bytes: dict[str, int]
A mapping of string column names to the maximum number of bytes for each column.
- warn_on_truncate: bool, default True
Whether to log a warning when truncating a column.
- Returns
- bool
True if truncation was performed, False otherwise.