nv_ingest_client.cli.util package#

Submodules#

nv_ingest_client.cli.util.click module#

class nv_ingest_client.cli.util.click.ClientType(*values)[source]#

Bases: str, Enum

Enum for specifying client types.

REST#

Represents a REST client.

Type:

str

REDIS#

Represents a Redis client.

Type:

str

KAFKA#

Represents a Kafka client.

Type:

str

KAFKA = 'KAFKA'#
REDIS = 'REDIS'#
REST = 'REST'#
class nv_ingest_client.cli.util.click.LogLevel(*values)[source]#

Bases: str, Enum

Enum for specifying logging levels.

DEBUG#

Debug logging level.

Type:

str

INFO#

Informational logging level.

Type:

str

WARNING#

Warning logging level.

Type:

str

ERROR#

Error logging level.

Type:

str

CRITICAL#

Critical logging level.

Type:

str

CRITICAL = 'CRITICAL'#
DEBUG = 'DEBUG'#
ERROR = 'ERROR'#
INFO = 'INFO'#
WARNING = 'WARNING'#
nv_ingest_client.cli.util.click.click_match_and_validate_files(
ctx: Context,
param: Parameter,
value: List[str],
) List[str][source]#

Matches and validates files based on the provided file source patterns.

Parameters:
  • ctx (click.Context) – The Click context.

  • param (click.Parameter) – The parameter associated with the file matching option.

  • value (List[str]) – A list of file source patterns to match against.

Returns:

A list of matching file paths. If no files match, an empty list is returned.

Return type:

List[str]

nv_ingest_client.cli.util.click.click_validate_batch_size(
ctx: Context,
param: Parameter,
value: int,
) int[source]#

Validates that the batch size is at least 1.

Parameters:
  • ctx (click.Context) – The Click context.

  • param (click.Parameter) – The parameter associated with the batch size option.

  • value (int) – The batch size value provided.

Returns:

The validated batch size.

Return type:

int

Raises:

click.BadParameter – If the batch size is less than 1.

nv_ingest_client.cli.util.click.click_validate_file_exists(
ctx: Context,
param: Parameter,
value: str | List[str] | None,
) List[str][source]#

Validates that the given file(s) exist.

Parameters:
  • ctx (click.Context) – The Click context.

  • param (click.Parameter) – The parameter associated with the file option.

  • value (Union[str, List[str], None]) – A file path or a list of file paths.

Returns:

A list of validated file paths.

Return type:

List[str]

Raises:

click.BadParameter – If any file path does not exist.

nv_ingest_client.cli.util.click.click_validate_task(
ctx: Context,
param: Parameter,
value: List[str],
) Dict[str, CaptionTask | DedupTask | EmbedTask | ExtractTask | FilterTask | InfographicExtractionTask | SplitTask | StoreEmbedTask | StoreTask | UDFTask][source]#

Validates and processes task definitions provided as strings.

Each task definition should be in the format “<task_id>:<json_options>”. If the separator ‘:’ is missing, an empty JSON options dictionary is assumed. The function uses a schema check (via check_schema) for validation and instantiates the corresponding task.

Parameters:
  • ctx (click.Context) – The Click context.

  • param (click.Parameter) – The parameter associated with the task option.

  • value (List[str]) – A list of task strings to validate.

Returns:

A dictionary mapping task IDs to their corresponding task objects.

Return type:

Dict[str, TaskType]

Raises:

click.BadParameter – If any task fails validation (including malformed JSON) or if duplicate tasks are detected.

nv_ingest_client.cli.util.click.debug_print_click_options(ctx: Context) None[source]#

Retrieves all options from the Click context and pretty prints them.

Parameters:

ctx (click.Context) – The Click context object from which to retrieve the command options.

nv_ingest_client.cli.util.click.parse_task_options(
task_id: str,
options_str: str,
) Dict[str, Any][source]#

Parse the task options string as JSON.

Parameters:
  • task_id (str) – The identifier of the task for which options are being parsed.

  • options_str (str) – The string containing JSON options.

Returns:

The parsed options as a dictionary.

Return type:

Dict[str, Any]

Raises:

ValueError – If the JSON string is not well formatted. The error message will indicate the task, the error details (e.g., expected property format), and show the input that was provided.

nv_ingest_client.cli.util.click.pre_process_dataset(
dataset_json: str,
shuffle_dataset: bool,
) List[str][source]#

Loads a dataset from a JSON file and optionally shuffles the list of files.

Parameters:
  • dataset_json (str) – The path to the dataset JSON file.

  • shuffle_dataset (bool) – Whether to shuffle the dataset before processing.

Returns:

The list of file paths from the dataset. If ‘shuffle_dataset’ is True, the list will be shuffled.

Return type:

List[str]

Raises:

click.BadParameter – If the dataset file is not found or if its contents are not valid JSON.

nv_ingest_client.cli.util.processing module#

nv_ingest_client.cli.util.processing.get_valid_filename(name: Any) str[source]#

Return a sanitized version of the given filename.

This function, adapted from Django (django/django), converts the input string to a form that is safe to use as a filename. It trims leading and trailing spaces, replaces remaining spaces with underscores, and removes any characters that are not alphanumeric, dashes, underscores, or dots.

Parameters:

name (Any) – The input value to be converted into a valid filename. It will be converted to a string.

Returns:

A sanitized string that can be used as a filename.

Return type:

str

Raises:

ValueError – If a valid filename cannot be derived from the input.

Examples

>>> get_valid_filename("john's portrait in 2004.jpg")
'johns_portrait_in_2004.jpg'
nv_ingest_client.cli.util.processing.report_overall_speed(
total_pages_processed: int,
start_time_ns: int,
total_files: int,
) None[source]#

Report the overall processing speed based on the number of pages and files processed.

This function calculates the total elapsed time from the start of processing and reports the throughput in terms of pages and files processed per second.

Parameters:
  • total_pages_processed (int) – The total number of pages processed.

  • start_time_ns (int) – The nanosecond timestamp marking the start of processing.

  • total_files (int) – The total number of files processed.

Notes

The function converts the elapsed time from nanoseconds to seconds and logs the overall throughput.

nv_ingest_client.cli.util.processing.report_stage_statistics(
stage_elapsed_times: defaultdict,
total_trace_elapsed: float,
abs_elapsed: float,
) None[source]#

Reports the statistics for each processing stage, including average, median, total time spent, and their respective percentages of the total processing time.

Parameters:
  • stage_elapsed_times (defaultdict(list)) – A defaultdict containing lists of elapsed times for each processing stage, in nanoseconds.

  • total_trace_elapsed (float) – The total elapsed time across all processing stages, in nanoseconds.

  • abs_elapsed (float) – The absolute elapsed time from the start to the end of processing, in nanoseconds.

Notes

This function logs the average, median, and total time for each stage, along with the percentage of total computation. It also calculates and logs the unresolved time, if any, that is not accounted for by the recorded stages.

nv_ingest_client.cli.util.processing.report_statistics(
start_time_ns: int,
stage_elapsed_times: defaultdict,
total_pages_processed: int,
total_files: int,
) None[source]#

Aggregate and report statistics for the entire processing session.

This function calculates the absolute elapsed time from the start of processing to the current time and the total time taken by all stages. It then reports detailed stage statistics along with overall processing throughput.

Parameters:
  • start_time_ns (int) – The nanosecond timestamp marking the start of the processing.

  • stage_elapsed_times (defaultdict) – A defaultdict where each key is a processing stage (str) and each value is a list of elapsed times (int, in nanoseconds) for that stage.

  • total_pages_processed (int) – The total number of pages processed during the session.

  • total_files (int) – The total number of files processed during the session.

Notes

The function calls report_stage_statistics to log detailed timing information per stage, then calls report_overall_speed to log the overall throughput.

nv_ingest_client.cli.util.system module#

nv_ingest_client.cli.util.system.configure_logging(logger, log_level: str)[source]#

Configures the logging level based on a log_level string.

Parameters:
  • logger (logging.Logger) – The logger to configure.

  • log_level (str) – The logging level as a string, expected to be one of ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’.

Module contents#