nv_ingest_api.internal.transform package#
Submodules#
nv_ingest_api.internal.transform.caption_image module#
- nv_ingest_api.internal.transform.caption_image.transform_image_create_vlm_caption_internal(
- df_transform_ledger: DataFrame,
- task_config: BaseModel | Dict[str, Any],
- transform_config: Any,
- execution_trace_log: Dict[str, Any] | None = None,
Extracts and adds captions for image content in a DataFrame using the VLM model API.
This function updates the ‘metadata’ column for rows where the content type is “image”. It uses configuration values from task_config (or falls back to transform_config defaults) to determine the API key, prompt, endpoint URL, and model name for caption generation. The generated captions are added under the ‘image_metadata.caption’ key in the metadata.
- Parameters:
df_transform_ledger (pd.DataFrame) – The input DataFrame containing image data. Each row must have a ‘metadata’ column with at least the ‘content’ and ‘content_metadata’ keys.
task_config (Union[BaseModel, Dict[str, Any]]) – Configuration parameters for caption extraction. If provided as a Pydantic model, it will be converted to a dictionary. Expected keys include “api_key”, “prompt”, “endpoint_url”, and “model_name”.
transform_config (Any) – A configuration object providing default values for caption extraction. It should have attributes: api_key, prompt, endpoint_url, and model_name.
execution_trace_log (Optional[Dict[str, Any]], default=None) – Optional trace information for debugging or logging purposes.
- Returns:
The updated DataFrame with generated captions added to the ‘image_metadata.caption’ field within the ‘metadata’ column for each image row.
- Return type:
pd.DataFrame
- Raises:
Exception – Propagates any exception encountered during the caption extraction process, with added context.
nv_ingest_api.internal.transform.embed_text module#
- nv_ingest_api.internal.transform.embed_text.transform_create_text_embeddings_internal(df_transform_ledger: ~pandas.core.frame.DataFrame, task_config: ~typing.Dict[str, ~typing.Any], transform_config: ~nv_ingest_api.internal.schemas.transform.transform_text_embedding_schema.TextEmbeddingSchema = TextEmbeddingSchema(api_key='api_key', batch_size=4, embedding_model='nvidia/llama-3.2-nv-embedqa-1b-v2', embedding_nim_endpoint='http://embedding:8000/v1', encoding_format='float', httpx_log_level=<LogLevel.WARNING: 'WARNING'>, input_type='passage', raise_on_failure=False, truncate='END'), execution_trace_log: ~typing.Dict | None = None) Tuple[DataFrame, Dict] [source]#
Generates text embeddings for supported content types (TEXT, STRUCTURED, IMAGE, AUDIO) from a pandas DataFrame using asynchronous requests.
This function ensures that even if the extracted content is empty or None, the embedding field is explicitly created and set to None.
- Parameters:
df_transform_ledger (pd.DataFrame) – The DataFrame containing content for embedding extraction.
task_config (Dict[str, Any]) – Dictionary containing task properties (e.g., filter error flag).
transform_config (TextEmbeddingSchema, optional) – Validated configuration for text embedding extraction.
execution_trace_log (Optional[Dict], optional) – Optional trace information for debugging or logging (default is None).
- Returns:
- A tuple containing:
The updated DataFrame with embeddings applied.
A dictionary with trace information.
- Return type:
Tuple[pd.DataFrame, Dict]
nv_ingest_api.internal.transform.split_text module#
- nv_ingest_api.internal.transform.split_text.transform_text_split_and_tokenize_internal(
- df_transform_ledger: DataFrame,
- task_config: Dict[str, Any],
- transform_config: TextSplitterSchema,
- execution_trace_log: Dict[str, Any] | None,
Internal function to split and tokenize text in a ledger DataFrame.
This function extracts text from documents that match a filter criteria based on source types, splits the text into chunks using the specified tokenizer, and rebuilds document records with the split text. The resulting DataFrame contains both split and unsplit documents.
- Parameters:
df_transform_ledger (pd.DataFrame) – DataFrame containing documents to be processed. Expected to have columns ‘document_type’ and ‘metadata’, where ‘metadata’ includes a ‘content’ field and nested source information.
task_config (dict) –
- Dictionary with task-specific configuration. Expected keys include:
”tokenizer”: Tokenizer identifier or path.
”chunk_size”: Maximum number of tokens per chunk.
”chunk_overlap”: Number of tokens to overlap between chunks.
- ”params”: A sub-dictionary that may contain:
”hf_access_token”: Hugging Face access token.
”split_source_types”: List of source types to filter for splitting.
transform_config (TextSplitterSchema) – Configuration object providing default values for text splitting parameters.
execution_trace_log (Optional[dict]) – Optional dictionary for logging execution trace information; may be None.
- Returns:
DataFrame with processed documents. Documents with text matching the filter are split into chunks, and then merged with those that do not match the filter.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the text splitting or tokenization process fails.