modifiers.llm_pii_modifier#

Module Contents#

Classes#

LLMInference

A class for redacting PII via LLM inference

LLMPiiModifier

This class is the entry point to using the LLM-based PII de-identification module. It works with the Modify functionality as shown below:

API#

class modifiers.llm_pii_modifier.LLMInference(
base_url: str,
api_key: str | None,
model: str,
system_prompt: str,
)#

A class for redacting PII via LLM inference

Initialization

infer(text: str) list[dict[str, str]]#

Invoke LLM to get PII entities

class modifiers.llm_pii_modifier.LLMPiiModifier(
base_url: str,
api_key: str | None = None,
model: str = 'meta/llama-3.1-70b-instruct',
system_prompt: str | None = None,
pii_labels: list[str] | None = None,
language: str = 'en',
)#

Bases: nemo_curator.modifiers.DocumentModifier

This class is the entry point to using the LLM-based PII de-identification module. It works with the Modify functionality as shown below:

dataframe = pd.DataFrame({“text”: [“Sarah and Ryan went out to play”, “Jensen is the CEO of NVIDIA”]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)

modifier = LLMPiiModifier( # Endpoint for the user’s NIM base_url=”http://0.0.0.0:8000/v1”, api_key=”API KEY (if needed)”, model=”meta/llama-3.1-70b-instruct”, # The user may engineer a custom prompt if desired system_prompt=None, pii_labels=PII_LABELS, language=”en”, )

modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(“output_files/*.jsonl”, lines=True, orient=”records”)

Initialization

Initialize the LLMPiiModifier

Args: base_url (str): The base URL for the user’s NIM api_key (Optional[str]): The API key for the user’s NIM, if needed. Default is None. model (str): The model to use for the LLM. Default is “meta/llama-3.1-70b-instruct”. system_prompt (Optional[str]): The system prompt to feed into the LLM. If None, a default system prompt is used. Default prompt has been fine-tuned for “meta/llama-3.1-70b-instruct”. pii_labels (Optional[List[str]]): The PII labels to identify and remove from the text. See documentation for full list of PII labels. Default is None, which means all PII labels will be used. language (str): The language to use for the LLM. Default is “en” for English. If non-English, it is recommended to provide a custom system prompt.

load_inferer() modifiers.llm_pii_modifier.LLMInference#

Helper function to load the LLM

modify_document(text: str) str#