modifiers.async_llm_pii_modifier
#
Module Contents#
Classes#
A class for redacting PII via asynchronous LLM inference |
|
This class is the entry point to using the LLM-based PII de-identification module.
It works with the |
API#
- class modifiers.async_llm_pii_modifier.AsyncLLMInference(
- base_url: str,
- api_key: str | None,
- model: str,
- system_prompt: str,
A class for redacting PII via asynchronous LLM inference
Initialization
- async infer(text: str) list[dict[str, str]] #
Invoke LLM to get PII entities
- class modifiers.async_llm_pii_modifier.AsyncLLMPiiModifier(
- base_url: str,
- api_key: str | None = None,
- model: str = 'meta/llama-3.1-70b-instruct',
- system_prompt: str | None = None,
- pii_labels: list[str] | None = None,
- language: str = 'en',
- max_concurrent_requests: int | None = None,
Bases:
nemo_curator.modifiers.DocumentModifier
This class is the entry point to using the LLM-based PII de-identification module. It works with the
Modify
functionality as shown below:dataframe = pd.DataFrame({“text”: [“Sarah and Ryan went out to play”, “Jensen is the CEO of NVIDIA”]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)
modifier = AsyncLLMPiiModifier( # Endpoint for the user’s NIM base_url=”http://0.0.0.0:8000/v1”, api_key=”API KEY (if needed)”, model=”meta/llama-3.1-70b-instruct”, # The user may engineer a custom prompt if desired system_prompt=None, pii_labels=PII_LABELS, language=”en”, max_concurrent_requests=10, )
modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(“output_files/*.jsonl”, lines=True, orient=”records”)
Initialization
Initialize the AsyncLLMPiiModifier
Args: base_url (str): The base URL for the user’s NIM api_key (Optional[str]): The API key for the user’s NIM, if needed. Default is None. model (str): The model to use for the LLM. Default is “meta/llama-3.1-70b-instruct”. system_prompt (Optional[str]): The system prompt to feed into the LLM. If None, a default system prompt is used. Default prompt has been fine-tuned for “meta/llama-3.1-70b-instruct”. pii_labels (Optional[List[str]]): The PII labels to identify and remove from the text. See documentation for full list of PII labels. Default is None, which means all PII labels will be used. language (str): The language to use for the LLM. Default is “en” for English. If non-English, it is recommended to provide a custom system prompt. max_concurrent_requests (Optional[int]): The maximum number of concurrent requests to make to the LLM. Default is None, which means no limit.
- batch_redact(
- text: pandas.Series,
- pii_entities_lists: list[list[dict[str, str]]],
- async call_inferer(
- text: pandas.Series,
- inferer: modifiers.async_llm_pii_modifier.AsyncLLMInference,
- load_inferer() modifiers.async_llm_pii_modifier.AsyncLLMInference #
Helper function to load the asynchronous LLM
- modify_document(text: pandas.Series) pandas.Series #