LLM Column Classification#

LLM column classification analyzes column names (like customer_email or phone_num) and sample values to identify what type of PII each column contains, if any.

The more clear the column names are, the better the column classification will perform. Often though, it will still be able to detect column entities when the column names are vague by relying on the sample values.

When to Use LLM Classification#

LLM classification works best when:

You have many columns and want to identify which ones contain PII
You need to classify entire columns rather than individual values
Your data is in tabular format

Examples#

Column Analysis

# Input data structure
columns = {
    "cust_phone": ["555-1234", "555-5678", "555-9012"],
    "user_mail": ["john@email.com", "jane@company.org", "bob@site.net"],
    "full_nm": ["John Smith", "Jane Doe", "Bob Johnson"]
}

# LLM will classify:
# - "cust_phone" as PHONE_NUMBER
# - "user_mail" as EMAIL
# - "full_nm" as PERSON

Unclear Column Names

# Input with ambiguous headers
columns = {
    "col_1": ["john@email.com", "jane@company.org"],
    "field_a": ["John Smith", "Jane Doe"], 
    "data_3": ["555-1234", "555-5678"]
}

# LLM will analyze sample values and classify:
# - "col_1" as EMAIL (based on email patterns in samples)
# - "field_a" as PERSON (based on name patterns in samples)
# - "data_3" as PHONE_NUMBER (based on phone patterns in samples)