HeavyIQ LLM_TRANSFORM

View as Markdown

Note: The LLM_TRANSFORM operator requires HeavyIQ to be configured, otherwise the operator will error.

LLM_TRANSFORM can be highly useful for use cases such as the following:

  • Data cleanup: for example, standardizing messy date strings into a standardized format that can be casted to a HeavyDB DATE type
  • Classification: for example, classifying users from Slack logs as humans or bots, or credit card purchases as food, travel, entertainment, etc
  • Sentiment analysis: for example, determining whether the sentiment of tweets is postive, neutral, or negative
  • Entity extraction: return all persons/companies/emails etc from a text string
  • Limited fact lookup: LLM_TRANSFORM can be used to add factual information, for example to return the county seat of a county, or the headquarters city of a company. Note that care should be taken to check for hallucinations, particularly for more esoteric facts.

Syntax

1LLM_TRANSFORM(
2 string_variable_or_expression,
3 'transform_instruction',
4 ['output_choices_or_regex']
5)

Parameters

  • string_variable_or_expression (mandatory): The input string variable or expression that the LLM will process. This can be any valid string data type or expression.
  • transform_instruction (mandatory): A string literal that specifies the transformation instruction for the LLM. This should be a clear and concise command or query directed to the LLM.
  • output_choices_or_regex (optional): A string literal that can either specify a set of pipe (|) delimited output choices or a regular expression (regex) bounded by forward slashes (/). This parameter constrains the possible outputs of the transformation. Note that for regexes, ^ and $ special characters to denote the start and end of the string are not supported.

Example with Output

1heavysql> SELECT messy_date, TRY_CAST(LLM_TRANSFORM(messy_date, 'Extract the date of the event in YYYY-MM-DD format', '/\d{4}-\d{2}-\d{2}/') AS DATE) AS nice_date FROM date_strs;
2messy_date|nice_date
3November 3, 2008|2008-11-03
42008-11-04|2008-11-04
5Nov 5, 2008|2008-11-05
606.11.2008|2008-11-06
77. November 2008|2008-11-07

Additional Example Usage

Basic usage without output constraints

1SELECT
2 state_name,
3 LLM_TRANSFORM(
4 state_name,
5 'Return the capital of this US state'
6 ) AS state_capital
7FROM
8 usa_states;

Constrain output to a set of discrete choices (must be pipe-delimited)

1SELECT
2 LLM_TRANSFORM(
3 user_name,
4 'Return whether this user is a human or a bot',
5 'human|bot'
6 ) AS user_type,
7 COUNT(*) AS num_users
8FROM
9 slack_logs
10GROUP BY
11 user_type
12ORDER BY
13 num_users DESC;
1SELECT
2 DATE_TRUNC(DAY, tweet_created) AS "day",
3 LLM_TRANSFORM(
4 text_,
5 'Return the sentiment of this tweet as positive, neutral, or negative',
6 'positive|neutral|negative'
7 ) AS sentiment,
8 COUNT(*) AS num_tweets
9FROM
10 tweets_airlines
11WHERE
12 text_ ILIKE '%delta%'
13GROUP BY
14 "day",
15 sentiment
16ORDER BY
17 "day",
18 sentiment;

Constrain output to a regex and cast to decimal (regex must be bounded by ’/’)

1UPDATE
2 weather
3SET
4 cleaned_temperature = TRY_CAST(
5 LLM_TRANSFORM(
6 temperature_str,
7 'Return a formatted decimal string with one decimal point for this temperature, i.e. 81.2',
8 '/-?\d+.\d/'
9 ) AS DECIMAL(4, 1)
10 );

Run LLM_TRANSFORM twice to generate a longitude and latitude pair, each constrained by a regex.

1SELECT
2 port_of_unlading,
3 TRY_CAST(
4 LLM_TRANSFORM(
5 port_of_unlading,
6 'Return the longitude of this port',
7 '/-?\d{1,3}.\d{4}/'
8 ) AS DOUBLE
9 ) AS longitude,
10 TRY_CAST(
11 LLM_TRANSFORM(
12 port_of_unlading,
13 'Return the latitude of this port',
14 '/-?\d{1,2}.\d{4}/'
15 ) AS DOUBLE
16 ) AS latitude,
17 COUNT(*) AS num_containers
18FROM
19 bill_of_lading_dec_2014_apr_2018
20GROUP BY
21 port_of_unlading
22ORDER BY
23 num_containers DESC;

Restrictions and Performance Notes

  • By default, LLM_TRANSFORM is restricted to run on up to 1,000 unique string inputs per operator call. This limit can be changed at HeavyDB startup via the configuration flag --llm-transform-max-num-unique-value. Note that if a table had a column containing US state names with only 50 unique values, LLM_TRANSFORM will be allowed to run since 50 is lower than the default 1,000 unique string limit.
  • The database will only run LLM_TRANSFORM once per unique input to allow for performant use of the operator on large tables with low cardinality columns.
  • The database will try to push down any filters below an LLM_TRANSFORM call, so that LLM_TRANFORM needs to be called on the minimal set of inputs possible. Hence, for a table with a billion rows that has other filters (WHERE clauses) that filter the table to 20 rows, LLM_TRANSFORM would only be called on the unique strings from those 20 rows. Hence filtering large tables can be a technique to allow LLM_TRANSFORM to not hit the limit specified by --llm-transform-max-num-unique-value(by default 1,000). Note that we do not yet push down filters if LLM_TRANSFORM itself is used as a filter.
  • Currently embedded pipe characters (|) for output choice constraints, or leading or trailing forward slashes (/) for output regex contraints are not allowed as they cannot be escaped, but this is planned to be addressed in a future release.

\