*** description: >- The LLM\_TRANSFORM operator allows you to leverage the power of the HeavyIQ Large Language Model directly from SQL to declaratively specify data transforms using natural language. ---------------------------------- # HeavyIQ LLM\_TRANSFORM **Note:** The LLM\_TRANSFORM operator requires HeavyIQ to be configured, otherwise the operator will error. LLM\_TRANSFORM can be highly useful for use cases such as the following: * Data cleanup: for example, standardizing messy date strings into a standardized format that can be casted to a HeavyDB DATE type * Classification: for example, classifying users from Slack logs as humans or bots, or credit card purchases as food, travel, entertainment, etc * Sentiment analysis: for example, determining whether the sentiment of tweets is postive, neutral, or negative * Entity extraction: return all persons/companies/emails etc from a text string * Limited fact lookup: LLM\_TRANSFORM can be used to add factual information, for example to return the county seat of a county, or the headquarters city of a company. Note that care should be taken to check for hallucinations, particularly for more esoteric facts. **Syntax** ```sql LLM_TRANSFORM( string_variable_or_expression, 'transform_instruction', ['output_choices_or_regex'] ) ``` **Parameters** * **string\_variable\_or\_expression** (mandatory): The input string variable or expression that the LLM will process. This can be any valid string data type or expression. * **transform\_instruction** (mandatory): A string literal that specifies the transformation instruction for the LLM. This should be a clear and concise command or query directed to the LLM. * **output\_choices\_or\_regex** (optional): A string literal that can either specify a set of pipe (|) delimited output choices or a regular expression (regex) bounded by forward slashes (/). This parameter constrains the possible outputs of the transformation. Note that for regexes, `^` and `$` special characters to denote the start and end of the string are not supported. **Example with Output** ```sql wordWrap heavysql> SELECT messy_date, TRY_CAST(LLM_TRANSFORM(messy_date, 'Extract the date of the event in YYYY-MM-DD format', '/\d{4}-\d{2}-\d{2}/') AS DATE) AS nice_date FROM date_strs; messy_date|nice_date November 3, 2008|2008-11-03 2008-11-04|2008-11-04 Nov 5, 2008|2008-11-05 06.11.2008|2008-11-06 7. November 2008|2008-11-07 ``` **Additional Example Usage** Basic usage without output constraints ```sql SELECT state_name, LLM_TRANSFORM( state_name, 'Return the capital of this US state' ) AS state_capital FROM usa_states; ``` Constrain output to a set of discrete choices (must be pipe-delimited) ```sql SELECT LLM_TRANSFORM( user_name, 'Return whether this user is a human or a bot', 'human|bot' ) AS user_type, COUNT(*) AS num_users FROM slack_logs GROUP BY user_type ORDER BY num_users DESC; ``` ```sql SELECT DATE_TRUNC(DAY, tweet_created) AS "day", LLM_TRANSFORM( text_, 'Return the sentiment of this tweet as positive, neutral, or negative', 'positive|neutral|negative' ) AS sentiment, COUNT(*) AS num_tweets FROM tweets_airlines WHERE text_ ILIKE '%delta%' GROUP BY "day", sentiment ORDER BY "day", sentiment; ``` Constrain output to a regex and cast to decimal (regex must be bounded by '/') ```sql UPDATE weather SET cleaned_temperature = TRY_CAST( LLM_TRANSFORM( temperature_str, 'Return a formatted decimal string with one decimal point for this temperature, i.e. 81.2', '/-?\d+.\d/' ) AS DECIMAL(4, 1) ); ``` Run LLM\_TRANSFORM twice to generate a longitude and latitude pair, each constrained by a regex. ```sql SELECT port_of_unlading, TRY_CAST( LLM_TRANSFORM( port_of_unlading, 'Return the longitude of this port', '/-?\d{1,3}.\d{4}/' ) AS DOUBLE ) AS longitude, TRY_CAST( LLM_TRANSFORM( port_of_unlading, 'Return the latitude of this port', '/-?\d{1,2}.\d{4}/' ) AS DOUBLE ) AS latitude, COUNT(*) AS num_containers FROM bill_of_lading_dec_2014_apr_2018 GROUP BY port_of_unlading ORDER BY num_containers DESC; ``` **Restrictions and Performance Notes** * By default, `LLM_TRANSFORM` is restricted to run on up to 1,000 unique string inputs per operator call. This limit can be changed at HeavyDB startup via the configuration flag `--llm-transform-max-num-unique-value`. Note that if a table had a column containing US state names with only 50 unique values, `LLM_TRANSFORM` will be allowed to run since 50 is lower than the default 1,000 unique string limit. * The database will only run `LLM_TRANSFORM` once per unique input to allow for performant use of the operator on large tables with low cardinality columns. * The database will try to push down any filters below an `LLM_TRANSFORM` call, so that `LLM_TRANFORM` needs to be called on the minimal set of inputs possible. Hence, for a table with a billion rows that has other filters (WHERE clauses) that filter the table to 20 rows, `LLM_TRANSFORM` would only be called on the unique strings from those 20 rows. Hence filtering large tables can be a technique to allow `LLM_TRANSFORM` to not hit the limit specified by `--llm-transform-max-num-unique-value`(by default 1,000). Note that we do not yet push down filters if `LLM_TRANSFORM` itself is used as a filter. * Currently embedded pipe characters (`|`) for output choice constraints, or leading or trailing forward slashes (`/`) for output regex contraints are not allowed as they cannot be escaped, but this is planned to be addressed in a future release. \\