HEAVY.AI supports four algorithms: linear regression, random forest regression, gradient boosted tree (GBT) regression, and decision tree regression. Creation and training of the models can be accomplished both via a CREATE MODEL statement as well as via invocation of dedicated table functions. Inference using the models can be accomplished via a row-wise ML_PREDICT operator, or via dedicated table functions.
Creating a regression model is accomplished via the CREATE MODEL statement.
We will step through the various parts and options of the statement above, giving concrete examples along the way.
Like CREATE TABLE, CREATE MODEL allows for:
CREATE MODEL (without qualifiers): Creates a named model. If a model already exists by the same name, the statement will throw an error message to that effect.Other things to take note of when using CREATE MODEL include:
my_model_123, just like a SQL table or column name.LINEAR_REG, RANDOM_FOREST_REG, DECISION_TREE_REG, or GBT_REG.revenue * 2.0 ) that are not column references (i.e. revenue) must be aliased, i.e. revenue * 2.0 AS revenue2xDatetimes and timestamp values can be extracted to either continuous or categorical values if needed:
Geo columns are not directly supported, but operators returning types above from a geo column or from columnar longitude and latitude work as expected
Standard Options
EVAL_FRACTION: (optionally aliased as DATA_SPLIT_EVAL_FRACTION): Specifies the proportion of the dataset to be withheld from the training data, and allows the EVALUATE MODEL command to be run on the evaluation set without explicit specification (see the EVALUATE MODEL section below for more details). Note that EVAL_FRACTION must be >= 0.0 and < 1.0. Default value is 0.0 (no evaluation hold-out set).DATA_SPLIT_TRAIN_FRACTION): Specifies the proportion of the dataset to be used for training. The most common use case for specifying TRAIN_FRACTION is when training a model over a large amount of data, such that specifying a TRAIN_FRACTION of less than 1 will speed up training, at some cost in model accuracy. Note that TRAIN_FRACTION must be >= 0.0 and <= 1.0.CAT_TOP_K: For models with categorical predictors, this option specifies the top-k number of attributes that will be one-hot encoded from each, based on the attribute’s frequency of occurrence in the training dataset. Note that the default value for CAT_TOP_K is 10, so only the 10 most-frequent categorical values are considered in modeling unless this is adjusted.
CAT_TOP_K is set to 3, and a categorical predictor column of us_state has 20 rows for ‘CA’, 15 rows for ‘TX’, 12 rows for ‘NY’, 10 rows for ‘WA’, and 8 rows for ‘FL”, then one-hot encoded columns will be generated for ‘CA’, ‘TX’, and ‘NY’.CAT_MIN_FRACTION described immediately below. For a categorical attribute to be one-hot encoded, the attribute must also have a column frequency greater or equal to CAT_MIN_FRACTION.CAT_MIN_FRACTION: For models with categorical predictors, this option specifies the minimum frequency an attribute must be represented in a categorical column to be one-hot encoded as a predictor.
CAT_TOP_K, such that for a categorical attribute to be one-hot encoded, it must be both among the top-k attributes for a column, as well as have a frequency of occurrence >= CAT_MIN_FRACTION.CAT_MIN_FRACTION is 0.01, meaning that only categorical attributes that make up at least 1% of their input column will be one-hot encoded (assuming they are also among the top CAT_TOP_K attributes of that column in terms of frequency).Currently registered model names can be accessed via the SHOW MODELS command. If the configuration flag --restrict-ml-model-metadata-to-superusers is set to false (the default), any user can execute this command, otherwise it is restricted to superusers only.
SHOW MODEL DETAILS
Metadata for currently registered models can be accessed via the SHOW MODEL DETAILS command.
If SHOW MODEL DETAILS is run without a list of table names, metadata for all registered models will be returned.
If SHOW MODEL DETAILS is executed with a list of one or more model names, then just the metadata for those model names will be returned.
If the configuration flag --restrict-ml-model-metadata-to-superusers is set to false (the default), any user can execute this command, otherwise it is restricted to superusers only.
Metadata for model features, including regression coefficients for linear regression models and feature importance scores for random forest regression models, can be displayed by executing SHOW MODEL FEATURE DETAILS <model_name>;
If you have superuser access, you can view relevant model metadata by querying the information_schema.ml_models table. The table schema is as follows:
The ml_models system table can be queried as follows:
Evaluating a trained model can be done with the EVALUATE MODEL statement, which returns the model’s R-squared (or R2) score. R-squared is a goodness-of-fit measure for regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.
In the future, other additional metrics, such as Mean Squared Error (MSE), and Mean Absolute Error (MAE) may also be returned. Note: Additional metrics can be computed today in SQL from the values returned by ML_PREDICT described below if desired.
If CREATE MODEL was executed with a specified EVAL_FRACTION, then EVALUATE MODEL can be run without a specified query (i.e. simply EVALUATE MODEL <model_name>), which will run the model on the specified proportion of the training dataset held out for evaluation (i.e. the “test set”).
If EVAL_FRACTION was not specified in the CREATE MODEL statement, or if you wish to test the model’s performance on a different evaluation set, a specific evaluation query can be provided. It is expected that the order of the columns should be <predicted_col>, <categorical predictor cols>, <continuous predictor cols> , and that the arguments specified should semantically match the variables the model was trained on.
For example, the following workflow shows training a random forest regression model to predict prices of Florida real estate parcels for single-family homes with a defined 20% test set, and then evaluating the model performance on that held-out test dataset, and then separately on a dataset of Georgia real estate parcels of single-family homes.
In this example, 88% of the variance in the Florida dataset can be explained by the model (based on 20% hold-out values). The same model applied entirely outside of its training domain explains 75% of the sale price variance in Georgia.
Model inference, or using the model to predict values, can be performed by using the ML_PREDICT operator, which can be run anywhere a normal SQL operator is expected (i.e. it is executed row-wise and is not a table function). Syntax is as follows:
Note the following:
model_name is always the first argument to ML_PREDICT, and should be enclosed in single quotes.CREATE MODE, should never be provided to ML_PREDICT.Example