Linear Regression

View as Markdown

Summary

Linear regression models are a type of supervised learning algorithm that predict a continuous target variable by fitting a straight line to the input features. They are commonly used for predicting numeric outcomes based on one or more input variables. Advantages of linear regression include simplicity, interpretability, and fast computation time. However, disadvantages include the assumption of a linear relationship between variables, which may not always hold, and sensitivity to both outliers and collinear features, which can affect the model’s performance and reported accuracy.

Note: If you are not sure that your data meet these requirements, you may want to first do some exploratory visualizations and transformations using scatter plots to visualize pairwise relationships. Altneratively, HeavyML makes it simple with to try first or compare a regression model with fewer assumptions, such as Random Forests.

Example

1CREATE OR REPLACE MODEL florida_parcels_sale_prc_lr OF TYPE LINEAR_REG AS
2SELECT
3 SALEPRC1,
4 PARUSEDESC,
5 CNTYNAME,
6 ACRES,
7 TOTLVGAREA,
8 EFFYRBLT,
9 SALEYR1
10FROM
11 florida_parcels_2020 with (CAT_TOP_K=20, EVAL_FRACTION=0.2);
12

Linear Regression Options

With the exception of the general options listed above, the linear regression model type accepts no options.

Model Evaluation

Like all other regression model types, the model r2 can be obtained via the EVALUATE MODEL command. If the model was created with a specified EVAL_FRACTION, the model r2 score can be obtained on that test holdout set via the following:

1EVALUATE MODEL florida_parcels_sale_prc_lr;
2
3r2
40.08852338436213192

If no EVAL_FRACTION was specified in the CREATE MODEL command, or if EVAL_FRACTION was specified but you wish to evaluate the model on a different dataset, you can specify the evaluation query explicitly as follows:

1EVALUATE MODEL florida_parcels_sale_prc_lr ON
2SELECT
3 SALEPRC1,
4 PARUSEDESC,
5 CNTYNAME,
6 ACRES,
7 TOTLVGAREA,
8 EFFYRBLT,
9 SALEYR1
10FROM
11 florida_parcels_2020
12WHERE
13 PARUSEDESC = 'CONDOMINIUMS';
14
15r2
160.03106805209354901

The relatively low R2 scores obtained for the linear regression model are not atypical for complex multi-variate relations. As noted above, in such cases, it will likely be worth trying a random forest or Gradient-Boosted Tree (GBT) regression model, as the accuracy of these models can be dramatically higher in many cases (in the example above, a simple random forest model achieved an R2 score above 0.87).

Model Prediction/Inference

Once a linear regression model is created, it can, like all other regression model types, be used for prediction via the row-wise ML_PREDICT operator, which takes the model name (in quotes) and a list of independent predictor variables semantically matching the ordered list of variables the model was trained on.

1SELECT
2 SALEPRC1 as actual_sales_price,
3 ML_PREDICT(
4 'florida_parcels_sale_prc_lr',
5 PARUSEDESC,
6 CNTYNAME,
7 ACRES,
8 TOTLVGAREA,
9 EFFYRBLT,
10 SALEYR1
11 ) AS predicted_sales_price
12FROM
13 florida_parcels_2020
14WHERE
15 SALEPRC1 BETWEEN 100000 AND 500000
16limit
17 10;
18
19actual_sales_price|predicted_sales_price
20211000|-30912.60198199749
21152400|6559.390672445297
22164000|35608.10665637255
23153900|56984.85121244192
24143500|52565.25603222847
25144000|64931.58916777372
26140000|79256.96579062939
27160000|90230.21915191412
28162000|56753.09885531664
29107000|80915.15436685085

Related Methods

A list of predictors for a trained linear regression model, along with their associated coefficients, can be obtained by executing the linear_reg_coefs table function, as shown in the following example;

1SELECT * FROM TABLE(linear_reg_coefs(model_name=>'florida_parcels_sale_prc_lr'));
2
3coef_idx|feature|sub_coef_idx|sub_feature|coef
40|intercept|1|NULL|313541950.3062068
51|PARUSEDESC|1|SINGLE FAMILY|-812725.0483721431
61|PARUSEDESC|2|CONDOMINIUMS|145061.7512208006
71|PARUSEDESC|3|VACANT RESIDENTIAL|-696124.6904133513
81|PARUSEDESC|4|MOBILE HOMES|-793766.3806761563
91|PARUSEDESC|5|MULTI-FAMILY - FEWER THAN 10 UNITS|-680141.6707517173
101|PARUSEDESC|6|RESIDENTIAL COMMON ELEMENTS / AREAS|-200166.7979917362
112|CNTYNAME|1|MIAMI-DADE|-14247.32188426969
122|CNTYNAME|2|BROWARD|11489.81580546272
132|CNTYNAME|3|PALM BEACH|4753.91666378783
142|CNTYNAME|4|LEE|-187796.8228951865
152|CNTYNAME|5|HILLSBOROUGH|1906136.710600209
162|CNTYNAME|6|ORANGE|14400.40299565677
172|CNTYNAME|7|PINELLAS|340509.3627915879
182|CNTYNAME|8|DUVAL|-6297.904384635765
192|CNTYNAME|9|POLK|51934.57494268686
202|CNTYNAME|10|BREVARD|-121257.6655694735
212|CNTYNAME|11|VOLUSIA|-14876.60197626388
222|CNTYNAME|12|SARASOTA|-80945.57973416231
232|CNTYNAME|13|COLLIER|-130752.9965216073
242|CNTYNAME|14|PASCO|27699.23056136876
252|CNTYNAME|15|MARION|-60143.98878492894
262|CNTYNAME|16|CHARLOTTE|-71174.58469642041
272|CNTYNAME|17|MANATEE|38203.76868561743
282|CNTYNAME|18|LAKE|45848.48937542497
292|CNTYNAME|19|SEMINOLE|64744.92416669091
302|CNTYNAME|20|OSCEOLA|228911.8974889433
313|ACRES|1|NULL|-7317.375525209893
324|TOTLVGAREA|1|NULL|101.4888021198855
335|EFFYRBLT|1|NULL|3995.56463628857
346|SALEYR1|1|NULL|-158867.0479846989