# Linear Regression **Summary** Linear regression models are a type of supervised learning algorithm that predict a continuous target variable by fitting a straight line to the input features. They are commonly used for predicting numeric outcomes based on one or more input variables. Advantages of linear regression include simplicity, interpretability, and fast computation time. However, disadvantages include the assumption of a linear relationship between variables, which may not always hold, and sensitivity to both outliers and collinear features, which can affect the model's performance and reported accuracy. Note: If you are not sure that your data meet these requirements, you may want to first do some exploratory visualizations and transformations using scatter plots to visualize pairwise relationships. Altneratively, HeavyML makes it simple with to try first or compare a regression model with fewer assumptions, such as Random Forests. **Example** ```sql CREATE OR REPLACE MODEL florida_parcels_sale_prc_lr OF TYPE LINEAR_REG AS SELECT SALEPRC1, PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1 FROM florida_parcels_2020 with (CAT_TOP_K=20, EVAL_FRACTION=0.2); ``` **Linear Regression Options** With the exception of the general options listed above, the linear regression model type accepts no options. **Model Evaluation** Like all other regression model types, the model r2 can be obtained via the `EVALUATE MODEL` command. If the model was created with a specified `EVAL_FRACTION`, the model r2 score can be obtained on that test holdout set via the following: ```sql EVALUATE MODEL florida_parcels_sale_prc_lr; r2 0.08852338436213192 ``` If no `EVAL_FRACTION` was specified in the `CREATE MODEL` command, or if `EVAL_FRACTION` was specified but you wish to evaluate the model on a different dataset, you can specify the evaluation query explicitly as follows: ```sql EVALUATE MODEL florida_parcels_sale_prc_lr ON SELECT SALEPRC1, PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1 FROM florida_parcels_2020 WHERE PARUSEDESC = 'CONDOMINIUMS'; r2 0.03106805209354901 ``` The relatively low R2 scores obtained for the linear regression model are not atypical for complex multi-variate relations. As noted above, in such cases, it will likely be worth trying a random forest or Gradient-Boosted Tree (GBT) regression model, as the accuracy of these models can be dramatically higher in many cases (in the example above, a simple random forest model achieved an R2 score above 0.87). **Model Prediction/Inference** Once a linear regression model is created, it can, like all other regression model types, be used for prediction via the row-wise `ML_PREDICT` operator, which takes the model name (in quotes) and a list of independent predictor variables semantically matching the ordered list of variables the model was trained on. ```sql SELECT SALEPRC1 as actual_sales_price, ML_PREDICT( 'florida_parcels_sale_prc_lr', PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1 ) AS predicted_sales_price FROM florida_parcels_2020 WHERE SALEPRC1 BETWEEN 100000 AND 500000 limit 10; actual_sales_price|predicted_sales_price 211000|-30912.60198199749 152400|6559.390672445297 164000|35608.10665637255 153900|56984.85121244192 143500|52565.25603222847 144000|64931.58916777372 140000|79256.96579062939 160000|90230.21915191412 162000|56753.09885531664 107000|80915.15436685085 ``` **Related Methods** A list of predictors for a trained linear regression model, along with their associated coefficients, can be obtained by executing the `linear_reg_coefs` table function, as shown in the following example; ```sql SELECT * FROM TABLE(linear_reg_coefs(model_name=>'florida_parcels_sale_prc_lr')); coef_idx|feature|sub_coef_idx|sub_feature|coef 0|intercept|1|NULL|313541950.3062068 1|PARUSEDESC|1|SINGLE FAMILY|-812725.0483721431 1|PARUSEDESC|2|CONDOMINIUMS|145061.7512208006 1|PARUSEDESC|3|VACANT RESIDENTIAL|-696124.6904133513 1|PARUSEDESC|4|MOBILE HOMES|-793766.3806761563 1|PARUSEDESC|5|MULTI-FAMILY - FEWER THAN 10 UNITS|-680141.6707517173 1|PARUSEDESC|6|RESIDENTIAL COMMON ELEMENTS / AREAS|-200166.7979917362 2|CNTYNAME|1|MIAMI-DADE|-14247.32188426969 2|CNTYNAME|2|BROWARD|11489.81580546272 2|CNTYNAME|3|PALM BEACH|4753.91666378783 2|CNTYNAME|4|LEE|-187796.8228951865 2|CNTYNAME|5|HILLSBOROUGH|1906136.710600209 2|CNTYNAME|6|ORANGE|14400.40299565677 2|CNTYNAME|7|PINELLAS|340509.3627915879 2|CNTYNAME|8|DUVAL|-6297.904384635765 2|CNTYNAME|9|POLK|51934.57494268686 2|CNTYNAME|10|BREVARD|-121257.6655694735 2|CNTYNAME|11|VOLUSIA|-14876.60197626388 2|CNTYNAME|12|SARASOTA|-80945.57973416231 2|CNTYNAME|13|COLLIER|-130752.9965216073 2|CNTYNAME|14|PASCO|27699.23056136876 2|CNTYNAME|15|MARION|-60143.98878492894 2|CNTYNAME|16|CHARLOTTE|-71174.58469642041 2|CNTYNAME|17|MANATEE|38203.76868561743 2|CNTYNAME|18|LAKE|45848.48937542497 2|CNTYNAME|19|SEMINOLE|64744.92416669091 2|CNTYNAME|20|OSCEOLA|228911.8974889433 3|ACRES|1|NULL|-7317.375525209893 4|TOTLVGAREA|1|NULL|101.4888021198855 5|EFFYRBLT|1|NULL|3995.56463628857 6|SALEYR1|1|NULL|-158867.0479846989 ```