Step #2: Develop Model

Now that we have our data in place, let’s set up everything we need to run executions and train a model.

Inspect Compute Environment

From the left blue menu click on the cube icon labelled Environments. Compute Environments are a Domino abstraction based on Docker images – effectively a snapshot of a machine at a given point of time – which allow you to:
- Rerun code from months or years ago and know that the underlying packages/dependencies are unchanged.
- Collaborate while being sure that you and your teammates are using the exact same software environment.
- Cache a large number of packages so you don’t have to wait for them to install the first time you (or a coworker) runs your code.
Pick the default Domino Standard Environment and then click Duplicate Environment at the top right. This allows you to modify an existing environment without affecting any projects that may be using it.
Click on the title and rename the duplicate environment WineQuality and then click Update.
Now click Edit Definition in the top right. This will allow us to paste the following into the Dockerfile Instructions section. Inspect these commands to see the package installation and configurations that are being done.
Copy

Copied!
```
            
            #Switch to root user for elevated permisssions
USER root

#install python packages
RUN pip install h2o auto-sklearn==0.14.6 matplotlib seaborn jsonify flask pyuwsgi requests-oauthlib
RUN pip install dominodatalab-data domino_data_capture

RUN conda config --add channels conda-forge
RUN conda install uwsgi --y
        
```
- Scroll down to Pluggable Workspaces Tools. This is the area in the Compute Environment where IDEs are made available for end users.
- Scroll down to the Run Setup Scripts section and add the following scripts that execute upon startup of executions (pre-run script) and a script that executes upon termination of executions (post-run script).
Prerun
Copy

Copied!
```
            
            echo "Printing all python package versions: "

pip freeze
        
```
Postrun
Copy

Copied!
```
            
            echo "Have a great day :)"
        
```
- Click Build at the bottom of the page to create a new revision with your changes and wait for the Build Status to have a checkmark. You can open the Build Logs link in another tab if you want to see what is happening.
- Finally navigate to the Projects tab - where you would see any projects that are leveraging this compute environment. Click Projects in the sidebar to navigate back to yours.
In your new project - go into the Settings tab to view the default Hardware Tier and Compute Environment - ensure they are set to Small and WineQuality respectively:
Click into the Workspaces tab to prepare for the next section.

Exploring Workspaces

Workspaces are interactive sessions in Domino where you can use any of the IDEs you have configured in your Compute Environment or the terminal.

In the top right corner click Create New Workspace.
- Type a name for the Workspace in the Workspace Name cell. Next, click through the available Compute Environments in the Workspace Environment dropdown and ensure that WineQuality is selected.
- Select JupyterLab as the Workspace IDE.
- Click the Hardware Tier dropdown to browse all available hardware configurations - ensure that Small is selected.
- Click Launch now.
Once the workspace is launched, create a new python notebook by clicking Python 3 (ipykernel) button:
Once your notebook is loaded, click on the left blue menu and click on the Data page, followed by the Data Dource we added in previous section as displayed below.
- Copy the provided code snippet into your notebook and run the cell.

After running the code snippet. Copy the code below into the following cell.

Copy
Copied!

            
            from io import StringIO
import pandas as pd

s=str(object_store.get("WineQualityData.csv"),'utf-8')
data = StringIO(s)

df=pd.read_csv(data)
df.head()

Now, copy the code snippets below cell-by-cell and run the cells to visualize and prepare the data! (You can click on the + icon to add a blank cell after the current cell).

Copy
Copied!

            
            import seaborn as sns
import matplotlib.pyplot as plt
df['is_red'] = df.type.apply(lambda x : int(x=='red'))
fig = plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot = True, fmt='.1g')

Copy
Copied!

            
            corr_values = df.corr().sort_values(by = 'quality')['quality'].drop('quality',axis=0)
important_feats=corr_values[abs(corr_values)>0.08]
print(important_feats)
sns.set_theme(style="darkgrid")
plt.figure(figsize=(16,5))
plt.title('Feature Importance for Wine Quality')
plt.ylabel('Pearson Correlation')
sns.barplot(important_feats.keys(), important_feats.values, palette='seismic_r')

Copy
Copied!

            
            for i in list(important_feats.keys())+['quality']:
      plt.figure(figsize=(8,5))
      plt.title('Histogram of {}'.format(i))
      sns.histplot(df[i], kde=True)

Finally write your data to a Domino Dataset by running.
Domino Datasets are read-write folders that allow you to efficiently store large amounts of data and share it across projects. They support access controls and the ability to create read-only snapshots, but those features are outside the scope of this lab.
Copy

Copied!
```
            
            import os
path = str('/domino/datasets/local/{}/WineQualityData.csv'.format(os.environ.get('DOMINO_PROJECT_NAME')))
df.to_csv(path, index = False)
        
```
Your notebook should now be populated like the display below.
Rename your notebook EDA_code.ipynb by right clicking on the file name as shown below then click the save icon.

Syncing Files

Files in a workspace are seperate from the files in your project. Now that we’ve finished working on our notebook, we want to sync our latest work.

To do so click on the File Changes tab in the top left corner of your screen.
Enter an informative but brief commit message such as Completed EDA notebook and click Sync All Changes.
Click the Domino logo on the upper-left corner of the blue menu and select the Project page. Then select your project followed by selecting Files on the blue sidebar as shown below.
Notice that the latest commit will reflect the commit message you just logged and you can see EDA_code.ipynb in your file directory.
Click on your notebook to view it. On the top of your screen and click Link to Goal in the dropdown and then select the goal you created in Defining Project Goals.
Now navigate to Overview and the Manage tab to see your linked notebook.
Click the ellipses on the goal to mark the goal as complete.

Run and Track Experiments

Now it’s time to train our models!

We are taking a three pronged approach and building a model in sklearn (python), xgboost (R), and an auto-ml ensemble model (h2o).

First, navigate back to your JupyterLab workspace tab. In your file browser go into the scripts folder and inspect multitrain.py.
- Check out the code in the script and comments describing the purpose of each line of code.
- You can also check out any of the training scripts that multitrain.py will call.
Now switch into your other browser tab to return to your Domino Project.
Navigate to the Jobs page.
- Domino Jobs are non-interactive workloads that run a script in your project. While Workspaces stay open for an arbitrary amount of time (within admin-defined limits), Jobs will free up the underlying compute resources as soon as the script execution is finished.
Click on Run.
Type in the following command in the File Name section of the Start a Job pop-up window. Click on Start to run the job.
Copy

Copied!
```
            
            scripts/multitrain.py
        
```
- Watch as three job runs have appeared, you may see them in starting, running or completed state.
Click into the sklearn_model_train.py job run.
In the Details tab of the Job, note that the Compute Environment and Hardware Tier are tracked to document not only who ran the experiment and when, but what versions of the code, software, and hardware were executed.
Click on the Results tab of the Job. Scroll down to view the visualizations and other outputs.
We’ve now trained three models and it is time to select which model we’d like to deploy.
Refresh the page. Inspect the table and graph to understand the R^2 value and Mean Squared Error (MSE) for each model. From our results it looks like the sklearn model is the best candidate to deploy.
In the next section we will deploy the model we trained here!