Starting from scratch Azure 11th Let's start Azure Machine Learning (2) - From model creation to evaluation
Introduction
In the previous article, I introduced the basic usage of Azure Machine Learning Studio (MLStudio). Finding the module, dragging and dropping, and modifying the properties was easier than I thought.
This time, we will train (learn) a machine learning model using the workspace created last time and the sample dataset implemented in it. It also uses the model to make predictions and assess the accuracy of the model.
The "Adult Census Income Binary Classification dataset" used in the previous article is data that indicates whether income is above $50,000 or below for attributes such as age, educational background, and gender. Income is present in the rightmost column under the name "income".
We will use this data to train a machine learning model that classifies a person's income as less than or greater than $50,000 given their attribute values.
Like this time, building a machine learning model based on data for which the answer is known is called "supervised learning", and a model for classification created by supervised learning is called classification. say. This time we will create this Classification.
First, let's check the state created in the previous article.
Taking a sample dataset called 'Adult Census Income Binary Classification dataset' as input, I used the 'Select Columns in Dataset' module to remove unwanted columns. Then I used the "Split Data" module to split the data into 75% and 25%.
Let's run the whole thing here once. Click the RUN icon at the bottom of the screen to run. If successful, a green tick will be displayed in the "Select Columns in Dataset" module and "Split Data" module.
Place the training model
In MLStudio, there are multiple models that store the results of training, depending on the type. use.
[1] Placing "Train Model" Enter "Train Model" in the search window of the module palette on the left side of the screen. Drag and drop the searched "Train Model" onto the canvas.
[Note] AzureML training model module In AzureML, besides "Train Model", training model modules are "Train Cluster Model", "Train Anomaly Detection Model", "Train Matchbox Recommender Model", "Sweep Clustering” and “Tune Model Hyperparameters” exist. "Train Cluster Model" is used when using a model for clustering. "Train Anomaly Detection Model" is used when using an anomaly detection model. "Train Matchbox Recommender Model" is a training model module for a recommendation model called Matchbox Recommender. Both “Sweep Clustering” and “Tune Model Hyperparameters” are very useful modules that automatically adjust hyperparameters and output the trained model. Hyperparameters are parameters set by humans during training, and the results differ greatly depending on the value. "Sweep Clustering" is used during clustering, and "Tune Model Hyperparameters" is used during regression and classification. (See the following note for the difference between clustering, regression, and classification.) What all model modules have in common is that they input algorithms and training data and output trained models. Note that these modules are never trained models, they output trained models.[2] Connect with lines
Connect the training data to the placed "Train Model". "Train Model" has a circle on the top that accepts two inputs. The 0 on the left connects the algorithms. The 〇 on the right connects the data for training. Since the bottom ① (left side) of the “Split Data” module is set to output 75% of the input data, connect this ① and the right side of the top side of “Train Model” with a line.
[3] Specify the actual value (answer to the predicted result)
It is good to connect the training data to the "Train Model", but as it is, the "Train Model" is in trouble. This is because we do not know which columns of data are the actual values (the answers to the predicted results). There are two types of machine learning: “supervised learning” and “unsupervised learning”. In "supervised learning", a model is trained using data with measured values (answers to predicted results) as input. Therefore, in "Train Model", we need to specify which column has the actual value (the answer to the predicted result).
[Note] "Supervised Learning" and "Unsupervised Learning" Machine learning models can be broadly classified into either "supervised learning" or "unsupervised learning". The word teacher here means the measured value (the answer to the predicted result). "Supervised learning" is roughly classified into regression and classfication, and "unsupervised learning" is classified into clustering. Regression is a model used to make numerical predictions given data, while classification is a model used to divide data into predefined classes. Clustering is used when you want to automatically classify data into arbitrary groups (clusters) without any criteria.In our data, the column that holds the actual value (the answer to the predicted result) is the income column. Select the placed “Train Model” module and click “Launch column selector” in the properties on the right. Then, the dialog "Select Columns" will be launched, so move the income column to the "SELECTED COLUMNS" side and click the check mark in the lower right. Please refer to the previous article for details on how to use "Select Columns".
AzureML comes with a variety of pre-configured algorithms. Algorithms are determined by the desired result. The result we are looking for this time is whether the income is less than or more than $50,000. If either A or B is the desired answer, we use an algorithm called binary classification.
AzureML provides several algorithms for binary classification, but this time we will use "Two-Class Boosted Decision Tree". "Two-Class Boosted Decision Tree" is one of the binary classification algorithms provided by AzureML, and is a type of algorithm called a decision tree.
Note: When the desired answer is either A or B, an algorithm called binary classification is used, but when the desired answer is three or more, it is called multinomial classification, and the applicable algorithm varies. (Of course, the algorithm for multinomial classification can also be used for binary classification).[1] Place "Two-Class Boosted Decision Tree"
Enter "two-class" in the search window of the module palette on the left side of the screen. Drag and drop the searched "Two-Class Boosted Decision Tree" onto the canvas.
[2] Connect with lines
Connect the placed “Two-Class Boosted Decision Tree” and “Train Model” with a line. The circle on the left side of "Train Model" accepts the input of the algorithm.
"Train Model" outputs the trained model. Use the output trained model to actually make predictions. There are several types of modules that make predictions using trained models in AzureML, but this time we will use a module called "Score Model".
Note: Modules generally refer to parts. The module here refers to the parts that make up an AzureML experiment.[1] Place "Score Model"
Enter "Score Model" in the search window of the module palette on the left side of the screen. Drag and drop the searched "Score Model" onto the canvas.
Note: In addition to 'Score Model', there are 'Score Matchbox Recommender' and 'Assign Data to Clusters' modules that make predictions using trained models.[2] Connect with lines
The "Score Model" module accepts a trained model and data for prediction as input. The circle on the left side of the "Score Model" module accepts the input of the trained model, and the circle on the right side of the top side accepts data for prediction.
Therefore, connect the 〇 on the left side of the upper side to the output of "Train Model", and the 〇 on the right side of the upper side to ② on the right side of the lower side of the "Split Data" module.
[3] Run and check the prediction results
Now you are ready to go. Click the "RUN" icon at the bottom of the screen to run it, and when it finishes successfully, right-click the ○ at the bottom of "Score Model" and click "Visualize".
Look at the far right of the results displayed in Visualize. The income column is the value originally in the data for prediction. The Scored Label column is the predicted result. The Scored Probabilities column represents the probability that the prediction results fall into one of the two classes for binary classification. 0.5 is used as a boundary to determine which classification is made. In this case, there are two classes, the $50,000 and above class and the $50,000 or less class. The higher the probability of being classified as above, the closer Scored Probabilities will be to 1.
Note: There are some prediction results that are blank. This may be due to missing values in the input, which may have caused the prediction to fail (the native-country column is blank). The cause of the unpredictability may be due to the lack of training data other than missing values, but it is very important to deal with missing values. In machine learning, a process called cleansing, which cleans up such data inconsistencies in advance, is very important. AzureML has an "Apply Transformation" module to compensate for missing values.How accurate were your predictions? There are various metrics for evaluating models in machine learning. In AzureML, there is an "Evaluate Model" module as a dedicated module for calculating indicators. Let's use this to check the accuracy.
Note: In addition to "Evaluate Model", there are "Evaluate Recommender" and "Cross Validate Model" modules for model evaluation. Normally, "Evaluate Model" is used, but "Evaluate Recommender" is used when evaluating the result of Matchbox Recommender, which is a model that makes recommendations. Also, "Cross Validate Model" not only validates the accuracy of the model, but also gives some results on how representative the dataset is and how sensitive the model is to data fluctuations. .[1] Place "Evaluate Model"
Enter "evaluate" in the search window of the module palette on the left side of the screen. Drag and drop the searched "Evaluate Model" onto the canvas.
[2] Connect with a line Connect the lower output 〇 of "Score Model" and the upper left 〇 of "Evaluate Model" with a line.
[3] Run selected modules only
So let's run it. Until now, it was executed by clicking the "RUN" button at the bottom of the screen. Execution by this method means re-executing all the modules placed on the canvas from the beginning to the end. However, in the current state, "Score Model" has already been executed, and it should be fine if only "Evaluate Model" is executed. If you want to execute partially like this, right-click the module you want to execute and click "RUN SELECTED". can.
Running time of MSStudio affects billing, so use it wisely.
[4] Check accuracy
Let's see the output of "Evaluate Model". Right-click the output 〇 of "Evaluate Model" and click "Visualize".
Then you can see the Receiver Operator Characteristic (ROC) curve with mixing matrix, confusion matrix, F1 values, and graphs and index values such as cumulative AUC (Area Under the Curve) values.
To understand the model evaluation correctly, we need to understand the meaning of the numbers. I will omit the detailed explanation here, but the F1 value is one guideline. The closer the F1 value is to 1, the better the accuracy.
Note: The Microsoft Azure HP has a model evaluation explanation page, so please see here for details. Curves can be displayed by clicking Precision/Recall or Lift as well as ROC.Finally
I actually trained the model and made predictions. We also showed you how to check the accuracy of the model you created. Just by dragging and dropping the modules and connecting the modules with lines, it is almost completed. As you can see, you can do machine learning very quickly.
Next time, we will publish the machine learning model created this time as a web service.
WINGS Project Written by Takashi Uesaka (Nextscape Co., Ltd.) / Supervised by Yoshihiro Yamada