Diabetes Prediction Model

This blog deals with prediction of Diabetes disease by performing an analysis of supervised learning models which are Logistic Regression Algorithm, K-Nearest Neighbors Algorithm, Random Forest Classifier, Support Vector Machine Classifier.

Uploaded dataset from kaggle, To find link to Diabetes dataset Click Here

The major goal of this blog is to achieve the best accuracy so that we can predict the People who are affected by diabetes and may help them reduce risk by immediately consulting to doctor.

My web application on Diabetes Prediction is : Diabetes Predictor App

Importing necessary libraries for the project

Importing Diabetes dataset from kaggle

Counting values and plotting graph of Diabetic and Non-Daiabetic people present in the Dataset

Comparision of Outcome with Glucose

Now, finding Blood Pressure levels and age of entries in dataset having Diabetes

Now, PairPlotting graph to check the BloodPressure, Glucose, Age, BMI, SkinThickness, Pregnencies and Insulin with respect to Outcome

Now, We will check with plotting of all the columns of the people having Diabetes

Printing Correlation matrix between columns to show positive or negative correlation between columns present in dataset

Now, lets split data and check

Here, I will evalute with different models to find the best fit

Building Logistic Regression Model

A Logistic Regression Model is a powerful supervised machine learning algorithm used for binary classification problems, where taget is categorical

image.jpeg

Logistic Regression model uses an activation function called sigmoid function so that the output can be converted to categorical values.

It make a linear data, non linear by passing values and not allowing certain values to pass

image.jpeg

If Sigmoid(Z) goes towards +infinity then Prediction will become 1. Otherwise, If is goes to -infinity the prediction will be 0.

Building KNN Classifier

KNN algorithm is a model which learns to proceed with labelled data and once the mdel is trained then it is well suited for unlabelled data as well as labelled data.

This model assumes that similar things exists in close proximity. If the objects near the test object then it assumes those obejct as similar.

Let's say "a flock of Birds"

image.jpeg

In KNN, K is always odd. It can be 1,3,5,7,..and so on.

Building Support Vector Classifier

Support Vector Machine creates a line seperating different types of data to classes.

According to SVM, we find the closest points to line seperating both classes. These points are 'Support Vectors' and then we calculate the distance between these vectors which is called magrin.

The hyperplane for which the margin is maximum is the beter hyperplane

image.jpeg

image.jpeg

Building Random Forest Classifier

Random forest classifier uses the algorithm involving lagre numbers fof individual decsion classifier, each tree has its class prediction and class with maximum votes comes out to be our model prediction

image.jpeg

Random forest classiers are fast and simple. Moreover, it is a flexible tool and a great algorithm to train early in the model development process.

Now, We will compare all the models and check whose accuracy is best

As we can see below are the accuracy of the the models which is as follows:

On Comparing all the models, we can see that Logistic Regression and Support Vector Machine performed really well, Also trying it with Hyperparameteruning we can improve Accuracy of the model

Hyperparameter tuning using GridSearchcv

Here we are choosing set of optimal parameters for our model to learn the algorithm in a manner which gives us the result with better accuracy.

As we can see we have Accuracy improvement with 83.77%

Model Evaluation

Checking Confusion matrix

Checking accuracy score

Cheking Classification Report

This is to Save the model and get an output file which can be further used to Load model in an application

Hence, Our model with 83.77% accuracy is ready considering Logistic Regression Algorithm.

WHY

There are several reasons which made me choose Diabetes Prediction as my Project for application.

Contribution

Hyperparameter tuning using GridSearchCV, learned about it and then implemented.

Checked the accuracy with all the possible classifiers and then made the model accurate.

Challenges

Procedure to build the application through free hosting app HEROKU

  1. When the model is built then we need to dump the file in pickle so that we can use picke output file in app.py for linking it with the UI and helping in creation of the web page.

  2. Create a free account in heroku

  3. On dashboard select create new app

  4. Link it with gibhub repository and add necessary files, to check click here to see my repository which have all the necessary files.

  5. Select Connect to GitHub one the connection is established then you can go ahead with manual deployement

  6. Click on Deploy. You can check here steps to deploy using heroku.

For deployement we need Procfile as well as requirements.txt file for heroku, which is present in the link to my github repository.

References