This blog deals with prediction of Diabetes disease by performing an analysis of supervised learning models which are Logistic Regression Algorithm, K-Nearest Neighbors Algorithm, Random Forest Classifier, Support Vector Machine Classifier.
Uploaded dataset from kaggle, To find link to Diabetes dataset Click Here
The major goal of this blog is to achieve the best accuracy so that we can predict the People who are affected by diabetes and may help them reduce risk by immediately consulting to doctor.
My web application on Diabetes Prediction is : Diabetes Predictor App
Importing necessary libraries for the project
import pandas as pd
import numpy as np
np.random.seed(42) #
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib "inline"
#Importing models needed
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve
#for warning message and removing warning
from warnings import filterwarnings
filterwarnings("ignore")
Importing Diabetes dataset from kaggle
data = pd.read_csv("diabetes.csv")
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB None
Counting values and plotting graph of Diabetic and Non-Daiabetic people present in the Dataset
count=data["Outcome"].value_counts()
count.plot(kind="bar", color=["Orange","Blue"])
plt.xticks(np.arange(2),("Non-Diabetic" , "Diabetic"));
Comparision of Outcome with Glucose
pd.crosstab(data.Glucose[::15],data.Outcome).plot(kind="bar",figsize=(18,8),color=["Orange","Blue"])
plt.ylabel("PEOPLE")
plt.xlabel("GLUCOSE")
#plt.xticks(rotation=0)
plt.legend(["Non-Diabetic","Diabetic"])
<matplotlib.legend.Legend at 0x7faacc9542d0>
Now, finding Blood Pressure levels and age of entries in dataset having Diabetes
plt.figure(figsize=(10,6))
#Plotting Scatter graph of People with Positive and Negative Diabetes
#Positive Data
plt.scatter(data.Age[data.Outcome==1],data.BloodPressure[data.Outcome==1],c="Blue")
#Negative Data
plt.scatter(data.Age[data.Outcome==0],data.BloodPressure[data.Outcome==0],c="Orange")
#ADDING INFORMATION IN THE GRAPH
plt.title("Diabetes with respect to Age and BloodPressure")
plt.xlabel("Age")
plt.ylabel("BloodPressure")
plt.legend(["Diabetic","Non-Diabetic"])
<matplotlib.legend.Legend at 0x7faacb4df350>
Now, PairPlotting graph to check the BloodPressure, Glucose, Age, BMI, SkinThickness, Pregnencies and Insulin with respect to Outcome
sns.set(style="ticks", color_codes=True)
sns.pairplot(data,hue="Outcome",palette="gnuplot");
Now, We will check with plotting of all the columns of the people having Diabetes
fig, axis = plt.subplots(nrows= 4, ncols=2, figsize=(12,10))
fig.tight_layout(pad=3)
Diabetic = data.Outcome ==1
axis[0,0].set_title('Glucose')
axis[0,0].hist(data.Glucose[Diabetic])
axis[0,1].set_title('BloodPressure')
axis[0,1].hist(data.BloodPressure[Diabetic])
axis[1,0].set_title('Age')
axis[1,0].hist(data.Age[Diabetic])
axis[1,1].set_title('BMI')
axis[1,1].hist(data.BMI[Diabetic])
axis[2,0].set_title('DiabetesPedigreeFunction')
axis[2,0].hist(data.DiabetesPedigreeFunction[Diabetic])
axis[2,1].set_title('Insulin')
axis[2,1].hist(data.Insulin[Diabetic])
axis[3,0].set_title('SkinThickness')
axis[3,0].hist(data.SkinThickness[Diabetic])
axis[3,1].set_title('Pregnancies')
axis[3,1].hist(data.Pregnancies[Diabetic])
(array([67., 46., 44., 16., 47., 28., 7., 9., 3., 1.]), array([ 0. , 1.7, 3.4, 5.1, 6.8, 8.5, 10.2, 11.9, 13.6, 15.3, 17. ]), <a list of 10 Patch objects>)
corr_data=data.corr()
corr_data
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
Pregnancies | 1.000000 | 0.129459 | 0.141282 | -0.081672 | -0.073535 | 0.017683 | -0.033523 | 0.544341 | 0.221898 |
Glucose | 0.129459 | 1.000000 | 0.152590 | 0.057328 | 0.331357 | 0.221071 | 0.137337 | 0.263514 | 0.466581 |
BloodPressure | 0.141282 | 0.152590 | 1.000000 | 0.207371 | 0.088933 | 0.281805 | 0.041265 | 0.239528 | 0.065068 |
SkinThickness | -0.081672 | 0.057328 | 0.207371 | 1.000000 | 0.436783 | 0.392573 | 0.183928 | -0.113970 | 0.074752 |
Insulin | -0.073535 | 0.331357 | 0.088933 | 0.436783 | 1.000000 | 0.197859 | 0.185071 | -0.042163 | 0.130548 |
BMI | 0.017683 | 0.221071 | 0.281805 | 0.392573 | 0.197859 | 1.000000 | 0.140647 | 0.036242 | 0.292695 |
DiabetesPedigreeFunction | -0.033523 | 0.137337 | 0.041265 | 0.183928 | 0.185071 | 0.140647 | 1.000000 | 0.033561 | 0.173844 |
Age | 0.544341 | 0.263514 | 0.239528 | -0.113970 | -0.042163 | 0.036242 | 0.033561 | 1.000000 | 0.238356 |
Outcome | 0.221898 | 0.466581 | 0.065068 | 0.074752 | 0.130548 | 0.292695 | 0.173844 | 0.238356 | 1.000000 |
Printing Correlation matrix between columns to show positive or negative correlation between columns present in dataset
fig,axis = plt.subplots(figsize=(15, 10))
axis = sns.heatmap(corr_data,annot =True , fmt=".2f")
Now, lets split data and check
pip install sklearn
Requirement already satisfied: sklearn in /usr/local/lib/python3.7/dist-packages (0.0) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from sklearn) (1.0.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->sklearn) (3.0.0) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->sklearn) (1.1.0) Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->sklearn) (1.4.1) Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->sklearn) (1.19.5)
from sklearn.model_selection import train_test_split
data.sample(frac=1)
data_x=data.drop("Outcome",axis=1)
data_y=data["Outcome"]
#print(data_x)
#print(data_y)
train_x,test_x,train_y,test_y = train_test_split(data_x,data_y,test_size=0.2)
Here, I will evalute with different models to find the best fit
A Logistic Regression Model is a powerful supervised machine learning algorithm used for binary classification problems, where taget is categorical
Logistic Regression model uses an activation function called sigmoid function so that the output can be converted to categorical values.
It make a linear data, non linear by passing values and not allowing certain values to pass
If Sigmoid(Z) goes towards +infinity then Prediction will become 1. Otherwise, If is goes to -infinity the prediction will be 0.
from sklearn.linear_model import LogisticRegression
#Model Building
model_lr= LogisticRegression(random_state=0)
model_lr.fit(train_x,train_y)
#Model Evaluation
model_lr = model_lr.score(test_x,test_y)
model_lr
0.8311688311688312
KNN algorithm is a model which learns to proceed with labelled data and once the mdel is trained then it is well suited for unlabelled data as well as labelled data.
This model assumes that similar things exists in close proximity. If the objects near the test object then it assumes those obejct as similar.
Let's say "a flock of Birds"
In KNN, K is always odd. It can be 1,3,5,7,..and so on.
from sklearn.neighbors import KNeighborsClassifier
#Building Model
model_knn = KNeighborsClassifier()
model_knn.fit(train_x,train_y)
#Model Evaluation
model_knn = model_knn.score(test_x,test_y)
model_knn
0.7792207792207793
Support Vector Machine creates a line seperating different types of data to classes.
According to SVM, we find the closest points to line seperating both classes. These points are 'Support Vectors' and then we calculate the distance between these vectors which is called magrin.
The hyperplane for which the margin is maximum is the beter hyperplane
from sklearn import svm
#Model Building
model_svm = svm.SVC()
model_svm.fit(train_x,train_y)
#Model Evaluation
model_svm = model_svm.score(test_x,test_y)
model_svm
0.8116883116883117
Random forest classifier uses the algorithm involving lagre numbers fof individual decsion classifier, each tree has its class prediction and class with maximum votes comes out to be our model prediction
Random forest classiers are fast and simple. Moreover, it is a flexible tool and a great algorithm to train early in the model development process.
from sklearn.ensemble import RandomForestClassifier
#Model Builing
model_rfc = RandomForestClassifier()
model_rfc.fit(train_x,train_y)
#Model Evaluation
model_rfc = model_rfc.score(test_x,test_y)
model_rfc
0.7857142857142857
Now, We will compare all the models and check whose accuracy is best
As we can see below are the accuracy of the the models which is as follows:
LogisticRegression : 83.11%
K Nearest Neighbors : 77.90%
comparsion_model = pd.DataFrame({"LogisticRegression":model_lr ,
"K Nearest Neighbors":model_knn ,
"Support Vector Machine":model_svm ,
"Random Forest Classifier":model_rfc},
index=["Accuracy"])
print(comparsion_model)
comparsion_model.T.plot.bar(figsize = (15,10))
LogisticRegression ... Random Forest Classifier Accuracy 0.831169 ... 0.785714 [1 rows x 4 columns]
<matplotlib.axes._subplots.AxesSubplot at 0x7faabd62f810>
On Comparing all the models, we can see that Logistic Regression and Support Vector Machine performed really well, Also trying it with Hyperparameteruning we can improve Accuracy of the model
Hyperparameter tuning using GridSearchcv
Here we are choosing set of optimal parameters for our model to learn the algorithm in a manner which gives us the result with better accuracy.
from sklearn.model_selection import GridSearchCV
model_lr_grid= {'C': np.logspace(-4,4,30),
"solver":["liblinear"]}
#Setting up Grid
model_lr_set= GridSearchCV(LogisticRegression(),
param_grid = model_lr_grid,
cv =5,
verbose = True)
#Fitting GridSearchcv
model_lr_set.fit(train_x,train_y)
model_score = model_lr_set.score(test_x,test_y)
print(model_score*100)
Fitting 5 folds for each of 30 candidates, totalling 150 fits 83.76623376623377
As we can see we have Accuracy improvement with 83.77%
Model Evaluation
prediction = model_lr_set.predict(test_x)
prediction
array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0])
Checking Confusion matrix
sns.set(font_scale= 3)
sns.heatmap(confusion_matrix(test_y,prediction), annot=True , fmt='g')
plt.ylabel("Predicted Label")
plt.xlabel("True Label")
Text(0.5, -10.5, 'True Label')
Checking accuracy score
from sklearn.metrics import accuracy_score
acc_score=accuracy_score(test_y,prediction)
print(acc_score*100)
83.76623376623377
Cheking Classification Report
report=classification_report(test_y,prediction)
print(report)
precision recall f1-score support 0 0.85 0.95 0.89 111 1 0.80 0.56 0.66 43 accuracy 0.84 154 macro avg 0.82 0.75 0.78 154 weighted avg 0.83 0.84 0.83 154
This is to Save the model and get an output file which can be further used to Load model in an application
import pickle
#Saving our trained model to a file so that we can connect it with the Application
pickle.dump(model_lr_set, open ("Diabetes_Pred.pkl" , "wb"))
model_loaded = pickle.load(open("Diabetes_Pred.pkl" , "rb"))
model_loaded.predict(test_x)
model_loaded.score(test_x,test_y)
0.8376623376623377
Hence, Our model with 83.77% accuracy is ready considering Logistic Regression Algorithm.
There are several reasons which made me choose Diabetes Prediction as my Project for application.
Firstly, We are not aware of the fact that Diabetes being a very common disease has been taken lightly among people which is not the case at all.
Secondly, there are rapidly increasing cases of Diabetes now a days, reason being stress due to which it is now spreading among children of our age.
Furthermore, I always thought of creating awareness in people regarging the seriousness behind this ail. I hope with this small start I can make something which can help me in achieving my aim.
Hyperparameter tuning using GridSearchCV, learned about it and then implemented.
Checked the accuracy with all the possible classifiers and then made the model accurate.
Researched about the relationship between all the features present in the dataset
Taking output of the model in a file
Googled how to deploy in a web application
When the model is built then we need to dump the file in pickle so that we can use picke output file in app.py for linking it with the UI and helping in creation of the web page.
Create a free account in heroku
On dashboard select create new app
Link it with gibhub repository and add necessary files, to check click here to see my repository which have all the necessary files.
Select Connect to GitHub one the connection is established then you can go ahead with manual deployement
Click on Deploy. You can check here steps to deploy using heroku.
For deployement we need Procfile as well as requirements.txt file for heroku, which is present in the link to my github repository.