Predicting Diabetes with Neural Networks

Mark Subra
Analytics Vidhya
Published in
4 min readOct 24, 2020

--

Diabetes is a chronic medical condition which is estimated to affect 415 million people in the world. 5 million deaths a year can be attributed to diabetes-related complications, and it is a comorbidity associated with COVID-19 deaths.

Early stages of diabetes are often non-symptomatic which is problematic for early detection and diagnosis. Poor diet, lack of exercise, and excessive body weight are significant causes. This is related to the onset of type 2 diabetes which is the result of the body’s gradual resistance to insulin. Type 1 diabetes is different in that it results from the body’s inability to produce sufficient insulin in the first place.

Type 2 diabetes can be prevented and reversed if diagnosed early. We can use machine learning to predict diabetes in patients using vital statistics by using datasets where patients have been diagnosed. By training a neural network, we can use it to make predictions on new patients.

Pima Indians Diabetes Dataset

The Pima Indians are a group of Native Americans living in Arizona who have been studied due to their genetic predisposition to diabetes. The incidence of type 2 diabetes among the Pima Indians is the highest in the world. This dataset is from Kaggle and courtesy of the UC Irvine machine learning repository. The dataset is a sample of measurements collected from females along with an indication of whether the patients developed diabetes within five years of the measurement.

Initial Observations

We have several categories along with the outcome which is a binary classification. Also by plotting histograms of the dataframe, we can see certain aspects of the population as well as values that should not be 0 such as BMI, blood pressure, or blood glucose.

Before going any further, the missing values or 0 values need to be replaced or removed. The data also needs to be scaled in order to allow the neural network to work effectively. Data cleaning and preprocessing is a whole other subject matter which I will cover in a future post.

Building the Neural Network

This neural network will be built with Keras by using the Sequential class.

Below is my code for the neural network:

In [31]:

from keras.models import Sequential
model = Sequential()

In [32]:

from keras.layers import Dense

In [33]:

# add first hidden layer
model.add(Dense(32,activation='relu',input_dim=8))

In [34]:

# Second hidden layer
model.add(Dense(16,activation='relu'))

In [36]:

# Output layer
model.add(Dense(1,activation='sigmoid'))

In [37]:

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

In [38]:

# Training for 200 epochs
model.fit(X_train,y_train,epochs=200)
Epoch 1/200
491/491 [==============================] - 2s 3ms/step - loss: 0.6517 - accuracy: 0.6375
.
.
.
Epoch 200/200
491/491 [==============================] - 0s 77us/step - loss: 0.2057 - accuracy: 0.9165

After training 200 epochs, the accuracy improved from 0.6375 to 0.9165. My full code can be found here.

Results Analysis

The following is the code for the results of the predictions on the test set:

In [39]:

# Train and Test accuracy
scores = model.evaluate(X_train,y_train)
print("Training Accuracy: %.2f%%\n" % (scores[1]*100))
scores = model.evaluate(X_test,y_test)
print("Testing Accuracy: %.2f%%\n" % (scores[1]*100))
491/491 [==============================] - 1s 1ms/step
Training Accuracy: 91.45%
154/154 [==============================] - 0s 645us/step
Testing Accuracy: 79.87%

In [40]:

from sklearn.metrics import confusion_matrix

In [42]:

y_test_pred = model.predict_classes(X_test)
cm = confusion_matrix(y_test,y_test_pred)
ax = sns.heatmap(cm, annot=True, xticklabels=["No Diabetes",'Diabetes'], yticklabels=['No Diabetes','Diabetes'],
cbar=False,cmap='Blues')
ax.set_xlabel('Prediction')
ax.set_ylabel('Actual')
plt.show()
Confusion matrix with false positives and negatives

The test set accuracy was 79.87% which is not too bad for a simple neural network, but it could certainly be improved. There were only 9 false positives but 22 false negatives. There are many improvements that need to be made to this model to lower false diagnoses. This dataset only has eight categories which might not be enough to truly make accurate predictions. Feature engineering is more likely to be useful than increasing the complexity of this neural network.

--

--

Mark Subra
Analytics Vidhya

I am a data scientist having recently graduated from the Flatiron School Immersive Data Science Bootcamp