This project presents a code/kernel used in a Kaggle competition promoted by Data Science Academy in January of 2019.

The goal of the competition was to create a Machine Learning model to predict the occurrence of diabetes.

Data source: National Institute of Diabetes and Digestive and Kidney Diseases

Competition page: kaggle.com/c/competicao-dsa-machine-learnin..

Predict the occurrence of diabetes

Above the EDA is presented with the source code used to perform the data pre-processing, data transformation, and create the machine learning models.

Exploratory Data Analysis

Data fields:

num_gestacoes - Number of times pregnant
glicose - Plasma glucose concentration in oral glucose tolerance test
pressao_sanguinea - Diastolic blood pressure in mm Hg
grossura_pele - Thickness of the triceps skinfold in mm
insulina - Insulin (mu U / ml)
bmi - Body mass index measured by weight in kg / (height in m) ^ 2
indice_historico - Diabetes History Index (Pedigree Function)
idade - Age in years
classe - Class (0 - did not develop disease / 1 - developed disease)

Loading the data

# Importing packages
import numpy as np 
import pandas as pd 
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
%matplotlib inline

# Loading the data
data = pd.read_csv('data/dataset_treino.csv')
test_data = pd.read_csv('data/dataset_teste.csv')
data.head(5)

	id	num_gestacoes	glicose	pressao_sanguinea	grossura_pele	insulina	bmi	indice_historico	idade	classe
0	1	6	148	72	35	0	33.6	0.627	50	1
1	2	1	85	66	29	0	26.6	0.351	31	0
2	3	8	183	64	0	0	23.3	0.672	32	1
3	4	1	89	66	23	94	28.1	0.167	21	0
4	5	0	137	40	35	168	43.1	2.288	33	1

Data overview

# General statistics
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 10 columns):
id                   600 non-null int64
num_gestacoes        600 non-null int64
glicose              600 non-null int64
pressao_sanguinea    600 non-null int64
grossura_pele        600 non-null int64
insulina             600 non-null int64
bmi                  600 non-null float64
indice_historico     600 non-null float64
idade                600 non-null int64
classe               600 non-null int64
dtypes: float64(2), int64(8)
memory usage: 47.0 KB

All 10 predictors variables (features) are quantitative (numerical) and we have 600 observations to build the prediction model.

The only qualitative column is the labels, where:

0 - do not have the disease
1 - have the disease

Data Cleaning

Checking if there are missing values

# If the result is False, there is no missing value
data.isnull().values.any()

False

Computing statistics for each column

data.describe()

	id	num_gestacoes	glicose	pressao_sanguinea	grossura_pele	insulina	bmi	indice_historico	idade	classe
count	600.000000	600.000000	600.000000	600.000000	600.000000	600.000000	600.000000	600.000000	600.000000	600.000000
mean	300.500000	3.820000	120.135000	68.681667	20.558333	79.528333	31.905333	0.481063	33.278333	0.346667
std	173.349358	3.362009	32.658246	19.360226	16.004588	116.490583	8.009638	0.337284	11.822315	0.476306
min	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	150.750000	1.000000	99.000000	64.000000	0.000000	0.000000	27.075000	0.248000	24.000000	0.000000
50%	300.500000	3.000000	116.000000	70.000000	23.000000	36.500000	32.000000	0.384000	29.000000	0.000000
75%	450.250000	6.000000	140.000000	80.000000	32.000000	122.750000	36.525000	0.647000	40.000000	1.000000
max	600.000000	17.000000	198.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

From the table above, we can see the zero values in almost all columns. For some of these columns, zero makes sense, like for Pregnancies and Outcome. But for some of the others, like BloodPressure or BMI, zero definitely doesn't make sense.

After read some papers about the variables in the dataset, I see that some columns can have a value very close to zero (e.g. grossura_pele), but others can't have a zero value.

The columns can not have a zero value.

glicose
pressao_sanguinea
bmi

Let's see the number of the occurrences of zero values for all columns:

# Compute the number of occurrences of a zero value 
features = ['num_gestacoes', 'glicose', 'pressao_sanguinea', 'grossura_pele', 'insulina', 'bmi', 'indice_historico', 'idade']
for c in features:
    counter = len(data[data[c] == 0])    
    print('{} - {}'.format(c, counter))

num_gestacoes - 93
glicose - 5
pressao_sanguinea - 28
grossura_pele - 175
insulina - 289
bmi - 9
indice_historico - 0
idade - 0

We can also see that column insulina has 289 values, which correspond to 48% of the training data.

Let's remove these observations from the selected columns.

# Removing observations with zero value
data_cleaned = data.copy()   
for c in ['glicose', 'pressao_sanguinea', 'bmi']:
    data_cleaned = data_cleaned[data_cleaned[c] != 0]

data_cleaned.shape

(564, 10)

The final number of observations was 564.

Let's see the compute some statistics again:

data_cleaned.describe()

	id	num_gestacoes	glicose	pressao_sanguinea	grossura_pele	insulina	bmi	indice_historico	idade	classe
count	564.000000	564.000000	564.000000	564.000000	564.000000	564.000000	564.000000	564.000000	564.000000	564.000000
mean	300.664894	3.845745	121.354610	72.049645	21.432624	84.406028	32.367199	0.483294	33.448582	0.340426
std	173.410435	3.349287	31.130992	12.261552	15.809953	118.432015	6.974710	0.337668	11.868844	0.474273
min	1.000000	0.000000	44.000000	24.000000	0.000000	0.000000	18.200000	0.078000	21.000000	0.000000
25%	150.750000	1.000000	99.000000	64.000000	0.000000	0.000000	27.300000	0.250500	24.000000	0.000000
50%	298.500000	3.000000	116.000000	72.000000	23.500000	49.000000	32.000000	0.389000	29.000000	0.000000
75%	450.250000	6.000000	141.250000	80.000000	33.000000	130.000000	36.600000	0.648250	41.000000	1.000000
max	600.000000	17.000000	198.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

Checking outliers

Let's check if there are outliers in the data.

First, we'll use a set of boxplots, one for each column.

fig, axes = plt.subplots(2,4, figsize=(20,8))

x,y = 0,0
for i, column in enumerate(data_cleaned.columns[1:-1]):    
    sns.boxplot(x=data_cleaned[column], ax=axes[x,y])
    if i < 3:
        y += 1
    elif i == 3: 
        x = 1
        y = 0
    else:
        y += 1

We can see some possible outliers for almost all columns (separated points in the plots).

The outliers can either be a mistake or just variance. For now, let's consider all of them as mistakes.

To remove these outliers we can use Z-Score or IQR (Interquartile Range).

The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. Z-score is finding the distribution of data where the mean is 0 and the standard deviation is 1. While calculating the Z-score we re-scale and center the data and look for data points that are too far from zero. These data points which are way too far from zero will be treated as outliers. In most of the cases, a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

Let's use the Z-score function defined in the Scipy library to detect the outliers:

# Compute the Z-Score for each column

print(data_cleaned.shape)

z = np.abs(stats.zscore(data_cleaned))    
data_cleaned = data_cleaned[(z < 3).all(axis=1)]   

print(data_cleaned.shape)

(564, 10)
(531, 10)

Using Z-Score, 33 observations were removed.

Let's see the boxplots again:

fig, axes = plt.subplots(2,4, figsize=(20,8))

x,y = 0,0
for i, column in enumerate(data_cleaned.columns[1:-1]):    
    sns.boxplot(x=data_cleaned[column], ax=axes[x,y], palette="Set2")
    if i < 3:
        y += 1
    elif i == 3: 
        x = 1
        y = 0
    else:
        y += 1

The data was much cleaner now. Still, there are some points in the boxplots, but some of them are not outliers, like the insulina values higher than 400, which is acceptable in people with diabetes.

Checking the balance of the dataset

Let's checking the distribuitions of examples for each label:

data_cleaned.classe.value_counts().plot(kind='bar');

data_cleaned.classe.value_counts(normalize=True)

0    0.676083
1    0.323917
Name: classe, dtype: float64

From the figure above, we see most of our examples are of people that do not have the disease. More specifically, 67% of the data are for healthy people.

As the dataset is unbalanced, let's use some methods to reduce the unbalance of the classes.

I use the over-sampling SMOTE method. SMOTE (Synthetic Minority Oversampling Technique) consists of synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

from imblearn.over_sampling import SMOTE

# Select the columns with features
features = ['num_gestacoes', 'glicose', 'pressao_sanguinea', 'grossura_pele', 'insulina', 'bmi', 'indice_historico', 'idade']
X = data_cleaned[features]
# Select the columns with labels
Y = data_cleaned['classe']

smote = SMOTE(sampling_strategy=1.0, k_neighbors=4)
X_sm, y_sm = smote.fit_sample(X, Y)

print(X_sm.shape[0] - X.shape[0], 'new random picked points')
data_cleaned_oversampled = pd.DataFrame(X_sm, columns=data.columns[1:-1])
data_cleaned_oversampled['classe'] = y_sm
data_cleaned_oversampled['id'] = range(1,len(y_sm)+1)

for c in ['num_gestacoes', 'glicose', 'pressao_sanguinea', 'grossura_pele', 'insulina', 'idade']:
    data_cleaned_oversampled[c] = data_cleaned_oversampled[c].apply(lambda x: int(x))

data_cleaned_oversampled.classe.value_counts().plot(kind='bar');

187 new random picked points

Training the Model

Now that data is cleaned, let's use a machine learning model to predict whether or not a person has diabetes.

The metric used in this competition is the accuracy score.

Using Decision Tree

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import tree

# Select the columns with features
features = ['num_gestacoes', 'glicose', 'pressao_sanguinea', 'grossura_pele', 'insulina', 'bmi', 'indice_historico', 'idade']
X = data_cleaned_oversampled[features]
# Select the columns with labels
Y = data_cleaned_oversampled['classe']

# Perform the training and test 100 times with different seeds and compute the mean accuracy.
# Save results
acurrances = []
for i in range(100):    
    # Spliting Dataset into Test and Train
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=i)

    # Create and train the model
    clf = tree.DecisionTreeClassifier(criterion = 'entropy', random_state=i, max_depth=4)
    clf.fit(X_train,y_train)

    # Performing predictions with test dataset
    y_pred = clf.predict(X_test)
    # Computing accuracy    
    acurrances.append(accuracy_score(y_test, y_pred)*100)

print('Accuracy is ', np.mean(acurrances))

Accuracy is  73.35648148148148

Using Logistic Regression (LR)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Select the columns with features
features = ['num_gestacoes', 'glicose', 'pressao_sanguinea', 'grossura_pele', 'insulina', 'bmi', 'indice_historico', 'idade']

# For the LR the oversampled database decreased the model accuracy, so I choose do not to use it.
X = data_cleaned[features]
# Select the columns with labels
Y = data_cleaned['classe']

# Perform the training and test 100 times with different seeds and compute the mean accuracy.
# Save results
acurrances = []
for i in range(100):    
    # Spliting the data
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=i)
    LR_model=LogisticRegression(class_weight={1:1.15})
    LR_model.fit(X_train,y_train)

    # Testing
    y_pred=LR_model.predict(X_test)
    acurrances.append(accuracy_score(y_test, y_pred)*100)

    # Print only the last
    if i == 99:
        pass
        #print(classification_report(y_test,y_pred))
        #print(confusion_matrix(y_true=y_test, y_pred=y_pred))

print('Accuracy is ', np.mean(acurrances))

Accuracy is  76.725

Feature selection

The large number of zero values in the grossura_pele and insulina are impairing the performance of the model. So, as insulin is an important parameter in the evaluation of diabetes let's replace the zeros values by the mean for both columns.

As the Logistic Regression (LR) performs better than the Decision Tree. I'll use only LR from now.

# Replacing the zeros by the mean
data_cleaned_no_zeros = data_cleaned.copy()

for c in ['grossura_pele', 'insulina']:    
    feature_avg =data_cleaned[data_cleaned[c]>0][[c]].mean()
    data_cleaned[c]=np.where(data_cleaned[[c]]!=0,data_cleaned[[c]],feature_avg)

# Select the columns with features
features = ['num_gestacoes', 'glicose', 'pressao_sanguinea', 'grossura_pele', 'insulina', 'bmi', 'indice_historico', 'idade']
X = data_cleaned_no_zeros[features]
# Select the columns with labels
Y = data_cleaned_no_zeros['classe']

# Perform the training and test 100 times with different seeds and compute the mean accuracy score.
# Save results
acurrances = []
for i in range(100):    
    # Spliting the data
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=i)
    LR_model=LogisticRegression(class_weight={1:1.1})
    LR_model.fit(X_train,y_train)

    # Testing
    y_pred=LR_model.predict(X_test)
    acurrances.append(accuracy_score(y_test, y_pred)*100)

    # Print only the last
    if i == 99:
        print(classification_report(y_test,y_pred))
        print(confusion_matrix(y_true=y_test, y_pred=y_pred))

print('Accuracy is ', np.mean(acurrances))

              precision    recall  f1-score   support

           0       0.83      0.91      0.87       110
           1       0.75      0.60      0.67        50

   micro avg       0.81      0.81      0.81       160
   macro avg       0.79      0.75      0.77       160
weighted avg       0.81      0.81      0.81       160

[[100  10]
 [ 20  30]]
Accuracy is  76.5625

The performance decrease when I replace the zeros values by the mean for grossura_pele and insulina columns.

Performing the Final Prediction

For the final model, I choose to use LR with the cleaned dataset.

# Create and train the model with all data
model=LogisticRegression(class_weight={1:1.1})
model.fit(data_cleaned[features],data_cleaned['classe'])

# Get the kaggle test data
X_test = test_data[features]
# Make the prediction 
prediction = model.predict(X_test)

# Add the predictions to the dataframe 
test_data['classe'] = prediction

# Create the submission file
test_data.loc[:,['id', 'classe']].to_csv('submission.csv', encoding='utf-8', index=False)

Source code

The solution is also available at Github.

How to use

You will need Python 3.5+ to run the code.
Python can be downloaded here.
You have to install some Python packages, in command prompt/Terminal: pip install jupyter-lab scikit-learn pandas seaborn matplotlib
Once you have installed the required packages, just clone/download this project: git clone https://github.com/cpatrickalves/kaggle-diabetes-prediction
Access the project folder in command prompt/Terminal and run the following command: jupyter-lab
Then open the kernel file.

Kaggle: Predict the occurrence of diabetes

This project presents a code/kernel used in a Kaggle competition promoted by Data Science Academy.