It is a simple probabilistic classifier based on Bayes Theorem. It can be used as an alternative method to logistic regression (Binary or Multinomial). It assumes strong independence among the predictors. It is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.
Data Description : A bank possesses demographic and transactional data of its loan customers.Objective is to predict whether the customer applying for the loan will be a defaulter or not. Sample size is 700. Independent Variables: Age group, Years at current address, Years at current employer, Debt to Income Ratio, Credit Card Debts, Other Debts. Dependent Variable: Defaulter (=1 if defaulter ,0 otherwise). The status is observed after loan is disbursed.
Importing and Readying the Data for Modeling, Model Fitting :
bankloan<-read.csv("BANK LOAN.csv",header=T)
head(bankloan)
## SN AGE EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 1 1 3 17 12 9.3 11.36 5.01 1
## 2 2 1 10 6 17.3 1.36 4.00 0
## 3 3 2 15 14 5.5 0.86 2.17 0
## 4 4 3 15 14 2.9 2.66 0.82 0
## 5 5 1 2 0 17.3 1.79 3.06 1
## 6 6 3 5 5 10.2 0.39 2.16 0
str(bankloan)
## 'data.frame': 700 obs. of 8 variables:
## $ SN : int 1 2 3 4 5 6 7 8 9 10 ...
## $ AGE : int 3 1 2 3 1 3 2 3 1 2 ...
## $ EMPLOY : int 17 10 15 15 2 5 20 12 3 0 ...
## $ ADDRESS : int 12 6 14 14 0 5 9 11 4 13 ...
## $ DEBTINC : num 9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
## $ CREDDEBT : num 11.36 1.36 0.86 2.66 1.79 ...
## $ OTHDEBT : num 5.01 4 2.17 0.82 3.06 ...
## $ DEFAULTER: int 1 0 0 0 1 0 0 0 1 0 ...
bankloan$AGE<-factor(bankloan$AGE)
install.packages("e1071")
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.2
riskmodel<-naiveBayes(DEFAULTER~EMPLOY+ADDRESS+DEBTINC+CREDDEBT,
data=bankloan)
riskmodel
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.7385714 0.2614286
##
## Conditional probabilities:
## EMPLOY
## Y [,1] [,2]
## 0 9.508704 6.663741
## 1 5.224044 5.542946
##
## ADDRESS
## Y [,1] [,2]
## 0 8.945841 7.000621
## 1 6.393443 5.925208
##
## DEBTINC
## Y [,1] [,2]
## 0 8.679304 5.615197
## 1 14.727869 7.902798
##
## CREDDEBT
## Y [,1] [,2]
## 0 1.245397 1.422238
## 1 2.423770 3.232645
Interpretation :
Note : Here in R, both continuous & categorical variables can be used in naiveBayes() command
Naïve Bayes methods differ based on the assumptions made about the distribution of independent variables. Python’s sklearn has various methods available and the two methods explored hereon are Gaussian Naïve Bayes and Multinomial Naïve Bayes.
Case 1: Continuous data Here same BANK LOAN DATA is used.
Import necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score, roc_curve, roc_auc_score
Import data and split the data into train and test for modelling:
bankloan = pd.read_csv("BANK LOAN.csv")
bankloan1 = bankloan.drop(['SN','AGE'], axis = 1)
X = bankloan1.loc[:,bankloan1.columns != 'DEFAULTER']
y = bankloan1.loc[:, 'DEFAULTER']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30,
random_state = 999)
Note : Here, categorical variables are removed from the data as Python can not handle mixed data (continuous and categorical).
NBmodel = GaussianNB()
NBmodel.fit(X_train, y_train)
## GaussianNB(priors=None, var_smoothing=1e-09)
Interpretation :
Case 2: Categorical data
Data description:
A company has comprehensive database of its past and present workforce, with information on their demographics, education, experience and hiring background as well as their work profile. The management wishes to see if this data can be used for predictive analysis, to control attrition levels. Objective: To develop an Employee Churn model via Decision Tree Sample size is 83 Gender, Experience Level (<3, 3-5 and >5 years), Function (Marketing, Finance, Client Servicing (CS)) and Source (Internal or External) are independent variables. Status is the dependent variable (=1 if employee left within 18 months from joining date)
Import data and split the data into train and test for modelling:
empdata = pd.read_csv("EMPLOYEE CHURN DATA.csv")
empdata1 = empdata.loc[:, empdata.columns != 'sn']
empdata1.head()
## status function exp gender source
## 0 1 CS <3 M external
## 1 1 CS <3 M external
## 2 1 CS >=3 and <=5 M internal
## 3 1 CS >=3 and <=5 F internal
## 4 1 CS <3 M internal
empdata2 = pd.get_dummies(empdata1)
empdata2.head()
## status function_CS ... source_external source_internal
## 0 1 1 ... 1 0
## 1 1 1 ... 1 0
## 2 1 1 ... 0 1
## 3 1 1 ... 0 1
## 4 1 1 ... 0 1
##
## [5 rows x 11 columns]
X_emp = empdata2.loc[:,empdata2.columns != 'status']
y_emp = empdata2.loc[:, 'status']
MNBmodel = MultinomialNB(alpha = 0)
MNBmodel.fit(X_emp, y_emp)
## MultinomialNB(alpha=0, class_prior=None, fit_prior=True)
##
## C:\Users\SANKHYA\ANACON~1\lib\site-packages\sklearn\naive_bayes.py:508: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
## 'setting alpha = %.1e' % _ALPHA_MIN)
Interpretation :
Python has seperate models in Scikit-learn library for Naive Bayes for user specific operations:
R has single function that handles all types of data(continuous,categorical & mixed) whereas
Python has separate functions for each data types.