Navie Bayes Classifier- R vs Python

About Naive Bayes Classifier :

It is a simple probabilistic classifier based on Bayes Theorem. It can be used as an alternative method to logistic regression (Binary or Multinomial). It assumes strong independence among the predictors. It is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

Naive Bayes Classifier in R :

Data Description : A bank possesses demographic and transactional data of its loan customers.Objective is to predict whether the customer applying for the loan will be a defaulter or not. Sample size is 700. Independent Variables: Age group, Years at current address, Years at current employer, Debt to Income Ratio, Credit Card Debts, Other Debts. Dependent Variable: Defaulter (=1 if defaulter ,0 otherwise). The status is observed after loan is disbursed.

Importing and Readying the Data for Modeling, Model Fitting :

bankloan<-read.csv("BANK LOAN.csv",header=T)
head(bankloan)

##   SN AGE EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 1  1   3     17      12     9.3    11.36    5.01         1
## 2  2   1     10       6    17.3     1.36    4.00         0
## 3  3   2     15      14     5.5     0.86    2.17         0
## 4  4   3     15      14     2.9     2.66    0.82         0
## 5  5   1      2       0    17.3     1.79    3.06         1
## 6  6   3      5       5    10.2     0.39    2.16         0

str(bankloan)

## 'data.frame':    700 obs. of  8 variables:
##  $ SN       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ AGE      : int  3 1 2 3 1 3 2 3 1 2 ...
##  $ EMPLOY   : int  17 10 15 15 2 5 20 12 3 0 ...
##  $ ADDRESS  : int  12 6 14 14 0 5 9 11 4 13 ...
##  $ DEBTINC  : num  9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
##  $ CREDDEBT : num  11.36 1.36 0.86 2.66 1.79 ...
##  $ OTHDEBT  : num  5.01 4 2.17 0.82 3.06 ...
##  $ DEFAULTER: int  1 0 0 0 1 0 0 0 1 0 ...

bankloan$AGE<-factor(bankloan$AGE)

install.packages("e1071")

library(e1071)

## Warning: package 'e1071' was built under R version 3.6.2

riskmodel<-naiveBayes(DEFAULTER~EMPLOY+ADDRESS+DEBTINC+CREDDEBT,
                       data=bankloan)

riskmodel

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.7385714 0.2614286 
## 
## Conditional probabilities:
##    EMPLOY
## Y       [,1]     [,2]
##   0 9.508704 6.663741
##   1 5.224044 5.542946
## 
##    ADDRESS
## Y       [,1]     [,2]
##   0 8.945841 7.000621
##   1 6.393443 5.925208
## 
##    DEBTINC
## Y        [,1]     [,2]
##   0  8.679304 5.615197
##   1 14.727869 7.902798
## 
##    CREDDEBT
## Y       [,1]     [,2]
##   0 1.245397 1.422238
##   1 2.423770 3.232645

Interpretation :

naiveBayes() fits a Naive Bayes algorithm.
It computes the conditional posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule.
Output shows a list of tables, one for each predictor variable. If the variable is categorical it shows the conditional probabilities for each class. For a numeric variable, for each target class, mean and standard deviation are shown.
Eg. For EMPLOY, mean for churn status = 0 is 9.51 and sd is 6.66.

Note : Here in R, both continuous & categorical variables can be used in naiveBayes() command

Naive Bayes Classifier in Python :

Naïve Bayes methods differ based on the assumptions made about the distribution of independent variables. Python’s sklearn has various methods available and the two methods explored hereon are Gaussian Naïve Bayes and Multinomial Naïve Bayes.

Case 1: Continuous data Here same BANK LOAN DATA is used.

Import necessary libraries:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score, roc_curve, roc_auc_score

Import data and split the data into train and test for modelling:

bankloan = pd.read_csv("BANK LOAN.csv")
bankloan1 = bankloan.drop(['SN','AGE'], axis = 1)
    
X = bankloan1.loc[:,bankloan1.columns != 'DEFAULTER']
y = bankloan1.loc[:, 'DEFAULTER']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                   test_size=0.30, 
                                                   random_state = 999)

Note : Here, categorical variables are removed from the data as Python can not handle mixed data (continuous and categorical).

NBmodel = GaussianNB()
NBmodel.fit(X_train, y_train)

## GaussianNB(priors=None, var_smoothing=1e-09)

Interpretation :

GaussianNB() fits a Gaussian Naïve Bayes algorithm for classification.
This model is suitable for continuous predictors and assumes the likelihood of predictors to be normal.

Case 2: Categorical data

Data description:

A company has comprehensive database of its past and present workforce, with information on their demographics, education, experience and hiring background as well as their work profile. The management wishes to see if this data can be used for predictive analysis, to control attrition levels. Objective: To develop an Employee Churn model via Decision Tree Sample size is 83 Gender, Experience Level (<3, 3-5 and >5 years), Function (Marketing, Finance, Client Servicing (CS)) and Source (Internal or External) are independent variables. Status is the dependent variable (=1 if employee left within 18 months from joining date)

Import data and split the data into train and test for modelling:

empdata = pd.read_csv("EMPLOYEE CHURN DATA.csv")
empdata1 = empdata.loc[:, empdata.columns != 'sn']
empdata1.head()

##    status function          exp gender    source
## 0       1       CS           <3      M  external
## 1       1       CS           <3      M  external
## 2       1       CS  >=3 and <=5      M  internal
## 3       1       CS  >=3 and <=5      F  internal
## 4       1       CS           <3      M  internal

empdata2 = pd.get_dummies(empdata1)
empdata2.head()

##    status  function_CS  ...  source_external  source_internal
## 0       1            1  ...                1                0
## 1       1            1  ...                1                0
## 2       1            1  ...                0                1
## 3       1            1  ...                0                1
## 4       1            1  ...                0                1
## 
## [5 rows x 11 columns]

X_emp = empdata2.loc[:,empdata2.columns != 'status']
y_emp = empdata2.loc[:, 'status']

MNBmodel = MultinomialNB(alpha = 0)
MNBmodel.fit(X_emp, y_emp)

## MultinomialNB(alpha=0, class_prior=None, fit_prior=True)
## 
## C:\Users\SANKHYA\ANACON~1\lib\site-packages\sklearn\naive_bayes.py:508: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
##   'setting alpha = %.1e' % _ALPHA_MIN)

Interpretation :

MultinomialNB() fits a Multinomial Naïve Bayes algorithm for classification. This model is suitable for categorical predictors.
alpha = 0 ensures the model doesn’t apply any smoothing on the data.

Python has seperate models in Scikit-learn library for Naive Bayes for user specific operations:

Conclusion :

R has single function that handles all types of data(continuous,categorical & mixed) whereas

Python has separate functions for each data types.