Naive Bayes classifier

Simply, bayesian theory is a process of reviewing other possibilities. For example, suppose we want to go to the outside. In this scenario, If we see the dark clouds in the sky after the first step into the outside, we “probably” go back and take the umbrella after reviewing for the probability of rain. So in this case, the previously assigned low probability for rain situation has been revised and a much higher probability value has been assigned to the rain probability with the appearance of clouds. In conclusion, we went to take the umbrella.

Naive Bayes Classifier is a method for the making classification in Bayes Theorem. It basically calculate the all probabilities and it classifies by the high probability values. Suppose we have a data set which called X; and we don’t know the X’s classification. Suppose X={X1,X2,X3,…Xn} and suppose this data has number of class: C1,C2….Cm . Now the mathematical formula for the determining of classification is

\[P(C/X)= P(X/Ci).P(Ci)/P(X) \]

Now let’s show this classifier as the code. First, I would like to create a data set where included nothing but random values. Suppose we are trying to build a data set of environmental conditions that can determine whether an animal will survive or not. In this case, our “live” variable which has included “yes” or “no” values will be our dependent variable all other variables will be eventually independent variables. You can change those variable names or given values it depends on your imagination. For now, let’s continue.

#for weather
df<-data.frame(
  
  Summary=sample(c("sunny","rainy","other"),size = 1000,replace = T)
)

# Let's make our data realistic.

for (i in which(df$Summary=="sunny")) {

    df$Temperature[i]<-"hot"

}

for (i in which(df$Summary=="rainy")) {

    df$Temperature[i]<-"cold"

}

for (i in which(df$Summary=="other")) {

    df$Temperature[i]<-"mild"

}
#other independent variables
df$wind<-sample(c("strong","weak"),size=1000,replace = T)
df$predators<-sample(c("high","low"),size=1000,replace = T)
#dependent variable
df$live<-sample(c("yes","no"),size=1000,replace = T)

We need those libraries for the further analysis

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
library(caTools)

We could use those functions in order to split our data set for test and train data set.

sample<-sample.split(df,SplitRatio = 0.75)
train<-subset(df,sample==TRUE)
test<-subset(df,sample==FALSE)

Note: We could have done this task without using any function.But note that, your dataset must be randomized and not arranged or sorted by any variable.

After this, now we can build our model. We will use e1071 library for this task. I assigned the “live” value as the dependent variable and all other columns will be the independent variable.

model<-naiveBayes(live~.,data=train)
model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        no       yes 
## 0.5333333 0.4666667 
## 
## Conditional probabilities:
##      Summary
## Y         other     rainy     sunny
##   no  0.3125000 0.3281250 0.3593750
##   yes 0.3142857 0.3500000 0.3357143
## 
##      Temperature
## Y          cold       hot      mild
##   no  0.3281250 0.3593750 0.3125000
##   yes 0.3500000 0.3357143 0.3142857
## 
##      wind
## Y        strong      weak
##   no  0.4968750 0.5031250
##   yes 0.5071429 0.4928571
## 
##      predators
## Y          high       low
##   no  0.5125000 0.4875000
##   yes 0.5107143 0.4892857

Above, each feature or variable’s conditional probability is calculated individually by the model. The apriori probabilities are also produced, which show how our data is distributed. As we can see, the apriori probability of the animal’s survival is %51. And another section of our model is “Conditional probabilities” which give us probabilities of animal surviving by depending on other variables; for example, if there is a the high density of predators our animal’s chance of living is %50.

Now we can use our model and data for the prediction. In this way, we can test our model’ prediction accuracy.

pred<-predict(model,test)
table(pred)
## pred
##  no yes 
## 328  72
table(test$live)
## 
##  no yes 
## 205 195

In above, It is seen that 256 records in the prediction, are estimated as “Yes” and 144 as “No”. On the other hand, in test data, 196 records are in the “yes” and 204 records are in the “no” class.

#confusion matrix
table(test$live,pred)
##      pred
##        no yes
##   no  164  41
##   yes 164  31

Now we can calculate our model’s accuracy. The result has shown that our model prediction accuracy is %50 in other words, our model classification percent is 50.

acc<-((75+127)/(75+129+127+69))
acc
## [1] 0.505

Model evaluation with function:

Below, we can see the accuracy which we already calculated in traditional ways. In addition, there is a Kappa Value which returns the ratio of similarity between predicted and actual values. We want high values near to 1 as much as possible but since we generate our data in random conditions, the low value of kappa not surprising.

confusionMatrix(as.factor(test$live),pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  164  41
##        yes 164  31
##                                           
##                Accuracy : 0.4875          
##                  95% CI : (0.4375, 0.5377)
##     No Information Rate : 0.82            
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0417         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.4306          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.1590          
##              Prevalence : 0.8200          
##          Detection Rate : 0.4100          
##    Detection Prevalence : 0.5125          
##       Balanced Accuracy : 0.4653          
##                                           
##        'Positive' Class : no              
##