﻿ Research Express@NCKU - Articles Digest (Volume 19 Issue 7) Alternative Prior Assumptions for Improving the Performance of Naïve Bayesian Classifiers Tzu-Tsung Wong Associate Professor of Institute of Information Management, College of Management, National Cheng Kung University tzutsung@mail.ncku.edu.tw Data Mining and Knowledge Discovery, Vol. 18, No. 2, 183-213, 2009 Naïve Bayesian classifiers are a widely used technique for classification, because its computational efficiency is high.  From the viewpoint of classification accuracy, the naïve Bayesian classifier works well, and sometimes outperforms other classification tools.  There are two essential assumptions for the operation of naïve Bayesian classifiers.  The first is that all attributes are independent when class value is given, and this is called the conditional independence assumption.  Past studies have pointed out that this assumption will not have a great impact on the accuracy resulting from the naïve Bayesian classifier, because the evaluation of accuracy is based on the 0-1 loss function; i.e., the loss for a correct prediction is 0, and 1 otherwise.  The second assumption is when the class value is given, the random vector for the probabilities of the possible values of an attribute will follow a Dirichlet distribution, and this is called the Dirichlet assumption.  This assumption can increase the classification accuracy of the naïve Bayesian classifier without slowing down its computational efficiency.  However, in a Dirichlet random vector, any pair of variables must be negatively correlated, and all variables must have the same normalized variance.  These two properties of the Dirichlet distribution are named as negative-correlation and equal-confidence requirements, respectively.  This study is to investigate the impact of the Dirichlet assumption on the performance of the naïve Bayesian classifier. For any given new instance, the naïve Bayesian classifier will calculate the probability for every possible class value based on the conditional independence and the Dirichlet assumptions.  Then the class value with the largest probability will be the predicted class of the new instances.  According to the Dirichlet assumption, the random vector corresponding to an attribute for any given class value is assumed to follow a Dirichlet distribution.  The available data will be used to update the Dirichlet distributions for all attributes.  The necessary estimates for calculating the classification probabilities are derived from the updated Dirichlet distributions.  Thus, The Dirichlet assumption is critical to the performance of the naïve Bayesian classifier. Compositional data are all nonnegative and their sum cannot be larger than one.  Market shares and probabilities are two examples of the compositional data, and the ways to process compositional data are called compositional data analysis.  Dirichlet distribution is the most popular prior in performing a Bayesian analysis on compositional data.  The advantages of the Dirichlet distribution for Bayesian analysis are: the computation of its moments is simple, the order of the variables is arbitrary, and it is conjugate to the multinomial sampling.  To investigate the impact of the Dirichlet assumption on the naïve Bayesian classifier, we adopted two multivariate distributions, generalized Dirichlet and Liouville distributions, which are more general than the Dirichlet distribution in some aspects.  In a Liouville random vector, all variables will be either all positively or all negatively correlated.  Thus, the Liouville distribution can release the negative-correlation requirement.  However, the generalized Dirichlet distribution can release both the negative-correlation and the equal-confidence requirements.  Since these two multivariate distributions are also conjugate to the multinomial sampling, they are also appropriate priors for naïve Bayesian classifiers.  In addition, both of the two multivariate distributions include the Dirichlet distribution as a special case.  However, the computational complexity for the moments of the Liouville and the generalized Dirichlet distributions are slightly higher. A prior for naïve Bayesian classifiers generally possesses two characteristics.  First, an appropriate prior should be noninformative; i.e., the prior probabilities for all possible values of an attribute are equal.  This characteristic makes the settings of priors much easier in practice.  Low confidence is the second characteristic for a prior, which implies that a prediction is primarily determined by the available data.  We therefore designed 30 noninformative and unconfident priors for each of the generalized Dirichlet and the Liouville distributions.  The correlations among variables are gradually changed to cover both positive and negative values.  The normalized variances of variables are controlled to exhibit different confidence levels.  These settings on the 60 generalized Dirichlet and Liouville priors can help us to investigate the impact of the Dirichlet assumption.  The 60 noninformative and unconfident priors are tested on 18 data sets chosen from an Internet database.  In general, the generalized Dirichlet distribution has the best performance among the three distribution families, while the Liouville distribution is the worst.  Since the generalized Dirichlet distribution can release the two requirements of the Dirichlet distribution, it is a more flexible prior for naïve Bayesian classifiers.  Since the variables in a random vector must be nonnegative, and the sum of the variables cannot be larger than one, it is inappropriate to assume that all variables are positively correlated.  That is why the Liouville distribution has the worst performance in this case.  Since the priors are all unconfident, their resulting classification accuracies will not be greatly different. In order to fully understand the impact of the Dirichlet assumption on the performance of naïve Bayesian classifiers, this study proposed the ways to set noninformative and confident Dirichlet, generalized Dirichlet, and Liouville priors.  The confidence level is gradually increased to search for the best parameter settings that can result the highest prediction accuracy.  When the confidence level of a prior is too high, a prediction will not mainly depend on the available data.  Since a noninformative prior indicates that all possible values of an attribute have the same probability.  This noninformative property greatly violates the information contained in the available data for classification.  Thus, the classification accuracy will decrease when the confidence level of a prior is too high.  This implies that the search for the best parameter settings is tractable. The noninformative and confident priors are also tested on the 18 data sets, and the best Dirichlet, generalized Dirichlet, and Liouville priors for each data set are found.  According to their resulting classification accuracies, the best generalized Dirichlet priors outperform the best Dirichlet priors in 17 of the 18 data sets, and they have the same performance only in one data set.  The generalized Dirichlet priors have higher accuracies than the best Liouville prior in 16 of the 18 data sets, and they have the same performance in the other two data sets.  This again demonstrates that the generalized Dirichlet distribution is the best among the three multivariate distributions.  In three of the 18 data sets, the classification accuracies of the best generalized Dirichlet priors can be three percent higher than the classification accuracies of the best Dirichlet and the best Liouville priors. Only one of the 18 best Liouville priors allows variables to be all positively correlated, and this correlation coefficient is very close to zero.  This implies that most realistic data sets do not allow variables to be all positively correlated, hence the Liouville distribution will not be able to greatly improve the performance of naïve Bayesian classifiers.  However, 16 of the 18 best generalized Dirichlet priors allow some, but not all, variables to be positively correlated.  This demonstrates that a multivariate distribution which allows some variables to be positively correlated should be a more appropriate prior for naïve Bayesian classifiers.  In addition, the ratios of the maximal normalized variances over the minimal normalized variances of the 18 best generalized Dirichlet priors are between 4 and 32.  This suggests that allowing variables to have different normalized variances can be beneficial to the prediction accuracy of the naïve Bayesian classifier.  Since the generalized Dirichlet distribution can release the two requirements of the Dirichlet distribution, it is the best prior for naïve Bayesian classifiers among the three multivariate distributions.  This study therefore concluded that the Dirichlet assumption does restrict the possibility of improving the performance of the naïve Bayesian classifier.
 < Previous Next > 