Coronary Artery Disease Prediction Using Decision Trees and Multinomial Naïve Bayes with k-Fold Cross Validation

Coronary artery disease has been the leading cause of death in the world population for at least two decades (2000-2019) and has experienced the largest increase in mortality in that time span compared to other causes of death. The success of predicting coronary artery disease early based on medical data is not only beneficial for patients, but also beneficial for the stability of the country's economy. This paper discusses the prediction of coronary artery disease risk by implementing two statistical learning methods, namely Multinomial Naïve Bayes and Decision Tree with 10-fold cross validation, where numerical variables are discretized to obtain categorical variables. The results showed that the Decision Tree method has better performance than the Multinomial Naïve Bayes method in predicting coronary artery disease. The performance measure of the Decision Tree method obtained an accuracy rate of 99.63%, 100% sensitivity, 99.33% specificity, 99.23% precision, and 100% Negative Prediction Value. These measures indicate that the Decision Tree method is appropriate for predicting coronary artery disease, including independent data (other coronary artery disease data with the same predictor variables). The results of this study Endang S Kresnawati, Yulia Resti, Bambang Suprihatin, ... e-ISSN : 2656-7245 175 also show that the different references to previous studies in discretizing numerical variables can improve the performance of the method in predicting coronary artery disease.

In high-income countries, coronary artery disease has long been a major contributor to the overall disease burden, in addition to stroke and cancer. The burden of this disease is also increasing in middle-income countries, and also in low-income countries. The success of early detection of coronary artery disease based on medical data is not only beneficial for patients but also beneficial for economic stability (WHO, 2019). Purushottam et al., (2016) predicted coronary artery disease using the same dataset, but they filled in the missing data using the AllPossible-MV algorithm (Alcalá-Fdez, et al., 2009;Alcalá-Fdez, et al., 2011). They proposed several machine learning methods, namely Support Vector Machine (SVM), Decision Tree C4.5 Algorithm, Neural Network (NN), PART,

Multiple Layer Perceptron (MLP), Radial Basis Function (RBF), TSEAFS, and Efficient
System. The highest level of accuracy achieved using the 10-fold cross-validation model was 86.3% using the Efficient System method, followed by the RBF method (78.53%), TSEAFS (77.45%), NN (76.47%), Algorithm C4.5 Decision Tree (73.53%), PART (73.53%), and SVM (70.59%). Chowdary et al. (2020) also predict coronary artery disease using the same dataset, but they change some of the categorical type data to numeric type. The machine learning methods they implement are quite a lot, namely Logistic Regression, Random Forest, Decision Tree, Gaussian Naïve Bayes, Binomial Nave Bayes, Multinomial Naïve Bayes, K-Nearest Neighbor, Artificial Neural Network, and Voting of Logistic Regression and K-Nearest Neighbor (VLRAKN). Their funding shows that the VLRAKN method has the highest level of accuracy at 89%. The accuracy that has been achieved using the split system validation model is 67% as e-ISSN : 2656-7245 Coronary Artery Disease Prediction Using Decision Trees and Multinomial … training data and 33% as test data. The other methods have an accuracy rate of between 80%-88%. This work also calculates the performance of prediction methods based on sensitivity, specificity, precision, and F-Measures, where the VLRAKN method is the method that has the highest performance measure on all of these measures.
Multinomial Nave Bayes and Decision Trees are two of the most popular and easy to understand classification methods. The Multinomial Naïve Bayes method uses Bayes' theorem in determining its decisions, where each predictor variable must be categorical following a multinomial distribution if there are more than two categories, and a binomial distribution if there are only two categories (Chen & Fu, 2018). The Decision Tree method uses a tree structure representation where each node describes the variable, the branch describes the value of the variable, and the leaf describes the class. Decision Trees have a fairly high level of accuracy in various cases (Santoso, 2012).
This study discusses risk prediction for coronary artery disease, which can also be called early detection of heart disease, by implementing two statistical learning methods, namely Multinomial Naïve Bayes and Decision Trees with 10-fold cross-validation as a model validation technique. The novelty in this study is a technique for categorizing five numerical variables in research data, namely age (years), cholesterol levels (mg/dl), fasting blood sugar levels (mg/dl), maximum heart rate (bpm), and old peak (mV) conducted with different criteria from Purushottam et al., (2016), as well as David and Belcy, (2018) and Riani et al., (2019).
The categorization of the five numerical variables is based on valid references that specifically discuss these numerical variables. In addition, in this study, the missing data was not included in the data processing because the majority of the data was incomplete. The performance of the two statistical learning methods is then measured based on the level of accuracy, sensitivity, specificity, precision, and negative predictive value (NPV). This performance measure is very important in practice, because it guides the choice of learning method or model, and provides a measure of the quality of the method or model that is finally selected, including for independent data (Hastie et al., 2009).

METODE
The steps in this study are presented in Figure 1. The research data is Heart Disease data from the Cleveland Clinic Foundation, which was donated as public data to the Center for disease, which consists of two categories, namely patients who have heart disease and patients who do not have heart disease. The predictor variables consisted of two personal data variables in the form of age and gender and 11 data variables from medical examination results.

Figure 1. Research Methodology
The dependent variable is denoted as , = no, yes, where "no" represents patients who do not have heart disease and "yes" represents patients who have heart disease. The thirteen independent variables are each denoted as , = 1,2,3, ⋯ , 13 namely age ( 1 ), sex ( 2 ), types of chest pain ( 3 ), blood pressure at rest ( 4 ), cholesterol level ( 5 ), fasting blood sugar level > 120 mg/dl ( 6 ), ECG results at rest ( 7 ), maximum heart rate ( 8 ), exercise causes anginatype chest pain ( 9 ), oldpeak or ST segment obtained from exercise relative to rest ( 10 ), ST segment slope ( 11 ), number of major pulses stained by fluoroscopy ( 12 ) and thalassemia Categorical three (normal); six (permanent disability); seven (temporary disability) Categorical no; yes In the process of predicting, the data is divided into two parts, namely training data and test data. The training data is used to build a prediction/classification learning model, while the test data is used to validate the previously built model. The model validation method using a cross-validation technique with many folds was chosen in this study because it has a small bias (Rodrı´guez et al, 2010). The division refers to (Burger, 2018)  The statistical learning methods used to build a predictive model of heart disease status in this work are the Multinomial Naïve Bayes method (Pan et al, 2018) and the Decision Tree (Han et al., 2012). These two methods are often successful in carrying out prediction/classification tasks with a high degree of accuracy (Retnasari and Rahmawati, 2017).
The Multinomial Naïve Bayes method works based on the Bayes theorem, which determines the maximum posterior probability of each observation obtained as the product of the prior probability and the likelihood probability. Let ( | ) and ( | ) Let A and B are the likelihood probabilities of the occurrence/diagnosis of cardiac arrest and no, respectively, which are written as, Posterior probability for , is, where the prior probability of each group is defined as, For the group of patients diagnosed with heart disease, ( | ) is the number of patients diagnosed with heart disease in variable with category , ( | ) is the number of patients diagnosed with heart disease in variable , ( ) is the number of patients diagnosed with heart disease, is the number of categories in the variable , and is the total number of groups in the study.
The Decision Tree is a classification method that has a tree structure such as a flow chart (Figure 3), where each internal node shows a test on a variable, each branch shows the results of the test, and a leaf node shows the results of the test node (classes), while the topmost node is called the root node. The concept of a decision tree is to partition data based on the highest e-ISSN : 2656-7245 Coronary Artery Disease Prediction Using Decision Trees and Multinomial … 180 gain value so that a decision tree is formed which is then used to form decision rules using IF-THEN logic (Han et al., 2012).

Figure 3. Decision Tree Model
The main steps in constructing a decision tree are: first, selecting a variable as the root; second, loading the branch for each value; third, dividing each branch into classes; and fourth, repeating the process for each branch until all cases in each branch have the same class. The basis for choosing a variable as the root is the highest information gain value of all variables.
Before getting the highest gain value, first calculate the entropy value of all values in the variable. Entropy acts as a parameter to measure the variance of the sample data. After the entropy value in the sample data is known, the most influential variable will be a measure of classifying the data. This measure is referred to as information gain.
Entropy and Information Gain are obtained using (6) -(8), respectively. where , , , , and respectively as the total number of patients, the number of patient groups, the total number of patients in the -th category of the predictor variable , the number of categories in the variable , the prior probability in the -th group of the predictor variable , and the prior probability in the -th category of the predictor variable .
Furthermore, after the prediction results of heart disease status using the Multinomial Naïve Bayes and Decision Tree methods were obtained, the performance of the two methods was evaluated. Regarding medical data, the performance of prediction results can be evaluated using the level of accuracy, sensitivity, specificity, precision, and negative predictive value (NPV) (Maniruzzaman et al., 2017) as shown in equation (9) -(13) based on the confusion matrix as in Table 2 (Gathak, 2017); (Burgers, 2018).

RESULT
This research begins by discretizing predictor variable data of numeric type based on references as presented in Table 3. Next, divide the data into 10-fold randomly and then separate them as training data and test data as illustrated in Figure 1.   Furthermore, Table 6 presents the prediction results for fold 1 to fold 10 using the Multinomial Naive Bayes method based on each learning model. Prediction results using the Decision Tree method for fold 1 to fold 10 based on each learning model are presented in Table   7.
Tab1e 6. Prediction of Heart Disease Status using Multinomial Naïve Bayes

DISCUSSION
Prediction of coronary artery disease has been carried out using many methods. This study proposes two statistical learning methods to predict coronary artery disease, namely Multinomial Naïve Bayes and Decision Trees. Both of these methods require that all variables be categorical type so that numerical variables in the research data are discretized first to obtain categorical type variables. The results of the discretization of the five numerical variables presented in Table 3 show that each age variable ( 1 ), blood pressure at rest ( 4 ), and cholesterol levels ( 5 ) has three categories, while each variable maximum heart rate ( 8 ) and oldpeak or ST segment obtained from exercise relative to rest ( 10 ) has two categories. The results of this discretization are different from those in Purushottam et al., (2016), David and Belcy, (2018), Riani et al., (2019) (only the variable 10 is the same), as well as Chowdary et al. (2020). This study also did not involve missing data like Purushottam et al., (2016).
Randomly dividing the data into 10-folds and then separating them as training data for model learning and test data to validate the model as presented in Table 4 shows that each fold has the same size, both training data, which contains nine folds, and test data. which contains one-fold. However, the size of the data on being diagnosed with heart disease and not having heart disease is not exactly the same. In the 1st and 3rd fold data, the size of the data diagnosed as having heart disease is larger than the data size diagnosed as not having heart disease, while in other folds it is the opposite.
Model learning using the Multinomial Naïve Bayes method shows that each variable has a different probability (likelihood) in each category as presented in Table 5 for the 1st learning data. The same thing happened to the other nine learning data. In the learning model using the Decision Tree method as shown in Figure 4, the variable that becomes the root node is the type of chest pain variable ( 3 ) which has four categories where each of the four categories has a size different internal nodes and leaves. The typical angina category has the smallest internal node and leaf size compared to the atypical angina, non-anginal pain, and asymptomatic categories, while the asymptomatic category has the largest size.
The performance measures of the two methods as presented in Table 6 and

CONCLUSION
This study succeeded in predicting the risk of coronary artery disease using two statistical learning methods, namely Multinomial Naïve Bayes and Decision Trees. Numerical variables in the research data were discretized to obtain categorical variables by referring to valid sources. The learning model validation technique used is 10-fold cross validation. The results showed that the performance measures of the Decision Tree method were more consistent, higher, and had a relatively smaller standard deviation than the Multinomial Naïve Bayes method. These results indicate that the performance of the Decision Tree method is better than the Multinomial Naïve Bayes method in predicting coronary artery disease. The results of this study also indicate that differences in reference in discretizing numerical variables can affect the performance of the method in predicting the risk of coronary artery disease.