Prediction of Liver Diseases by Using Few Machine Learning Based Approaches

Advancement in medical science has always been one of the most vital aspects of the human race. With the progress in technology, the use of modern techniques and equipment is always imposed on treatment purposes. Nowadays, machine learning techniques have widely been used in medical science for assuring accuracy. In this work, we have constructed computational model building techniques for liver disease prediction accurately. We used some efficient classification algorithms: Random Forest, Perceptron, Decision Tree, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM) for predicting liver diseases. Our works provide the implementation of hybrid model construction and comparative analysis for improving prediction performance. At first, classification algorithms are applied to the original liver patient datasets collected from the UCI repository. Then we analyzed features and tweaked to improve the performance of our predictor and made a comparative analysis among the classifiers. We examined that, KNN algorithm outperformed all other techniques with feature selection.


INTRODUCTION
Researchers faces more challenging task in healthcare sectors to predict the diseases from the voluminous medical databases. Nowadays data mining techniques are more essential in healthcare. Data mining tools and techniques including classification, clustering, association rule mining for assessing frequent patterns are applied to medical data for disease prediction. In data mining, classification techniques are much appreciated in medical diagnosis and predicting diseases (Ramana et al., 2011). Chronic Liver Disease is the leading cause of death worldwide which affects a large number of people worldwide. This disease it is caused by a combination of certain substances that damage the liver (Rahman et al., 2019). Liver is the largest internal organ in the human body, playing a major role in metabolism and serving several vital functions. It weighs about 3 lb (1.36 kg). The liver supports almost every organ in the body and is vital for our survival. Liver disease may not show any symptoms at earlier stage or the symptoms may seem low, like minor sickness and enervation. Symptoms somewhat rely on the type and the severity of liver disease. Liver diseases are diagnosed based on the liver functional test (Karthik et al., 2011). Classification techniques are widely applied in various automatic medical diagnoses. Problems with liver are not easily understood in primary stage as it will be functioning normally even when it is harmed (Liu and Huang, 2008). An early diagnosis of liver problems will accelerate patient's survival rate. Liver disease is often diagnosed by analyzing the enzyme levels in the blood (Schiff et al., 2007).

Review of Literature
Different researchers have worked on liver disease diagnosis previously and found accuracy of different machine learning algorithms using different tools. Rajeswari and Reena, (2010) in this paper, Authors perform data classification which is based on liver disorders this paper deals with the results in the field of data classification obtained with Naive Bayes algorithm, FT Tree algorithm and KStar algorithm. Dhamodharan, (2014) has predicted three major liver diseases such as Liver cancer, Cirrhosis and Hepatitis with the help of distinct symptoms. They used Naïve Bayes and FT Tree algorithms for disease prediction. Comparison of these two algorithms was assessed purely based on their classification accuracy measure. From the experimental results they concluded the Naïve bayes as the better algorithm which predicted diseases with maximum accuracy in classification than the other algorithm. Ramana et al. (2011) the classification algorithms considered here are Naïve Bayes classifier, C4.5, Back propagation Neural Network algorithm, and Support Vector Machines. These algorithms are evaluated based on four criteria: Accuracy, Precision, Sensitivity and Specificity. Karthik et al. (2011) in first phase, ANN is used for classifying the liver disease. In second phase rough set rule induction using LEM (Learn by Example) algorithm is applied to generate classification rules. In third phase fuzzy rules are applied to identify the types of the liver disease.
Aneeshkumar and Venkateswaran, (2012) in this paper authors are using classification. The overall performance of C4.5decision tree is better than Naive Bayesian. Pahariyavohra et al. (2014) Rajeswari and Reena, (2010) used Naive Bayes, K star and FT tree to analyze the liver disease. Data set is taken from UCI consisting 345 instances and 7 attributes. 10 fold cross validation test are imposed by using WEKA tool. Naïve Bayes shows 96.52% correctness in 0 sec. 97.10% accuracy is gathered by using FT tree in 0.2 sec. Paul R Harper reported that, there does not exist necessarily only one best classification tool but instead the best performing algorithm will rely on the features of the dataset to be examined. Ramana et al. (2012) modified rotation forest algorithm was proposed with multi layer perception classification algorithm and random subset feature selection method for UCI liver data set.
Rosalina and Noraziah, (2010) deduced prediction on a hepatitis prognosis disease assisted by SVMand Wrapper Method. From the experimental outcome, they observed the ongoing accuracy rate in the clinical lab test cost with lower execution time. They have fulfilled the goal by the combination of Wrappers Method and SVM techniques. Among the most influential work in Micro-Array Analysis can be attributed to Rifkin et al. (2003) Their work is attributed to a Support Vector Machine to accurately (80%) predict the origin of tumors collected from samples obtained at Massachusetts General and other medical institutions.

METHODOLOGY:
The research work about this paper can provide the solutions of liver disease features which include process of feature selection applied on dataset and the performance of model construction. Comparative analysis of classification algorithms is performed for ameliorating accuracy in prediction of liver patients with or without feature selection. This paper finds answers to these questions which can help to know the various aspects about classification of liver patients. By performing this work, it is shown that feature selection has a great significance as the process of choosing a subset of most relevant features for their usage in the construction of model. By using feature selection on ILPD (Indian Liver Patient Dataset) before a classification algorithm can be applied, performance of classification algorithm increases. This also provides that the splitting size also differ the accuracy of different algorithms. In this paper, five Classification algorithms Decision Tree, Perceptron, Support Vector Machine, Random Forest and K-Nearest Neighbors algorithms have been considered for comparing their performance based on the ILPD.

B. Proposed Algorithms
Decision Tree Classifier -For our first algorithm we will be using Decision Tree classifier. It is vastly used machine learning algorithms to this date. They are applied for both classification and regression problems. Now a question might arise why we are willing to use Decision tree classifier over other classifiers. To answer that question we can have two reasons. One being, Decision trees often tries to mimic the same way human brain thinks so it is quite simple to understand the data and come to some good conclusions or interpretations. To start a decision tree is a tree where there are a bunch of nodes and each node represent a feature (attribute), each link (branch) represent a decision otherwise known as rule and each leaf of the tree represent an outcome otherwise known as categorical or continues value. The idea is to create a tree for the entire data and get an outcome at every leaf (Russell, 2002).  (Gulia et al., 2014).

Perceptron -Perceptron is a single layer neural network and a linear classifier (binary
K-Nearest Neighbor Algorithm -K-Nearest neighbor algorithm (KNN) is one of the most widely used supervised learning algorithms that have been applied in many applications in data mining. It follows a way for classifying relied on closest training samples in the feature space. An object is distinguished by a majority of its neighbors. The neighbors are chosen from an array of objects for which the correct classification is observed (Russell, 2002).

Support Vector Machines (SVM) -
SVM is a learning system that uses a hypothesis space of linear functions in a high dimensional space, trained with a learning algorithm from optimization theory denoting a learning bias got from statistical learning theorem (Russell, 2002

C. Performance Metrics
To assess the result of the study accurately, rather than accuracy alone, some of the other performance metrices were introduced in the result sections too. By observing these metrices, a clear indication of better result was noticed among different folding and splits of the dataset.
Performance parameters are the most important factor to compare among classifier methods to get the best classifier. Applied performance metrices includes Accuracy, precision, Recall and F-Score. These parameters calculated from a confusion matrix which situated in every step of classification (Russell, 2002). About confusion matrix and detailed information about these proposed parameters are as follows:

D. Model Evaluation
To evaluate and carry out the analysis, we first preprocessed our dataset by removing null values and converting textual features to numerical values. Then, we found out the relation between different features in our dataset in visually. After processing, dataset was split into Training (75%) and Testing (25%) set for algorithmic model construction.
After feeding training data to build model, testing data was applied to find out performance of result. Then, we tried tweaking in features for having even better result. For this, we tried to find the correlation of each two features and omit one of them if there is linearity. Then we have to split again and feed in algorithms. Lastly, we measured performance of each algorithm if they are increased or not.

RESULTS AND DISCUSSION:
After processing dataset, we fed it into above mentioned machine learning classifiers one by one. Firstly, we run the classifiers on raw dataset after processing, assessed results and run a comparative analysis among the classifiers. Then we ran classifiers again after selecting useful features to improve performance of our existing classifiers. We have done an elaborate experiment on all the classifiers mentioned above and found KNN as the best performing classifier. Performance Comparison is made among these classification algorithms before and after applying feature selection. From the analysis of each algorithm we can say that, at first we get most accuracy in Support Vector Machine  Table  2 and also shown in

CONCLUSION AND FUTURE DIRECTION:
This work presents an approach that will be used for hybrid model construction of community health services. These classification algorithms can be implemented for other dominant diseases also like cardiac and diabetes prediction and classification. More than one dataset may be used for better approach and comparison . Another scope is to see whether by applying new algorithms will result any improvements over techniques which are used in this work in future.
More techniques for accuracy increment may be applied. Wrapper method may be applied for removing noise in the dataset.
Classification rules and disease identifying techniques may also be generated by using different efficient algorithms. More than one database for comparative analysis may also be used. Our works has certain limitations as the model has underperformed having less accuracy than expectations. So, in future, inclusion of deep learning methods may improve our results further.

ACKNOWLEDGEMENT:
Many thanks to the co-author supported with proper assistance and help for analysis and writing to conduct successful research study.

CONFLICTS OF INTEREST:
The authors declare they have no competing interests with respect to the research.