Stud e nt s P e r fo rm a n ce M eas ur e m e nt a nd Pr e d ictio n B ased on A ca d e m ic F ea tur es Thr o ugh th e M ac hin e L ea rnin g

.


INTRODUCTION:
We are living in the age of data.Like all modern organizations, educational institutions are collecting a large volume of data involved around their students.All educational institutions are working in a strongly competitive environment.Most of this data is usually used in producing simple queries and making traditional reports.A significant amount of data is kept unused.Data mining is a process of discovering useful and meaningful patterns as well as information from a large amount of data.Data mining provides us with tools and technology that can help us in the knowledge discovery process.It is a step-by-step process from integrating data from different sources to mining useful patterns and information and representing it in a way that is easy for human interpretation.
In this research, with the help of data mining and machine learning methodologies, we are trying to find out useful patterns and information from available data about students.We will try to find out the factors that are affecting students' results and overall performance.In the field of academics, data mining is very effective in discovering valuable information which can be used for profiling students based on their academic records (Suhem et al., 2012).Predictive information provides valuable time in which we can take measures like counseling and additional coaching to improve the performance of degrading students.Data mining, generally defined as the process of discovering meaningful patterns in large quantities of data, offers a great variety of techniques, methods, and tools for a thorough analysis of available data in various fields (Dorina K., 2012).Through our research, we can get insight into students' preferred teaching methods and fields.It can help us to improve education quality and tech-nical skills through the findings.The research aims to find students' performance for developing his/her career and also track the factors affecting their performance.It can also give us an idea of the dependencies and patterns among the factors.
The motivations behind our research are based on the huge possibilities and goals that we can achieve through it.The education sector plays a key role in the development of an individual as well as a nation.Hence continuous improvement and development of this sector is a must.Gathering important information about students and the factors that affect their results and performance might come in handy.It can help us in finding interesting patterns and how to use them in the advancement of student performance.Various tools, technology and methods are available due to the advancement of science.The availability of these tools and technology can monitor the regular performance of the students.The students can aware his/her present and future academic condition based on this type of research hence we are motivated to do the research work in this field.
Many institutions still manage students' information manually where significant data are kept unused.They gather various information about students but do not use it for further analysis or any kind of prediction.We have to use this huge amount of data in some productive way to our advantage.Manual management of student data limits the usability of this data.Every student has a preferred and suitable subject or sector to study.But all cannot choose their preferable subjects or sectors for many issues.Some students have to forcefully study non-favorable subjects or fields which limits their potential.Data analysis can help in understanding their skillsets and potentiality which can eventually guide them.All students are not in the same categories.There is always a gap between good, medium, and bad students.For the continuation of the gap between students, they do not cope with the boundaries.Lack of guidance remains between them as some are to be handled differently from others.For this reason, the good student usually always performs well and bad students remain lagging.We pressurized them to good in study and gain good results in exams.If we focus on their potential, they might have a better future.Students can achieve their highest potential if their talents and skillsets are discovered in the early stages and are properly guided.Objectives describe what we expect to achieve from the research.The main objectives of the research are listed here.

Students' performance measurement
In this work, we will collect all the information from the student's track record and analysis those data to see how a student's performance is increasing or decreasing.

Finding students' potentiality
By analyzing students' track records and all informative datawe can know about a student's potentiality.

Achieve steps of learning approach
In this work, it will be helpful for theteachers to achieve steps of the learning approach from the findings.They can give the right effort to a student.

Improving education quality and technical skills
To find out the student's potentiality wecan improve educational quality and technical skills through the approach.That will help a student to make a better future.

Factors affecting students' performance
Finding the factors affecting students' performance through our analysis and taking necessary steps according to findings.It can help both teacher and a student to utilize the factors that affect those students.And also find out dependencies and patterns among those factors.Therefore the research will not only help us in improving the students' performance but also bring out their full potential.It will help to develop new teaching methods accor-ding to student types and provide insight into student sentiment.That may decrease the number of student dropouts.Following the introduction above, we conclude the paper with a satisfactory performance of the research.The remainder of this paper is organized as follows.In section 2, we discuss the student's sentiments.Sections 3 and 4 consist of methodology and result analysis of the research; conclusions are discussed in section 5.

Students'sentiment analysis
Students' sentiment analysis is an analysis process that measures students' emotions, attitudes, or opinions based on students' data.In this study, we evaluate students' responses to their academic studies like how good they are at their class tests and final exam.These responses give them the outcome in their result.We compare their present result with the past result to evaluate their progress or downfall in the study.We use their numeric information which is attendance, performance marks, class-test marks, and exam mark to predict their result and the status of the result.We can evaluate the student's sentiments from their academic activities.Therefore if the students do not perform according to their expectations, we can consult with them to know if they are facing any problems or difficulties.Students are a crucial asset for the long-term development of any nation.If we give them proper guidelines and show them the best paths, they will surely succeed in bringing prosperity to both their society and the nation.

Datasheet attributes
Initially, to analyze students' sentiment, we need to have data consisting of their various features.For this purpose, we used an xlsx file to store data.We prepared 2 datasheets for two years of 1000 students in four subjects.We entered Students ID, Name, per subjects Class Attended, Attendance_ percentage, Attendance, Class Test (CT_1, CT_2, CT_3), Exam, Grade, and Status as an attribute.Then we converted the excel file into a CSV file.It is a plain text file that contains a list of data and is used for exchanging data between different methods.We stored students' information digitally and preprocessed their information for better quality analysis.If any student's information was missing, we filled the data with the algorithm of filling the missing value.We can fill the missing data with mean or median values.We have normalized the data for ranging all the students' data for better use and to reduce complexity.Then we classified the data and visualized those data to founding the desired outcome.

Problem finding and necessary measures
Through our analysis using various methods, we find out the problems behind students' bad results, de-pression and inattentiveness.After analyzing the data, we can get various insights into the factors that are affecting their performance.This research explores as well the possibility of identifying the key indicators in the small dataset, which will be utilized in creating the prediction model, using visualization and clustering algorithms (Lubna M. A. Z., 2019).With our analysis, we can identify the students who are performing unexpectedly.With the information, we can cooperate with that student to get an idea of his/her problems or issues.Some students fail or gain bad results because they do not like that subject, some do not understand that subject, some students have not interested in education study, some have family issues or some need guidelines.Throughout the work, we can find out what problems they are facing and give them suggestions.If the problems are solely base on their academics we can provide them with extra classes, tutoring, individual tasks, etc.If they have family issues, we can monitor them so that we can give them extra care.If they feel depressed, we can give them counseling.If the financial state of the parents is in bad condition, we can help them to get motivated and provide them with a weaver or scholarship opportunities.Our findings can also help to understand their sentiment towards their studies.This can help them to bring out their potential and individual skillsets.Students will also be cheerful-minded as they are doing and working on what they love and admire.

METHODOLOGY:
The objective of our proposed methodology includes finding useful patterns and valuable information.To accomplish this,we have to go through a step-by-step process from datapreparation to knowledge represent ation.The existing methodologies of students' information management and performance measurement are as follows.

Manual Evaluation
In this world, the amount of data is increasing day by day.Therefore nowadays it is hard to handle this data in manual ways.It is not possible that anyone can record this data in notes or other handwriting ways.This process limits the usability of this data.

Excel Datasheet
We know excel is a good software to record data.But excel cannot handle a vast amount of data properly.And it is quite hard to evaluate some important data from an excel sheet.Nowadays it is the world of big data and we need to extract information that can help us predict something for further inquiry.Excel also has limited visualization and knowledge representation methods.

Limited Visualization
We need to increase our visualization and utilization of the data.It can help us in decision making for the future.Limited visualization limits human interpretation and a better understanding of the data.

Descriptive Information
In thisresearch, we are trying to find out the potentiality of data from the findings.Unused data can also help us if we can properly use it and extract necessary information from it.Currently, used methods can give us descriptive information for particular findings.They do not provide us with predictive information which can be very beneficial to our goal.Our overall methodology can be perceived through the diagram given in Fig. 1.It represents the steps from which we can receive our expected outcome.At first, we go through the data integration phase then we preprocess the data by implementing various techniques.After that, we implement different machine learning algorithms to get knowledge and finally implement knowledge representation.

Data integration
Data integration is a process where data is combined from different heterogeneous sources into a single source and provides a unified view of the total data.The data integration process is a significant process for a variety of situations like data analysis, data processing, etc.Our dataset consists of academic records of various attributes of 1000 students.Each subject consists of various data of that subject like attendance, class performance, class test, exam, re-sult, grade, etc.These are the commonly used attributes in universities for students' academic records.Our research explores as well the possibility of identifying the key indicators in the small dataset, which will be utilized in creating the prediction model, using visualization and clustering algorithms (Lubna M. A. Z., 2019).In our dataset, we integrated per subject attendance from different sources like data from different course teachers.From the attendance, we calculated the per subjects' attendance marks and percentages.Based on the student's performance in their classroom teacher gives them performance marks.We integrated their performance mark into our dataset.Each subject has three class-test marks and the average marks are gathered into the dataset.Final exam marks are also gathered into the dataset.
We calculated the final result with all these marks.Based on the final result, we gave the student grade marks and the outcome pass or fail.In this manner, we combined 4 subjects' data into our dataset for 1 st year and 2 nd year.The above Fig. 2 shows year one' one subjects' data among the total four of our main datasets.In our dataset, we integrated per subjects' attendance from different sources like data from different course teachers.From the attendance, we calculated the per subjects' attendance marks and percentages.Based on the student's performance in their classroom teacher gives them performance marks.We integrated their performance mark into our dataset.Each subject has three class-test marks and the average marks are gathered into the dataset.Final exam marks are also gathered into the dataset.We calculated the final result with the help of attendance marks, performance marks, class-test average marks and final exam marks.

Data preprocessing
Data preprocessing is a process of preparing the raw dataand making it suitable for data mining and the machinelearning process.It is the first and crucial step while creatinga machine learning model.If the dataset consists of muchredundant or irrelevant data then the knowledge discoveryprocess from the dataset becomes difficult and inefficient.Data preprocessing is a step-by-step process.The steps are -1) To gather the dataset 2) To import the necessary libraries 3) To find the missing values and handle them 4) Encoding the categorical data 5) Removal of noise or outliers 6) Scale the values While working on a large amount of data we often face situations like missing values.It can also occur during a data transaction.In the dataset, we handle the missing values with median values.Missing values have a significant effect on the observation.Therefore, the missing values need to be handled otherwise we may face unwanted results.Missing values can also be handled manually but it consumes valuable time.The empty values can also be filled with mean values.We must remember the fact that the values filled by this process will not be accurate.Then we normalized the value in our dataset between the 0-1 range which is used to reduce the data redundancy and improve the data integrity.Min-max normalization is the simplest method and consists in rescaling the range of features to scale them in a predefined range.The equation through which minmax normalization is calculated is given below.
Where,  ′ = Normalized value, = Value to be calculated   = Minimum value of that attribute   = Maximum value of that attribute new_  = New minimum value of that attribute _  = New maximum value of that attribute Another data preprocessing method is noise or outlier analysis and their removal.The solution can be found through clustering.The cluster can show us if there is any noise or outliers present in the dataset.

Algorithms
After data preprocessing, we get consistent data in our hands.Now we can implement various kinds of algorithms using specific attributes according to the needs of a specific pattern.In data mining and machine learning methods, there are generally two types of algorithms.One of them is unsupervised learning and another is supervised learning.An unsupervised learning algorithm is an algorithm that learns patterns and information from unlabeled data.This method is appropriate to use when there is a need of discovering hidden patterns or groups from a dataset without the need for human intervention.In our research, we implemented the K-means clustering method as our unsupervised learning algorithm.Kmeans clustering is the process of grouping a set of data points into several groups such that objects in the same group are more similar than the objects in other groups.In K-means clustering the K value defines the number of clusters.K-means tries to make the intra-cluster data points as similar as possible and at the same time keep the clusters as different as possible.This algorithm assigns data points to a group or cluster such that the sum of the squared distance between the data point value and the cluster centroid value is at the minimum.The process of how the k-mean algorithm works are given below -1) Specify the number of clusters 2) Initialize centroids by first shuffling the dataset and randomly selecting k data points for the centroids without replacement 3) Continue the iterations until no changes to the centroids are found 4) Compute the sum of the squared distance between data points and all centroids 5) Assign each data point to the nearest cluster 6) Compute the centroids for the clusters by taking the average of the data points that belong to each cluster K-means can give us an idea of the data we are dealing with and also provide insight into the dependencies between the attributes.It can also show us if there are any outliers and noise present in the dataset.So, it is a useful process for outlier analysis as well as noise removal.There is one more thing we have to keep in mind while doing K-means clustering.The value of k determines how many clusters we will get.So, we need to choose the optimal k value which is hard for humans to select.That's why we used Elbow Method for this purpose which is used to select the k value.This method can provide us an idea of what the optimal number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters centroid.K is chosen at the spot where SSE starts to flatten out and form an elbow.Using this method, we can hope to get the best k value according to our dataset (

Knowledge representation and visualization
Knowledge representation is a technique for humaninterpretation through the experience of past problems.Itcan be viewed from different perspectives.
The usefulnessof the knowledge and inform-ation depends on how they arerepresented.It is not just transferring information but alsoattempts to ensure students receiving the visual information understand greater insights and perspectives surrounding agiven subject.
To make data more understandable for humansand pull insights from it, effective visualization is a must.
In this research, we used a scatter plot as it was suitable forour needs.A scatter plot is a visualization method that displaysthe values of two different variables as points.The data foreach point is represented by its horizontal (x) and vertical (y) position on the visualization.We used this plot-ting techniqueto find out the dependency among the attributes through thex and the y axis.It can give us insight into how the featuresare related to one another and also show the groupings amongthe data and outliers.

Necessary measures
After implementing the various algorithms, we can hope to find out the features that are directly linked to students' performance.Clustering methods can help to find out groupings of the students' performance or features as well as to get knowledge about the dependencies among the attributes.Hence, this can help us to pinpoint the features that are influencing students' performance.From the findings, the authority can focus on those features and take necessary measures to improve them which will eventually provide a better outcome for the students.
Through the classification and the regression method, we can predict if a student is likely to pass or fail.It is possible to provide timely warning and support to low-achieving students and advise highperforming students (Raheela et al., 2017).We may even predict how many marks or points will they get according to their academic records with a notable amount of accuracy which is very crucial for the students.With this information, we can provide various measures to the students who have the pos-sibility of failing or performing badly.The meas-ures can be additional classes, individual tutoring, implementing more effective teaching methods, counseling to gain focus, etc.

RESULTS:
In the result part, we concentrate on the outcome of the research and know the findings after implementing the algorithms.We compared the actual result with our expected result by implementing the K-means clustering, K-nearest classification and multiple linear regression algorithms.
Here, we have selected class attended (S11_Class_ Attended) and result (S11_Result) attribute of 1 st year 1 st subject from our main dataset to implement clustering among them.The updated dataset is given in Fig. 3.The value of k determines how many clusters we will get.Therefore, we need to choose the optimum k value which is difficult for a human to select.Hence, we used the Elbow method and chose k=3 as it is on the elbow point of the graph as shown in Fig. 4. Which is used to select the k value.
We choose the Elbow point to determine the k values.After normalization, the clusters are more wellformed and easier to distinguish.From Fig. 5 we can observe that the students with better class attendance have a better result.We can see that the S11 class attendance and S11 result have an almost linear relation among them.This information not only proves the importance of class attendance for a better result but also provides further analysis potential for this attribute.We can also see the groupings of the data points which show us that majority of the student have good and average class attendance thus having good and average results as well as performance.We implemented a similar process for the various subject of the same and different years to validate the above information about class attendance attributes or features.From the above Fig.6 and Fig. 7, we can observe that they show similar patterns and characteristics to Fig. 5.This validates all the information we gained from our cluster for the class attendance feature.Following the same procedures, we collected inform ation and insight into the other attributes.Every cluster can give us useful information which will be beneficial for our objective.They can also provide us with information on the factors that influence a student's performance.
From Fig. 8. we can observe that the student's class performance and results create scrambled clusters where the clusters are not distinguishable from one another.We can see that the S11 class performance and S11 result have nonlinear relations among them.This information proves the importance of class performance for better results is relatively less than other features.If we look deep into it, we can find the reason as very few students are responsive to their lecturers or teachers.Most often students feel shy and fear as some of them are introverts.We can also see the groupings of the data points which show us that majority of the student have good and average class performance marks including one outlier.From the Fig. 10.we can observe that the students with better class test marks have a better result.We can see that the S11 class test and S11 result have an almost linear relation among them.This information not only proves the importance of class test marks for a better result but also provides further analysis potential for this attribute.We can also see the groupings of the data points which show us that majority of the student have good and average class test marks thus having good and average results.
We implemented a similar process for the various subject of the same and different years to validate the above information about class test attributes or features.For example, the cluster of S21 class test marks and results is shown in Fig. 11.It shows a similar pattern and result.From the Fig. 12. we observed that the students with better exam marks have a better result.We can see that the S11 exam and S11 result have linear relations among them.We know that exams hold the highest mark in students' overall results.This information not only proves the importance of class test marks for better results but also provides further analysis potential for this attribute.We can also see the groupings of the data points which show us that majority of the student have good and average class test marks thus having good and average results.We implemented a similar process for the various subject of the same and different years to validate the above information about exam attributes or features.For example, the cluster of S13 exam marks and results is shown in Figure 13.which shows a similar pattern and result.All the clusters we used were to identify and verify significant features of a subject and its effect on the result.We also found groupings of the students through those graphs.
There is another way we can utilize the clusters to find more information and knowledge.We used this opportunity to find information about our cause.We implemented clusters to compare students' performance over two years which gave us insight into if students' performance is increasing or degrading.We also compared various subject features or attributes to get information on their status.
Here, from Fig. 14 we can identify the difference between the attendance of the first year and the second year.We can look into the matter to find out the cause for why a student's attendance percentage has changed relative to the previous year.Similarly, from Fig. 15 we can identify the difference between the class test mark for the first year and second year.We can look into the matter to find out the cause for why a student's test mark has changed relative to the previous year.Then we can provide feedback according to the changes.
From Fig. 16.we can identify the difference between the exam mark of the first year and second year.We can look into the matter to find out the cause for why a student's exam percentage has changed relative to the previous year.We can take the necessary steps and provide feedback according to the findings to improve in future exams.
From Fig. 17 we can identify the difference between the result of the first year and second year.We can look into the matter to find out the cause for why a student's result has changed relative to the previous year.We can take necessary steps and provide feedback according to the findings to improve future results and work on the issues that are causing the changes.

Classification
Classification techniques are used to build an educational model based on knowledge discovery in databases to predict learner behaviors.We imple-mented the K Nearest Neighbor classification method to predict students' results.As KNN can handle categorical data, we used various academic features to predict if a student is likely to pass or fail according to his or her available data.In this part of the research, we predicted if a student will pass (1) or fail (0) according to his academic attributes such as attendance, class performance and class test.These three available attributes can be recorded before the final exam.This means there will be sufficient time to analyze these data before the exam and take necessary measures according to them.If the prediction given in Fig. 19 shows that a student is likely to fail according to his or her current statistics various measures like extra classes, counseling, different teaching methods, etc. can be applied.Which will hopefully be beneficial to the students and the research objective.We can also find out the problems and the issues that the students might be facing and provide a solution to them.It will also give the authorities insight into how to cope with these problems or prevent them in the future.Fig. 19 shows the actual exam status and predicted exam status.In our KNN classification, we got an accuracy rate of 99% as it predicts categorical data very efficiently.

Regression
As KNN could not handle continuous values we used multiple linear regression to predict continuous values like exam values.This method uses multiple independent variables to predict dependable variables.We have predicted students' exam numbers according to their various academic features.After predicting their exam values before the exam takes place, we can take necessary measures according to the findings.These measures might be very beneficial for their outcome and the findings can also help us to gather significant patterns and knowledge.The Fig. 20 shows the attributes that we worked on.
Here, academic attributes like attendance, class performance, and class test average work as independent variables which are available before the exam.We used those to predict exam marks which are dependent here.From Fig. 20 we observed that exam values are continuous so multiple linear regression was the best option for our objective.The attribute which has the highest coefficient value will have the most influence on the result.We can see that most of the differences between the two values are very small although there are a few exceptions where the difference is relatively large.We can take various necessary measures for those students who are likely to get less or unexpected marks in the exam according to our findings.It can give us insight into the lacking's and clear the path to overcome them.We can use our findings or knowledge to develop more effective teaching methods and avoid mistakes in the upcoming future for a better outcome.It will hopefully help the students to perform better and keep on track which will bring out their full potential.

CONCLUSION:
The education sector is undoubtedly one of the most significant sectors for any nation.The development of this sector is essential for long-term economic and national prosperity.Like all other organizations, educational organizations also store large amounts of data that contain various features.Through analyzing these data, we can find interesting patterns and knowledge which might be beneficial to the development of the education sector.Our research is based on measurement and prediction methods that may influence the students' performance and sentiment.
We can evaluate the students in various manners and look into their data from different perspectives.Through these methods, we can gather various knowledge and patterns and provide feedback or necessary measures according to the findings.Tak-ings the necessary steps or measures that help the students to stay on their track and bring out their best possible outcome.It can help us to identify the issues and problems that affect the performance of students and get an idea of how to overcome them in the upcoming future.There are huge potential appli-cations for our research.The future scope of our research is as follows: 1) Development of effective teaching methods 2) Maximization of educational institutes' efficiency and resource allocation 3) Improvement in the decision-making process for higher education and career 4) Potentiality to identify effects of various other attributes on students' performance.5) Potentiality to track performance throughout a student's career and establish effective student profiling 6) Identification of undesirable behaviors & early dropouts of students to provide appropriate advising or counseling

ACKNOWLEDGEMENT:
This is to express our warm thanks to the dept. of Computer Science and Engineering at the Bangla-desh Army University of Engineering and Technology (BAUET) for giving the opportunities and helping the research to reach its successful state.

Fig. 2 :
Fig. 2: First year data of subject one.

Fig. 5 :
Fig. 5: The cluster of S11 class attended and result.The above Fig.5.shows cluster without the normalization process, therefore the clusters lookscrambled.After normalization, the clusters are more wellformed and easier to distinguish.From Fig.5we can observe that the students with better class attendance have a better result.We can see that the S11 class attendance and S11 result have an almost linear relation among them.This information not only proves the importance of class attendance for a better result but also provides further analysis potential for this attribute.We can also see the groupings of the data points which show us that majority of the student have good and average class attendance thus having good and average results as well as performance.We implemented a similar process for the various subject of the same and different years to validate the above information about class attendance attributes or features.

Fig. 8 :
Fig. 8: Cluster of S11 class performance and result.Now, we implemented a similar process for the various subject of the same and different years to validate the above information about class perform-ance attributes or features.For example, the cluster of S13 class performance and result is given in Fig.9.It shows a similar pattern and result.

Fig. 14 :
Fig. 14: Comparison between first year (above) and second year attendance.

Fig. 15 :
Fig. 15: Comparison between first year (above) and second year class test.

Fig. 16 :
Fig. 16: Comparison between first year (above) and second year exam.

Fig. 17 :
Fig. 17: Comparison between first year (above) and second year result.

Fig. 21
Fig.21represents the actual values (x-axis) and the predicted values (y-axis) of the exam.This figure gives a certain idea of the regression model's accuracy.We observed that the actual values and the predicted values are almost linear.It means we can assume that the predicted values are almost similar to the actual values.Although there might be some outliers which is acceptable in this case as the values are continuous.Therefore, with a significant amount of efficiency, we can predict the exam values and take proper steps according to the findings.During clustering, we saw that the exam was most linear to the overall result.It has the most dependencies on the result.So, in this manner, we can get the idea of the overall result prediction through exam predicttion.Through this process, we can implement various other factors to identify their significance and use them to predict the result or see their influence on the result.

Fig. 22
Fig.22represents the actual exam numbers, predicted exam numbers, and their differences.As shown in Fig.22the actual and predicted values are very close to each other despite being continuous values.We can see that most of the differences between the two values are very small although there are a few exceptions where the difference is relatively large.We can take various necessary measures for those students who are likely to get less or unexpected marks in the exam according to our findings.It can give us insight into the lacking's and clear the path to overcome them.We can use our findings or knowledge to develop more effective teaching methods and avoid mistakes in the upcoming future for a better outcome.It will hopefully help the students to