Classification
Classification is the process of grouping the similar data items in a single group and separating the dissimilar data items in different groups. For example: dividing the students into different grade class (A+, A, B+, …) based on their obtained marks. In the context of data mining, it is the process of finding the model that is capable of distinguishing the data onto different classes based on their similarities and dissimilarities.
It is used in almost all areas like medical (grouping the medical patients based on their symptoms), sales (categorizing the customers), finance (categorizing the loan applicants), etc.
First, we gather an unclassified dataset (structured or unstructured) and then pass it to a classifier (the model) to get the classified dataset. Various classification algorithms can be used for this purpose, and the appropriate algorithm can be used based on your requirements.
Learning and Testing of Classification
The steps involved in learning and testing of classification are given below:
- Data Preparation
Gather the necessary data and pre-process them if necessary. Split the data into training and testing sets. - Model selection
Choose the appropriate algorithm based on your requirements (e.g. logistic regression, decision tree, random forest). Tune the hyperparameter properly for effective and optimal performance of the model. - Model training
Use the training data to fit the data into the classification model and learn underlying data patterns and insights. - Model evaluation
Evaluate the model based on metrics like accuracy, efficiency, speed and more. - Model optimization
If the results of the model is not satisfactory, optimize the model by tasks like feature engineering, choosing the different algorithm and tune hyperparameters further.
Decision Tree Induction
A decision tree is a tree in which each branch node represents the no of choices and the terminal node represents the decision or classification. In classification, a decision tree is a classifier that classify an instance starting from the root node until the terminal node is found.
Terminologies
- Root node: It is the starting node of the decision tree, which gets split into further nodes.
- Decision node: It is when the node further splits into sub-nodes.
- Terminal node: It is when the node can’t be further split into sub-nodes.
- Splitting: It is the process of dividing nodes into sub-nodes.
- Pruning: It is the process of merging sub-nodes back into a single node. It is the opposite of splitting.
- Child node: It is sub-nodes divided from a single node (parent node).
Example
From the above figure, we can see how we can turn the training data into a decision tree model and perform the decision based on that. Decision tree helps to visualize the data better, which helps in better understanding of the data.
Merits of decision tree
- Interpretability: Transparent and easy to interpret model
- Robustness: Handles both numerical and categorical features/ inputs and insensitive to outliers
- Versatility: Can perform both classification and regression tasks
Demerits of decision tree
- Overfitting: Decision tree may overfit the data, especially when the tree grows too deep
- Bias towards the feature with larger no of unique values
- Sensitive to feature scaling or normalization
- Resource intensive and expensive due to long time requirement and complexity
Bayesian Classification
Bayesian network is a powerful supervised machine learning model that helps to make prediction by using principles of Bayes Theorem. It is a probabilistic graphical model that represents the knowledge about an uncertain domain, where each node corresponds to a random variable and each edge represents conditional probability for the corresponding variable.
Bayesian Theorem
It is a fundamental concept in probability that describes the relationship between two conditional probability of two events.
Bayesian classification helps to predict the label of the instance based on its features by calculating posterior given all the features of the instance. The Bayesian network is trained on a labelled dataset with a set of features along with corresponding labels.
From this data, the classifier finds the posterior probabilities of each class and conditional probabilities of the features. For new instance, it calculates the posterior probabilities for each class using the Bayes Theorem. The class with the highest posterior probability is then assigned to the predicted label.
This classification often assumes that features are independent of each other given the class label (known as “naive” assumption). This makes calculation more efficient and easy.
Rule Based Classification
Rule based classification is a straight forward approach for building the classification model where prediction is made based on set of pre-defined rules.
In rule based classification, the rules are manually defined by domain experts or extracted from the training data. These rules usually take forms of “IF-THEN” statements, where the antecedent (IF part) describes the conditions based on features values and the consequent (THEN part) describes the predicted class.
Even though this classification is interpretable and flexible, it can’t handle complex relationships.
Linear Regression
Linear based classification uses a linear regression model to make binary or multi-class predictions. As shown in the above figure of binary classification, the values above the threshold (line) are classified as positive and the values below the threshold are classified as negative.
This type of classification is simple, efficient and easy to implement but sensitive to feature scaling and can only handle simple predictions. So it is only suitable for well-behaved classification problems where decision boundary can be approximated by linear function.