A Guide to Data Pre-processing

5 min readMay 18, 2024

If you have no idea about data mining, please refer to this blog first.

OK, I am assuming that you know a little about database, data mining and data warehouse.

What is Data Pre-processing?

As mentioned in the earlier blog, data pre-processing is a crucial step in data mining task. Data pre-processing is a method of gaining only relevant dataset for the data mining task.

The data we store in our database, flat files(e.g. text files, images, videos) etc are messy and gives no meaningful information that can be utilized. Not only is it incomplete and not-understandable, it is inconsistent, and the model doesn’t know what to do with it.

What does data pre-processing do?

Eliminates noisy, missing and redundant data
Improves model’s accuracy, speed and efficiency.
Reduces computational time.
Creates better data understanding and analysis.

Steps in Data Pre-processing:

The four major steps involved in the data pre-processing are:

1. Data Cleaning

Similar to removing dirt and dust in our house, datasets might have unnecessary and irrelevant data. It is necessary to remove and handle them to avoid any potential future issue like inaccuracy, slowness, etc of the model.

Mostly noisy, inaccurate and missing data are handled in this step. To gather accurate data, we need to extract data only from the trusted and reliable source. How do we handle the noisy data?

Let’s see what the noisy data actually means:

If we represent 2D data as points in the graph, we might get something like this. Notice that data are denser in some regions while they are sparse in other regions.

If we group together dense data, we get a group of somewhat similar data. The data excluded from such groups are called noisy data. They are different from other data. Removing such data improves accuracy of the model. These data are likely to cause incorrect output in the model. The above shown way of removing noisy data is called clustering. Other methods for this task can be binning method or regression method.

Now, how do we handle missing data?

In the above figure, we can see that the data are incomplete/missing. The two common ways of handling this situation are:

Ignore the missing data: You can ignore a small volume of missing data if you have a large volume of the dataset. The effect of such small volume will be negligible in the final output.
Fill the missing data: To handle the missing data, you can fill the data from other data sources, use mathematical expression if possible or use methods like mean, median or mode to find the probable value.

2. Data Integration

Data Integration is the process of gathering and integrating data from multiple sources of heterogenous data (web, database, flat files) under unified view in a single coherent data source.

While performing data integration, few problems may arise:

Schema Integration: Different data source may use different schemas for storing their data. In this case, it becomes tough to integrate the data under single unified schema. Data and their sources must be completely understood to perform proper data integration.
Data Quality: Data integration must be done from only reliable and trusted sources to avoid data quality issues in the data.
Data Access: While extracting data from the source, you must have proper access and permission to do so.
Outdated Data: Outdated data must be eliminated from the dataset unless required.
Entity Identification Problem: The data integration system should be able to identify the same entity/object from multiple database and relate them together.

3. Data Reduction

Data reduction is the process of condensing the original no. of data in terms of dimension or size. Even though the quantity is compromised, the quality of the data should not be compromised. Let’s see how it’s done:

Dimension Reduction: When we come across the weak data (not that important), outdated data or redundant data, we perform the dimension reduction by removing them.
Data Compression: It is a technique of compressing / minimizing data into smaller size of the data using various encoding techniques and algorithms. Two types of data compression are:

Lossy compression: It is a type of compression which reduces the data quality to compress data. Still, the data is quite understandable and usable to retrieve information from there. JPEG image and MP3 audio files are examples of the lossy conversion.
Lossless compression: It is a type of compression which preserve the data quality during the compression. The compressed data file can be reverted to the original data file. It provides simple and minimal data size reduction. ZIP files, RAR files and PNG images are examples of the lossless compression.

4. Data Transformation

Data transformation is the process of converting data from one format or representation to another. It is a crucial step in data pre-processing. Some methods of data transformation are:

Normalization: It is a process of scaling features(properties) of the dataset to a common scale, typically between 0 and 1. This is used when the features in the dataset have vastly different scale range. Doing this helps to improve the efficiency and speed of the machine learning algorithms. Common examples are min-max scaling, z-score normalization, etc.
Discretization: It is the process of converting the continuous features into discrete or categorical features. It is useful when the data is being fed to machine learning algorithms which work better with discrete inputs or when the relationship between feature and target variable is non-linear.
Data Aggregation: It is the process of combining multiple data points into a single summary value. This method can be useful when working with large volume of dataset. Common aggregation functions include sum, average, count, minimum, and maximum. Data aggregation can be performed at different levels, such as by group, by time period, or by location, depending on the structure and characteristics of the dataset.

This concludes the basic concept and guide to data pre-processing task.