Data cleansing, also known as data cleaning or data scrubbing, is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database. The goal of data cleansing is to identify incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replace, modify, or delete the dirty or coarse data. Data cleansing is an essential process for preparing raw data for machine learning (ML) and business intelligence (BI) applications.
The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict, such as rejecting any address that does not have a valid postal code, or with fuzzy or approximate string matching, such as correcting records that partially match existing, known records. Some data cleansing solutions will clean data by cross-checking with a validated data set. A common data cleansing practice is data enhancement, where data is made more complete by adding related information. Data cleansing may also involve harmonization (or normalization) of data, which is the process of bringing together data of varying file formats, naming conventions, etc..
Data cleansing is important because it ensures that data is of the highest quality, which prevents errors, customer and employee frustration, increases productivity, and improves data analysis and decision-making. Clean and accurate data is particularly crucial for training ML models, as using poor training datasets can result in erroneous predictions in deployed models.