what is data cleaning

11 months ago 41
Nature

Data cleaning, also known as data cleansing or data scrubbing, is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. The goal of data cleaning is to identify incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replace, modify, or delete the dirty or coarse data. Data cleaning is an important early step in the data analytics process, which involves preparing and validating data before the core analysis. The process of data cleaning may involve removing typographical errors, validating and correcting values against a known list of entities, and cross-checking with a validated data set. Data cleaning is a complex process that involves removing unwanted observations, outliers, fixing structural errors, standardizing, dealing with missing information, and validating results. The processes involved in data cleaning may vary from dataset to dataset, and there are tools available to help with the process, such as MS Excel and programming languages like Python. The benefits of data cleaning include improved data quality, more accurate, consistent, and reliable information for decision-making in an organization, and cost-effectiveness.