Why implement a Data Cleaning process?

10 février 2021

Data cleaning Gouvernance Sécurité

Why implement a Data Cleaning process?

The democratization of office tools, the need to collaborate and analyze on a daily basis, have greatly contributed to the explosion of documents and data storage in files: unstructured data now constitutes an important part of the data heritage. It is therefore necessary to integrate them into the mapping process. But before analyzing them, it is necessary to carry out a data cleaning.

What is Data Cleaning?

Data Cleaning is a computer process that consists of cleaning data before analyzing it. The goal of Data Cleaning is to identify data that is outdated, incomplete, corrupted or duplicated within an information system. These data are then removed from the data catalog so as not to alter or harm the accuracy of the stored data.

An exploding volume of data in the cloud and elsewhere...

According to IDC studies, global data volumes are expected to reach 175 zettabytes (a zettabyte is equal to 1 billion terabytes!) by 2025 ... and in parallel, less than 0.5% of these data would be analyzed.

Data storage takes place in dedicated office servers, Clouds and personal equipment (computer, external hard drive...).

Therefore, it is necessary to regularly clean one's information system in order to:

Facilitate its compliance with regulations such as the GDPR (by reducing sensitive sources) Minimize its exposure to cyber risks Limit its environmental impact (digital storage requires servers, data centers, network equipment, ... whose ecological footprint is rather high at present)

Tips for implementing a data file cleaning process (Data cleansing / Data Cleaning):

  • Every organization has sensitive data. These may concern its own activity (intellectual property, know-how, etc.) or its customers, administered or users (personal data, contracts, etc.). Sensitizing teams to the risks associated with these data remains the first piece of advice for implementing a good data and cleaning strategy, which could be referred to here as "computer hygiene".
  • Each organization must be able to easily identify risk data. The first lever, the detection phase of obsolete files, is generally the most radical and effective: how many files over 5 years old are really necessary for an organization to function?
  • After this obsolete file deletion step, action prioritization can be done by data sensitivity level. In fact, a classification of files according to their risk level, allows for prioritization, minimization of analysis work, and finally prioritization of cleaning actions.
  • Put in place a security strategy, a security policy in order to limit access to sensitive files (security at the storage level, privilege management)
  • The automation of the process by dedicated tools provides the guarantee of a really applied process. In fact, given the volume, the classification and file analysis work cannot be carried out exhaustively and effectively by humans.

It is also advisable to rely on "smart" solutions that implement algorithms and allow for regular audit/monitoring of the information asset. Beyond raising awareness among employees, it is indeed necessary to centralize the identification of sensitive data, and to define classification processes, file cleaning processes, in order to define specific security measures that can cover backup, deletion, logging, access, etc.