Data Discovery: How and Why it Improves the Data Mapping of an Information System?

16 février 2023

Stéphane Le Lionnais

Gouvernance Cartographie Stratégie Data Governance

Data Discovery: How and Why it Improves the Data Mapping of an Information System?

The phenomenon of cyberattacks is global. Every year, the number of cyberattacks against companies, local communities, and hospitals continues to increase. Victims of computer hacking typically report an intrusion into their information system, a blocked operation, or data theft.

According to a new study by cybersecurity company Surfshark on the last quarter of 2022, France ranks first in data leaks density. On average, the hexagon has 212 leaked accounts per 1000 people.


To combat this phenomenon, the European Union (EU) has legislated to establish a reliable and secure digital environment on its territory. Thus, teams responsible for securing the information system and/or protecting personal data are increasingly subject to strong pressure to comply with complex regulations (NIS 1-2 Directive "Network and Information Security", GDPR "General Data Protection Regulation, etc.)

But concretely for a CISO or DPO, how to identify, rank, and categorize data according to their level of sensitivity and criticality?

Likewise, data valuation has become a new structural issue for any organization. Data has become a fundamental element but is still complex in managing its lifecycle. But due to a lack of knowledge, its valuation can resemble a quest for the Holy Grail. Indeed, teams dedicated to data valuation or innovation spend a significant part of their time searching for and finding the right information.


Therefore, to meet these new requirements, it is crucial that these teams improve their skills.

It is in this context that the notion of data discovery and classification has emerged. Indeed, data discovery makes it possible to simplify navigation in data and make them more accessible to all users. In this article, we will explore how and why data discovery improves the data mapping of an information system. Why does a data governance approach require a good understanding of data?

The Challenges of Data Cataloging

So far, data cataloging methodologies are generally based on manual processes. Also, given the human resources required, organizations have a hard time keeping the data catalog up to date; this is due to the growing complexity of regulations, technologies, and disparate data source formats.

Data discovery tools such as MyDataCatalogue now enable teams responsible for securing, valuing, or protecting data to better understand the data available by providing them with context and automatically conceptualizing it. Retrieving context, such as the source of data, allows for a quick automatic first classification by associating a sensitivity/criticality level with the applications used, the data creator users, and the storage location.

Data discovery does indeed improve data mapping projects by automating the classification to the maximum since metadata analysis may not always be enough. Indeed, content analysis, which consists of examining data and identifying regular expressions (representing social security numbers, credit card numbers), algorithms, or learning models, becomes essential to validate and complete classifications made solely on metadata.

This approach is becoming less and less complex to implement. Indeed, existing solutions natively integrate preconfigured rules to automatically identify payment information (PCI), personal data (PII), and/or other security standards.

Organizations must also lay the foundation for data mapping and understand the data processing associated with data (whether personal or not, sensitive or simply business-related).

Most organizations will start with semi-automated methods to establish this mapping and improve it through collaborative services and automated data discovery.

The importance of data discovery for example

Most data discovery solutions only identify two key attributes: the location and type of data. However, data stored by an organization is generally formatted differently (even inconsistently) in different application sources, so it is essential to conceptualize them for interpretation.

To illustrate this, let's take the example of an organization that wants to control its IS and align it as much as possible with its strategy (an urbanized IS approach). In this approach to data control, organizations usually begin with the implementation of data repositories. And for security and return on investment reasons, the first repository to be addressed is generally the third-party repository.

In implementing this type of repository, the difficulty encountered is not a technical/computer problem, but rather in collaboration and realigning data between business and governance.

Depending on the business, a third party may be called a client, user, supplier, collaborator, store, etc. Once the repository solution is defined and chosen, how do you find and locate the data applications and sources that deal with third parties and that need to be connected to the service? To solve this problem, there are 2 solutions:

  • The first (the most time-consuming and therefore resource-intensive) is to interview the different business references in your organization to create a manual mapping.

  • The second (the fastest and least expensive) is to use automated data discovery tools such as MyDataCatalogue that will identify and classify data by concepts regardless of the semantics or language used in business applications.


In conclusion, we must indeed rethink the traditional data cataloging approach in a collaborative mode. A data catalog without data discovery functionality can be worse than not having a data catalog at all.