Dawizz a Blueway company
Your challenges Resources Who are we ? Request a demo

How to free organizations from the stifling grip of Dark Data?

8 février 2023


RGPD Data cleaning cybersecurity


It was the analysis institute Gartner that helped popularize the term "Dark Data". In an article dating back to September 2017, Sony Shetty provided the following definition: "The information heritage that organizations collect, process, and store during their daily work but no longer use thereafter".

With the massive digital transformation of society, and the proliferation of digital exchanges that comes with it, it is clear that the amount of dark data has multiplied in recent years. But what is the extent of the phenomenon exactly?

The multifaceted nature of data production and storage in organizations makes it difficult to measure precisely. Several studies have nevertheless attempted to quantify it, estimating the share of dark data in the world's data heritage at over 50%. No less than 52% according to Statista in 2020, and up to 65% according to the Digital Decarb website.

In addition to the fact that dark data now occupies a majority share in IT systems, what is also certain is that this proportion is only increasing. Why? Simply because the digitization of businesses and administrations is not yet complete, and the multiplication of sources gives an exponential profile to the number of interactions they maintain, and thereby to the volume of data they generate.


In fact, the proliferation of what is also known as "cold data" poses not one problem, but several: security, financial, and - last but not least - environmental. It is not difficult to see that if "dark data" corresponds to the share of data that the organization that produced it has not been able to classify, it is more than likely that a non-negligible proportion of this shadow zone contains latent risks, which will be expressed suddenly on occasion of an unforeseeable exposure: hacking, external audit, internal malice, etc...

These risks themselves are of various natures, depending on what the data contains:

  • Regulatory risk in the event of non-compliance with increasingly numerous frameworks (we immediately think of the GDPR, but almost every economic sector adds its own regulatory layers).
  • Competitive risk in the event of sensitive information of an R&D or commercial nature.
  • Reputational risk, especially if the data concerns individuals, etc. The materialization of any of these risks inevitably results in a financial loss, directly in the event of a fine for non-compliance with the rules, but also indirectly through the degradation of the competitive position.

It should also be noted that the general attitude towards the responsibility of organizations in the event of data leaks has radically changed over time. Until a few years ago, the company or administration that suffered an incident resulting in data loss was seen as a victim. Today, it is perceived as a lack of adequate precautionary measures, and expectations are increasing from authorities such as ANSSI in France, as well as from customers and public opinion.

Even in the absence of a data leak problem, the mere retention of dark data generates costs for organizations that, although often poorly or not identified, are nonetheless considerable. Several layers are added to contribute to overall economic inefficiency: energy consumption of server rooms, construction and maintenance of corresponding premises, oversized machine parks, billing for storage space by hosting companies, etc.

A figure? 2 billion euros per month! This is the estimate made by the IDC analysis institute for all companies worldwide. But in the context of the climate emergency that all members of society must react to, the most unacceptable cost is perhaps the environmental cost of dark data. The mobilization of considerable resources to keep data available that will never be used is indeed pure waste.

At the forefront of these resources, we obviously think - and rightly so - of the electricity needed to power data centers: in an article dated August 2019, Charlotte Trueman wrote in Computerworld that the electrical consumption of data centers had already exceeded 3% of global consumption (i.e., more than the total consumption of the UK!). But electricity is unfortunately not the only needlessly solicited resource: in a publication from May 2021, the contributors to the scientific journal IopScience were able to evaluate that data centers are in the top 10 of water-consuming industries in the US, due to their cooling needs.

And for France? In its June 2020 report, the independent expert collective GreenIT.fr estimated that data centers accounted for 13% of the total electricity consumption related to digital technology. By adopting a conservative estimate of 50% of dark data in the data hosted on these servers, we arrive at more than one million tons of CO2 emitted "for nothing" each year, equivalent to about 1.5 million round-trip flights between Paris and New York by plane (according to the calculator made available on the government website of civil aviation).

Dar Data metrics representation

So, what to do?

The first prerequisite for solving a problem is to have identified its existence. Still limited in the world of work a few years ago, awareness has now become widespread, and many organizations have included digital sobriety in the objectives of their architecture and IT operations. More and more have put in place indicators, and some are now publishing improvement goals. This is the case, for example, of Société Générale, which has committed to reducing greenhouse gas emissions by 50% between 2019 and 2025. In the public sphere, the ANCT has recently decided to integrate the management of data life into the levers that local authorities are invited to consider to establish their responsible digital roadmap.

Under these conditions, can we afford to be optimistic and predict the short-term disappearance of the (almost) bottomless pit of dark data? In reality, it is not so simple, with the first factor of resistance coming from reflexes deeply rooted in organizations, at both the management and employee level, and which could be called the "we never know" syndrome.

In doubt, even data that is clearly of no interest for future use is retained and often duplicated at the time of its production. This behavior de facto results in them being condemned to endless wandering on the cloud because no one will take the time to look for them there, and a new layer of useless data will quickly cover the previous one.

Behaviors can - and must - change, but that is not enough. The desire to tackle the dark data issue head-on faces many difficulties. A study conducted by TRUE Global Intelligence for the software publisher Splunk sought to characterize them more precisely. Asked in early 2019 about their perception of the main obstacles, a panel of 1,300 IT decision-makers distributed in 7 countries, including France, the US, and the UK, highlighted the following ranking: • The amount of data involved: 39% of respondents. • The lack of necessary skills: 34%. • The lack of availability of resources: 32%. • The difficulty of coordinating between departments: 28%.

And yet, we cannot accuse these decision-makers of lacking motivation, since 77% of them consider searching for and finding dark data in their organization to be a top priority.

However, and this is what allows for a certain optimism in a situation that is currently only getting worse, solutions are now available to help organizations effectively address the problem. Data discovery platforms offered by Dawizz and other publishers make it possible to better catalog the content of the data assets and more easily identify "cold data".

With our MyDataCatalogue trial process platform, we have sought to push the scope and automation of the data cataloging process as far as possible: programmable scans, field name recognition algorithms, criteria application at both the metadata and data level, whether structured or not, multilingual glossaries, and consideration of synonyms...


As noted in the TRUE Global Intelligence survey mentioned above, one of the obstacles to implementing a data governance policy aimed at reducing the volume of dark data is the difficulty of organizing interaction between services that often have different cultures and objectives. With the Collaborative Data Cleaning service, Dawizz has taken this reality into account by defining an intuitive interface with each department that allows it to perform its task effectively and quickly: data management, business users, and IT simply contribute to the final result. You can find a more complete description of how this ready-to-use solution works by reading this article produced on the occasion of World CleanUp Day.

The Crédit Agricole Group is one of the users of Collaborative Data Cleaning, with several Regional Banks already having deployed the service or preparing to do so. The feedback from the Normandy Regional Bank illustrates the benefits that come from a well-equipped data cleaning campaign, with this joint testimony from the CDO and the DPO: "The development of a user interface optimized for our use cases made it easy to deploy the service, resulting in the liberation of 95% of disk space used in the targeted area. In addition to our primary goals of compliance and security, file deletion allows us to contribute to the bank's CSR objectives in terms of the positive carbon impact of our actions."

Like the Hydra of Lerna in Greek mythology, dark data has an unfortunate tendency to regenerate as it is destroyed. That's why it's important to regularly relaunch cleaning campaigns to avoid reconstituting stocks that are too large, and to monitor results to be able to identify areas in the organization that are potentially less effective than the average in their cleaning efforts, and to support them with targeted awareness-raising actions.