Leveraging metadata and AI to extract value from data

Data is ‘the new uranium’ — an incredibly valuable resource, but very dangerous when unmanaged and left to decay. Organisations create and capture volumes of unstructured information and dark data and don’t often know where it is stored and who is doing what to it. Discover how to overcome this challenge to achieve significant value from data and the pivotal role metadata and Artificial Intelligence play in achieving data outcomes.

In a world of AI and content sprawl, does anyone still care about metadata? It turns out, yes. Corporations and governments have shifted their focus to treating information and records as a strategic asset. If data is ‘the new oil’ then metadata is critical to effectively understand, manage and extract value from it.

However, data is also ‘the new uranium’ — an incredibly valuable resource, but very dangerous when unmanaged and left to decay. As a result, the last five years have seen an increase in both data protection and minimisation legislation (to manage data risk) and also data exploitation strategies (to manage data value). This second shift has occurred at the highest levels: the UK Ministry of Defence Data Strategy, Digital Strategy, and Cyber Resilience Strategy, for example, all commit to managing data as a strategic asset so that it can be exploited at scale and speed. Metadata is a key pillar of this outcome, with Defence committing to:

  • organising, consolidating and cataloguing its data to enable visibility, consistency and re-use across organisations
  • contributing to the delivery of an enterprise data catalogue solution, a map of Defence data assets

To create catalogues, we need metadata. So what’s been stopping us?

The volume problem

Organisations create and capture incredible volumes of unstructured information, more now than ever before. At its most simple, unstructured information includes documents, emails, images, and other content where each meaningful ‘element’ of the record is not in its own discrete ‘field’ (such as it would be in a traditional database, which is a structured system).

One terabyte of storage can hold around 85 million documents, or about the equivalent of a third of the US Library of Congress. The average large organisation could have 250 TB of unstructured data – 83 Libraries of Congress! Billions of documents. If an average document is 20 pages, it will take a human an hour to read just three documents.

But in order to categorise something, you need to know what it’s about, and that means you do need to read it. And humans simply can’t read this scale of content. So, we have to turn to AI to do that reading and analysis, and for the ‘generative AI’ task of creating the metadata. But there’s a caveat: it has to be Explainable AI. This means we can’t use Machine Learning, neural nets, or large language models (LLMs) to help us — not when the outcome of the decisions we make in relying on that categorisation could have an impact on an individual (say, for example, a decision on keeping, sharing, destroying, or restricting a record).

The variety problem

We also need AI to deal with the variety of content we hold. Even if we did hire enough humans to read all our unstructured content, they would all need to be subject matter experts in dozens or even hundreds of laws, policies, and business functions. They would need to know how to categorise the content for national security sensitivity, and under privacy laws, and under records retention and disposal schedules. They would also need to know how to categorise it under perhaps hundreds of other types of policy and legislation, as well as under functional groupings. The data has to be catalogued in all the ways that people want to be finding and using and reporting on it, and there are as many of those ways as there are teams.

So again we need AI, which can ‘be the SME’ for thousands of different policies and regulations. We use Rules as Code (RaC) AI for this purpose, because it scales across any type of regulatory or functional classification, does not rely on staff member effort to train and supervise, and is explainable and transparent per the Ethical AI laws. But there’s also something else we have to do to succeed with variety. We have to centralise our categorisation system. We can’t modify millions of documents to add dozens of tags or labels to them directly in their source system. Because of the variety of classification types, systems that only allow a few labels per item can’t work for our purpose of protecting, preserving, and exploiting our content. We have to use a manage-in-place approach, and capture all the metadata centrally.

The velocity problem

Related to the other two problems is the rate of change. Content does not stand still. We have to categorise and manage content from the moment it is captured or created, all the way through its end of life. During that time, its content and context will change repeatedly. The categories that apply to it will change as well. The obligations and functional priorities that we want to categorise against will also change constantly. This is another reason why requiring humans to add and update labels can’t scale, and why using traditional AI systems like ML can’t scale. We don’t have time to re-train and re-supervise our AI engine every time the category types change.

So what is the future of metadata?

First, the future of data is metadata. Without categorisation, we can’t control, understand, or exploit our content.

Second, the metadata has to be centralised, in a manage-in-place model. There are too many categories, changing too often, to apply labels and tags at the source.

Third, the categorisation has to use AI. And that AI has to be transparent, explainable, and scalable. Rules as Code is the only non-disruptive, sustainable AI type for this purpose. Castlepoint RaC generates metadata using Ethical AI, ensuring a high degree of accuracy, at scale, with complete traceability and defensibility.

This new modern approach has been rapidly adopted across governments and corporates, and is proving to be highly effective in transforming them to meet the data outcomes everybody is focused on for 2024 and beyond: extracting the most value as possible from data, one of our most valuable assets, while preserving and protecting it to make sure bad actors can’t do the same. Compliance, security, privacy, audit, and discovery force-multiply with an enterprise-wide, AI generated data catalogue.