Overcoming complexity with AI powered information autoclassification

While automation has transformed the world by boosting organisational efficiencies, the complex and multifaceted function of information classification remains largely unautomated. There is a good reason for that. In this article we explore the challenges faced by organisations in classifying dark and unstructured data and solutions that provide the answer to overcoming the hurdles

Automation has huge transformative potential and has been an integral part of achieving operational and cost efficiencies within organisations for decades if not centuries. As we enter the fourth industrial revolution, the automation landscape is evolving at unprecedented pace. It is becoming more sophisticated and intelligent, overtaking the repetitive and mundane tasks that used to be done by humans, but at a larger scale, with better accuracy and at faster speeds while delivering data-driven insights. Automated content creation with Generative AI, automated process management, automated mailouts, automated spam filters are just a few examples.

But while automation has allowed organisations to increase throughput, minimise human errors, and reduce time to market, it lacks cognitive and problem-solving abilities especially where complexity is involved. This is one reason why information classification largely remains unautomated within corporate and government businesses: because it is complex, multifaced ,and has significant ethical implications.

So, let’s explore what automated information classification actually means, what the challenges faced by organisations are, and what are the possible solutions.

First up, what do we mean by ‘classification’?

This is the process, traditionally done by people, of determining what a set of information is about. We use classification to help us organise and categorise data, in order to find it easily, use it efficiently, and protect it effectively. Three main types of classification are required for digital records:

  1. Security classification (categorising by sensitivity and risk). For example, ‘this data is Secret’, or ‘this data has Personally Identifiable Information’
  2. Compliance classification (categorising by applicable rule or policy). For example, ‘this data needs to be retained for 15 years from date of last action’
  3. Functional classification (categorising semantically). For example, ‘this data is about grant funding’, or ‘this data is about Simon’.

Almost every piece of digital information will need some kind of classification in order to get value from it, and managing risk and classification is not as simple as it sounds. It could be multifaceted and complex. For instance, one email can be about Simon, and also be about grant funding, and contain PII, and need to be kept for 15 years. That is not a straightforward classification exercise and has ethical implications as well.

You might think classification of content is already automated. There are a lot of technologies offering some kind of automation for classification. And we definitely do need to use technology to classify information: there is too much volume, variety, and velocity of content for humans to be able to classify manually and it can be prone to human error. But when considering these solutions, we need to evaluate and understand just how automated this process is. Because there’s automatic classification, and then there’s autoclassification.

What’s the difference between ‘automatic classification’ and ‘autoclassification’?

In the old days of paper files, classification was centralised. The papers all got sent to the records team, who stamped the front of the manila folder with security markings, and filed it away in the right category of documents.

When digital recordkeeping began, the job of stamping and filing was decentralised, and the individuals creating the content were asked to add labels and use file plans, instead of stamps and filing cabinets. The same outcome, but no longer done by specialists. And by the time everyone had a word processor on their desk, we were creating a lot more content every day. Inevitably, the quality of classification degraded, and in many cases, was no longer done at all. This is what led to the situation most enterprises find themselves in now — huge amounts of ‘dark data’, uncategorised, unprotected, non-compliant, and undiscoverable.

When classification automation joined the fray, the focus was on the scale problem. Computers needed to somehow apply the right categories to all that messy data. But, the computers could not know what categories to apply. They had to be told, by the humans. Many solutions offering automatic classification are only offering automated application of manual classifications, but at scale. You need to tell that system what classification to apply to bulk data, based on a rules engine, or a file plan, or a naming convention. With the exception of some limited use cases where systems automatically detect PII or PCI in documents or emails, for example, you will probably find that your computer does not tell you what classification should be applied. Automatic classification systems can tell you that a document has a ‘PII’ category, but not that it has a ‘Simon’ category specifically, or a ‘grant management’ category, or a ‘retain for 15 years’ category. Not unless you tell them first, by adding a label for example, or using a file plan.

But what about AI?

Yes! Artificial Intelligence goes beyond the first wave of automatic classification systems, to start telling you what your content is about, and what that means in terms of its categorisation. But some AI classification tools are more ‘auto’ than others. Supervised Machine Learning, for example, still requires humans to tell it what to do. It needs to be fed curated data at scale, trained on that data, and supervised by humans through that training process. This can work well for limited use cases, but is much harder to scale for those multifaceted classification lenses, because it’s so human-intensive and subject to error (just like the old days). It also has another critical issue, in that it’s inherently not transparent or explainable (which it has to be, if you are using it in any kind of risky or regulated context). This also effectively rules out generative AI like ChatGPT, Bard, or Copilot for autoclassification purposes in a corporate or government context.

Solution: true autoclassification, without the impacts

Organisations have huge volume, variety, and velocity of data. This data has inherent risk, and unexploited value.

Humans cannot keep up with the vast and diverse data estate. Sensitive data is uncontrolled. Regulatory obligations in handling are not met. Important data is lost or allowed to decay.

Attempts to overcome this challenge by placing a burden onto users to categorise information, or to train and supervise unexplainable AIs, have too much impact and risk, and they fail.

This is why Castlepoint uses true autoclassification – not metadata labels, rules engines, files plans, or Machine Learning that needs to be trained. We classify (more accurately than a human) using explainable AI, at huge scale. Castlepoint pioneered Regulation as Code for classifying data ethically, efficiently, and effectively, and disrupted the first wave of automatic classification technology with a new and sustainable approach. One which can actually keep pace with the rapid rate of content capture, and with the increasingly volatile threat environment, the fast rate of regulatory change, and the highly competitive global market, where every bit of value needs to be extracted from your data estate.

If you want true autoclassification, rather than automatic application of manual classification, contact our team, or take a deeper dive into explainable autoclassification with our free masterclass video.