AI is not Data

This may be a controversial post for some, since the data analytics industry has seemingly done its best to blur the line between the data domain and today’s Artificial Intelligence (AI) boom. That said, board directors and executives will be best-served by better understanding how these domains differ and so their organisations can take advantage of AI advances.

First, let’s wind the clock back to the first time that I remember using a mass-market AI technology. I was still at school, and used a program called WordStar to write my essays. If I misspelled a word, it could detect this, and would suggest the word that I meant to type, just like a professional copy editor might.

It’s an everyday, boring use case, but spell checking is AI, and has been around for decades. These days the technology is more sophisticated, showing a wavy red line under suspect words in real-time as they are typed.

There are two points I want to make from this very boring example.

Firstly, the technical definition of AI is very broad. Anything that a computer does that is equivalent to something a smart human could do is considered to be AI. Simply put, when a human does it, if it is a sign of intelligence, when a computer does it, it is called artificial intelligence.

Secondly, when the word AI is used, it is very much referring to the technology of the moment. Things that were considered AI in the past are no longer what people refer to when they use the word AI. I first started my career working on AI in the 1990s, and at the time, when people referred to deploying an AI system for a business, they were generally referring to an expert system. An expert system is when a programmer sits down with an expert, finds out all the rules that the expert uses to do their job, and codes those rules into a computer, allowing that job to be automated.

Since then, AI has been used to refer to different technologies over time. Not that long ago, when deep learning took off, the term AI came to refer to that. For example, AlphaGo, a deep learning-based system beat a top professional in the game of Go in 2016, showing how deep learning was the ascendant AI tech at the time. Now, with OpenAI’s demonstrations of the DALL-E image generator in January 2021 and ChatGPT in November 2022, Generative AI (GenAI) is what people typically are referring to when they talk about AI.

What does this have to do with data?

Generative AI, in being used for creative works, like composing a paragraph of text on a given subject, writing lines of code for a software function, or seamlessly removing an object from a photo, is very different in application from other types of machine learning. Normally a machine learning model is specific to a given organisation, so has been trained on confidential data from that organisation. Generative AI models require far more data to train than typically any one organisation has, and hence have been trained on everything that is on the Internet and then some. As a result, many applications of Generative AI don’t require any organisational data to start providing value. I can get ChatGPT to summarise a meeting transcript without needing to give it any previous meeting transcripts to learn from.

This harks back to my first AI experience, where my word processor would suggest spelling corrections, and I didn’t need to give it any data to learn from before it could do that (although I did eventually add some Australian words to its dictionary). But in the intervening years, many organisations were sold on predictive analytics solutions that required data warehouses or data lakes that took years of proprietary data, and created business insights from them. This type of Predictive AI does require data, but now when people talk about AI, they are usually not referring to that type.

Why does this matter?

Predictive AI benefits from copious quantities of clean, well-organised data. To produce this requires data analysts and data engineers, a lot of storage, and careful governance. There are data governance and ethics frameworks that need to be implemented into business processes so that organisations make appropriate use of this data. This data is a honeypot for hackers, so requires good cybersecurity practices to ensure it stays out of their hands. All of this is expensive, and slows down new applications of AI to ensure they are done responsibly.

Generative AI doesn’t require any in-house data for many of the valuable applications. The most data-like application is called RAG (Retrieval Augmented Generation), which uses a ChatGPT-type system in conjunction with a document repository, and is more like how a search engine works, so isn’t using the type of data usually used by Predictive AI. Documents, software source code, images, videos and sound files are the main inputs for Generative AI applications. As a result, there is no need for the exact same data platforms, ethics and governance controls, or cybersecurity protection. However, with the speed of change occurring in Generative AI, organisations that wish to gain the most from it will benefit from innovation or experimentation frameworks.

In fact, an organisation may choose to have different areas look after each. One area may be responsible for data and predictive analytics, and the other area may be responsible for AI and innovation. They will have different cultures, skill mixes and capital needs. There’s also a risk that putting these areas together will result in one or both areas not being as successful due to these clashes.

For example, choosing to step up a Microsoft 365 license to gain access to Copilot features in Teams, Word and Powerpoint should not be treated as a data project, but as an innovation project. (See how the Australian government did this.) Similarly, whether to use GenAI features in Adobe or Canva products is not a data project.

There are still many governance or risk-related aspects to work through with GenAI projects, but these are often different considerations to those covered for a predictive AI project using private or confidential data. If a single AI governance is to be used to vet all AI projects, a key question is whether all AI projects will need to be assessed on all the aspects relevant to either Predictive and Generative AI, or whether projects will be assessed only on relevant aspects, and how those will be identified.

The fact that GenAI is not tied to internal data is also apparent in the proliferation of “shadow AI”, where employees use AI tools on their personal devices or using personal accounts in order to get access AI capability not provided by their employer. When was the last time an internal data repository was integrated with a third party service at no cost? Never. Shadow AI typically isn’t held back by the need for data assets to be integrated, because GenAI doesn’t use them.

In conclusion, today’s AI projects (referring to GenAI) are not data projects. There are different skills, platforms and controls required to get value from Predictive AI’s data-oriented projects and Generative AI’s generally document-oriented but data-free projects. Don’t fall for data analytics industry hype that they are the same, or you could end up with additional costs but ultimately miss catching the benefits from the latest AI wave.