Why the T in ChatGPT is AI's biggest breakthrough - and greatest risk

AI companies hope that feeding ever more data to their models will continue to boost performance, eventually leading to human-level intelligence. Behind this hope is the "transformer", a key breakthrough in AI, but what happens if it fails to deliver?

When ChatGPT first took the world by storm in 2022, its capabilities were so impressive that people happily looked past its awkward name. Yet hidden within those initials lies a key breakthrough responsible for sending artificial intelligence rocketing these past few years – and potentially a limitation that could see it crashing back to Earth.

A good neural network architecture is vital when developing artificial intelligence
SHUTTERSTOCK/Qpt


GPT stands for generative pre-trained transformer, and it is the last word that matters most. The term was coined in a 2017 paper by a team at Google, which introduced a concept called “self-attention”. This means that when a model is given a string of words, it doesn’t consider each one by itself, but instead makes links between the words it has already been fed, “transforming” the whole input into a new output.

This design yielded huge success when given enough computer power and data, leading to the surprising jump in apparent reasoning and language capabilities behind today’s AI tools. What’s more, using ever greater levels of computing power and data seems to continually improve a transformer’s performance. This “scaling law” is why AI companies have predicted that their models will get better and better, with OpenAI claiming that its upcoming GPT-5 model will have the reasoning capabilities of a PhD student, approaching the field’s grand goal of achieving artificial general intelligence (AGI), a machine capable of anything a human can do.


Outside observers are more sceptical, however. OpenAI has seen a number of high-profile departures this month, with prominent AI researcher Gary Marcusposting on X that people leaving the firm would be “inconceivable if AGI or even just major profitability were close”.


More fundamentally, some researchers question whether these all-important scaling laws can continue unimpeded, especially as most of the available data to train the models on, which is largely scraped from the internet, has already been gathered. “You need more knowledge, you need more text,” says Sepp Hochreiter at Johannes Kepler University Linz in Austria. There is some hope that using AI to generate “synthetic” data might push through this barrier, but others are sceptical.


Even if transformers continue to follow these scaling laws, there could still be fundamental problems with their design. One is that they lack an internal memory, says Hochreiter, which we know is central to the way that human intelligence works.

Transformers must also repeatedly look back at data they have already seen. Processing and generating long text sequences can therefore require enormous computational resources as the AI scans back and forth. This isn’t as much of a problem when working with short emails or simple questions, but it makes transformers ill-suited for working with much longer text sequences, like books or large data sets, says Yoon Kim at the Massachusetts Institute of Technology. “Transformers are just fundamentally inefficient and are ill-equipped for these types of applications.”


Researchers hope to solve these problems with alternative neural network architectures, such as a model developed by Hochreiter and his colleagues called extended long short-term memory (x-LSTM). Its predecessor, just known as LSTM, was the de facto AI language architecture before transformers took over, but Hochreiter says the new version, which has many more artificial neurons and a redesigned memory, produces comparable results to transformers while being much more efficient and able to remember past states.

Sam Altman and OpenAI aim to build a machine capable of human-level reasoning
Bloomberg Copyright: SeongJoon Cho/Bloomberg via Getty Images


But these models, while offering advantages to the transformer, aren’t radical evolutions, says Kim. “These aren’t very different models from one another. If you go high-level enough, they’re essentially the same model.”


A deeper problem with both transformers and their alternatives is whether they are “Turing complete”, which means they are able to run any algorithm or compute anything that any other computer can. This might seem like a technical point, says Razvan Pascanu at Google DeepMind, but if future AI systems are to be as reliable and capable as we want them to be, then demonstrating that they are Turing complete is important.


Take a simple problem: adding two numbers. Because transformers work by extrapolating from their training data, they are good at adding together numbers they have seen many times, but can produce inaccurate results for numbers not found in their training data.


While transformer arithmetic is improving as models scale up, it still isn’t clear whether they have the capability to add in a Turing-complete way, as any other computer would. “When we build these systems, we’re trying to make them approximate some function we care about, and usually this is what’s going on,” says Pascanu. “But for certain things, like addition, you might care about representing the addition algorithm exactly — you don’t care about approximating addition.”

So what is the solution? Finding an AI architecture that can perform as well as the transformer, while also solving the current problems with efficiency and memory, is no easy task. “It’s really hard to find good neural network architectures that work,” says Jonathan Frankle at AI company Databricks. “It’s not like you can just sit down at a whiteboard and think up a new way to do this and it works. There are a lot of grad students and people who have thrown a lot of time and energy into failed efforts to just improve the transformer, let alone to come up with something whole-cloth new and different.”


This means that if the transformer fails to live up to the heady promises of AI firms or if the scaling laws don’t follow their observed trajectory, it isn’t obvious where an alternative might come from. The commercial pressures at big AI companies leave little room for experimentation, says Frankle, and with transformers so entrenched in current AI systems, it is unclear if they can change course now, says Frankle.


“Even if you do come out with something that’s very interesting and exciting, the standard of proof for you to convince someone to switch over to that, to take the risk on their billion-dollar training run to use your thing, is exceedingly high,” he says.

Post a Comment

0 Comments