Google's new LLM neural-net architecture separates memory components to control capacity and compute burst costs


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


A new cloud network architecture developed by researchers at Google could solve one of the biggest challenges for large language models (LLMs): expanding their decision-time memory without exploding costs memory and computing. Call Titansthe architecture enables models to find and store at decision time small pieces of information that are important in long sequences.

Titans combines traditional LLM attention blocks with “neural memory” layers that allow models to efficiently handle short- and long-term memory tasks. According to the researchers, LLMs that use neural long-term memory can scale to millions of signals and perform better than both classic LLM and alternatives such as Mamba while having much fewer parameters.

Lines of attention and linear models

The classic transformation architecture used in LLMs uses the self-care equipment to measure the relationship between tokens. This is an effective way to learn complex and granular patterns in signal sequences. However, as the length of the sequence grows, the computational and memory costs for calculating and storing attention increase quadratically.

Newer recommendations apply other architecture which has linear complexity and can scale without exploding memory and computation costs. However, Google researchers argue that linear models do not show competitive performance compared to classical transformations, because they compress their contextual data and tend to miss details. important information.

The ideal architecture should have different memory components, which can be coordinated to use existing knowledge, remember new facts, and draw away from the their text.

“We argue that in an effective learning paradigm, similar to (the) human brain, there are separate but interconnected modules, each of which is responsible for a role that is critical to the learning process,” the researchers say. research writing.

Long-term neural memory

“Memory is a confederation of systems—eg, short-term, working, and long-term memory—each serving a different function with different neural structures, and each able to function differently -dependent,” the researchers wrote.

To fill the gap in current language models, the researchers propose a “long-term neural memory” model that learns new information during decision-making without the inefficiencies of the full attention mechanism. Instead of storing information during training, the neural memory model learns a function that can remember new facts during decision making and changes the memory process based on the data it encounters. This solves the generalization problem that other neural network architectures suffer from.

To decide which pieces of information are worth storing, the neural memory model uses the concept of “surprise.” The more a sequence of signals differs from the type of information stored in the model's weights and existing memory, the more remarkable it is and therefore the worth remembering. This allows the model to make efficient use of its limited memory and only store bits of data that add useful information to what the model already knows.

To handle very long data sequences, the neural memory module has an adaptive memory mechanism that allows it to remove information that is no longer needed, which helps manage the limited memory capacity.

The memory model can complement the attention mechanism of the current transformation models, which the researchers describe as “short-term memory models, serving the size of the current context window.” On the other hand, our neural memory with the ability to continuously learn from data and store it in its weights can play the role of long-term memory. “

A titan of architecture

Example of Titan architecture (source: arXiv)

The researchers describe Titans as a family of modules that include customizable transformation blocks with neural memory modules. The model has three main components: the “core” module, which acts as short-term memory and uses the classical attention mechanism to attend to the current category of input signals that the model processing; a “long-term memory” model, which uses a neural memory architecture to store information outside of the current context; and “continuous memory” model, the learning parameters that remain fixed after training and store independent knowledge from time to time.

The researchers suggest different ways to connect the three parts. But in general, the main advantage of this architecture is to allow the attention and memory models to complement each other. For example, the attentional sequences can use the historical and current context to determine which parts of the current context window should be stored in long-term memory. At the same time, long-term memory provides historical knowledge that is not present in the current attentional context.

The researchers conducted small tests of Titan models, between 170 million and 760 million parameters, on a wide range of tasks, including language modeling and language tasks with long sequences. They compared the performance of Titans with different transformer-based models, such as linear models Mamba and hybrid models such as Samba.

Titans (red line) outperforms other models, including GPT-4, on long-range tasks in low image and tuning conditions (source: arXiv)

Titans showed strong performance in language modeling compared to other models and outperformed both transformational and linear models of similar sizes.

The performance difference is especially noticeable in tasks on long strings, such as “a needle in a haystack,” where the model must retrieve pieces of information from a very long sequence, and BABYLONwhere the model must reason over facts spread over very long documents. In fact, in these tasks, Titan performed better than models with orders of magnitude more parameters, including GPT-4 and GPT-4o-miniand a Llama-3 model enhanced by amplification-based generation (RAG).

In addition, the researchers were able to expand the context window of Titans up to 2 million tokens while maintaining memory costs at a low level.

There is still a need to test the models at larger sizes, but the results of the paper show that the researchers have not yet reached the peak of Titans' potential.

What does it mean for enterprise applications?

With Google being leading in long context modelswe can expect this method to find its way to private and open models such as Gemini and Gemma.

With LLMs supporting longer context windows, there is increasing potential to create applications where you squeeze new knowledge into your prompt instead of using methods such as RAG. The development cycle for developing and reporting across rapidly established applications is much faster than complex RAG pipelines. At the same time, architectures like Titans can help reduce decision costs for very long queues, making it possible for companies to use LLM applications for more use cases.

Google plans to release PyTorch and JAX code for training and evaluating Titans models.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *