Meta proposes new scalable memory layers that improve cognition, reduce hallucinations


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


As enterprises continue to adopt large language models (LLMs) in various applications, one of the main challenges they face is to improve the real knowledge of models and reduce hallucinations. In a new paper, researchers at Meta AI praise “scalable memory levels,” which could be one of several possible solutions to this problem.

Scalable memory layers add more parameters to LLMs to increase their learning ability without requiring additional computing resources. The architecture is useful for applications where you can save extra memory for realistic experience but also want the decision speed of nimbler models.

Dense layers and memory

Traditional language models use “dense strings” to encode a lot of information in their parameters. In dense sequences, all parameters are used at their maximum capacity and are usually activated simultaneously during decision making. Dense layers can learn complex tasks, and increase the need for additional computing resources and energy.

In contrast, for simple factual knowledge, sequences would be much simpler with more efficient and transparent associative memory architectures. This is what memory arrays do. They use sparse operations and key-value analysis methods to encode and retrieve knowledge. Sparse arrays take up more memory than dense arrays but only use a small portion of the parameters at a time, which makes them much more computationally efficient.

Memory layers have existed for many years but are rarely used in modern deep learning architectures. They are not optimized for current hardware accelerators.

A current frontier LLM usually uses some form of “a variety of experts”(MoE), which uses a mechanism very similar to memory arrays. MoE modules are made up of many smaller specialized components that specialize in specific functions. At decision time, a routing engine determines which expert to execute based on the input string. PERSONALan architecture recently developed by Google DeepMind, extending MoE to millions of experts, providing more granular control over the parameters that will be implemented at the time of decision.

To download memory songs

Memory layers are computationally light but memory heavy, which presents unique challenges for current hardware and software frameworks. In their paper, the Meta researchers propose several changes that will solve these challenges and make it possible to use them at scale.

Memory strings
Memory layers can store knowledge simultaneously across multiple GPUs without slowing down the model (source: arXiv)

First, the researchers configured the memory layers for parallelization, spreading them across multiple GPUs to store millions of key-value pairs without changing other layers in the model. They also implemented a special CUDA kernel to handle high memory bandwidth operations. And, they developed a parameter sharing mechanism that supports a single set of memory parameters across multiple memory layers within a model. This means that the keys and values ​​used for searching are shared across rows.

These changes make it possible to implement memory layers within LLMs without slowing down the model.

“Memory layers with their sparse operations are a great complement to dense networks, providing greater capacity for knowledge acquisition while being light on computation,” the researchers write. “They can be scaled efficiently, and provide users with an attractive new way to dispose of memory with computing.”

To test memory sequences, the researchers changed Llama models by replacing one or more dense layers with a shared memory layer. They compared the memory-enhanced models with the intensive LLMs as well as the MoE and PEER models on a number of tasks, including factual question answering, scientific and logical world knowledge and coding.

Memory model vs dense layers
A 1.3B memory model (solid line) trained on 1 trillion tokens outperforms a 7B model (dashed line) on realistic question-answering tasks as it receives more memory parameters (source: arxiv)

Their results show that in-memory models significantly improve over dense foundations and compete with models that use 2X to 4X more compute. They also match the performance of MoE models that have the same computing budget and parameter count. The model's performance is particularly specific to tasks that require factual knowledge. For example, in terms of answering realistic questions, a memory model with 1.3 billion parameters approaches the performance of Llama-2-7B, which was trained on twice as many signals and 10X more computing.

In addition, the researchers found that the benefits of memory models are still consistent with model size as they scaled their tests from 134 million to 8 billion parameters.

“Given these findings, we strongly recommend that memory layers should be integrated into all next-generation AI architectures,” the researchers write, saying that there is still much more room for improvement. “In particular, we hope that new learning methods will be developed to push the effectiveness of these sequences even further, enabling less forgetting, fewer hallucinations and continuous learning.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *