Making Transformers Simpler and Faster

Transformers have become the backbone behind many recent advances in AI, powering systems like ChatGPT for natural language tasks. Yet the standard transformer architecture has many intricacies that make it complex and inefficient. In a new paper, researchers Bobby He and Thomas Hofmann explore how to streamline transformers by removing unnecessary components, making them more straightforward, faster, and practical. 

The core idea is that several aspects of the standard transformer block—the primary building block transformers are made of—can be simplified or removed entirely without hampering performance. Specifically, He and Hofmann identify and eliminate excess baggage in terms of 1) skip connections, 2) value and projection parameters, 3) sequential sub-blocks, and 4) normalization layers.  

Skip connections are links between layers that help information flow across the network. The researchers find these can be discarded in each transformer layer's attention and feedforward sub-blocks. The key is to initialize the attention mechanism to have a vital identity component, allowing tokens to retain information about themselves better as signals pass through the network.

The value and projection parameters help transform representations as they enter and exit the multi-head attention module in each layer. Surprisingly, He and Hofmann reveal these extra transform matrices can be fixed to the identity without affecting results. This eliminates half of the matrix multiplications required in the attention layer.  

Similarly, the standard transformer computes attention and feedforward sequential sub-blocks in order. By parallelizing these computations instead, skip connections linking the sub-blocks become unnecessary.

Finally, normalization layers that help regulate activations can also be removed with the proper architecture adjustments, albeit with a minor drop in per-step speeds. 

Together, these modifications lead to a radically simplified transformer block that matches, if not exceeds, the performance and efficiency of the original complex block. For example, the simplified model attains over 15% faster throughput and uses 15% fewer parameters, yielding practical savings.

Real-World Impact

The work has both theoretical and practical implications. On the theory side, it reveals limitations in current tools like signal propagation for explaining and designing networks, motivating more nuanced dynamic theories that capture training intricacies. Practically, simpler transformer architectures can translate to significant efficiency gains in AI systems, reducing computational demands to deploy large language models.

For example, consider CPROMPT.AI, a platform allowing everyday users to build and share custom AI applications easily. The apps tap into capacities like text generation from prompts, with no coding needed. Simpler and faster transformers directly enable deploying more powerful capacities to more people for less cost—crucial as advanced AI diffuses across society.  

He and Hofmann’s simplifications also compound the work of other researchers pursuing efficient transformers, bringing us closer to practical transformers at the scales and accuracies necessary to push AI forward. So, while recent models boast hundreds of billions of parameters, streamlined architectures could pack comparable performance in packages sized for broad access and impact.

The quest for AI that is not only capable but accessible and responsible continues. Reducing transformer complexity provides one path to more efficient, economical, and beneficial AI development and deployment.

The key facts from the paper

  • Skip connections in both the attention and feedforward modules of transformers can be removed without hampering performance by initializing attention to have a vital identity component.  
  • The value and projection parameters in multi-head attention are unnecessary and can be fixed to the identity matrix.
  • By parallelizing the attention and feedforward computations, sequential sub-blocks can also be eliminated.
  • Simplifying transformers in these ways yields models with 15% higher throughput and 15% fewer parameters.
  • Limitations of signal propagation theories for neural network design are revealed, motivating more refined dynamic theories.

Glossary

  • Skip connection - a connection between non-consecutive layers in a neural network. 
  • Value parameters - weights in the transformer attention mechanism 
  • Projection parameters - weights that transform attention outputs
  • Sequential sub-blocks - the standard process of computing attention and then feedforward blocks
  • Normalization layer - a layer that regulates activation values