Generative AI

Overview of Large Language Models: From Transformer Architecture to Prompt Engineering

Authored By

Published on

February 24, 2023

AI-based conversational agents such as ChatGPT and Bard have skyrocketed in popularity recently. These and many other language models compete to dominate the new technological frontier as the competition intensifies. These tools are entering our daily lives through browsers and communication platforms. However, with the industry constantly evolving, it can be hard to keep up. Thus, deciding which product to use or invest in could be challenging. The key to staying ahead is to be aware of the new technology trends. Understanding the inner workings of GPT and BERT will give you the skills to navigate the rapidly changing landscape of language models.

At the heart of this technology lies the innovative Transformer architecture. A deep learning model that has redefined the way we process natural language text due to its remarkable efficiency. In this article, we dive into the details of Transformer, exploring its impressive history of modification and improvement. By the end, you'll have a solid grasp of the cutting-edge technology driving the language models of today.

The model that made the difference

A new generation of powerful language models began with a breakthrough discovery in 2017, introducing a revolutionary AI structure called Transformer in a landmark paper "Attention is all you need". This encoder-decoder architecture composed by stacks of transformer layers, depicted below, quickly became popular for Natural Language Processing (NLP) problems.

‍

Encoder-Decoder architecture — Figure 1: (a) In the architecture Encoder-Decoder, the input sequence is first encoded into a state vector, which is then used to decode the output sequence (b) A transformer layer, encoder and decoder modules were built by using stacks of transformer layers. Source: Holistic AI

‍

Its innovative use of attention mechanisms and parallel processing set this model apart from the traditional Convolutional Neural Networks (CNN) and recurrent Long-Short Term Memory (LSTM) networks. The network processed data sequences in parallel and used attention layers to simulate the focus of attention in the human brain.

This mechanism connects relationships between words in the text, making it much more efficient to process large sequences. As a result, the parallel nature of this architecture took full advantage of graphics processors, and the attention layer eliminated the problem of forgetting that plagues recurrent networks.

In the diagram below, you can see the activation of an attention layer in action. An attention layer can handle many head attentions. These activations represent the significant associations learned by the model during training:

‍

*Figure 2: Connection made by the model between elements of the text. These associations were learned during training.*

‍

The ingestion of information

The question arises of how to train the language modeling task in this architecture. Given that the Attention Layer observes all elements of the sequence the training would be weak given that the output has already been observed. To resolve this, there are two approaches:

‍

*Figure 3: Language Modeling Approaches. (a) Masked Language Modeling predict hidden words in the sequence. (b) Causal Language Modeling, predict the next word in the sequence. Source: Holistic AI*

‍

Masked Language Modeling (MLM) from BERT and Causal Language Modeling (CLM) from GPT. Proposed by researchers at Google and OpenAI respectively, these models represented a significant leap forward in NLP technology. With their massive size, composed of millions to billions of parameters, only companies with the computational power to scale could handle these models. MLM use the encoder module to mask some of its inputs, challenging the model to fill in the gaps. At the same time, CLM predicts the next element in the sequence using a masked attention layer in the decoder to avoid observing future data during training.

Despite their impressive abilities to extract knowledge, each model had its limitations. For example, MLM could relate information from the entire sequence but only used 15% of the sequence to calculate errors. On the other hand, CLM could take full advantage of the output sequence but could only learn causal information. Besides that, to use for specific tasks, both models had to be modified and fine-tuned.

The generalisation of AI

The power of these language models was clear from their ability to generalise from limited examples. However, they needed to be adapted for specific tasks to be beneficial for practical applications. This was a challenge, as the traditional approach of modifying the structure and fine-tuning the last layers needed to be more scalable for commercial solutions. So instead, researchers and engineers sought a new approach that would allow the models to generalise task instructions, taking as input natural language instructions and their parameters, and then execute the desired task in the output sequence. This is where models like GPT-3 and T5 came into prominence.

‍

‍

*Figure 5: Context learning settings used for GPT3 for perform a task with language model.*

‍

With these improvements, just as the increase in computing power was defined by Moore's law, the trend towards increasing the number of parameters in these models seems to represent the new version of that law:

‍

*Figure 6: evolution of the number of parameters of Language Models over the years.*

‍

Making language models bigger doesn't make them inherently better at following user intent. For example, large language models can generate output that is false, toxic, or useless to the user. In other words, these models are not aligned with their users.

One more step: Prompt engineering

At this stage, the technology needs greater precision that accurately meets the user's requests. For this, technologies like InstructGPT and LaMDA alienate their language models from the user's intent. Instead, they use fine-tuning and reinforcement learning strategies applied to human feedback. LaMDA also extends the strategy to query external knowledge sources.

‍

*Figure 7: How LaMDA handles groundedness through interactions with an external information retrieval system.*

‍

LaMDA-Base returns a draft answer in the first call, followed by sequential calls to the LaMDA-Research model. The choice between querying the information retrieval system or responding to the user is determined by the first-word output (TS) by LaMDA-Research, which identifies the next recipient.

‍

There are three InstructGPT methods: (1) supervised fine-tuning (SFT), (2) reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model.

InstructGPT and LaMDA powers the ChatGPT and Bard AI services respectively. Both are currently working on problems of reducing toxicity and the veracity of responses. In terms of applications, the announcement of integrations in many platforms and services and together with other intelligences such as DALL·E 2 and Imagen (Text-to-Image Models), MusicLM (Generating Music from Text) will start a new era of unprecedented applications.

Heading 2

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.