InstructGPT And Why It Matters For The Success Of ChatGPT

InstructGPT is the successor to the GPT-3 large language model — as explored in the intelligence factory race between AI labs — (LLM) developed by OpenAI. InstructGPT is a model which uses reinforcement learning from human feedback that gets incorporated into the GPT model to make it more reliable.

Table of Contents

From GPT-3 to IntructGPT

GPT-3 has been an incredible turning point for the current AI paradigm, where machine learning models could be turned into general-purpose engines via an architecture called a transformer.

transformer-architecture — The transformer architecture – sometimes referred to as the transformer neural network or transformer model – is an architecture that endeavors to solve sequence-to-sequence tasks while easily handling long-range dependencies.

Along the way, OpenAI figured out a few other components that could be plugged into to make these large language models effective.

Indeed, in-context learning via prompting and learning from human feedback proved effective additions, which OpenaAI used to move from its GPT model to what would later become InstructGPT.

InstructGPT, thus, is the underlying stack that sits beneath ChatGPT. Its core difference with GPT is that InstructGPT uses a human feedback approach in the fine-tuning process, where humans show a set of outputs to the GPT model once it has been pre-trained thourhg the InstructGPT framework.

In the InstructGPT framework, humans iterate on a much smaller dataset, acting in a few ways.

First by producing the desired output and second by comparing it with that generated by GPT.

Second, by labeling the output coming from GPT from human feedback.

Third, by showing that output to the GPT model to instruct it toward the desired outcome on narrower tasks and sort of questions.

This is how we get (mostly) from the GPT model to the InstructGPT model, which has now become a standard within OpenAI’s technology.

Understanding InstructGPT

InstructGPT is the result of an overhaul of the GPT-3 language model. Responding to user complaints about GPT-3, creator OpenAI made the new and improved model:

Better at following English instructions.
Less inclined to spread misinformation (more truthful), and
Less likely to produce toxic results or those that reflect harmful sentiments.

instructgpt-reinforcment-learning-process

The problem with GPT-3 arose because it was trained to predict the next word from a large dataset and not to safely perform the task the user wanted. To address the problem, OpenAI used a technique known as reinforcement learning from human feedback (RLHF).

With reinforcement learning, an AI agent learns to make decisions by performing actions in an environment and receiving feedback as rewards or penalties.

OpenAI shared already back in 2017 how this process was instrumental in developing safe AI systems. And yet this same methodology proved quite effective also to make these AI systems way more effective for specific tasks.

To be sure, reinforcement learning from human feedback wasn’t a discovery of OpenAI but an achievement of academia.

Yet, what the OpenAI team was good at, was in scaling this approach.

Back then, the team at OpenAI trained an algorithm with 900 bits of feedback from a human evaluator to make it learn to backflip.

Of course, that doesn’t seem a huge achievement for a simple and narrow task, and yet, this was the embryonic stage of what would later turn into something like ChatGPT.

The three steps in InstructGPT training

The RLHF process can best be described as a 3-step feedback cycle between the person, reinforcement learning, and the model’s understanding of the goal.

Source: OpenAI

To better understand this process, let’s explain each step.

Step 1 – Collect human-written demonstration data and train a supervised policy

Once a prompt has been sampled from a dataset, a labeler demonstrates desirable output behavior. These can be submitted by GPT-3 users, but OpenAI researchers also guide labelers based on written instructions, informal conservation, and feedback on specific examples where necessary.

Then, the data are used to refine GPT-3 by training supervised learning baselines.

Step 2 – Collect comparison data and train the reward model

Next, a dataset of human-labeled comparisons between two outputs on a larger set of prompts is collected. Several model outputs are sampled from a prompt, and the labeler ranks each output from best to worst.

The reward model (RM) is then trained on this dataset to clarify which output OpenAI’s labelers prefer.

Step 3 – Use the reward model as a reward function to fine-tune the GPT-3 policy

In step three, a new prompt is sampled from the dataset, and based on the above, the policy generates an output and calculates a reward. The reward is maximized by the company’s Proximal Policy Optimization (PPO) algorithm.

The result is that InstructGPT is much better at following instructions.

This is what the whole process looks like:

InstructGPT vs. GPT-3

Instruct GPT-3 is the model of choice for OpenAI labelers despite it having 100x fewer parameters than the model on which it is based.

The company also noted that “at the same time, we show that we don’t have to compromise on GPT-3’s capabilities, as measured by our model’s performance on academic NLP evaluations.”

InstructGPT models were in beta mode on the API for over twelve months and are now its default language models. Moving forward, OpenAI believes that model refinement with humans in the loop is the most effective way to improve reliability and safety.

Key takeaways

InstructGPT is the successor to the GPT-3 large language model (LLM) developed by OpenAI. It was developed in response to user complaints about the toxic or harmful results generated by GPT-3.
To address the problem, OpenAI used a technique known as reinforcement learning from human feedback (RLHF). The process is best described as a 3-step feedback cycle between a human, reinforcement learning, and the model’s understanding of the goal.
Despite the increase in performance, it is worth noting that Instruct GPT3 is the model of choice for OpenAI labelers despite it having 100x fewer parameters.

Key Highlights on InstructGPT:

Introduction to InstructGPT:
- InstructGPT is the successor to OpenAI’s GPT-3 model.
- Developed to address user complaints regarding GPT-3, especially around toxic or misleading outputs.
- Uses reinforcement learning from human feedback (RLHF) for enhanced reliability and safety.
From GPT-3 to InstructGPT:
- GPT-3 was based on the transformer architecture, a major milestone in AI.
- InstructGPT incorporates in-context learning via prompting and learns from human feedback.
- The main distinction between GPT and InstructGPT is the latter’s use of human feedback during fine-tuning.
Understanding InstructGPT:
- Designed to better follow English instructions, reduce misinformation, and minimize toxic outputs.
- It uses a three-step RLHF process, enhancing the model’s alignment with human-desired outputs.
Three Steps in InstructGPT Training:
- Step 1: Collect human-written demonstration data and refine the model using supervised learning.
- Step 2: Use human-labeled data to train a reward model that assesses the model’s outputs.
- Step 3: Fine-tune the model using the reward model to maximize the desired output.
InstructGPT vs. GPT-3:
- InstructGPT is preferred by OpenAI labelers despite having 100x fewer parameters than GPT-3.
- OpenAI believes that refining models with human involvement is the best way to improve reliability and safety.
Overall Significance:
- InstructGPT models, after being in beta mode for over a year, are now the default language models for OpenAI.
- The focus is on ensuring the AI understands and executes user instructions without compromising on safety and reliability.