Member-only story
Proximal Policy Optimization (PPO): The Key to LLM Alignment
Modern policy gradient algorithms and their application to language models…

Recent AI research has revealed that reinforcement learning (RL) — reinforcement learning from human feedback (RLHF) in particular — is a key component of training large language models (LLMs). However, many AI practitioners (admittedly) avoid the use of RL due to several factors, including a lack of familiarity with RL or preference for supervised learning techniques. There are valid arguments against the use of RL; e.g., the curation of human preference data is expensive and RL can be data inefficient. However, we should not avoid using RL simply due to a lack of understanding or familiarity! These techniques are not difficult to grasp and, as shown by a variety of recent papers, can massively benefit LLM performance.
This overview is part three in a series that aims to demystify RL and how it is used to train LLMs. Although we have mostly covered fundamental ideas related to RL up until this point, we will now dive into the algorithm that lays the foundation for language model alignment — Proximal Policy Optimization (PPO) [2]. As we will see, PPO works well and is incredibly easy to understand and use, making it a desirable algorithm from a practical perspective. For these reasons, PPO was originally selected in the implementation of RLHF used by OpenAI to align InstructGPT [6]. Shortly after, the popularization of InstructGPT’s sister model — ChatGPT — led both RLHF and PPO to become highly popular.
Background Information
In this series, we are currently learning about reinforcement learning (RL) fundamentals with the goal of understanding the mechanics of language model alignment. More specifically, we want to learn exactly how reinforcement learning from human feedback (RLHF) works. Given that many AI practitioners tend to avoid RL due to being more familiar with supervised learning, deeply understanding RLHF will add a new tool to any practitioner’s belt. Plus, research has demonstrated that RLHF is a pivotal aspect of the alignment process [8] — just using supervised fine-tuning (SFT) is not enough; see below.