LLM APIs: Completion and Chat-Completion
Some subtleties on Completion and Chat-Completion APIs, especially for Local LLMs
Transformer-based language models are fundamentally next-token predictors, so naturally all LLM APIs today at least provide a completion endpoint. If an LLM is a next-token predictor, how could it possibly be used to generate a response to a question or instruction, or to engage in a conversation with a human user? This is where the idea of "chat-completion" comes in. This post is a refresher on chat-completion, and some interesting details on how chat-completion is implemented in practice.
Language Models as Next-token Predictors
A Language Model is essentially a "next-token prediction" model, and so all LLMs today provide a "completion" endpoint, typically at /completions
under the base URL. The endpoint simply takes a prompt and returns a completion (i.e. a continuation).
A typical prompt sent to a completion endpoint might look like this:
The capital of Belgium is
and the LLM will return a completion like this:
Brussels.
But interacting with a completion endpoint is not very natural or useful: you cannot give instructions or ask questions; instead you would always need to formulate your input as a prompt whose natural continuation is your desired output. For example, if you wanted the LLM to highlight all proper nouns in a sentence, you would format it as the following prompt:
Chat-To-Prompt Example: Chat/Instruction converted to a completion prompt.
User: here is a sentence, the Assistant's task is to identify all proper nouns.
"Jack lives in Bosnia, and Jill lives in Belgium."
Assistant:
The natural continuation of this prompt would be a response listing the proper nouns, something like:
John, Bosnia, Jill, Belgium are all proper nouns.
This seems sensible in theory, but a "base" LLM that performs well on completions may not perform well on these kinds of prompts. The reason is that during its training, it may not have been exposed to very many examples of this type of prompt-response pair. So how can an LLM be improved to perform well on these kinds of prompts?
Instruction-tuned, Aligned LLMs
This brings us to the heart of the innovation behind the wildly popular ChatGPT: it uses an enhancement of GPT3 that was explicitly fine-tuned on instructions (and dialogs more generally) -- this is referred to instruction-fine-tuning or IFT for short. In addition to fine-tuning on instructions/dialogs, the models behind ChatGPT (i.e., GPT-3.5-Turbo and GPT-4) are further tuned to produce responses that align with human preferences (i.e. produce responses that are helpful and safe), using a procedure called Reinforcement Learning with Human Feedback (RLHF). See this OpenAI InstructGPT Paper for details on these techniques and references to the original papers that introduced these ideas. Another recommended read is Sebastian Raschka's post on RLHF and related techniques.
For convenience, we refer to the combination of IFT and RLHF as chat-tuning. A chat-tuned LLM can be expected to perform well on prompts such as the one in the Chat-To-Prompt Example above. These types of prompts are still unnatural, however, so as a convenience, some API servers for chat-tuned LLMs servers also provide a "chat-completion" endpoint (typically /chat/completions
under the base URL), which allows the user to interact with them in a natural dialog, which might look like this (the portions in square brackets are indicators of who is generating the text):
[User] What is the capital of Belgium?
[Assistant] The capital of Belgium is Brussels.
or
[User] In the text below, find all proper nouns:
Jack lives in Bosnia, and Jill lives in Belgium.
[Assistant] John, Bosnia, Jill, Belgium are all proper nouns.
[User] where does John live?
[Assistant] John lives in Bosnia.
Chat Completion Endpoints: under the hood
How could this work, given that LLMs are fundamentally next-token predictors? This is a convenience provided by the LLM API service (e.g. from OpenAI or local model server libraries): when a user submits a chat to the chat-completion endpoint, under the hood, the server converts the instructions and multi-turn chat history into a single string, with annotations indicating user and assistant turns, and ending with something like Assistant:
as in the Chat-To-Prompt Example above.
Now the subtle detail to note here is this:
It matters how the dialog (instructions plus chat history) is converted into a single prompt string. Converting to a single prompt by simply concatenating the instructions and chat history using an "intuitive" format (e.g. indicating user, assistant turns using
User:
,Assistant:
, etc.) can work, however most local LLMs are trained on a _specific_ prompt format. So if we format chats in a different way, we may get odd/inferior results.
This means that if an LLM server library wants to provide a chat-completion endpoint for a local model, it needs to convert chat history to a single prompt using the specific formatting rules of the model. For example the oobabooga/text-generation-webui library has an extensive set of chat formatting templates for a variety of models, and their model server auto-detects the format template from the model name.
There are more interesting details on prompt formatting, which we explore in the full article on the Langroid blog. Langroid is an open-source, agent-oriented LLM application framework, and support for locally-running LLMs was recently added.