为什么ChatGPT不一次性给出答案？

1. Introduction

In this tutorial, we’ll explain why ChatGPT generates the answer word by word. We’ll briefly introduce ChatGPT and explain why ChatGPT doesn’t give the response all at once.

2. What Is ChatGPT in Simple Terms?

ChatGPT is a chat-like program or application that can understand and generate human-like text based on the input it receives. It originates from GPT (generative pre-trained transformer) architecture, which is a neural network with an attention mechanism. It’s a chatbot with the ability to interact with users, responding to a variety of questions and offering information across various domains.

ChatGPT can understand more than 95 natural languages. In addition to that, it has a basic understanding of programming code, enabling it to comprehend and generate code for certain tasks. The current free version of ChatGPT works only with text, whereas the paid version GPT-4 can accept a prompt with text and image. It means that GPT-4 possesses the capability to understand and interpret information from images as well.

3. Why Does ChatGPT Not Give the Answer All at Once?

To understand this question, we need to get familiar with how ChatGPT works. We have a separate article about that, and here, we’ll cover only parts that are useful for answering our question.

First of all, since ChatGPT has a neural network in the background, it works with numerical vectors and matrices. It means that the input text prompt is converted first into tokens and then into a numerical matrix, the computation of the neural network is performed, and the output of the neural network is converted back into text. The answer to our question lies in the nature of the GPT architecture itself and the way it generates the output.

3.1. How Does ChatGPT Generate an Answer?

The last layer in the GPT neural network has a softmax function. The softmax function produces a probability distribution for the next token, and based on this distribution, the next token is selected.

It means that after all the computations, ChatGPT outputs a single token. The sampling of this token is performed from the probability distribution generated by the softmax function.

A single token represents a word, subword, punctuation mark, or character. And ChatGPT response, in most cases, will be constructed of multiple such tokens. It means that in addition to its probabilistic nature, ChatGPT is also an autoregressive system. It will generate an answer word by word based on the input prompt together with all generated tokens:

ChatGPT token

Basically, after the first iteration, the output token is incorporated back into the input prompt, and this cycle continues until the output token is equal to the end token.

3.2. How Much Processing Power ChatGPT Needs?

In addition to the autoregressive nature, ChatGPT is a large language model (LLM) that has around 175 billion parameters. And it requires a lot of power to run.

According to Professor Tom Goldstein, we need at least five A100 GPUs with 80GB VRAM each to load the ChatGPT model and text. We know that a 3-billion LLM needs around 6ms to generate a single token using an A100 GPU. If we scale that up to the size of ChatGPT, it will take around 350ms for a single token on an A100 GPU.

A likely choice for ChatGPT on Azure cloud would be an 8-GPU server. Now, if we assume that using this server would be 8 times faster than a single GPU, we get around 44ms per single token. Roughly 23 tokens or 17 words in one second.

Besides everything, we don’t consider many factors such as the total number of requests (probably a few million daily), how everything is distributed and optimized during high and low volume, utilization, availability and stability of resources, and similar.

It’s likely that ChatGPT operates on a limited number of servers. When there’s a high demand, servers are overloaded, so an additional delay is added.

4. Conclusion

In this article, we’ve explained why does ChatGPT not answer all at once. In one sentence, ChatGPT generates the answer token by token, and this process demands significant computational power and a certain time to complete.

Persistence

REST

Security