顶级大型语言模型的比较分析

1. Introduction

In this tutorial, we’ll analyze the best large language models currently available. Through this systematic analysis, we’ll describe several of the most popular models, highlighting their features, strengths, and weaknesses. We’ll focus exclusively on generative systems based on LLMs, as comparing LLMs with different purposes wouldn’t be meaningful.

By the end of this article, readers should have a clearer idea of which LLM model might best suit their needs.

2. What Are Large Language Models?

Large Language Models (LLMs) are advanced artificial intelligence systems that understand and generate human-like text. They are trained on vast amounts of text data to learn patterns, structures, and nuances of language.

LLMs use deep learning techniques to process and generate text, particularly variants of neural networks like Transformers. Also, these models have parameters of a few hundred million to a few hundred trillion, which is why we call them large.

2.1. What Are the Practical Applications of Large Language Models?

LLMs have a wide range of applications across various industries and domains. Some of them include:

Chatbots – LLMs can converse with users, answer questions, provide customer support, and assist with diverse tasks.
Language translation – very effective in translating text between different languages with high accuracy.
Text summarization: These models can process long text documents into concise summaries while preserving key information and meaning.
Knowledge extraction and discovery – in addition to summarizing text, LLMs analyze lengthy documents to extract valuable information and insights. Subsequently, they can brainstorm and act as collaborative partners for problem-solving tasks.
Code generation – LLMs can generate code snippets or assist developers in writing software by understanding natural language descriptions of programming tasks.

3. Analysis of Top Large Language Models

In this article, we’ll describe some of the most popular LLM systems currently available. We won’t explain the algorithms behind these models, as many are not open-sourced and lack extensive information about their architecture. Our focus will be on LLMs as accessible platforms primarily utilized as chatbots for various purposes.

To select the best models, we’ll use the current ranking on the LMSYS Chatbot Arena Leaderboard. It’s a platform that serves as a crowdsourced open platform for evaluating LLMs. It collects human preference votes to rank various models based on their performance using an Elo rating system.

Users can participate in the ranking process by evaluating and voting on the performance of different LLMs. After entering a prompt, the system randomly selects two models, processes the prompt, and responds anonymously to the user. Then, the user can vote on which model performed better.

Only after voting the system will reveal the names of the models:

Large Language Models - LLM Arena

The world of LLMs is pretty competitive. Many new models come up every month, making it even more intense. As a result, the field of LLMs remains dynamic and always changing.

Here, we’ll present some of the most powerful models currently available. However, this list may not always be up-to-date, as new models, updates, or patches may appear every week. Still, certain leading model families and platforms will likely maintain their leading positions over a longer period.

3.1. GPT by OpenAI

OpenAI is a leading artificial intelligence research laboratory that aims to develop and promote user-friendly AI systems. One of its notable creations is ChatGPT, a pioneering LLM model based on the GPT architecture, designed to engage in human-like conversations and assist users in various tasks. OpenAI made history with ChatGPT as the fastest-growing app upon its release, attracting over 100 million monthly users within just two months. This rapid growth surpassed some popular platforms like TikTok and Instagram.

The most powerful LLMs by OpenAI are:

gpt-4-turbo-2024-04-09 – GPT-4 Turbo with Vision model. Vision requests can now use JSON mode and function calling. It has a context window of 128k tokens and returns a maximum of 4,096 output tokens. It’s trained with data up to December 2023
gpt-4-1106-preview – GPT-4 Turbo preview model featuring improved instruction following, JSON mode, reproducible outputs, parallel function calling, and more. It has a context window of 128k tokens, returns a maximum of 4,096 output tokens and is trained with data up to April 2023
gpt-4-0125-preview – GPT-4 Turbo preview model intended to reduce cases of “laziness” where the model doesn’t complete a task. It has a context window of 128k tokens and returns a maximum of 4,096 output tokens. It’s trained with data up to December 2023
gpt-4-0613 – Snapshot of GPT-4 from June 13th 2023 with improved function calling support. Recommended replacement for the retired model gpt-4-0314. Has context window of 8192 tokens and trained with data up to September 2021

3.2. Claude by Anthropic

Anthropic is an artificial intelligence startup founded by former members of OpenAI in 2021. Since then, it has raised funding from many VC funds and corporations, including Amazon and Google. Anthropic focuses on creating reliable AI systems, with a strong emphasis on AI safety and ethical considerations. These models are available on claude.ai and the Claude API, accessible in over 150 countries.

The most powerful LLMs by Anthropic are:

Claude 3 Opus – the most intelligent model by Anthropic and powerful as gpt-4-turbo-2024-04-09 and gpt-4-1106-preview models by LMSYS ranking. Can process a wide range of visual formats, including photos, charts, graphs, and technical diagrams. It supports 200k tokens in one input, and for some customers, it can go up to 1 million tokens
Claude 3 Sonnet – slightly weaker than Opus but still in the top 5 by LMSYS ranking. It strikes an ideal balance between intelligence and speed, particularly for enterprise workloads. It is also more affordable than other models with similar intelligence
Claude 3 Haiku – the fastest model from the Claude family and top 10 by LMSYS ranking. Supports 200k input tokens as other Claude models and is ideal for cost-saving tasks

3.3. Gemini by Google

Gemini is a family of LLMs created by Google DeepMind. These LLMs are multimodal, which means that the models can process information from multiple modalities, including text, images, audio, and video. Gemini can tackle many interesting problems. One of them is reasoning, which uses different modalities, such as the whole movie. Namely, long context understanding from the whole movie is an experimental feature researcher from Google tested with Gemini 1.5 pro.

The most powerful LLMs by Google are:

Gemini Ultra – the most capable and largest model for highly complex tasks. Doesn’t have LMSYS rank for unknown reasons. Google states that this is the first model to outperform human experts on MMLU benchmarks. They also state that this model outperforms GPT-4 in most of the common LLM benchmarks
Gemini Pro 1.0 – is among the top five models by the LMSYS ranking. It’s available online as the default model

3.4. Mistral by Mistral AI

Mistral AI is a French company founded in April 2023 by previous Meta and Google DeepMin employees. It produces open-source LLMs, following the importance of open-source software, and as a response to proprietary models.

The most powerful LLMs by Mistral AI are:

Mistral Large – Top-tier reasoning for high-complexity tasks. One of the best LLMs currently available
Mixtral 8x22B Instruct – one of the most powerful open-source models. It has a 64k context window, and it’s fluent in English, French, Italian, German, and Spanish and strong in code

3.5. Llama by Meta

Llama (Large Language Model Meta AI) is a family of autoregressive LLMs released by Meta AI starting in February 2023. Meta released all models as open-sourced, with weights available online, which makes them very popular in the community. Llama models are trained on a wide variety of datasets, including web pages, open-source GitHub repositories, Wikipedia in 20 different languages, public domain books, LaTeX source code from ArXiv papers, and Stack Exchange question-answers.

The most powerful LLMs by Meta are:

Llama 3 70b Instruct – The most powerful open-source model. Currently ranked in the top 5 in the LMSYS arena. Has 70 billion parameters and a context window of 8k tokens
Llama 3 8b Instruct – A smaller but still powerful Llama model with 8 billion parameters

4. Comparison of the Top Large Language Models

For the comparison, we’ll use LLSYS ranking and some common LLM benchmarks reported by companies. LLSYS ranking is dynamic, and the numbers change daily. Because of that, we’ll use “top 5,” “top 10,” and “top 15” categories as a measurement.

Some of the common LLM benchmarks will include:

Massive Multi-task Language Understanding (MMLU) – the MMLU serves as a standardized way to assess AI performance on tasks ranging from simple math to complex legal reasoning. It covers 57 subjects across STEM, the humanities, the social sciences, and more, ranging in difficulty from an elementary level to an advanced professional level
HellaSwag – The HellaSwag benchmark is a large language model benchmark designed to evaluate the commonsense reasoning abilities of language models. The dataset consists of a series of sentences, each followed by a question that requires understanding the context and reasoning about potential outcomes
MATH – includes a dataset of 12500 mathematics problems. The dataset covers various subjects, including algebra, calculus, statistics, geometry, and linear algebra.
HumanEval – the HumanEval benchmark is a tool designed to evaluate the functional correctness of code generated by LLMs. It evaluates LLMs’ performance in code generation tasks by measuring the probability of the generated code passing a set of unit tests

In addition to these benchmarks, there are prompting techniques used during the evaluation. Most commonly, we can differ:

0-shot – we ask a question without giving any example of the model
1-shot – we provide a single example of the model. For instance, “Using this Example 1 as a reference, answer Question 1”
k-shot – the same as 1-shot but using k examples

The table that shows a comparison between the presented models is below:

Input context window

Maximum output context

Release date (m-d-y)

Price per one million input tokens

Price per one million output tokens

LLSYS

MMLU (5-shot)

HellaSwag (10-shot)

MATH (4-shot)

HumanEval (0-shot)

gpt-4-turbo-2024-04-09

128k

4096

04-09-2024

10$

30$

Top 5

–

gpt-4-1106-preview

128k

4096

11-06-2023

10$

30$

Top 5

–

gpt-4-0125-preview

128k

4096

01-25-2024

10$

30$

Top 5

–

gpt-4-0613

8192

8192

01-13-2023

30$

60$

Top 15

–

Claude 3 Opus

200k

4096

04-03-2024

15$

75$

Top 5

86.8%

95.4%

61.0%

84.9%

Claude 3 Sonnet

200k

4096

04-03-2024

15$

Top 5

79.0%

89.0%

40.5%

73.0%

Claude 3 Haiku

200k

4096

04-13-2024

0.25$

1.25$

Top 10

75.2%

85.9%

40.9%

75.9%

Gemini Ultra

32.8k

8192

–

83.7%

87.8%

53.2%

74.4%

Gemini Pro 1.0

32.8k

8192

12-13-2023

0.13$

0.38$

Top 5

71.8%

84.7%

32.6%

67.7%

Mistral Large

32k

4096

02-26-2024

Top 15

81.2%

89.2%

–

45.1%

Mixtral 8x22B Instruct

64k

–

04-17-2024

open-source

Top 15

77.75%

88.5%

–

45.1%

Llama 3 70b Instruct

04-18-2024

open-source

Top 5

82.0%

–

50.4%

81.7%

Llama 3 8b Instruct

04-18-2024

open-source

Top 15

68.4%

–

30.0%

62.2%

Notice that in this table, GPT-4 models don’t have values for common benchmarks, but in many papers, other LLM platforms tend to compare their results with GPT-4. That is because the GPT-4 version mentioned in the original paper with common LLM benchmarks is outdated and retired by OpenAI.

5. Conclusion

In this article, we’ve presented some of the most powerful LLM models and platforms currently available. We’ve presented a comprehensive comparison using some model parameters, cost, and popular LLM benchmarks.

From what we’ve seen, there are lots of different language models out there, each made for different things. Some are really powerful, some don’t cost much, and some are free and open for anyone to use. It’s cool to see how many choices we have, depending on our needs.

As time passes, we’ll likely see even more new models popping up, giving us even more options based on what we need and can afford.

Persistence

REST

Security

1. Introduction

2. What Are Large Language Models?

2.1. What Are the Practical Applications of Large Language Models?

3. Analysis of Top Large Language Models

3.1. GPT by OpenAI

3.2. Claude by Anthropic

3.3. Gemini by Google

3.4. Mistral by Mistral AI

3.5. Llama by Meta

4. Comparison of the Top Large Language Models

5. Conclusion