1. Overview

Ollama is an open-source software designed to run Large Language Models (LLM) locally.

In this tutorial, we’ll see how to install and use Ollama on a Linux system with an NVIDIA GPU. We’ll use apt, but we can adapt the commands to other package managers.

2. Ollama’s Key Advantages

Ollama offers a variety of generative AI functionalities, depending on the chosen model:

  • Question-and-answer
  • Text completion
  • Text summarization
  • Translation
  • Text classification
  • Creation of embeddings, which are numerical representations of text to find similar texts and organizing texts into groups
  • Image, audio, and video capabilities

Additionally, we can integrate Ollama with other applications using its REST API or dedicated libraries:

  • C++ → ollama-hpp
  • PHP → Ollama PHP
  • JavaScript → Ollama JavaScript Library
  • Java → LangChain4j
  • Python → Ollama Python Library
  • R → ollama-r
  • Ruby → ollama-ai

In terms of privacy, Ollama stands out because it works completely offline, giving us full control over our data and execution environment.

3. Installing Ollama

Before we continue, let’s take a look at the minimum hardware requirements, which depend on the number of parameters (in billions). Simply put, parameters are settings or rules that a model adjusts as it learns to improve its performance. The more parameters a model has, the more detailed and accurate it can be in understanding and generating human-like language.

For example, llama3:8b has 8 billion parameters. We indicate the number of parameters by using abbreviations such as 7B, 13B or 30B after the model name.

3.1. Hardware Requirements

Ollama stresses the CPU and GPU causing overheating, so a good cooling system is a must. These are the minimum requirements for decent performance:

  • CPU → recent Intel or AMD CPU
  • RAM → minimum 16GB to effectively handle 7B parameter models
  • Disk space → at least 50GB to accommodate Ollama, a model like llama3:8b and the Open WebUI web interface
  • Additional disk space → we can install as many models as we want, the size in GB of each of them is listed in the model library
  • GPU → not required but strongly recommended, we should check the list of supported GPUs

As a rule of thumb, the minimum number of GBs of VRAM should be half the number of billions of model parameters:

  • 8B model → 4GB VRAM
  • 16B model → 8GB VRAM
  • 32B model → 16GB VRAM
  • 64B model → 32GB VRAM

Larger models (13B+) require higher-end hardware that isn’t yet available in consumer computers, with few exceptions. Conversely, very small models (0.5B, 1.5B, or 3B) can run on limited hardware. For reference, ChatGPT-3 has a 175B model and requires at least five NVIDIA A100 GPUs with 80GB VRAM each.

Our test machine has 16GB of RAM and an NVIDIA GPU with 4GB of VRAM, just enough to run the llama3:8b LLM.

3.2. Configuring NVIDIA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and API created by NVIDIA for general-purpose computing, known as GPGPU (General-Purpose Computing on Graphics Processing Units). The nvidia-cuda-toolkit package is essential for Ollama to use an NVIDIA GPU as it provides the necessary tools and libraries for CUDA.

When installing the nvidia-cuda-toolkit package, we may encounter apt dependency conflicts with certain versions of NVIDIA drivers. For example, nvidia-driver-470 is incompatible with the nvidia-cuda-toolkit, causing it to be removed during the toolkit installation. If we have a similar problem, we should install a newer driver, such as nvidia-driver-535:

$ sudo apt install nvidia-driver-535

After performing this driver update, we need to restart the computer. Then let’s see if the driver works with CUDA support and direct rendering, which allows applications to interact directly with the GPU:

$ nvidia-smi
[...]
| NVIDIA-SMI 535.183.01    Driver Version: 535.183.01    CUDA Version: 12.2
[...]
$ glxinfo | grep "direct rendering"
direct rendering: Yes

We’re ready to install the CUDA toolkit:

$ sudo apt install nvidia-cuda-toolkit

If the installation was successful, the CUDA compiler driver should be available:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
[...]

Lastly, let’s install nvtop:

$ sudo apt install nvtop

We’ll use nvtop to monitor how Ollama uses our CPU, GPU, RAM and VRAM.

3.3. Installing Ollama

Installing Ollama is very simple. If we did the previous steps correctly, the installer ]automatically configures NVIDIA GPU support:

$ curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.

If the message NVIDIA GPU installed doesn’t appear, we need to double-check that the NVIDIA driver and nvidia-cuda-toolkit are installed correctly, and then repeat the installation of Ollama.

3.4. Installing and Testing a Large Language Model

This command runs the llama3:8b model. If it’s not already installed, it downloads it automatically:

$ ollama run llama3:8b

After the installation, let’s give it a try:

In this video, we used nvtop to see real-time hardware usage during response generation. If we have multiple GPUs, we can monitor them simultaneously or select only one by pressing F2.

It’s not surprising that Ollama generates the response one word at a time instead of all at once. However, we can change this behavior using the stream option of the Ollama API. When set to false, Ollama returns the JSON response in a single, complete output after processing the entire request, instead of incrementally returning the response in real-time chunks:

$ curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama3:8b",
  "prompt":"Write a sentence about Linux",
  "stream": false 
}' | jq .

[...]
{
  "model": "llama3:8b",
  [...]
  "response": "Linux is an open-source operating system that has become incredibly popular
among developers and tech enthusiasts due to its flexibility, customizability, and vast
community of users who contribute to its growth and improvement.",
  [...]
}

When integrating Ollama into more complex applications, the “stream”: false option is critical for tasks that require full response analysis, such as generating reports or documents. It also ensures stable data transfer in scenarios with unreliable network conditions.

As a final note, the llama3:8b model was trained in 30 languages, but 95% of the training was in English only. If we plan to use primarily a language other than English, we’d look for more appropriate models.

3.5. Ollama Web Interface

Open WebUI is a user-friendly graphical interface for Ollama, with a layout very similar to ChatGPT. This command makes it run on port 8080 with NVIDIA support, assuming we installed Ollama as in the previous steps:

$ sudo docker run -d --network=host --gpus all -v open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui \
--restart always ghcr.io/open-webui/open-webui:cuda

This makes Open WebUI available at http://localhost:8080. However, if port 8080 is already in use and we want to change the port to, say, 8090, there are a few more steps.

First, let’s stop the open-webui Docker container and rename or delete it to avoid conflicts:

$ sudo docker stop open-webui
$ sudo docker rename open-webui open-webui-old

Next, let’s create this file called Dockerfile in our working directory:

FROM ghcr.io/open-webui/open-webui:cuda

# Change the port configuration
ENV PORT=8090

# Start the service with the new port
CMD ["bash", "start.sh"]

Now we’re ready to create our custom Docker image:

$ sudo docker build -t open-webui-custom .
[...]
 => => naming to docker.io/library/open-webui-custom

Once the image is built, we can run it:

$ sudo docker run -d --network=host --gpus all -v open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui \
--restart always open-webui-custom

From now on, Open WebUI is available at http://localhost:8090 even after a computer restart.

The first time we open it, we have to create an account. By allowing multiple accounts to communicate with the same Ollama installation, it ensures that different users can effectively manage their individual projects and settings. Importantly, all data processed through Open WebUI remains completely local, ensuring that our privacy is fully protected.

Let’s give it a try:

We can go to Settings, Admin Panel, Connections, Ollama API if we need to change the OLLAMA_BASE_URL. In addition, we can take a look at Open WebUI Troubleshooting guide for any connection error.

4. Conclusion

In this article, we explored how to install and use Ollama on a Linux system equipped with an NVIDIA GPU.

We started by understanding the main benefits of Ollama, then reviewed the hardware requirements and configured the NVIDIA GPU with the necessary drivers and CUDA toolkit.

After successfully installing Ollama, we tested the llama3:8b model and discussed the possibility of changing the response generation behavior using the stream setting.

Finally, we set up Open WebUI, a user-friendly graphical interface for managing Ollama, ensuring a seamless integration.