Gamer's Edge AI Inference

Written by Dominik Pantůček on 2025-04-24

llm

Last time we have shown how to run (or more like crawl) LLM inference on common CPUs. It can be rather sluggish though. And that is why we dug deeper and tried to use a commodity hardware - a gaming laptop - to speed things up! It is actually very interesting how even a gaming GPU can increase performance of AI tasks.

Firstly it was a great relieve to learn that LLaMa.cpp has a decent support for offloading the computations on GPU. Secondly Ubuntu comes with all the necessary libraries for NVidia GPUs - that means CUDA support works out of box.

CUDA stands for Compute Unified Device Architecture which is available on many newer NVidia GPUs. It allows for general-purpose computing on graphics processing units - GPGPU and typical AI inference engines try to leverage it if possible.

To install the required development libraries, we follow our previous steps and add a few more packages to the mix:

apt install git build-essential nvidia-cuda-toolkit libcurl4-openssl-dev

Yes, that is all. At least those were all the adjustments needed for building LLaMa.cpp with CUDA support. The actual build process is - again - only slightly different to what we have already seen:

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j 16

When compiled, it can be run the same way as before:

./build/bin/llama-cli -hf 'TheDrummer/Cydonia-24B-v2.1-GGUF'

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080 Laptop GPU, compute capability 8.9, VMM: yes
build: 4935 (960e7260) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init

And after about a 30 seconds of startup time it can be used as well:

> Hello, what are your capabilities and how can you assist us?

The inference is really fast, because after about 20 seconds we see our familiar output:

Hello! I'm a versatile AI assistant designed to help with a wide range of tasks. Here are some of my capabilities and how I can assist you:

Information and Knowledge:

Answer questions and provide explanations on a wide range of topics.

Offer summaries of complex topics.

Help with educational content and homework.

Writing Assistance:

Assist with writing essays, reports, and other documents.

Provide suggestions for improving grammar, style, and clarity.

Help with creative writing, such as short stories or poetry.

...

That is really good! But how does it compare for the usage we intend? The benchmarks make it pretty clear:

Engine	Startup	Inference
CPU	01:02	05:58
GPU	00:31	00:22

Figure 1: Comparison of CPU and GPU inference times.

It has to be admitted - the results are blazing fast. In the future we plan to test the same model on AI GPU and see what the limits are.

Hope you enjoyed this artificial ride through GPU-based inference (pun intended) and see ya next time!