AI Inference on CPU

Written by Dominik Pantůček on 2025-04-10

llm

One of our current projects has to use LLM for extracting structured data from rather unstructured and noisy input. Such endeavor typically requires specialized GPU for decent inference times and it would be wise to test the solution with cheaper setup before buying such expensive hardware. It turns out it is possible to test even large models on CPU.


Although it would be really nice to start playing with any LLM using some dedicated AI GPU like NVidia H100 - such entrée comes with a price that makes it definitely worth double-checking whether it is a good idea to get a particular hardware in the first place. Luckily enough, our hypervisor cluster has quite some unallocated resourcesboth in spare CPU cores and expendable RAM. The storage space is currently a complete no-issue.

So we have allocated 16 CPU cores, 160 GB of RAM and 500 GB of storage for our little experiment. The installation of Ubuntu 24.04 LTS smoothly came to pass and it was possible to start playing around with some LLMs. But how?

When it comes to running LLM inference on an ordinary CPU, the project LLaMa.cpp seems to be a good place to start. Installing requiredpackages, downloading and compiling is rather straightforward:

swapoff -a
sed -i -e 's@^/swap@#&@' /etc/fstab
apt install git build-essential libcurl4-openssl-dev
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DLLAMA_CURL=ON
cmake --build build --config Release -j 16

The -DLLAMA_CURL=ON is necessary to enable the feature that allows you to automatically download the models from the HuggingFace repository. Actually you can just pass the model name as a command-line argument and if it is not cached locally, llama.cpp will download it for you.

Running the engine as a chatbot requires just firing up llama-cli with appropriate arguments. We have chosen a LLM well suited for conversations as suggested by our friends and this is just what we tried:

./build/bin/llama-cli -hf 'TheDrummer/Cydonia-24B-v2.1-GGUF'
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz)

...
... lots and lots of system and model information
...

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


>

What can it actually do? And how fast? Let's ask it exactly this question.

> Hello, what are your capabilities and how can you assist us?

It took at least 5 minutes and an answer started being written to the terminal. One word a time - or parts of it - every 1 or 2 seconds. In the end, the reply was:

Hello! I'm a text-based AI model, and I'm here to assist you in a variety of ways. Here are some of my capabilities:

  1. Information and Knowledge: I can provide information on a wide range of topics, including but not limited to science, history, literature, technology, and more. I strive to generate accurate and helpful responses to the best of my ability.

  2. Writing Assistance: I can help with writing tasks, such as drafting emails, composing essays, creating reports, and generating ideas for creative writing projects. I can also provide feedback on grammar, style, and structure.

  3. Language Translation: I can translate words or phrases from one language to another to the best of my ability. However, please note that my translations may not be perfect and it's always a good idea to have a native speaker review important translations.

  4. Coding Help: I can assist with coding questions, explain concepts, spot potential issues with snippets, and suggest improvements. I have experience with various programming languages.

  5. Learning and Education: I can explain complex concepts in a simple way, create practice problems, and provide learning resources. I can also help with homework and study questions across various subjects.

  6. Ideas and Brainstorming: I can generate ideas, suggest alternatives, and help brainstorm solutions to problems or challenges you may be facing.

  7. Entertainment: I can tell jokes, share interesting facts, play text-based games, and engage in casual conversation.

Here are a few things I can't do:

  1. I can't browse the internet or access real-time information.

  2. I can't provide personal advice or make decisions for you.

  3. I can't guarantee the accuracy of my responses, as I generate them based on patterns I've learned from large amounts of text data.

  4. I can't feel emotions or have personal experiences.

To get the most out of my assistance, please provide as much context and detail as possible in your questions or requests. I'm here to help, so don't hesitate to ask if you have any questions!

Funnily enough, the output is already a Markdown flavor which is compatible with our website publishing system. That is definitely a cool feature!

This testing setup is definitely not something for real-time conversations. But it allows us to run test prompts and analyze results later on. Which is a good thing to assess the model's abilities and compare it to its stated resource requirements. The most important resource is apparently the memory and its bandwidth.

With everything in place, we were able to run our test prompt to see how long does it take to load the model, process a prompt and finish writing the answer to terminal:

time ./build/bin/llama-cli -hf 'TheDrummer/Cydonia-24B-v2.1-GGUF' --multiline-input -st <input.txt
...
real    5m51.726s
user    92m20.735s
sys     0m3.790s

Yes, it requires some more arguments to make it stop after processing the prompt (the -st one) and we had to enable --multiline-input as our test prompt contains large and structured data. However we can see that it took almost 6 minutes to finish the job. Not that bad on a 10-year old CPU.

We could not resist and had to try running the infamous DeepSeek-R1 model on the same task:

time ./build/bin/llama-cli -hf 'unsloth/DeepSeek-R1-GGUF:UD-IQ1_S' --multiline-input -st <input.txt
...
real    74m14.133s
user    673m28.989s
sys     5m50.581s

To our surprise, this is still well below our treshold for including it in further tests. The plan is to batch a few test prompts overnight and then look at the results. This allows us to decide whether an expensive hardware is worth the result. And yes, running the latter LLM in almost real time requires much better hardware than our test VM with an old CPU.

Hope you liked a modern AI experiment with LLM on an old CPU and see ya next time!