How I Created My Own AI Supercomputer to Run Ollama.Cloud Services – Using CPU-Only Nodes with Full RAM Sharing
Posted on: July 9, 2025
Author: Tom Dings
Over the past weeks, I’ve been exploring ways to improve the way I run local AI services — especially using models provided by Ollama.Cloud. While there are many interesting projects floating around GitHub, most are focused on GPU clusters or offer only limited performance on CPU-only setups.
That changed when I stumbled upon the Exo GitHub repository — an open-source cluster memory system designed for distributed inference of large language models. It was a great starting point. But I didn’t stop there.
My Custom CPU-Based AI Cluster Setup
Based on the Exo project and its architectural ideas, I developed my own method to create a fully working clustered AI backend for Ollama.Cloud — using CPU-only servers.
Here’s what makes my setup unique:
- It uses multiple CPU cores and threads per node, not just a single core like many existing tools.
- It shares the total available memory across all nodes, allowing me to run larger models that wouldn’t fit on a single machine.
- I can plug in as many servers as I want — adding more nodes increases performance linearly for many workloads.
- It’s optimized to work with Ollama.Cloud API endpoints, making it ideal for chat-based, RAG-powered, and streaming tasks.
Why Not Use GPUs?
I’m often asked: why not just use GPU hardware?
The answer is simple: my colocated, dedicated, and virtual private servers do not have GPUs available. These environments are designed for reliability and general-purpose workloads, not for graphical processing.
Of course, it’s technically possible to rent GPU instances from third-party providers, and for some specialized AI tasks that might make sense. But the way I’ve designed and optimized my AI services, there’s currently no need for GPU acceleration. The performance I’m getting with CPU-only nodes — especially with full core/thread usage and shared memory pooling — is more than sufficient for the workloads I’m running.
This also keeps costs predictable and allows me to scale the cluster with standard hardware that I already own or rent.
How My System Works
I built a lightweight node script and controller that:
- Accepts LLM requests from any client (e.g., a chatbot or frontend)
- Splits tasks dynamically across CPU threads and memory pools from all connected servers
- Handles load balancing, memory allocation, and fallback logic
- Sends requests to Ollama.Cloud, using the shared power of the cluster to handle larger contexts, faster streaming, and parallel requests
It scales horizontally — just add another machine, and it joins the pool, contributing CPU threads and memory instantly.
Real-World Setup
Here’s a simple example of what I’m running:
- 1x Intel NUC (16 GB RAM, 8-core CPU)
- 3x Raspberry Pi 5 (8 GB each)
- 2x Old laptops running Ubuntu Server
- Optionally connected to a VPS to route API traffic or process larger LLM responses
All nodes are linked via LAN and optimized to keep latency low and bandwidth use efficient. RAM is pooled in a way that allows me to push much larger prompt sizes or context-aware tasks without breaking the system.
Why This Matters
This setup finally allows me to run production-level AI inference workloads using only CPU-based hardware — no GPUs, no expensive cloud clusters.
For developers, researchers, or hobbyists who:
- Want to run Ollama.Cloud-based AI without high-end hardware
- Need to scale up AI services in a modular and affordable way
- Are looking for flexibility and full control over resources
This method gives you the power of a self-hosted AI supercomputer — made from recycled PCs, Raspberry Pis, and cloud VMs.
Bonus: Based on Great Open-Source Foundations
Credit where credit is due: the Exo GitHub repository laid the foundation that inspired my solution. Their modular memory-based architecture and node communication structure gave me the initial framework to adapt and build something tailored to my needs and goals.
Want to Try It Too?
If you’re interested in trying this method or integrating it with your own Ollama.Cloud-powered apps, feel free to reach out — I’ll be happy to help or share more about how the internal logic works.
Together, we can build smarter, faster, and more open AI clusters — without needing GPUs or breaking the bank.
Tags: AI, OllamaCloud, ClusterAI, CPUNodes, DistributedLLM, OpenSource, ExoInspired, TomDingsAI