Why I Build Single-Machine ML Systems
Introduction
You’ve been there: what started as a simple ML experiment now requires three different cloud services, two IAM roles, and a dedicated DevOps engineer just to change a hyperparameter. The “quick start” in the cloud has become a maintenance nightmare.
The Real Cost of Cloud Complexity
Most computational startups follow the same playbook: start in the cloud because “everyone else is doing it.” The promise is seductive — infinite scale, managed services, enterprise-grade infrastructure from day one. But here’s what actually happens: your team spends more time wrestling with Kubernetes configs than improving your models.
The cloud gets you moving fast initially, but that speed comes with hidden complexity debt. Services become entangled. Projects become interdependent through shared infrastructure. What should be a simple model update now requires coordinating changes across multiple cloud components. I’ve seen teams where 40% of engineering time goes to maintaining cloud glue code instead of advancing their core science.
My Single-Machine Approach: 5 Core Principles
After years of consulting with ML teams, I’ve developed a contrarian approach that keeps computational startups focused on what matters: the science, not the infrastructure.
1. One Environment, All Projects
Every project runs in the same standardized notebook environment. No custom Docker images, no project-specific cloud configurations. Just a consistent Linux system with common ML libraries. When your data scientist switches between projects, they’re working in familiar territory.
Pro tip: I maintain a single Ansible playbook that provisions any new machine with the exact same software stack. New team member? New hardware? Same environment in 20 minutes.
2. Build for Big Single Machines
Instead of distributing across cloud services, I design each project to maximize a single powerful machine. Modern workstations can handle 128GB+ RAM, multiple GPUs, and terabytes of fast storage. Most ML workloads fit comfortably on well-configured hardware.
Pro tip: A $15k workstation with dual RTX 4090s often outperforms $50k worth of cloud GPU hours over a year, with zero ongoing costs. I built my own system with 5 RTX 3090s, 500GB memory, and 15TB storage for around $10k — though fair warning, building custom rigs takes significant time and isn’t for everyone.
3. Local Development, Production-Ready Code
I build projects that run identically on development laptops and production servers. The same Python environment, the same file paths, the same dependencies. No “it works on my cloud instance” debugging sessions.
Pro tip: Use relative paths and environment variables for everything. Your code should work whether it’s running on a MacBook or a 48-core server.
4. Minimal External Dependencies
Projects depend only on the base Linux system and standard ML libraries. No cloud-specific APIs, no managed databases that lock you into a vendor ecosystem. Your models should be portable across any infrastructure.
Pro tip: SQLite handles more than you think. I’ve seen production systems serving millions of predictions with simple file-based databases.
5. Native Linux Scaling
When you outgrow a single machine, Linux-native projects scale naturally. Your code already runs on standard Linux — adding machines means replicating your environment, not re-architecting for cloud services. SSH, rsync, and simple job queues handle most coordination needs.
Pro tip: Start with tools like GNU Parallel or simple Python multiprocessing. You can distribute work across machines without complex orchestration frameworks.
6. Hardware as Code
I build and configure machines myself, but document everything. Parts lists, BIOS settings, driver versions — all version controlled. This makes hardware failures recoverable and scaling predictable.
Pro tip: Keep a detailed build log for each machine. When something breaks at 2 AM, you’ll thank yourself for documenting that obscure CUDA driver configuration.
When This Approach Works Best
This isn’t anti-cloud dogma — it’s about choosing the right tool for your situation. Single-machine systems excel when:
- Your team is 2-10 people focused on computational research
- Compliance is critical and data locality matters
- Cost predictability is more important than infinite scale
- Your ML workloads fit on powerful individual machines
- Development speed matters more than operational complexity
The sweet spot is computational startups where the core business is developing better algorithms, not managing distributed infrastructure.
The Surprising Benefits
Beyond simplicity, this approach delivers unexpected advantages:
Faster iteration cycles. No waiting for cloud resources or fighting with service quotas. Your GPU is always available.
Predictable costs. Hardware depreciates, but it doesn’t surprise you with a $10k cloud bill because someone left a training job running.
True portability. Your entire ML pipeline fits in a Git repo and runs anywhere Linux does. Customer wants on-premises deployment? No problem.
Better debugging. When everything runs on one machine, you can inspect every process, monitor every resource, and debug with standard Unix tools.
Knowledge that compounds. Your team learns versatile Linux skills — bash, systemd, ssh, monitoring tools — that transfer across projects and companies. These are Lego blocks for computational work. Compare that to burning time debugging idiosyncratic cloud frameworks that may not exist in five years.
Making the Transition
If you’re drowning in cloud complexity, here’s how I help teams transition:
Start with a single powerful development machine that mirrors your current cloud setup. Migrate one project at a time. Document what works and what doesn’t. Most teams are surprised how much simpler their workflow becomes.
You can even run this approach in the cloud initially — rent bare metal instances instead of managed services. Get the simplicity benefits while using your existing cloud credits.
The Bottom Line
The cloud isn’t evil, but it’s not always optimal. For computational startups focused on advancing science rather than building distributed systems, single-machine ML can be a competitive advantage. Your team spends time on research, not infrastructure. Your costs are predictable. Your code is portable.
When your business is computation, sometimes the simplest solution is the smartest one.
Building efficient ML systems for computational startups? I help teams escape cloud complexity and focus on what matters. [Let’s talk about your infrastructure challenges.]