datavaluepeople logo

Self-hosted LLMs: running your own inference infrastructure

Written by Daniel Burkhardt Cerigo

When does it make sense to run your own LLM inference infrastructure instead of paying per-token to third-party APIs like OpenAI or Anthropic? And how do you execute it once you’ve decided to?

I gave a talk and ran a half-day hands-on workshop on the topic.

If you want to get a grasp on the topic but you’re not a developer, then watch the talk. If you’re a developer who actually wants to set up, tune, and manage inference servers yourself, then go to the workshop section below.

The talk

Data Science Festival Big Birthday Bash 2026, 16th May 2026, London

View the slides

Talk takeaways

  1. Ability to apply a concrete decision framework to evaluate Third-party vs Self-hosted inference for any LLM application.
  2. Brief intro to setting up a basic inference server.

Recording

The workshop

AI in Production 2026, 4-5th June 2026, Newcastle Upon Tyne

If you want to set up, tune, and manage inference servers yourself, you need to understand how (decode-only) transformer models actually work internally. Understanding what a KV-cache is, and why it is critically important for inference, is a good measure of where you need to get to - if you can understand that then you’ll be able to readily understand the rest of the relevant aspects of inference optimisation you will run into. Without this knowledge you can’t even understand the most important run arguments for inference engines like vLLM, so you can’t expect to tune or manage an inference server effectively. The following slide has some good references. Work through them with an LLM, get explanations of the content, and have it quiz you to check your understanding.

The workshop was a 3h15m afternoon covering the decision framework for Third-party vs Self-hosted, applying it in some worked example LLM applications, then getting hands-on with a deployment of an inference server using current leading open-source technologies, and trying out server tuning/optimisation. I focused on aspects related to LLM inference and mostly ignored/assumed knowledge of the dev-ops aspects, as the dev-ops parts are already covered many times elsewhere. The interesting part is the challenges/opportunities specific to transformer architectures.

Workshop takeaways

  1. Ability to apply a concrete decision framework to evaluate Third-party vs Self-hosted inference for any LLM application.
  2. Knowledge and skills to set up a basic inference server.
  3. Practice connecting your understanding of transformer architecture, to server optimisation decisions.

Resources

Using those resources you can work through the workshop plan yourself. Takeaway/goal 3 is the most important for a technical practitioner. If you can achieve this yourself then you’re in good stead.

If you’re thinking about self-hosting, or just starting to grapple with leveraging AI internally in your org, drop me an email and I’d be happy to talk!

Daniel Burkhardt Cerigo

Written by Daniel Burkhardt Cerigo

June 12, 2026

datavaluepeople is a group of artificial intelligence experts. Through applied machine learning, building automated systems, advising, and education, we create value for businesses, organizations, and humans. Drop us an email to speak to us about how we could work with your organisation, or if you are interested in joining our team.

linkedIn icongithub icon
Continue reading