As a devloper, I've always hated paying fixed subscription fees for tools I only use occasionally. ChatGPT Plus costs 20 USD/month no matter how much you use it. But LLMs don't really need to work that way - they're built on tokens, not time. So I decided to setup a ChatGPT clone which bills me for the tokens I use.

The chat application

For the chat application, I am using LibreChat, an open source AI conversation app, which supports nice features like MCP servers, web search, RAG & image generation. The LibreChat repository comes with a docker-compose file which you need to host on a server. I could have rented a “cheap” VPS for 5 dollars a month but that's too rich for my taste, so I decided to setup a home server, for free ✨

A delightful side tangent

Inspired by this fireship video I decided to look into Coolify, an open source alternative to popular tools like Vercel, Netlify & Railway. You can link any repository to a Coolify project and it manages the container orchestration and reverse proxy configurations for you. It also supports push to deploy and automatic SSL certificate management 🎉.

Having setup such a lovely tool for self hosting all my side projects, I needed a way to monitor the CPU and memory utilization of all my containers. Coolify provides a very simple version of this but I wanted it to be as comprehensive as possible. So one thing led to another and I set up prometheus, cadvisor, node exporter & grafana to get real time metrics for container resource usage , I/O operations, disk utilization & even my CPU's temperature.

Was this necessary to setup the chatGPT clone? No.
Did I have more than one containers to monitor when I set it up? No.
Is it the coolest looking thing ever and perhaps my favorite side tangent? YES.

How the chat application talks to AI

LibreChat gives you the ability to configure API tokens for some major model providers like OpenAI, Anthropic, & Google. If you want to run models on your own infra you can also integrate with AWS Bedrock or Azure OpenAI. But all this seemed too tedious to me. The obvious choice was OpenRouter, a service that supports almost every model (including the open source ones) behind an API interface fully compatible with OpenAI's API, which means any tool designed to work with openAI's v1 APIs can work with OpenRouter.

After I set this up, the rough architecture of my setup looked like this

Which model to choose?

Now that I have access to almost every model available, how do I choose which one to use? Ideally I want

A smart model that can answer most of my questions
A model with decently high tokens per second
A model with a reasonable price for it's speed and intelligence.

Simply picking the model with the cheapest price per token is not the best because the model may be too dumb

Another strategy is to divide the models' performance on popular benchmarks with it's price to calculate an “intelligence per dollar” metric, but how do you calculate the “price”. The official APIs give you 2 things - price per input & output tokens. A non-reasoning model with a higher cost per output token may be cheaper to run than a reasoning model with lower cost per output token simply because it produces more output tokens. Later those output tokens are passed as input tokens for the model to “remember” the context of the conversation

The most ideal way to calculate this is what artificial intelligence does, which is to run a bunch of popular benchmark and track the price it cost to run the benchmark. They then create an “intelligence per dollar” graph using this price.

This graph was what I used to decide that, as of writing this post, GPT OSS 120B and Grok 4 fast are the most efficient models to use for price. This is why my monthly costs are as low as 50 cents.