How to compare different LLMs to find the best one for your custom chatbot

Technical Deep Dive

How to compare different LLMs to find the best one for your custom chatbot

Here is a simple guide to find the right LLM for you, based on the criteria that matter the most

There are many LLMs out there – from ChatGPT, Bard, and Claude, to Llama, Mistral, and many others. By default, you might be using ChatGPT, but every once in a while you might get frustrated and wonder about other options. There are a lot of choices out there, but no clear method for choosing one.

Here is a simple guide to find the right LLM for you, based on the criteria that matter the most — accuracy, context length, and speed.

Compare Chats Side by Side for Accuracy

To evaluate which open source model is best for your use case, you should be able to chat with them live to find out.

There are very good resources on the internet that host these models, so that you can test them live, side by side. One of my favorite is https://chat.lmsys.org/. To compare models, go to the Arena (side-by-side) tab – you can pick any two models you’d like, give them the same prompts, and see which one you like better.

Plug in some prompts that are common in your field, and compare responses for accuracy. You would be able to tell pretty quickly which model performs better.

Other than the accuracy of generation, there are two other very important factors to consider

Context Length

If you need to upload or copy paste long context to the model, it is crucial to find a model with a long enough context length. GPT4 and Mistral’s max context length is currently 32k, while GPT 3.5 turbo’s context length is 4k. This is very important, because if you upload a document that exceeds the context length, it can lead to errors or hallucinations

Streaming Speed

Speed affects user experience A LOT. Some models stream faster than others, so be sure to pay close attention to this as a part of your decision.