The importance of evals.

This short article explains why evaluations are the most crucial step when custom fine-tuning LLMs, and provides a high-level overview of the process involved.

November 2024 - Dan Austin

Before starting, be sure that you're climbing the right mountain.

How can you take meaningful steps forward if you don't know the destination?

Evaluations provide clarity on what a custom AI's capabilities would be in the perfect world, and so by defining them, you define success.

What are evals?

You can think of evals as a set of tests which the AI needs to pass,  just like unit tests in traditional software. For each test, you know know ahead of time what the input is and what the perfect output for that input looks like.

Each input will portray a single scenario or edge case that you expect to happen in production. These are normally highly specific to the domain or task that you working on and the number of evals required can vary largely depending on the scope and complexity of your intended use case.

When you have evals covering the most important scenarios of your intended task, you can take an LLM and have it to take your tests.

How to use evals for fine-tuning LLMs successfully?

Step 1: Define your mountain

The first step is to scope exactly what your custom LLM should do in a perfect world. Each part of the scope is then converted to an eval.

This is not to say you need to know everything up front, at a minimum you just need to know key scenarios.

Evals will forever be expanded upon as you desire new capabilities over time.

Step 2: Discover where you are

The next step is to allow your LLM to take the test. This shows you exactly what your LLM is good at and where it fails.

It is wise to run multiple foundation models and fine-tuned models through your evals at this stage to see which one performs best at your use case.

Step 3: Analyse how to reach the summit

After running the evals, you will gain an extremely valuable objective insight into where your model performs well and where it performs poorly.

Now it can be determined whether the next best step is to continue prompt engineering, or to curate data and fine-tune the model.

Step 4: Take the next step

If prompt engineering plateaus and your evals no longer improve, or you are playing whack-a-mole with different evals after writing different prompts, it is time to fine-tune.

Firstly, we curate data which fills in the knowledge gaps identified in step 3, then we fine-tune and restart from step 2 until our evals pass.

Summary

Evaluations define success for an AI fine-tuning project and serve as the north star for which every effort will be made to achieve. Therefore it is absolutely vital they are defined and that they cover the key use cases of your domain or task.

At AiTuning, we write custom evaluations for AI fine-tuning all the time. It is the hardest part of the project, and an area which requires careful work. Once they have been created however, you can rest assured that you are making progress along the fine-tuning journey as more and more evals pass, and that you have a compass guiding you along the way.

Please get in touch if you'd like to learn a little more or have any questions.

Best,
Dan Austin