Fine-Tuned Todo List LLM

LLM fine-tuned with synthetic data to convert speech into structured todo lists.

March 2024 - Daniel Austin

Using synthetic data, AiTuning fine-tuned an LLM intended for a todo list app to achieve a 10x performance increase on evals, 5x speed increase over GPT4-Turbo, and a 20x cost reduction.

The journey

Our journey began by identifying a valuable use case: adding todos.

We envisioned a model that could be seamlessly integrated into a todo list app, allowing users to input tasks by speech with varying dates, times, and complexity. From simple "I need to do x today" to intricate "I need to do x, y, z... and I want to do x every third Wednesday of the month in the evening as a high priority, remind me 1 hour before. As for y, It is due Friday, but I will do it Wednesday morning..."

To ensure our model's adaptability, we studied well-known todo list app APIs and defined a custom yaml schema which our model should output that allowed us to plug into any of the APIs effortlessly.

We also realised early on that LLMs were not very good at calendar date reasoning. Even when we injected today's date into the prompt, it could not always return the correct due date if the user said something like "Next Friday" or "In 3 weeks and 2 days". So our yaml schema provided a way for these due dates to be easily expressed and we used a simple software program to parse these expressions and return the exact date required. Essentially we used the LLM for what it is great at (interpretation of language) and software for what is great at (rule based calendar reasoning).

Next, we created an initial set of model evaluations (evals), covering all known edge cases and their perfect yaml outputs. Prompt engineering followed, iteratively editing prompts until we reached the performance limits of state-of-the-art models like GPT-4 (41% in our task).

The Data Engine

Inspired by OpenAI cofounder Andrej Karpathy, we developed a data engine. The process: Fine tune a foundation model, run evals, analyse failed evals for knowledge gaps, create new data for gaps, repeat until all evals pass.

To create our data, we had humans create the inputs, whilst GPT-4-Turbo & Claude Opus generated outputs. Every output was then stringently reviewed and edited by a human to ensure conformity to our yaml schema output and best practices. As we developed intuition for logical rules in our data, we also built validation rules in traditional software to ensure data quality.

After fine tuning with a new batch of 50-100 high-quality data points, we ran the evals & analysed each failed eval output. This revealed cases the model didn't understand, guiding what we need for the next data batch. It is worth noting that we put time into reproducing all failures with minimal noise, which allowed us to pinpoint exactly what the model needed to learn, or where errors in our training data were hiding.

After creating each new batch of data we appended it to our existing dataset and then trained a fresh foundational model from scratch. Due largely to our extremely high quality data, we did not need a large quantity of tokens, therefore the cost of compute was negligible each time. Also, due to our precise approach in knowing what data was relevant to generate in each batch, we saw clear performance jumps with each new fine tuned model (between 5-20% performance gains with each new batch).

After 2-3 batches like this, our fine tuned model outperformed expensive state-of-the-art models. So we used our fine tuned model instead to generate synthetic data, still scrutinising and editing each output before including it in the next training batch.

Over time, each batch improved the fine tuned model, which meant less time spent by a human editing the outputs, which translated to faster data batch creation, which led to better models, which lead to less human editing... and so on.

The Triumph

Our fine-tuned model achieved 96% accuracy, outperforming its non-specialised younger sibling by over 10x and state-of-the-art models like GPT-4 by over 2x.

Not only that, but our model is around 20x cheaper and 5x faster than GPT-4. We're confident that fine tuning specialised AI models is the future and we can't wait to fine tune even more for our clients.

Limitations and Possibilities

Currently, our model only supports adding todos. However, completing todos, updating & deleting them, creating lists, and fetching existing todos are all possible with additional fine tuning for maximum accuracy. Our model can handle up to 10 tasks with varying levels of detail, including exact and inexact times, dates & partial dates, priorities, reminders, repeating tasks, and subtasks.

We're extremely excited about the potential of fine tuning specialised AI models and look forward to pushing the boundaries of what's possible.