Emergent AI
Posts
Build vs. Buy

Build vs. Buy

It's actually build, then buy

Ilan Man
May 06, 2025

🎯 tldr

As GenAI matures, companies will need to build internal muscles to augment and build on top of vendored AI solutions. But first, let's talk evals.

President and co-founder of OpenAI sharing some thoughts way back in the early days

Evals are nothing more than evaluation metrics for LLMs. Nothing new here. The challenge with evaluating LLMs is that it's harder than traditional ML to determine if something is accurate or not. In addition, it's more important than ever to be able to trust our models as GenAI proliferates to the non-technical masses. As I've said before, people will start turning off their brains as GenAI gets better, and that's a recipe for disaster.

🧐 What Happened
💡 Why It Matters
✅ What It Means for You
👀 What We're Watching

🧐 What Happened

I wanted to share a write up by a data company (hex.tech) around how they evaluated GPT4.1 compared to GPT4o. Instead of relying on vanity metrics like pass rate, which don't tell you where or why the LLM succeeds or fails, they advocate a funnel-based evaluation approach.

The core idea behind the funnel is to break down a complex task into sequential, semantically meaningful stages, then evaluate the model’s performance at each of those stages. This decomposition gives you insight into where the model is struggling.

While Hex's domain is accuracy on text-to-SQL, the funnel approach generalizes to other use cases. For the Hex team, ultimately GPT4.1 performed 1.9 times better than GPT4o to deliver the correct result on the question "Did the query return the correct data?". But rather than simply accept the results as is, they wanted to know where the LLM broke down to see if they could improve the model. They ran the following set of evals on ~450 tables:

Did our RAG system find the tables needed for the agent to write a correct query?
Did the agent hallucinate any non-existent tables in its query?
Did the agent choose the right tables from the set retrieved when writing the query?
Did the agent get confused and mismatch any columns to the wrong table?
Did the agent hallucinate any columns?
Did the query run without errors?
Did the query return the correct data? <-- this is the original pass/fail question

The Hex team comparing GPT-4o against GPT-4.1 on funnel metrics

Using this funnel view, they could easily determine where GPT4.1 over or underperformed against GPT4o (see Stage 3 and Stage 6). Not only does this add visibility into how the model operates, it also enables you to go in, make the necessary tweaks, and re-run the model to see if it improves. It's a more complex process, and requires more work on your part to break down the problem, but it's critical for adding visibility into the black box of GenAI in general.

💡 Why It Matters

Back to my tldr, this is an example of building an internal tool (funnel based evals in this case) to augment a vendored solution. I believe this is going to happen more and more as companies want more control over the off-the-shelf models, especially during these early years of GenAI. Unlike modern SaaS, which is a mature space and requires little customization, we need to assume vendored AI solutions are all 20% effective (at best), and come with a higher downside risk. How can we get them to a consistently usable state (>80% effective)? This will require building internal muscles to augment solutions.
If you’re working with a vendor LLM (say for data analysis, customer support, summarization, etc.), a headline pass rate like “80% accurate” might hide serious performance gaps.

The funnel approach helps:

Identify which stage the model consistently fails at
Pinpoint whether the failure is due to user prompt issues, model limitations, or vendor updates
Justify feature changes, prompt rewrites, or vendor swaps based on data—not gut feel

✅ What It Means for You

Two things:

We need to make sure we have data scientists and AI specialists involved in vendor decisions and integrations. Assume solutions will not work out of the box - what role should our team be expected to play?
Even if we're not training models, our team is responsible for integrating them—and understanding their limitations is critical. With funnel-based thinking, we can:
- Ask smarter questions of our vendors (“Where is the model failing: retrieval or reasoning?” "If we notice a problem within the process, can we ask for a feature enhancement?").
- Guide internal product teams to add logging around intermediate steps → this is a step towards observability in GenAI, just like we have in the data and software space. Again, augmenting solutions.
- Use structured feedback to iterate on prompts, RAG pipelines, or fallback logic.

👀 What We're Watching

As I've been talking to more early AI companies, a lot of them say the same thing - they are spending more time integrating and cleaning customer data than they expected. I'm also going to continue following eval companies to see how their offerings evolve, and if some best in class frameworks (such as funnels) rise to the top.

Reply

or to participate.