How to Evaluate an Agentic AI Consultant (Before You Waste Six Figures)

The agentic AI consulting market is flooded right now. Since early 2025, every generalist data services firm has added "agentic AI" to their website. The gap between "we've built agents" and "we've run agents in production for six months" is enormous, and it rarely shows up in a proposal deck.

An agentic workflow engagement that goes wrong is not cheap. Discovery alone can run $20,000 to $40,000. A full pipeline build is six figures. If the system that gets handed to you isn't built for unattended operation -- if it works in a demo but breaks when your data is messy or your volume doubles -- you've spent real money on something you can't trust.

This guide is for the person who is already past "should we explore agentic AI?" and is now deciding who to bring in. I will cover what production experience actually looks like, the red flags that a polished proposal can hide, the specific questions you should ask on an eval call, and an honest look at when you should do it yourself instead.

What Production Experience Actually Looks Like

There is a specific difference between a consultant who has demo'd agentic systems and one who has shipped them. Demo systems run under controlled conditions: clean data, known inputs, a human watching the terminal. Production systems run unattended on schedules -- at 2 AM, against messy real-world data, with no one watching.

The operational concerns that separate the two are not glamorous: error handling when an upstream API returns a malformed response, state recovery when a graph node fails halfway through a job, human review gates for outputs that fall below a confidence threshold, monitoring that tells you whether last night's run completed and what it produced. A consultant who has not built for these conditions will deliver something that works in their demo environment and requires constant babysitting in yours.

One concrete signal: ask whether they have ever operated an agent past the first 90 days of production. The first month is when you build. Months two and three are when you discover that your assumptions about the data were wrong, that the prompt that worked in development produces hallucinations on edge cases in production, and that the orchestration logic you wrote for the happy path does not handle partial failures gracefully. A consultant who has been through that cycle has experience that is genuinely hard to fake. One who has not will cite GitHub repos and conference talks.

Ask for case studies with real metrics: what the system processed per run, what the error rate was before and after validation was added, how long the handoff took and what broke in the first week. Good answers include embarrassing numbers. Consultants who have only built successes have not built complex enough systems.

Red Flags

These are the patterns that should make you slow down, not eliminate immediately -- but each one deserves a follow-up question.

No production deployments, only proof-of-concept work. POCs are valuable, but they are a different skill than running a system that operates overnight without supervision.
"We do everything" positioning. Orchestration frameworks, prompt engineering, data pipelines, MLOps, security, compliance. A team that claims deep expertise in every adjacent field has shallow expertise in all of them.
Demo-only portfolio. Beautiful diagrams and GitHub repos with low commit count, no issues, and no evidence of iterating on real failures.
No validation methodology. When you ask how they ensure the agent's outputs are correct, you get an answer about prompt engineering and testing. That is necessary but not sufficient. Production agentic systems need independent verification -- a checker layer that can catch the cases where the maker was confidently wrong.
Vague engagement structure. "We will scope it out once we understand your needs" is fine for the first call. It is a red flag if it is still the answer when you are two weeks into discovery.

Questions to Ask and What Good Answers Look Like

The following questions take about 45 minutes to work through. They are not adversarial -- they are designed to give a competent consultant room to show depth. A team with real production experience will answer these with specifics. A team without it will answer with frameworks and principles.

"Walk me through a production agentic system you have shipped. What does it do, what was the volume, and what broke in the first month?"

A strong answer has scale (number of items processed per run, schedule frequency, time in production) and at least one honest failure story. The failure story is the important part: it shows they stayed through the hard part and built the fix. An answer that is only success narrative is not informative.

"How do you handle validation in an agentic pipeline?"

This question separates consultants who have thought about accuracy from ones who have not. The honest answer is that LLMs are confident and wrong often enough that you cannot trust them unsupervised on consequential decisions. A good answer describes some form of independent verification -- a maker-checker pattern where one layer produces a result and a second layer independently verifies it, with disagreements flagged for human review rather than resolved automatically. The specific implementation will vary, but the underlying logic should be the same: do not let the system grade its own homework.

In the financial pipeline we built at Labyrinth, the maker-checker pattern caught a tax calculation error where the code was wrong and the LLM was right. That kind of cross-validation only has value because the two layers are genuinely independent and because disagreements go to a human rather than defaulting to one side. A consultant who has never built this kind of validation layer has not operated an agentic system on regulated data.

"What orchestration framework do you use and when would you choose something different?"

For complex agentic workflows -- conditional routing, stateful multi-step processing, human-in-the-loop checkpoints -- LangGraph is the most capable option available right now. A good answer uses LangGraph or something comparable and explains why. It also explains when simpler options (Airflow, Prefect, plain Python) are the right choice instead. A consultant who reaches for a complex orchestration framework for every problem, or who cannot explain the tradeoffs, has not done the calibration work that production systems require.

"What does your monitoring setup look like for an autonomous agent fleet?"

You want to hear: session logs with tool call counts and file access records, automated alerting on error conditions, a human review queue for low-confidence outputs, and a clear escalation path when something unexpected happens at 3 AM. The answer should also include something about what the agent actually did during a session -- which files it read, which steps it completed, where it spent its turns. An agent you cannot introspect is an agent you cannot improve.

"What went wrong on a past project, and how did you handle it?"

This is the most diagnostic question on the list. Everyone has a war story. The content of the story matters less than the self-awareness it demonstrates: did they catch the failure early or late, did they own it or externalize it, and did they build something systematic to prevent recurrence? A consultant who cannot name a concrete failure has not been honest with you or with themselves.

How to Structure the Engagement

Whether or not you end up using an external consultant, this is the structure that works for agentic AI projects:

Discovery comes first and it deserves real time. A week is usually not enough. You need to understand the data sources, the quality characteristics of each one, where ambiguity lives in the problem, and what the human review capacity looks like for cases the system cannot handle automatically. Skipping this step produces systems that are architecturally correct and practically brittle.

Architecture design should happen before any code is written. The graph structure of the workflow, the state management approach, where validation layers go, how partial failures are handled -- these decisions are hard to change after implementation starts. A good consultant will produce an architecture document and walk you through it before they open an editor.

The build phase should have validation built in from the start, not added at the end. Validation is not a QA pass at the handoff; it is a structural component of the workflow. If the engagement plan describes testing as a phase that comes after the build, push back.

Handoff planning should start in week two. Who owns the system after the engagement ends, what does the runbook look like, who gets paged at 3 AM, and what is the support window? The answers to these questions should be in writing before the project is half done.

When DIY Is the Right Call

Hiring a consultant is worth the overhead when the problem is complex enough, the stakes are high enough, and the team genuinely lacks the experience to do it well. None of those conditions is always true.

If your team already has production experience with a graph orchestration framework, the consultant may not be adding value proportional to cost. If the problem is well-scoped and the data is clean, the "agentic AI" framing may be more complexity than the actual problem needs. A deterministic pipeline with explicit validation logic is easier to operate and debug than an LLM-based one, and for many data engineering tasks it is the right tool.

A hybrid approach often makes sense: a short engagement to review the architecture and validate the approach, with the build staying in-house. This gives you an outside perspective on the structural decisions without the full overhead of an external build. Ask whether the consultant offers this option; a good one will not push you toward a longer engagement than your problem warrants.

One Last Thought

The agentic AI consulting space will consolidate over the next 18 months. The firms with real production experience and honest case studies will be easy to find. For now, the filtering is on you. Ask the hard questions, push on the case studies, and weight the answers that include failure stories and honest tradeoffs more heavily than the polished ones.

If you want to talk through your specific situation -- the data, the problem, and whether an outside engagement makes sense -- reach out at /contact. We are happy to tell you when the answer is to do it yourself.