It was February 3, 2022. ChatGPT didn’t exist yet. Most ops leaders were still debating whether AI was ready for enterprise use. We went live with an AI model that would handle 100% of our incoming support ticket classification — replacing a process that had previously consumed the equivalent of six full-time employees.

This is what I learned. Not the version that ends up in a renewal deck, but the actual experience of scoping it, building it, training it on real-world data for three months before it hit the numbers we needed, and living with it once it did.

Before I get into the AI work, there’s an important prerequisite worth calling out. Before any AI model could do intelligent distribution, we’d already rebuilt our training framework and restructured the team into seven skills-based tiers. That tiering work wasn’t just an organizational improvement — it was the foundation the AI model depended on. The bot needed to understand both the complexity of the ticket and the availability and capability of each agent before it could route effectively. Without the tiering in place first, the AI would have been distributing work into an undifferentiated pool. I cover that project in detail in Why We Rebuilt Our Support Training Into 7 Tiers Before Touching AI — but the sequencing mattered, so it’s worth flagging here.

The Problem We Were Actually Solving

Before getting into the implementation, it’s worth being precise about what the problem was — because how you define the problem determines whether AI is even the right answer.

We had a high-volume support operation. Every incoming ticket required classification before it could be routed to the right team. Someone had to read the ticket, determine what type of issue it was, tag it appropriately, and send it on its way. That sounds simple. At scale, with hundreds of tickets per day, it consumed six FTEs whose entire job was reading and tagging. Not resolving. Not supporting customers. Reading and tagging.

This is exactly the kind of problem AI is built for: high volume, repetitive pattern recognition, well-defined output categories. If your problem looks like this, AI is probably worth evaluating. If your problem is more ambiguous — nuanced customer interactions, complex judgment calls, situations requiring empathy — the calculus is different. If you’re thinking about where AI fits in the broader contact centre picture, this post on CXMaster covers the agentic AI landscape with a similarly sceptical eye.

We selected Forethought as the platform and designed two concurrent models.

Two Models, Not One

Most people think about AI ticket triage as a single model. We ran two simultaneously, addressing different parts of the same workflow.

The Category Triage Model classified every incoming ticket into the correct support category, routing it to the right team automatically. This was the high-volume model — 49,563 correct predictions in year one at a cost of $3.23 per ticket categorized, saving $160,000 annually.

The Approve/Decline Model handled a specific workflow type that required a binary decision on every submission. More precise, lower volume — 5,258 correct predictions at $7.61 per decision, saving $40,000 annually.

Combined: $200,000 saved in year one. ROI of 4.48x on the platform cost. Both models at 89% accuracy with 100% ticket coverage.

Those numbers look clean in a slide. The work behind them was not — and the numbers didn’t appear on day one.

You can see the full project breakdown, including the financial model, in the AI Ticket Triage case study on DataDrivenOps.

The Three Months Nobody Talks About

We went live on February 3, 2022. The models were trained on historical data and performing well in testing. What the initial deployment revealed was something every AI practitioner knows and few vendor decks mention: real-world data is messier than your training set.

The first few weeks of live operation surfaced ticket types and phrasings the model hadn’t encountered in training. Edge cases. Regional variations in how customers described the same issue. New product features that generated new categories of questions the historical data didn’t cover.

For approximately three months after go-live, we ran a continuous improvement cycle — reviewing misclassifications, identifying patterns, and feeding real-world data back into the models. Each retraining cycle improved accuracy incrementally. Steady, measurable progress toward the numbers we’d targeted rather than dramatic overnight improvement.

By month three, both models had stabilised at 89% accuracy. That’s when the economics in the renewal deck started to reflect what we were actually experiencing.

The lesson: plan for a training runway, not a launch day. Budget three to six months of active model improvement before you treat the implementation as complete. If your implementation plan ends at go-live, you’re only halfway done.

The Part Nobody Puts in the Vendor Deck

Data quality is the actual project. The AI implementation took weeks. The data preparation that made it possible took significantly longer.

Every AI model is trained on historical data. Ours was trained on historical ticket data — which meant every ticket classified by a human over the previous years. The problem: humans are inconsistent. The same ticket type had been tagged a dozen different ways by different agents. Abbreviations, typos, deprecated categories, classifications that reflected how a team used to work rather than how it currently operated.

Before a single line of model configuration was written, we audited, cleaned, and standardized that historical data. This is unglamorous work. It doesn’t show up in demos. But skip it and you’re training your model on noise. This is closely connected to the broader data governance work we do in the DataDrivenOps Data Governance pillar — the discipline is the same whether you’re feeding a BI dashboard or an AI model.

Then there’s the 89% problem. 89% accuracy sounds excellent. In production, it means 11% of tickets are misclassified. At our volumes, that represented real tickets going to the wrong team, generating rework, and occasionally frustrating customers. The question is not whether 89% is good — it is. The question is what you do with the 11%.

We built a confidence threshold layer. Any prediction the model scored below a set confidence level was routed to a human review queue rather than automatically applied. This was the single best design decision we made. Don’t build AI triage that forces a prediction on every ticket. Build AI triage that knows when to ask for help.

Finally: the org change was harder than the technology. Six people had been spending their days reading and tagging tickets. When the AI took over that work, we had to figure out what those six people would do instead. This is not a technology problem. It’s a leadership problem. The human side of technology transitions is something I write about more on CXMaster’s Leadership & Career section — the pattern repeats across every major operational change.

What We’d Do Differently

Start the retraining pipeline on day one. Don’t wait for the renewal cycle. From the first week you have a growing dataset of edge cases. Build the retraining cadence into the implementation plan from the start — we built ours in as up to two retrains per year, and I’d have started that process earlier.

Instrument the 11% from the beginning. We tracked overall accuracy. We should have tracked which categories had lower accuracy and which misclassifications had the highest downstream cost. That granularity would have accelerated the improvement cycle significantly.

Involve the receiving teams earlier. The teams who received routed tickets had opinions about the classification taxonomy we didn’t fully incorporate until after launch. Earlier involvement produces better categories and faster adoption.

Is AI Ticket Triage Right for You?

It depends on three things. Volume — you need enough that cost-per-ticket savings compound into something meaningful. Category clarity — if your taxonomy is fuzzy, fix that first, because AI will reproduce your ambiguity at scale. And data quality — run a quick audit of your historical classifications before committing to an implementation.

One more prerequisite based on what we learned: make sure your team structure and routing logic are solid before you introduce AI. We were fortunate the skills-based tiering work happened first. If we’d tried to implement AI triage into an undifferentiated support team, the distribution problem would have remained even with perfect classification.

AI got a lot louder in 2023. But the fundamentals that made this work in February 2022 haven’t changed: clean data, a realistic multi-month training runway, a plan for the edge cases, and honest leadership through the organizational change. The technology is the easy part. It always is.