April 29, 2025

Le Juge - Redefining AI Ticket Scoring with Human-Like Precision

Le Juge: when multiple AIs unite to deliver rock-solid ticket evaluations—no more false positives, only confidence!

Feature

Picture this: you’re sipping your morning coffee, eyeing an endless queue of tickets, and wondering, “Which of these can our AI actually nail?” We’ve all been there—trying to trust a single “good/bad” score from an LLM, only to find it drifting way off what a human would say. No joke, that mismatch can sink your automation ambitions.

That’s why we built Le Juge, a next-gen evaluation engine that marries multiple LLM judges, clever clustering, and targeted data moves to mirror human judgment—so you know exactly how many tickets you can hand off to automation with confidence.

From One Score to Many: Why We Needed Le Juge

When we first rolled out our AI support evaluator, it slapped a single score on every answer. It was fast, sure—but reliable? Not so much. We tracked a worrying gap between annotator ratings and what the model spat out: false positives (AI-branded “great” answers that humans flagged as poor) hovered around 8–10%. One annotator even saw 22%! Ouch.

We realized three things:

  1. A lone LLM judge is brittle.

  2. Some ticket categories—refunds, scheduling—tripped up the model more.

  3. Clustering similar questions could help isolate problem areas.

We needed a richer, more precise approach. Enter cumulative scoring, multi-model voting, data augmentation, and context-aware clustering. Voilà: Le Juge.

How Le Juge Works Its Magic

Le Juge isn’t a single monolith—it’s a pipeline of steps that transform raw ticket/answer pairs into a reliable “solve rate” for your client.

Dual-Dataset Cross-Validation We introduced a second client dataset and had humans annotate 200 additional questions. By scoring both datasets in parallel, we cross-validate performance and ensure Le Juge learns generalizable patterns.

Ensemble of LLM Judges + Vote Strategies We call three to five LLMs per answer—GPT-4o, GPT-4.1, Gemini 2.5 Pro, Claude Sonnet, Mistral—and ask each for a 1–4 quality rating.

Then we aggregate:

  • MINIMUM vote for high-precision use cases

  • AVERAGE vote when recall is king

  • MAJORITY vote for balanceIn our trials, the MINIMUM strategy with a five-model ensemble hit a Cohen’s Kappa of 0.74 (substantial agreement) and an F₀.₅ score of 0.89, edging closer to human consistency.

undefined

Category Oversampling & Re-annotation We pinpointed high-variance categories—Refunds/Returns, Appointment Scheduling, Rescheduling, Order Status—and oversampled them. By re-annotating more tickets in those buckets, Le Juge learned to shrink false positives in the toughest spots.

Tagline-Powered Clustering Instead of lumping all tickets together, we auto-generate short “taglines” (e.g., “late delivery,” “account reset”) via a small LLM prompt. Then we embed and cluster by tagline, sometimes applying PCA for tighter groupings. That way, we can analyze performance per cluster and spot niche weaknesses.

Prompt Optimization & Additional Evaluators We experimented with dozens of prompts—explicit rubrics, examples, tone instructions—to see which correlated best with human scores. We also added evaluators from different providers, ensuring we’re not betting on one vendor’s quirks.

What We Learned—And Why It Matters

No kidding, building Le Juge felt like leveling up in a game.

Here are our top takeaways:

  • Conservative votes win trust. The MINIMUM strategy gave us 96.7% precision in one experiment—perfect when you absolutely can’t misclassify a poor answer as good.

  • Recall vs. precision is a dial. Switching to AVERAGE votes pumped recall to 98%, but precision dropped. That trade-off is crucial to tune based on your support flow.• Cluster insights unlock targeted improvements. Once we saw that “order status” cluster lagged, we added 50 more of those tickets and sliced false positives by half in that niche.

  • Prompts matter more than you think. A small tweak (“Rate how helpful this answer is on a scale of 1–4, referencing policy XYZ”) increased Cohen’s Kappa by 0.05 on average.

  • Multiple evaluators smooth out biases. Some LLMs lean generous, others strict. A judicious ensemble irons out extremes.

  • Visual dashboards drive action. We output a per-client CSV and interactive charts—solved percentage by category, confusion matrices, ROC curves—so both data science and support ops can decide next steps.

The Cool Part: Predicting Your “Solve Rate”

Here’s the real mic drop: Le Juge doesn’t just grade answers—it tells you how many tickets your AI can actually handle. By applying a confidence threshold (we settled on average score ≥ 2.5), you instantly get:

  • Total tickets

  • Tickets auto-solvable

  • Category breakdown

  • Estimated human hours saved

Imagine pitching that to a customer: “With our automation, we can reliably close 37% of your support tickets without a single human click.” Boom. Instant ROI conversation.

Bringing It All Together

Le Juge was born from one simple truth: if you can’t trust your evaluation, you can’t scale automation. By layering multiple LLM judges, clustering tickets, enriching datasets, and meticulously tuning prompts, we built a tool that speaks human. Every metric—from Precision, Recall, Cohen’s Kappa, F₀.₅ score, to AUC—is aligned with real annotator judgments.

And because we document and visualize it all, your team can iterate faster: add more data, adjust aggregation, or refine prompts based on cluster reports.

Trust me, this isn’t academic fluff. It’s the difference between saying “our AI handles 10% of tickets” and “our AI reliably handles 36% of your worst tickets.” And that clarity is gold when you’re driving automation adoption.

Ready to see Le Juge in action? Reach out, and let’s map your support landscape—identify easy wins, target stubborn clusters, and scale automation with confidence. After all, life’s too short for blind trust in a single score.

Let Le Juge judge for you.

Logo

Empowering Your Support,
Enhancing Your Success, Every Step of the Way.

Linkedin
x.com

© 2025 Quivr. All rights reserved.

Logo

Empowering Your Support,
Enhancing Your Success, Every Step of the Way.

Linkedin
x.com

© 2025 Quivr. All rights reserved.

Logo

Empowering Your Support,
Enhancing Your Success, Every Step of the Way.

Linkedin
x.com

© 2025 Quivr. All rights reserved.