Note: Our customers are increasingly interested in the details behind AI models, so this is a more technical article than usual.
From Extremes to Nuance
When we first launched Balto’s sentiment analysis over a year ago, we deliberately kept it simple. Calls were labeled as either positive or negative, surfacing conversations at the extremes: the most satisfied and the most frustrated.
This worked for several reasons: extremes are revealing, they have an outsized impact, they’re easy for AI to detect, and customers told us they didn’t want the “noise” of “average” conversations, which makes it harder to see the conversations that really matter.
But customer feedback and model improvements made one thing clear: single labels don’t capture the reality of most conversations. True coaching value lies in the nuance, the subtle shifts in tone and emotion that shape everyday interactions.
Why One Label Isn’t Enough
Imagine a customer calling in furious and threatening to cancel their service. The agent resolves the issue, apologizes, and offers a discount. By the end, the customer’s tone shifts to something far more positive. The customer apologizes for being rude in the beginning of the conversation.
Now, labeling this conversation with “positive sentiment” might seem obvious, but is it? Any one label oversimplifies. It could be positive, because the result is positive. Or maybe it’s negative, because the negative start to the conversation might be the most revealing and helpful to learn from. Or maybe it is actually neutral, since the positive and negative roughly balance each other out?
And that’s a simple conversation. Longer, more complex calls make the problem worse. Ultimately, what matters isn’t just the outcome; it’s the entire journey.
That’s why we advanced our sentiment models: to track sentiment as it evolves within a call and not just one catch-all label.
How Our New Sentiment Model Works
The short version: Balto now measures sentiment every ~800 milliseconds, assigns a calibrated 1–9 score, and generates a time-series graph showing how sentiment rises and falls throughout the conversation.
The technical version: Our three-stage pipeline balances accuracy, scalability, and efficiency.

Step 1. Seed a Large LLM
- Carefully designed instructions make for a large model that generates thousands of high-quality sentiment labels.
- A foundation of nuanced, “reasoned” examples that general-purpose models don’t deliver out of the box.
Put Simply: We start by teaching a really smart AI to label sentiment on a few thousand conversations, so we have clear, high-quality examples to build from. From these examples, we can actually develop entire synthetic (fake) conversations to train the model even further.
Step 2. Distill into an ~8B model
- That seed is distilled into an ~8B-parameter model.
- It auto-labels tens of millions of utterances, scaling sentiment judgment to the size of real-world call data.
Put Simply: We take those first examples and train a smaller, faster AI that can label tens of millions of call snippets on its own, so the system can handle the huge scale of real-world conversations. We’ve always prioritized speed – so we have sentiment models designed to surface results quickly, and models designed to prioritize accuracy.
Step 3. Deploy a compact production model
- A smaller runtime model is trained on the distilled data.
- It sustains ~2,500 requests per second, delivering new sentiment scores every 800ms.
- It’s LLM-agnostic, so we can swap in newer backbones as they improve without starting over.
Put Simply: We then build an even smaller, super-fast AI that can handle thousands of calls per second and update sentiment nearly in real time. It’s very flexible too, so we can plug in newer AI tech as it comes out without rebuilding everything from scratch. This last point is very important – LLMs are moving quickly, and you don’t want to be stuck with a vendor that is locked into one particular LLM architecture.
This layered design gives us the reasoning strength of LLMs, the scalability of a mid-size annotator, and the efficiency of a production-ready runtime. Basically, we balance speed and accuracy.
Our 1–9 Sentiment Scale
Behind the scenes, we label conversations from 1-9:
- 5 = Neutral — most conversations hover around a 5.
- 1 = Extreme negativity — think profanity, hostility, escalation, yelling.
- 9 = Really, really happy — think genuine enthusiasm or gratitude.
Visually, it looks like this:

Now, you can:
- Monitor Sentiment Graphs over time, show curves across each conversation. Instead of scanning random samples, you’ll see emotional trajectories to which you can coach and QA against.
- Click-to-Jump, so you can quickly jump to the lowest and highest sentiment point. We break these moments out as snippets for your review.
- Filters, Overlays, and Fluctuations highlight specific events within a conversation, based on your use case, such as
- Conversations with negative starts but with positive turnarounds later.
- Conversations with outsized sentiment swings.
- Conversations that end on a positive note.
- Conversations that end on a negative note.
- Agents with especially positive or negative sentiment.
- Frequency of successful turnarounds.
- Number of 9s per agent per month.
The list goes on and on. You can architect this sentiment to track and answer all kinds of questions.
If you have any specific use cases you would like to implement, please contact your customer success manager.
What’s Next
We’re continuing to evolve the model. Coming soon:
- Updated score calibration for even more precise distributions.
- Better categorical overlays to more accurately flag turnarounds or positive closes.
- Real-time alerts that notify supervisors of sentiment dips mid-call.
- Agent analytics to better track individual agent performance over time.
Thank you for being a Balto customer. As always, please reach out if we can help with anything.
Chris Kontes
Chris Kontes is the Co-Founder of Balto. Over the past nine years, he’s helped grow the company by leading teams across enterprise sales, marketing, recruiting, operations, and partnerships. From Balto’s start as the first agent assist technology to its evolution into a full contact center AI platform, Chris has been part of every stage of the journey—and has seen firsthand how much the company and the industry have changed along the way.
