Balto’s 1–9 Sentiment Model for Smarter Call Insights

Note: Our customers are increasingly interested in the details behind AI models, so this is a more technical article than usual.

From Extremes to Nuance

When we first launched Balto’s sentiment analysis over a year ago, we deliberately kept it simple. Calls were labeled as either positive or negative, surfacing conversations at the extremes: the most satisfied and the most frustrated.

This worked for several reasons: extremes are revealing, they have an outsized impact, they’re easy for AI to detect, and customers told us they didn’t want the “noise” of “average” conversations, which makes it harder to see the conversations that really matter.

But customer feedback and model improvements made one thing clear: single labels don’t capture the reality of most conversations. True coaching value lies in the nuance, the subtle shifts in tone and emotion that shape everyday interactions.

Why One Label Isn’t Enough

Imagine a customer calling in furious and threatening to cancel their service. The agent resolves the issue, apologizes, and offers a discount. By the end, the customer’s tone shifts to something far more positive. The customer apologizes for being rude in the beginning of the conversation.

Now, labeling this conversation with “positive sentiment” might seem obvious, but is it? Any one label oversimplifies. It could be positive, because the result is positive. Or maybe it’s negative, because the negative start to the conversation might be the most revealing and helpful to learn from. Or maybe it is actually neutral, since the positive and negative roughly balance each other out?

And that’s a simple conversation. Longer, more complex calls make the problem worse. Ultimately, what matters isn’t just the outcome; it’s the entire journey.

That’s why we advanced our sentiment models: to track sentiment as it evolves within a call and not just one catch-all label.

How Our New Sentiment Model Works

The short version: Balto now measures sentiment every ~800 milliseconds, assigns a calibrated 1–9 score, and generates a time-series graph showing how sentiment rises and falls throughout the conversation.

The technical version: Our three-stage pipeline balances accuracy, scalability, and efficiency.

Infographic explaining Balto’s new sentiment model in three steps: seeding a large LLM with high-quality sentiment labels, distilling it into an ~8B model to scale auto-labeling of utterances, and deploying a compact production model that processes ~2,500 requests per second, delivers sentiment scores every ~800 ms, and remains LLM-agnostic.

Step 1. Seed a Large LLM

Carefully designed instructions make for a large model that generates thousands of high-quality sentiment labels.
A foundation of nuanced, “reasoned” examples that general-purpose models don’t deliver out of the box.

Put Simply: We start by teaching a really smart AI to label sentiment on a few thousand conversations, so we have clear, high-quality examples to build from. From these examples, we can actually develop entire synthetic (fake) conversations to train the model even further.

Step 2. Distill into an ~8B model

That seed is distilled into an ~8B-parameter model.
It auto-labels tens of millions of utterances, scaling sentiment judgment to the size of real-world call data.

Put Simply: We take those first examples and train a smaller, faster AI that can label tens of millions of call snippets on its own, so the system can handle the huge scale of real-world conversations. We’ve always prioritized speed – so we have sentiment models designed to surface results quickly, and models designed to prioritize accuracy.

Step 3. Deploy a compact production model

A smaller runtime model is trained on the distilled data.
It sustains ~2,500 requests per second, delivering new sentiment scores every 800ms.
It’s LLM-agnostic, so we can swap in newer backbones as they improve without starting over.

Put Simply: We then build an even smaller, super-fast AI that can handle thousands of calls per second and update sentiment nearly in real time. It’s very flexible too, so we can plug in newer AI tech as it comes out without rebuilding everything from scratch. This last point is very important – LLMs are moving quickly, and you don’t want to be stuck with a vendor that is locked into one particular LLM architecture.

This layered design gives us the reasoning strength of LLMs, the scalability of a mid-size annotator, and the efficiency of a production-ready runtime. Basically, we balance speed and accuracy.

Our 1–9 Sentiment Scale

Behind the scenes, we label conversations from 1-9:

5 = Neutral — most conversations hover around a 5.
1 = Extreme negativity — think profanity, hostility, escalation, yelling.
9 = Really, really happy — think genuine enthusiasm or gratitude.

Visually, it looks like this:

Image of Balto’s 1–9 sentiment scale. The middle value, 5, represents neutral conversations. The lower end, 1, indicates extreme negativity such as hostility or yelling. The upper end, 9, indicates strong positivity such as genuine enthusiasm or gratitude.

Now, you can:

Monitor Sentiment Graphs over time, show curves across each conversation. Instead of scanning random samples, you’ll see emotional trajectories to which you can coach and QA against.
Click-to-Jump, so you can quickly jump to the lowest and highest sentiment point. We break these moments out as snippets for your review.
Filters, Overlays, and Fluctuations highlight specific events within a conversation, based on your use case, such as

Conversations with negative starts but with positive turnarounds later.Conversations with outsized sentiment swings.Conversations that end on a positive note.Conversations that end on a negative note.Agents with especially positive or negative sentiment.Frequency of successful turnarounds.Number of 9s per agent per month.
Conversations with negative starts but with positive turnarounds later.
Conversations with outsized sentiment swings.
Conversations that end on a positive note.
Conversations that end on a negative note.
Agents with especially positive or negative sentiment.
Frequency of successful turnarounds.
Number of 9s per agent per month.

The list goes on and on. You can architect this sentiment to track and answer all kinds of questions.

If you have any specific use cases you would like to implement, please contact your customer success manager.

What’s Next

We’re continuing to evolve the model. Coming soon:

Updated score calibration for even more precise distributions.
Better categorical overlays to more accurately flag turnarounds or positive closes.
Real-time alerts that notify supervisors of sentiment dips mid-call.
Agent analytics to better track individual agent performance over time.

Thank you for being a Balto customer. As always, please reach out if we can help with anything.

Balto’s New Sentiment Analysis Model: Moving Beyond Positive and Negative Labels