What’s the difference between calibration and Inter-Rater Reliability? Part 2 of a 3-Part Series on IRR

///What’s the difference between calibration and Inter-Rater Reliability? Part 2 of a 3-Part Series on IRR

What’s the difference between calibration and Inter-Rater Reliability? Part 2 of a 3-Part Series on IRR

Share on LinkedIn0Share on Google+0Tweet about this on TwitterShare on Facebook3Pin on Pinterest0Email this to someone

In my 14 years in the call center industry, I have had many occasions to visit call centers in nearly every industry imaginable.  I’ve come across different examples of calibration, each intended to reduce risk to the organization from customer service:

  • A group of  Quality Assurance (QA) folks sitting in a room listening to calls and then discussing them,
  • A group of agents sharing their opinions with QAs on how they think their calls should be graded,
  • QAs and agents debating the attributes that separate a good call from an excellent call, from a mediocre or bad call,
  • A lead QA, manager, trainer, consultant or client instructing members of the QA team on how to evaluate calls.
  • A lead QA, manager or trainer playing examples of (pre-selected) good and bad calls.

While these may be common call center practices, they are far from best practices.  In order to drive long-term improvement in the consistency and accuracy of your QA team, the outcome of any calibration process must be quantifiable, repeatable and actionable.

Inter-Rater Reliability (IRR) versus Calibration


Click to download the Case Study

Inter-rater reliability studies are more than structured or unstructured conversations. IRR studies demand a rigorous approach to quantitative measurement.  IRR studies require that an adequate number of calls be monitored, given the size of the Quality Assurance team, variability in scoring, the complexity of calls, complexity of the monitoring form, etc.  Inter-rater reliability testing also requires that call scoring be completed individually (in seclusion if possible).  While discussion is key in reducing scoring variability within any Quality Assurance team, scoring and discussion of scoring variations must become separate activities which are conducted at different points in time.

Inter-Rate Reliability testing aims to answer two key questions:

  1. “How consistent are we in scoring calls?” and,
  2. “Are we evaluating calls in the right way?”

In other words, certainly the goal of IRR is to ensure that each member of the Quality Assurance staff is grading calls consistently with his / her peers.  However, a high degree of consistency between the members of the Quality Assurance Staff does not necessarily ensure that calls are being scored correctly, in view of organizational goals and objectives.  A further step is needed to ensure that call scoring is conducted with reverence to brand image, organizational goals, corporate objectives, etc.  This step requires that a member of the management team take part in each IRR study, acting as the standard of proper scoring for each call.

Efforts to attain high degrees of inter-rater reliability are necessary to ensure fairness to your agents whose calls are being evaluated.  Your agents deserve to know, with a high level of confidence, that their monitored calls will be scored consistently, no matter which member of the Quality Assurance team scores them. And they need to know that they are scored well.  Without valid and reliable methods of evaluating rep performance, you risk making bad decisions because you are basing them on faulty data; you risk lowering the morale of your agents through your very efforts to improve it; you open yourself to possible lawsuits for wrongful termination or discriminatory promotion and reward practices.  You, too, need to know that your quality monitoring scores give reliable insight about the performance of your call center and about the performance of your agents on any individual call.

Sample Reports from Inter-Rater Reliability Study




Based on the figures above, it is very clear that the members of the QA team are relatively equal in scoring accuracy (defect rate) but that QA#1 struggles to accurately score in an area that is critical not only to the internal perception of agent performance but to the customer experience as well (auto-fail questions).  QA#1 also tends to be the most consistent in his / her with the remaining members of the team (correlation).  From a call perspective, it is clear that calls 6 and 10 included scenarios or situations that were difficult for the QA team to accurately assess.  Improving upon the current standing may mean redefining what qualifies as excellent, good or poor, adding exemptions or special circumstances to the scoring guidelines or simply better adherence to the scoring guidelines that already exist.

Does your calibration process deliver results that are this quantifiable and specific?

A few tips from our Inter-Rater Reliability Standard Operating Procedure:

  1. Include in your IRR studies any individual who may monitor and provide feedback to agents on calls, regardless of their title or department.
  2. Each IRR should include scoring by an individual outside of the QA team who has responsibility for call quality as well as visibility to how the call center fits with larger organization objectives.
  3. Make sure each IRR includes a sufficient sample size – 10 calls at minimum!

About Jim Rembach

Jim Rembach is a panel expert with the Customer Experience Professionals Association (CXPA) and an SVP for Customer Relationship Metrics (CRM). Jim spent many years in contact center operations and leverages this to help others. He is a certified Emotional Intelligence (EQ) practitioner and frequently quoted industry expert. Call Jim at 336-288-8226 if you need help with customer-centric enhancements.

Visit My Website
View All Posts
By | 2016-12-05T15:15:16+00:00 November 17th, 2010|Inter-Rater Reliability (IRR)|1 Comment