Inter-Rater Reliability (IRR)
Inter-rater Reliability (IRR) is a more effective way to calibrate scorers in a quality monitoring program in a contact center. Traditional quality monitoring call calibration is poor method used to evaluate quality monitoring analysts.
Show me the money! Part 3 of a 3-Part Series on IRR
Based on the previous two posts on this topic, you’ve probably come to the conclusion that Inter-Rater Reliability (IRR) is a far more rigorous process than the one in which you and your team utilizes and that this process may require more time than you’ve been allocating to calibration. In many cases, these conclusions are valid.
However, when executed correctly and on an on-going basis, IRR is not a cost, but a process that delivers value (direct and indirect costs) back to the organization by enhancing the accuracy and efficiency of the feedback process to enable long-term performance improvement.
The following scenario plays out weekly, if not daily, in nearly every call center. Agent Jack receives feedback about his calls that were monitored the previous week. He is praised for his positive demeanor, “can do attitude” and creative problem solving skills which allowed him to solve customer problems without placing them on hold or transferring them to other departments. Jack is pleased but confused by the misalignment with the previous week’s assessment which was hardly glowing. The prior feedback highlighted his incorrect call opening; that his calls were longer than those of his peers and he failed to capture email addresses on 3 of the 5 calls monitored. To him, the difference did not reflect the performance he remembers. So, Jack decides to raise this issue with his supervisor as well as the company’s Human Resources department, launching a rather lengthy and time consuming process of research, gap analysis and documentation.
The evaluation disconnect can be dramatically reduced with a process that aligns what is most important to customers and organizational profitability to what your Quality Assurance (QA) team listened for so that everyone on the QA team reflected the same organizational priorities in their monitoring feedback. How much direct and indirect cost would be saved if your QA team was so consistent in their feedback, hours of time were saved each week because agents no longer perceived inconsistency in the performance feedback they were receiving? Imagine if instead of researching and documenting gaps in performance assessments your QA and Coaching teams could actually focus on improving performance. All of these benefits are real outcomes of Inter-Rater Reliability testing.
The example below is based on a call center that supports a customer-base of 350,000. The average revenue per sale is $85 and the average lifetime value of each of these customers is $4,000. Given the financial impact of every customer, a 5% improvement in three key areas – customer defection, recommendation rate and repurchase rate – generated an additional $4.8 million in revenue.
- A 5% improvement (decrease) in the number of customers who defect allows the company to retain an additional 1,102 of their own customers, generating over $4 million based on the lifetime value of a customer.
- A 5% improvement (increase) in recommendations of the company’s products would yield an additional $53,253 in sales assuming only 5% of the customers who indicated they would recommend the company actually do so.
In this example, we are using a customer-base of 350,000 customers. 71.6% fall into the ‘delight category’ when it comes to ‘likelihood to recommend’ = 250,600 customers. If only 5% of these people who said they would recommend actually do and convert it to a sale, there will be 12,530 new sales. To convert this into revenue figures, multiply 12,530 recommendations by the estimated revenue per sale ($85) to yield $1,065,050 generated through recommendations.
In order determine the revenue increase generated by a 5% improvement in recommendation rate, find number of additional customers that would now fall into the ‘delight category’ and once again assume that only 5% will actually make a recommendation to generate a sale. Now there are 13,157 customers (instead of the baseline of 12,530). Multiply 13,157 customers by the estimated revenue per sale ($85) to arrive at $1,118,303. Subtracting the baseline figure of $1,065,050, the difference is $53,253 additional revenue resulting from a 5% improvement in likelihood to recommend.
The same logic applies to the next calculation.
- A 5% improvement (increase) in future (re)purchases of the company’s products would yield an additional $369,123 in sales assuming only 35% of the customers who indicated they would make another purchase from the company actually do so.
Attaining a 5% improvement across all call center agents is attainable within a single year with the proper investment of staff and time. This company supports its 350,000 customers with approximately 300 full-time call center agents. These agents are managed by five full time QA analysts and two performance coaches (at an annual cost of $250,000) as well as a team of call center supervisors. Whether achieving a 5% improvement in call center performance requires one or two years (the timeframe required by a majority of Customer Relationship Metrics’ business partners), the ROI exceeds 800%. However, it is important to note that once the improvement is achieved, ongoing investment in staff (QAs and coaches) will be needed to maintain performance and/or continue driving performance improvements.
What’s the difference between calibration and Inter-Rater Reliability? Part 2 of a 3-Part Series on IRR
In my 14 years in the call center industry, I have had many occasions to visit call centers in nearly every industry imaginable. I’ve come across different examples of calibration, each intended to reduce risk to the organization from customer service:
- A group of Quality Assurance (QA) folks sitting in a room listening to calls and then discussing them,
- A group of agents sharing their opinions with QAs on how they think their calls should be graded,
- QAs and agents debating the attributes that separate a good call from an excellent call, from a mediocre or bad call,
- A lead QA, manager, trainer, consultant or client instructing members of the QA team on how to evaluate calls.
- A lead QA, manager or trainer playing examples of (pre-selected) good and bad calls.
While these may be common call center practices, they are far from best practices. In order to drive long-term improvement in the consistency and accuracy of your QA team, the outcome of any calibration process must be quantifiable, repeatable and actionable.
Inter-Rater Reliability (IRR) versus Calibration
Inter-rater reliability studies are more than structured or unstructured conversations. IRR studies demand a rigorous approach to quantitative measurement. IRR studies require that an adequate number of calls be monitored, given the size of the Quality Assurance team, variability in scoring, the complexity of calls, complexity of the monitoring form, etc. Inter-rater reliability testing also requires that call scoring be completed individually (in seclusion if possible). While discussion is key in reducing scoring variability within any Quality Assurance team, scoring and discussion of scoring variations must become separate activities which are conducted at different points in time.
Inter-Rate Reliability testing aims to answer two key questions:
1. “How consistent are we in scoring calls?” and,
2. “Are we evaluating calls in the right way?”
In other words, certainly the goal of IRR is to ensure that each member of the Quality Assurance staff is grading calls consistently with his / her peers. However, a high degree of consistency between the members of the Quality Assurance Staff does not necessarily ensure that calls are being scored correctly, in view of organizational goals and objectives. A further step is needed to ensure that call scoring is conducted with reverence to brand image, organizational goals, corporate objectives, etc. This step requires that a member of the management team take part in each IRR study, acting as the standard of proper scoring for each call.
Efforts to attain high degrees of inter-rater reliability are necessary to ensure fairness to your agents whose calls are being evaluated. Your agents deserve to know, with a high level of confidence, that their monitored calls will be scored consistently, no matter which member of the Quality Assurance team scores them. And they need to know that they are scored well. Without valid and reliable methods of evaluating rep performance, you risk making bad decisions because you are basing them on faulty data; you risk lowering the morale of your agents through your very efforts to improve it; you open yourself to possible lawsuits for wrongful termination or discriminatory promotion and reward practices. You, too, need to know that your quality monitoring scores give reliable insight about the performance of your call center and about the performance of your agents on any individual call.
Sample Reports from Inter-Rater Reliability Study
Based on the figures above, it is very clear that the members of the QA team are relatively equal in scoring accuracy (defect rate) but that QA#1 struggles to accurately score in an area that is critical not only to the internal perception of agent performance but to the customer experience as well (auto-fail questions). QA#1 also tends to be the most consistent in his / her with the remaining members of the team (correlation). From a call perspective, it is clear that calls 6 and 10 included scenarios or situations that were difficult for the QA team to accurately assess. Improving upon the current standing may mean redefining what qualifies as excellent, good or poor, adding exemptions or special circumstances to the scoring guidelines or simply better adherence to the scoring guidelines that already exist.
Does your calibration process deliver results that are this quantifiable and specific?
A few tips from our Inter-Rater Reliability Standard Operating Procedure:
1. Include in your IRR studies any individual who may monitor and provide feedback to agents on calls, regardless of their title or department.
2. Each IRR should include scoring by an individual outside of the QA team who has responsibility for call quality as well as visibility to how the call center fits with larger organization objectives.
3. Make sure each IRR includes a sufficient sample size – 10 calls at minimum!
Is your survey calibration process destroying agent morale? Part 1 of a 3-Part Series on IRR
As the economy has contracted over the two years, many organizations have focused on minimizing costs by reducing (if not eliminating) on-going training, quality initiatives, hiring, promotion, etc. The impact has been a decline in employee engagement which had a direct and measurable impact on the way they treat your most valuable asset – your customers.
According to the Gallup Q12 employee-engagement survey, the following questions represent the largest drivers of employee engagement (correlating to employee productivity, customer loyalty, bottom-line growth:
1. Do I know what is expected of me at work?
2. Do I have the materials and equipment I need to do my work right?
3. At work, do I have the opportunity to do what I do best every day?
4. In the last seven days, have I received recognition or praise for doing good work?
5. Does my supervisor, or someone at work, seem to care about me as a person?
6. Is there someone at work who encourages my development?
7. At work, do my opinions seem to count?
8. Does the mission/purpose of my company make me feel my job is important?
9. Are my co-workers committed to doing quality work?
10. Do I have a best friend at work?
11. In the last six months, has someone at work talked to me about my progress?
12. This last year, have I had opportunities at work to learn and grow?
Your quality assurance team, and more specifically, your calibration process has direct impacts on the questions which appear in bold above (questions 1, 3, 4, 6, 11, and 12). How do you think Agent Joe feels when QA Jerry tells him he did great on his last call and QA Ben gives him a mediocre call monitoring score the very next week on a nearly identical call? This scenario of inconsistency is exactly what kills employee engagement, along with the credibility of your Quality Assurance team.
I am not suggesting that you eliminate calibration because consistency and accuracy are vital components of any high functioning team. But in order to (quantifiably) attain high levels of accuracy and consistency within your Quality Assurance team, you leap over the current calibration process to a more rigorous and measurable process. The risk from inadequate calibration is substantial to the organization and to you as a leader. Reduce risk by engaging in what we at Customer Relationship Metrics (CRM) call Inter-Rater Reliability (IRR).
In the next post of this series I will explain the difference between calibration and IRR. But in the meantime, I’ll leave you with some food for thought:
- A highly tenured QA team migrating conducting their very first Inter-Rater Reliability test yielded a 32.32% failure rate in accuracy of call scoring. Such a failure rate is hardly unusual.
- This same team reached agreement (consistency) in call scoring only 56.5% of the time (and this is WITH well-defined scoring instructions).
- The area in which the least consistency in scoring existed was in skills collectively called customer service or soft skills. These skills are key in driving a positive customer experience and brand loyalty.
- This highly tenured team is made up of only four Quality Assurance agents. Our research has shown that the larger the QA team, the higher the initial failures in accuracy and consistency.





