Challenges in international assessment

Last updated:

By: Louise Badham

Article originally published in International School Magazine – Spring 2023: International School Magazine - Spring 2023 | School Management Plus: School & education news worldwide.

The International Baccalaureate (IB) offers educational programmes to students in more than 5,000 schools in over 150 different countries around the world.

The rich diversity of the schools and students experiencing an IB education—through the Primary Years, Middle Years, Diploma and Career-related programmes—is one of the greatest strengths and joys of the organisation. In the Diploma Programme, the largest of the IB programmes, there were 188 different first languages and 212 first nationalities (IBO, 2022) represented amongst the student cohort who took their final exams in May 2022. This extraordinary representation of languages and cultures from around the globe is something of which the IB is immensely proud—and is a tangible example of the IB’s mission of international-mindedness in practice.

Yet, when it comes to offering formal assessments, this wonderful diversity also presents the organisation with a unique set of challenges. It raises important questions, such as How do we ensure that students taking exams in different languages face the same level of challenge? How do we write exam questions that are culturally representative and inclusive for students all around the world? How do we translate markschemes consistently if key terms like ‘adequate’ and ‘good’ mean different things in different languages? How do we guarantee that final grades reflect equivalent levels of attainment—regardless of the language in which the exams were taken? With formal, summative assessments playing such a central role in students’ lives—and determining the next step in their academic or professional lives—these difficult questions need to be asked. The answers are often complicated and messy—but then, the most interesting and important questions usually are.

And so, the IB’s assessment staff are continuously wrestling with these thorny questions and exploring how the IB can make assessment practices as fair, valid, reliable—and as linguistically and culturally inclusive—as possible. Recent studies in the assessment research team, for example, have looked into whether exam questions in the Diploma Programme’s Biology, Physics and Chemistry courses are ‘lost in translation’ (McGrane et al, 2021)—that is, whether the level of demand changes when translated into other languages. In another study, student work is being translated between six different languages to investigate whether word limits in coursework impact students’ performance in languages that are more ‘word hungry’ than others.

We’ve also investigated whether traditional approaches to assessment—where examiners review a student’s piece of work and assign it a mark—are always the most suitable when we need to compare student work produced in different languages. Instead of English-speaking examiners marking work in English and Spanish-speaking examiners marking work in Spanish, before the final marks are compared … would it be possible for bilingual examiners to assess work from both languages at the same time? The short answer is, yes, they can! To an extent. And … it’s complicated and messy.

We asked experienced examiners from the Diploma Programme’s Language A: Literature course to use a method known as ‘comparative judgement’ to review pairs of literary essays from a previous exam session (Badham & Furlong, 2022). Instead of assigning marks, each examiner reviewed around 100 pairs of essays and, in each case, simply made a decision about which essay was ‘better’. The results from multiple decisions were used to rank order the responses and calculate final results. It is notoriously difficult for examiners to come to a common agreement on subjective essay-based responses such as those needed to test skills of literary analysis. So comparative judgement can be a really useful way to get around this—not least because every response needs to be seen multiple times by different examiners to make sure that the final decision is sufficiently reliable.

The next question was whether this approach could work when bilingual examiners were asked to compare work written in different languages. Our Language A: Literature examiners were therefore asked to review pairs of essays in English, pairs of essays in Spanish—and pairs of essays where one was in English and the other in Spanish. Information from over 4,000 examiner judgements was used to rank order the student responses, from strongest to weakest—both within each language, and across languages. Finally, we asked examiners for feedback on how well they believed this whole process worked.

Interestingly, when considering only the numbers, using comparative judgement bilingually across languages seemed pretty successful! Overall, the examiners generally agreed with each other about which essays were stronger and which were weaker. There were some slight indications that the bilingual judgements were a little less reliable but, on the whole, the method seemed to work. We could, in theory, have generated a mark for each student’s piece of work from the bilingual comparisons that, from a statistical point of view, would be considered a ‘reliable’ result.

But—the examiners’ feedback showed it was not quite so simple. Whilst they found the potential cross-language standardisation benefits to be an advantage, there were also many challenges. Most found making bilingual judgements more difficult than traditional marking, as thinking in two languages at once was intellectually demanding. And, on a large scale, the method would require numerous examiners with very high levels of bilingualism—which would be a huge challenge in terms of recruitment—not to mention the need to find a way to measure and check examiners’ linguistic proficiency in both languages, which would be necessary to ensure that all examiners could access and understand student responses equally in both languages.

Examiners also found intriguing academic differences in the way students from each language wrote their essays. There were differences in style, and in the ways in which students structured their essays in English A: Literature compared to Spanish A: Literature. They also noticed that Spanish A: Literature students tended to take a more contextual approach—for example, commenting on how aspects such as authors’ biographical details might have influenced the writing of the texts—whereas English A: Literature students seemed more likely to analyse texts from a more technical point of view, such as focusing more on formal literary devices.

All of this raises a number of interesting questions: What are the most appropriate ways to compare results from different language versions of IB assessments? How do we design assessments that allow for culturally different approaches to the same subject? How can we gather meaningful evidence on how academic skills are represented and understood in different linguistic and cultural groups? When should different language versions be considered variants within one academic subject—and when are they separate subjects in their own right? The answers, of course, will be complicated and messy. But only by continuing to ask and investigate these complex questions can we strive to offer the fairest and most valid assessments to our linguistically and culturally diverse IB community.


Badham L & Furlong A (2022) Summative assessments in a multilingual context: What comparative judgment reveals about comparability across different languages in Literature. International Journal of Testing, 23(2), pp.111-134.

IBO (2022) The IB Diploma Programme Statistical Bulletin – May 2022 Examination Session. Cardiff: International Baccalaureate Organisation. Available at

McGrane J, Kayton H, Double K, Woore R & El Masri Y (2021). Is Science Lost in Translation? Language Effects in the International Baccalaureate Diploma Programme Science Assessments. Final Report. Oxford University Centre for Educational Assessment (OUCEA). Available at: