Evaluating Fairness of LLM-Generated Testimonials

Resaro has partnered GovTech Singapore to ensure that their automated testimonial tool, Appraiser, is unbiased with respect to gender and race. This article was originally published by GovTech on Medium.

Read on for GovTech's perspective of our bias audit process, and learn how independent third-party testing can help build trust around AI solutions.

Introduction

Across the world, organisations are experimenting with the powerful text generation capabilities of Large Language Models (“LLMs”) for various use cases, from improving knowledge management for their employees to building more personalised chatbots for their customers. Here at GovTech and DSAID, we focus on how we can leverage LLMs to further our goal of building tech for the public good.

One such product is Appraiser, a tool which uses LLMs to help our teachers become more efficient and effective in writing personalised testimonials for their students. These testimonials provide valuable insights into the students’ qualities and contributions beyond their academic results, and are important for applications to universities or scholarships. However, teachers often have to spend many hours writing personalised testimonials for each student, which they could have otherwise used to prepare for classes or mark assignments.

Appraiser helps teachers by utilising LLMs to create an initial draft of student testimonials based on factual accolades and draft descriptors provided by the teachers. Subsequently, teachers can review the draft generated by the LLM and enhance it with additional details before submitting it for further vetting and clearance by the school heads. This would save them time and improve the overall quality of the testimonials. We worked closely with the Ministry of Education (MOE) and teachers to test different prototypes and to ensure that Appraiser provides personalised and high-quality drafts that they can work with.

However, one concern that we had was whether using LLMs to generate draft testimonials may introduce some biases. Recent research highlighted that LLMs may implicitly perpetuate gender and racial biases, which they learnt from large amounts of data from the Internet. To ensure that Appraiser was safe before putting it out for wider use, we set out to evaluate if the LLM underlying Appraiser did indeed generate draft testimonials which contained implicit gender or racial biases.

For this evaluation, we partnered with Resaro, an independent, third-party AI assurance firm, to evaluate the fairness of Appraiser’s generated draft testimonials. By shedding light on the challenges and considerations involved in assessing fairness in AI-generated outputs, we aim to contribute to ongoing discussions on the responsible development and deployment of AI in education.

Reviewing Existing Research

The social science literature has shown that there are significant gender biases in letters of recommendation, reference letters, and other professional documents.

For example, Trix and Psenka (2003) found that letters written for female applicants to medical school tend to portray women as teachers, while men are depicted as researchers. Khan et al. (2021) examined gender bias in doctor residency reference letters and discovered significant differences in gendered adjectives: women were often described as “delightful” or “compassionate”, whereas men were characterised as “leaders” or “exceptional”. Dutt et al. (2016) investigated recommendation letters for postdoctoral fellowships and identified discrepancies in letter length and tone.

Since LLMs have been trained on a large corpus of historical documents, it is unsurprising that researchers have found that LLM-generated outputs perpetuate such stereotypes. A UNESCO study found “unequivocal evidence of prejudice against women”, especially in how they were portrayed and what careers they were associated with, and Bloomberg discovered that OpenAI’s ChatGPT ranked similar resumes differently depending on the race and gender of the profile. More relevant to our problem, Wan et al. (2023) explored gender bias in LLM-generated reference letters and found that both GPT-4 and open-source Llama models produce biased generated responses.

In Appraiser, teachers are asked to indicate the gender of the student to ensure that the correct pronouns are used for names where the gender is not obvious, while race is not indicated at all. However, both the Bloomberg and Wan et al. (2023) studies highlight how LLMs can infer gender and race from names, which could introduce bias into the generated testimonials. While various mitigation measures could be used to alleviate such biases, we wanted to conduct a robust study to first rigorously measure any differences in the generated testimonials across gender and race.

Methodology

Inspired by Wan et al. (2023), we compared whether there were any differences in the language style (how it is written) and lexical content (what is written) of the generated testimonials across gender and race for students with similar qualities and achievements.

First, we generated synthetic profiles across different genders and races which we would use to generate testimonials using Appraiser. We mixed different combinations of personal attributes, CCA records, and academic achievements, and tagged them to one of 8 names that would be indicative of a specific gender (male or female) and race (Chinese, Malay, Indian, Eurasian). This allowed us to isolate the effect of changing the name, and thus the gender and/or race, on the language style and lexical content of the generated testimonial. The attributes were carefully selected to represent a wide range of student profiles. A sample input and its accompanying generated output is shown below:

2755aecbba9221712e14129e4ea8a281ce56e5dd 5440x2960

A total of 3520 testimonials were generated and used for our analysis.

Next, language style was assessed by comparing sentiment and formality across the generated testimonials. Sentiment analysis, using Flair NLP’s DistilBERT classifier, was employed to determine whether the LLM exhibited any bias in the emotional tone of the testimonials based on gender or race.

Formality analysis, on the other hand, aimed to identify any discrepancies in the level of professionalism conveyed in the testimonials for different student profiles. We used an off-the-shelf Roberta model pre-trained on Grammarly’s Yahoo Answers Formality Corpus (GYAFC) to score the generated text. The overall formality score for a given testimonial is the average across all sentences.

42ac84261d3d3029d24366f7866dc207a095370b 2672x2672

To evaluate biases in lexical content, we classified the adjectives across all generated testimonials into 7 categories of common gender stereotypes. These categories are based on Hentschel’s 2019 study where a panel of judges classified a list of 74 attributes used by previous researchers into 7 conceptually distinct groupings that common gender stereotypes typically fell under — assertiveness, independence, instrumental competence, leadership competence, concern for others, sociability, and emotional sensitivity.

72aca9fa92210c80be02c558c3fe37c26610d840 2672x2672

With these categories, we calculated the percentage of adjectives that belonged to each category out of all the adjectives used in that same testimonial.

To measure bias, we ran a regression analysis and checked if the gender or race attributes had a statistically significant effect on the measure of interest (sentiment score, formality score, or the category percentage), while controlling for personal attributes, CCA records, and academic achievements. If the coefficients for gender or race were statistically significant and meaningfully large, then it would be indicative of gender or racial biases in the generated testimonials.

For those who are mathematically inclined, we provide the model specification below:

df615af8e52ea2eada92b1c2cab972a203bff4c8 2672x2672

Intuitively, since we carefully constructed the student profiles to be identical except for their names, a significant difference in the adjectives used suggests that the LLM is varying how and what is generated in the testimonials due to the student’s gender or race (as inferred by the names). By contrast, if we do not find any meaningful variations, then we could consider the LLM as fair since it treats students with similar profiles similarly, regardless of gender or race.

Our Results

Analysis of Language Style

On language style, we found no statistically significant differences (at the 95% confidence level) across gender and race for both the sentiment score and formality score. Table 1 below shows the results of the analysis on differences in language style (note that the baseline gender is male and the baseline race is Chinese).

Where there was weak statistical difference, the actual difference between the scores was negligible, giving us confidence that the LLM was fair in language style and formality across different genders and races.

Table 1 — Analysis of Language Style:

ab87a704be517fad0c7eb5c2644c8bf5d43e2276 5440x3476

Analysis of Lexical Content

We repeated the analysis, this time on the percentage of content that could be attributed to each lexical category. This gave us 7 regression specifications with the following dependent variables — assertiveness, independence, instrumental competence, leadership competence, concern for others, sociability, and emotional sensitivity. Results are shown in Table 2.

Similar to the previous results, we found that the LLM did not exhibit significant differences across gender and race.

Table 2 — Analysis of Lexical Content:

bc7d4260031cf4c5d2fbc7f7037c63e39768c34a 5440x2052

An additional robustness check was performed using a more stringent definition of fairness where we examined the combined differences of both gender and race, and added interaction terms between the two variables. The results were still the same — there were no significant differences across gender and race.

A closer look at the coefficients of the student’s attributes (i.e., personal attributes, CCA records, academic achievements) revealed that most of the difference in the generated testimonial was due to variations in the student’s attributes.

For example, students whose competitive nature was mentioned in the prompt were significantly more likely to have assertiveness-related adjectives in the generated text. Other interesting relationships are summarised in Table 3.

Table 3 — Relationship between Lexical Content and Students’ Attributes:

65d3f7eeace8b8374184f8ca619d546cec97a5c9 5440x2408

Other Findings

Besides the analysis of language style and lexical content, the team also reviewed a sample of the generated dataset to identify quality-related issues that might be missed from the automated analysis. We identified two potential issues with the current approach — hallucinations with a small proportion of generated testimonials and limitations of the model in understanding the local context.

Hallucinations refer to the common phenomena where an LLM generates output that is made up or not factually supported by the input data. Based on the outputs generated, we found that the model’s hallucination spanned from imagined characteristics about a student to extrapolation or inferences about a student’s traits. However, as teachers will review and edit the draft testimonial generated by Appraiser, they would be able to detect such hallucinations and are thus not significant impediments to deployment. Nonetheless, this remains an active area of research for the team.

Another issue faced was that the LLM may not have the full context of the Singapore education system and is unable to understand the nuances between the different achievements and awards. For example, it might not be clear to the LLM that EAGLES is an award that recognises achievement, good leadership and service, or what the “Values in Action” programme is. It is impractical to expect that the LLM learns all the nuances of the local education system. To work around such issues, we can modify the user experience in Appraiser and provide additional guardrails to nudge teachers to provide the right level of detail required. To promote its optimal use, there are training efforts to educate teachers about the range of responses they might encounter with generative AI as well as the tool’s potential limitations.

Conclusion

In this evaluation, GovTech and Resaro set out to assess the fairness of Appraiser, our new tool which leverages LLMs to help our teachers generate drafts of personalised testimonials for their students. Through creating a diverse synthetic dataset of over 3,500 testimonials and conducting a comprehensive analysis of potential biases in language style and lexicon content, we found that Appraiser does not produce biased testimonials, as we found no significant differences in language style or lexicon content based on gender or ethnicity.

With this finding, we are more confident in availing Appraiser to more teachers to trial in the months ahead. We will also continue testing, iterating, and improving how we can support teachers in writing testimonials for students, all while ensuring that the generated testimonials remain free from bias, regardless of gender and race.

Additionally, we hope to apply what we have learnt from this exercise to other products within the government, and to build up our internal capabilities to measure bias effectively. We especially want to thank Resaro, whose expertise and dedication were invaluable in helping us understand how to measure and tackle issues of fairness and reliability in LLMs.

References

Dutt, K., Pfaff, D., Bernstein, A., Dillard, J., & Block, C. (2016). Gender differences in recommendation letters for postdoctoral fellowships in geoscience. Nature Geoscience, 9. https://doi.org/10.1038/ngeo2819.

Hentschel, T., Heilman, M. E., & Peus, C. V. (2019). The multiple dimensions of gender stereotypes: A current look at men’s and women’s characterizations of others and themselves. Frontiers in Psychology, 10, Article 11. https://doi.org/10.3389/fpsyg.2019.00011

Khan, S., Kirubarajan, A., Shamsheri, T., Clayton, A., & Mehta, G. (2023). Gender bias in reference letters for residency and academic medicine: a systematic review. Postgraduate Medical Journal, 99(1170), 272–278. https://doi.org/10.1136/postgradmedj-2021-140045

Trix, F., & Psenka, C. (2003). Exploring the Color of Glass: Letters of Recommendation for Female and Male Medical Faculty. Discourse & Society, 14(2), 191–220. https://doi.org/10.1177/0957926503014002277

Wan, Y., Pu, G., Sun, J., Garimella, A., Chang, KW. & Peng, N. (2023). “Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3730–3748, Singapore. Association for Computational Linguistics.

Evaluating Fairness of LLM-Generated Testimonials

Introduction

Reviewing Existing Research

Methodology

Our Results

Other Findings

Conclusion

References

More Insights

"Quality is key": Four experts on what responsible AI adoption really looks like

Evaluating Accuracy and Preventing Misuse in a GenAI Anti-Money Laundering Assistant

Investing in Trustworthy AI: A Collective Commitment for Our Future