Testing the Performance of an LLM-based Search Assistant Application

3ab9b08002a8a889e9b816f8ea2c73ed008e11d9 1080x452

Introduction

1. In the era of AI, applications are the primary interface through which users engage with AI technologies, like large language models (LLMs). They play a pivotal role in delivering LLM-driven solutions to end-users, making it essential to ensure they function accurately, reliably, and safely. However, testing applications presents unique challenges. Unlike general-purpose AI models, testing LLM-based applications requires addressing the complexities of use cases, industry requirements, and the diverse variations in language and user interactions.

2. While internal testing is indispensable for establishing the baseline safety and performance of LLM applications, third-party (3P) testing provides an additional layer of scrutiny. This is particularly important given the evolving and nascent state of science in LLM application testing. 3P testing allows organisations to enhance safety by identifying potential blind spots and uncovering risks that might otherwise go unnoticed. It also builds greater trust among users and other stakeholders by ensuring high standards of reliability.

3. Recognising this, the Infocomm Media Development Authority of Singapore (IMDA) partnered with Singapore Airlines (SIA) and Resaro to conduct a 3P testing of SIA’s LLM-based search assistant application. The objective of the exercise (and this blogpost) is to develop and share a structured methodology as a reference point for future testers of similar RAG-based search assistant applications (a common use case across industries today), and contribute to global efforts to improve and standardise testing of LLM applications.

4. As outlined in Bengio et al. (2025)1 , risk identification and assessment form the initial stages of AI risk management practices. We followed this approach to determine key areas of concern for testing. We first identified potential risks and unintended outcomes that could arise from the application’s use. We then analysed the severity and likelihood of these risks, taking into account existing safety measures implemented in the application. The outcome led us to focus on understanding the risk of inaccurate or incomplete responses, which could mislead users and diminish trust in the application. Significantly, SIA’s Enterprise Risk Management Framework had also identified this risk as a key concern, providing an additional layer of validation to our assessment and emphasising the critical nature of addressing this issue in AI systems. To address this, testing assessed the accuracy and completeness of the application’s responses. A largely synthetic testing dataset was designed to simulate real-world usage scenarios. The testing followed a two-stage approach: First, we measured the end-to-end performance of the application – looking at overall correctness of its output against ground-truth data. Second, we looked into component-level performance of the pipeline using more granular metrics for the application’s retrieval and synthesis mechanisms to obtain actionable insights for improvement. Both stages were conducted on non-perturbed and perturbed questions, with the latter providing insight into the application’s robustness.

5. This testing approach provided structured and human-interpretable insights, facilitating informed deployment decisions. In making such decisions, organisations should set appropriate thresholds based on their risk tolerance and assess issues identified through testing in relation to real-world acceptability. Beyond pre-deployment testing, organisations should also implement continuous monitoring to detect emerging issues and ensure sustained application safety and reliability. Finally, the exercise concluded with an examination of possible future enhancements to the testing process, such as the incorporation of metrics for linguistic quality to better capture user experience.

RAG and LLM-Based Search Applications

6. Retrieval-Augmented Generation (RAG) has emerged as a widely adopted framework for enhancing generative AI applications by grounding outputs in relevant external data. RAG-based applications typically combine information retrieval with LLM-based text generation, enabling the generation of responses that are both accurate and contextually relevant. This approach is highly versatile and is used in various industries for different types of applications.

7. One implementation of RAG is in the development of LLM-based search applications. These represent a common industry archetype that enhance users’ ability to access, interpret and synthesise information efficiently, whether from databases or the web. Implementing the RAG framework, these applications typically integrate information retrieval with LLM-based text generation to deliver synthesised and contextually relevant responses to user queries.

8. The SIA search assistant, a consumer-facing application integrated into the organisation’s public website, exemplifies this archetype. Its primary purpose is to help users derive answers to their queries relating to travel on SIA. As set out in Figure 1, the application comprised four key components:

User Query Input: Users submitted queries in natural language
Embedding User Query: The query was processed by an embedding model to convert it into a vector representation.
Hybrid Retrieval: The application employed a hybrid retrieval mechanism that combined traditional keyword-based matching techniques with semantic search techniques to identify documents with exact or partial matches to the query. Notably, the application relied only on retrieved content from curated internal databases and did not use external web searches.
Response Generation: The retrieved information was passed to a LLM, which synthesised it into coherent and contextually relevant responses to the user’s query.

e51d4619d876d9abcedf4e98251fd6a3b9c5ddeb 1080x452

Figure 1 – Simplified architecture of SIA’s application

Risk Identification and Assessment for Testing

9. To ensure a structured and targeted approach to testing, a risk identification and assessment process was undertaken2. This process aimed to identify key areas of concern where the application’s performance could have significant implications and informed the selection and development of appropriate testing methodologies.

10. In the initial risk identification phase, a systematic analysis of potential risks and unintended outcomes that could arise from the application’s use was conducted. As outlined in Bengio et al. (2025), this stage drew on AI-specific risk taxonomies3. The analysis identified five primary categories of risks:

Hallucinations and Inaccuracies: Generating incomplete or incorrect outputs
Toxic Output: Generating offensive content.
Biased Output: Providing skewed or discriminatory results.
Privacy Violations: Exposing sensitive internal data.
Vulnerability to Adversarial Input: Susceptibility to jailbreaks or prompt injections, leading to the generation of undesirable output.

11. Following risk identification, each risk was assessed based on its severity and likelihood of occurrence. This phase considered the effectiveness of mitigation measures implemented by SIA during the application’s development as these would have an impact on the risk’s severity and likelihood of occurrence. SIA had employed a comprehensive, multilayered approach to mitigate Gen AI risks, implementing enterprise-wide, application-level and use-case specific controls. For example, during internal testing, SIA had proactively identified potential edge cases where certain types of user inputs could challenge the application. To address this, SIA developed an innovative layered prompt structure to enhance the application’s robustness. This approach encapsulates the user’s query within two prompts: (a) a pre-prompt placed before the user query to reinforce appropriate response behaviour; and (b) a post-prompt that reiterated key safety instructions.

12. Taking the safeguards into account, the residual risk of hallucinations and inaccuracies remained a key concern, especially given their potential real-world consequences. For example, if the search application retrieves the wrong documents or misinforms users about travel requirements, it could result in financial losses or reputational damage for the organisation. This assessment was further supported by internal assessments that SIA had conducted earlier using its Enterprise Risk Management Framework, which had also identified this risk as a key concern. As a result, testing focused on assessing the accuracy and completeness of the application’s output to identify instances of factual inaccuracies, omissions or fabricated details.

Internal Testing

13. As part of application development, SIA had conducted multiple rounds of internal testing with questions provided by various stakeholders across its different business units. Aligned with the risk assessment stated above, testing focused on output correctness, particularly the accuracy and completeness of the application’s response, and the relevance of the links provided in the response. As part of SIA’s commitment to continuous improvement, internal testing identified opportunities for enhancement. These insights were valuable in refining the application’s performance, and SIA promptly implemented the necessary improvements prior to 3P testing.

3P Testing with Resaro

14. SIA supplemented its internal testing with 3P testing by Resaro. This enabled independent validation by experts in LLM application testing and the detection of potential issues that might have been missed during the internal testing process. This exercise had two main objectives:

Identify additional metrics that could allow SIA to better understand the performance of the application and how to improve it. SIA sought to expand its expertise in AI metrics and interpretation, partnering with Resaro to gain additional insights and refine their testing approach.
Test the application at both the end-to-end level and within its individual components. End-to-end testing would provide an overall view of performance to help make a judgment on deployment, while component testing would help isolate issues to solve problems. This was the approach adopted by SIA for internal testing, and 3P testing sought to validate the approach and identify any potential areas for further improvement.

15. With these objectives in mind, Resaro developed a framework to assess the application in a human-interpretable and actionable manner. The framework, illustrated in Figure 2, focused on the primary metric of correctness of the answer relative to the ground truth. This aligned with the outcome from the risk identification and assessment exercise and provided a high-level assessment of the application’s output. Secondary metrics were introduced to provide a more granular understanding of the application’s output by categorising issues and identifying where they occurred in the pipeline. These metrics allowed for deeper analysis when the primary metric of generation correctness signalled potential concerns.

f17e7059b648b90b23eaf3886a4e1445a03e13ed 1080x1080

Figure 2 - Overview of the testing framework

Testing Dataset

16. To ensure comprehensive testing, SIA and Resaro collaborated to create a diverse dataset that included both real-world examples and carefully generated synthetic data, providing a broad range of scenarios for thorough evaluation. To start with, SIA provided access to the questions used in its internal testing, as well as ground-truth retrieved documents and answers for these. Resaro expanded this dataset in two ways:

Synthetic Data Generation: Resaro used its internal LLM-based synthetic data pipeline to generate additional synthetic questions and answers. This expanded the breadth of the dataset and allowed for additional testing across a wider range of potential scenarios.
Data Perturbation: Resaro perturbed the questions to test the application’s robustness. These perturbations, such as synonyms and misspellings, were designed to mirror real-world usage scenarios rather than adversarial cases, which were already mitigated by the application’s guardrails.

17. The final dataset comprised 2,800 questions, offering a comprehensive mix of real-world and synthetic data. Approximately two-thirds of the dataset was fully synthetic, while the remaining one-third was partially synthetic, based on perturbations of SIA’s internal testing dataset.

Testing Methodology

18. Application testing comprised two stages, which were carried out in relation to both the non-perturbed (to identify baseline performance) and the perturbed questions (to assess robustness):

Correctness of the Final Output: Resaro first measured generation correctness, assessing whether the final output contained all the information from the ground-truth answer, and only the information from the ground-truth answer.
Testing of Individual Components: Resaro then used other metrics to diagnose specific issues in the retrieval or synthesis stages that contributed to the application’s performance in generating the final output.

First Stage: Baseline Performance – Testing of the Final Output

19. The first stage focused on measuring the correctness of the application’s final output , providing insights into its alignment with ground-truth data. This stage comprised of the following four steps:

Decomposition of Responses and Ground-Truth Answers: Both the generated responses and ground-truth answers were broken down into distinct claims4 using an LLM.
Precision5 Assessment: Each generated claim was checked by a judge LLM to ensure it was supported by a ground-truth claim, ensuring no extraneous or fabricated information was included.
Recall6 Assessment: Each ground-truth claim was checked by a judge LLM to verify its presence in the generated response, ensuring completeness.
F1 Score7 Calculation: Precision and recall scores were combined into an F1 score. This provided a single measure of the balance between the completeness (recall) and accuracy (precision) of the generated response.

Second Stage: Deep Dive into Baseline Performance – Testing of Individual Components

20. The F1 score from the first stage provided a high-level measure of the application’s performance but lacked the granularity needed to provide actionable insights into specific components within the application pipeline. For example, precision challenges, such as the inclusion of incorrect or extraneous information in the output, could stem from issues like hallucination (i.e., content not grounded in retrieved documents) or retrieval of irrelevant documents. However, the F1 score alone would not shed light on this, highlighting the need for other metrics.

21. The second stage of testing therefore aimed to test the individual components of the application using three additional metrics:

Generation Faithfulness: Assessed whether the claims in the final response were grounded in the retrieved documents, helping to identify instances of unsupported or fabricated information.
Retrieval Precision: Assessed whether irrelevant or unrelated documents are being retrieved during the information retrieval stage.
Retrieval Recall: Measured the extent to which all necessary documents were retrieved, identifying omissions that could lead to incomplete responses.

First and Second Stages: Robustness Testing – Perturbed Questions

22. This portion of the testing process examined the application’s ability to handle perturbed questions that reflect real-world variations in user input, such as misspellings and the use of synonyms. Robustness testing offered a valuable perspective on the application’s capacity to manage such diverse inputs, effectively complementing the baseline performance test.

Applying the Testing Results and Continual Testing & Monitoring

23. These metrics from testing are useful in providing actionable insights:

The end-to-end test of final output highlighted the performance of the application in a human-interpretable manner e.g. a generation recall score of 0.95 shows that 95% of claims in the ground-truth answers were found in the generated answers.
The detailed test of individual components could show which part of the application to focus on for further development e.g. if the retrieval pipeline performs poorly in robustness tests, it suggests it needs to be enhanced to deal with diversity in user queries e.g. by finetuning the re-ranking model.

24. To move from metrics to a go/no-go decision on deployments requires thresholds. These need to be determined by the organisation based on the nature of the application and the tolerance for error. In making a deployment decision, organisations may consider:

Conducting a systematic analysis of error patterns to identify potential failure modes, assessing how these could affect end users and determining whether the impact is acceptable. In the context of generation correctness, error patterns include omissions (i.e. not including relevant details in the output) and ungrounded responses (i.e. output is not based on the retrieved data).
Assessing whether the metrics used accurately reflect their real-world requirements and adjusting their interpretation accordingly. For example, a low F1 score driven by poor recall performance may be acceptable in applications where precision is the priority.8
Benchmarking results against industry standards, prior versions of applications or alternative solutions to contextualize performance.

25. While pre-deployment testing helps to validate that the application meets the organisation’s own internal safety standards, continuous testing of the application’s output and behavior post-deployment is equally important as it can help detect emerging issues in real time, enabling timely intervention. One should run these tests when there are material changes made to the application, to check if the changes have impacted safety and performance. However, even if no changes are made, such testing is still important. For example, it allows for the detection of data drift, where the real-world data that the application encounters during deployment differs from the training data it was built on. Such testing for data drift can be performed by capturing a share of the production data that is as different as possible from the pre-deployment data and drafting ground-truth responses to test against the actual responses of the application.

Future Areas of Work

26. The current testing methodology provided a structured framework for testing the output correctness of a RAG-based website search application. By combining various metrics such as precision, recall and generation faithfulness, the approach offered valuable insights into the application’s ability to provide accurate and complete responses, at both the output and component levels.

27. The test design laid a foundation for future iterations that could: (a) account for other factors in user satisfaction, such as clarity, relevance, and fluency; (b) take into account the relative importance of the claims in the ground-truth data; and (c) account for LLMs’ tendency to generate varying outputs for the same query. These aspects, elaborated upon below, represent opportunities for a more comprehensive evaluation of the application’s performance.

28. Capturing linguistic quality and user experience: Responses that are correct but poorly structured or difficult to understand can undermine an application’s utility. Additional metrics such as linguistic quality, relevance or user experience are thus vital to obtaining a more comprehensive understanding of the application’s overall performance. That said, there will still be a need for ground truth-based metrics. User feedback is useful to understand user experience but given that users are unlikely to have ground-truth information, it is unlikely to accurately capture the correctness of a response.

29. Prioritising claims: Incorporating a measure of relative importance between claims would facilitate a prioritisation of the most critical claims, allowing the testing process to better align with human judgment and provide more meaningful insights into the application’s performance.

30. Executing multiple runs: Since LLMs are probabilistic and can generate varying outputs for the same query, conducting multiple runs per input can help improve the statistical robustness of the testing process. Averaging performance across multiple runs would provide a more reliable measure of the application’s output correctness, reducing the impact of response variability. Additionally, multiple runs allow for the identification of inputs that the application consistently struggles to get right, highlighting particularly problematic queries that may warrant greater focus.

Conclusion

31. This exercise demonstrated that LLM application testing, complemented with 3P testing, is an important step in ensuring overall AI safety and reliability. The structured approach shared serves as a reference point for future testers and contributes to global efforts to standardise LLM application testing. Nonetheless, opportunities for methodological improvement remain, given that the science of AI testing is nascent and evolving quickly. Also, testing should also not be a one-time exercise; it must be conducted periodically and complemented by continuous monitoring to address evolving challenges and maintain application safety and reliability over time.

1 Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., … Zeng, Y. (2025). International AI Safety Report. arXiv. https://doi.org/10.48550/arXiv.2501.17805

2 As set out in Bengio et al. (2025), risk identification and assessment form the initial stages of AI risk management practices. Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., … Zeng, Y. (2025). International AI Safety Report. arXiv. https://doi.org/10.48550/arXiv.2501.17805

3 We referred to the risk taxonomy in Slattery et al. (2025). Slattery, P., Saeri, A. K., Grundy, E. A. C., Graham, J., Noetel, M., Uuk, R., Dao, J., Pour, S., Casper, S., & Thompson, N. (2024). The AI risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence. arXiv. https://doi.org/10.48550/arXiv.2408.12622

4 Claims are distinct pieces of information or factual statements. For example, a generated response from the application might be “Travellers require a visa for entry and their passport must be valid for at least six months”. This response might be broken down into two claims: (1) A visa is required for entry and (2) The passport must be valid for at least six months.

5 Precision measures the accuracy of a system’s outputs by assessing how many of its results are correct. Imagine searching for “red shoes” online. High precision means that most of the results are actually red shoes, not sandals.

6 Recall measures the completeness of a system’s outputs by assessing how many of the total possible correct results it successfully identifies. Using the same example, high recall means the search engine found most of the red shoes available online, even if it also showed other red items.

7 The F1 score combines precision and recall into a single score. It measures the overall accuracy of a system, balancing showing only relevant results (i.e, precision) with finding all the relevant results (i.e. recall). A high F1 score for red shoes would mean that most of the results are red shoes and not too many other things.

8 For example, in email spam filters, while high recall (i.e. catching all spam emails) is beneficial, it is generally less harmful to let some spam emails through than to misclassify critical messages as spam and risk them being overlooked.