Evaluating ChindaLLM The Thai AI Chatbot Assistant Performance
This evaluation examines the performance of ChindaLLM, the Thai AI Chatbot Assistant, designed to deliver highly accurate responses across a wide range of documents. Utilizing OpenThaiGPT, the most advanced Thai LLM model available, ChindaLLM provides precise answers in 10 languages, improving internal communication and customer service within organizations.
1. Introduction to ChindaLLM
This evaluation examines the performance of ChindaLLM, the Thai AI Chatbot Assistant, which is designed to deliver highly accurate responses across a wide range of documents. Utilizing the latest version of OpenThaiGPT, the most advanced Thai LLM model available, ChindaLLM provides precise answers in 10 languages, thereby improving internal communication and customer service within your organization efficiently.
Key features include Retrieval-Augmented Generation (RAG) with a system for handling multiple documents, integration through LINE, Messenger, and website interfaces, the implementation of custom functions and tool calls utilizing agent-based AI, support for various models and personalization, web search capabilities, multimodal support, as well as text-to-speech, speech-to-text, and image generation functionalities.
The system was evaluated using three datasets: TyDiQA, XQuAD, and iapp_wiki_qa_squad. These datasets consist of Thai extractive question-answer pairs along with context, gathered from 2,695 samples.
Evaluation Setting
Model Configuration
- Model: OpenThaiGPT1.5 7B
- Temperature: 0.2
We collect all unique contexts in the dataset and store them. For each question, the system retrieves the top k most relevant documents and evaluates whether they match the ground truth document for that question.
In P@k, if a document in the retrieved set matches the ground truth, it receives a score of 1; otherwise, it gets a score of 0. The average score is then calculated across all questions.
In MRR@k, the score is assigned based on the document’s rank: a score of 1rank if the document in the retrieved set matches the ground truth, and 0 if it does not. The average score is calculated across all questions. Finally, the retrieved documents and questions are used to generate an answer for comparison.
Evaluation Dataset
TyDiQA (763 samples): The question answering covering 11 diverse languages with 204k question-answer pairs (including Thai). It contains language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people and the data is collected directly in each language without the use of translation. This evaluation dataset is part of the Thai Sentence Embedding Leaderboard.
XQuAD (1,190 samples): The benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages. This evaluation dataset is part of the Thai Sentence Embedding Leaderboard.
Iapp_wiki_qa_squad (742 samples): The extractive question answering dataset from Thai Wikipedia articles. It is adapted from the original iapp-wiki-qa-dataset to SQuAD format. This evaluation dataset is part of the Thai LLM Leaderboard.
Key Metrics Overview
- P@K (Precision at K): Measures the performance of the retrieval system in retrieving the ground truth label document within the top K documents, expressed as a percentage of the total number of questions.
- MRR@K (Mean Reciprocal Rank at K): Measures the performance of the retrieval system in retrieving the ground truth label document further and also considers the rank of the document as a score within top K documents, expressed as a percentage of the total number of questions.
Evaluation Results
Precision@K
Dataset | Precision@1 | Precision@5 | Precision@10 |
---|---|---|---|
TyDiQA | 0.8912 | 0.9879 | 0.9934 |
XQuAD | 0.9059 | 0.9916 | 0.9941 |
iapp_wiki_qa_squad | 0.9286 | 0.9663 | 0.9784 |
MRR@K
Dataset | MRR@10 |
---|---|
TyDiQA | 0.9343 |
XQuAD | 0.9439 |
iapp_wiki_qa_squad | 0.9446 |
Conclusion
- Precision@1 is highest in iapp_wiki_qa_squad (0.9286) → The retrieval system has the ability to retrieve the correct relevant document in the 1st position more frequently in this dataset compared to others.
- Precision@5 and Precision@10 are close to 1.0 in all evaluation datasets → This means that the top 5 and 10 retrieved documents often contain the correct relevant document.
- In the XQuAD dataset, Precision@5 and Precision@10 are the highest (0.9916 and 0.9941) → This shows that the retrieval system retrieves the most accurate relevant documents in this dataset.
- MRR@10 is highest in iapp_wiki_qa_squad (0.9446) → Relevant documents are most often retrieved in the 1st - 2nd positions.
- Then XQuAD follows with MRR@10 (0.9439) → This is very close to iapp_wiki_qa_squad.
- TyDiQA has the lowest MRR@10 (0.9343) → Although the value is the lowest among the three datasets, it is still very high, with relevant documents frequently appearing in the 1st - 2nd positions.