Datasets
We believe in contributing to the AI research community by sharing high-quality datasets. Below you can find datasets we've created and made available for research purposes.
Available Datasets on Hugging Face 🤗
All our open datasets are available on Hugging Face 🤗
Evaluation & Benchmarking
OpenThaiEval
Comprehensive evaluation dataset for Thai language models covering various tasks and domains.
📊 Model evaluation and benchmarkingOpenAI HumanEval-TH
Thai translation of OpenAI's HumanEval dataset for evaluating code generation capabilities in Thai context.
📊 Code generation evaluationMathematics & Reasoning
AIME 2024-TH
Thai version of American Invitational Mathematics Examination (AIME) 2024 problems for testing mathematical reasoning.
📊 Mathematical problem-solving evaluationMath-500-TH
Collection of 500 mathematics problems in Thai for training and evaluating mathematical reasoning capabilities.
📊 Math problem-solving training and evaluationAIMO Validation AIME-TH
Validation set for AI Mathematical Olympiad problems in Thai.
📊 Advanced mathematical reasoning evaluationTraining Datasets
Thai-R1-Distill-SFT
Supervised fine-tuning dataset distilled from reasoning models for Thai language.
📊 Fine-tuning language models with reasoning capabilitiesCode Generation Lite-TH
Lightweight dataset for training and evaluating code generation in Thai context.
📊 Code generation model trainingSpecialized Domains
⭐Thai Handwriting Dataset
Extensive collection of Thai handwritten text samples for OCR and handwriting recognition.
📊 Handwriting recognition, OCR training- Various writing styles
- Diverse handwriting samples
- Ground truth annotations
RAG Thai Laws
Comprehensive collection of Thai legal documents optimized for Retrieval-Augmented Generation (RAG) systems.
📊 Legal AI systems, RAG applications- Legal texts and regulations
- Pre-processed for RAG applications
- Structured legal information
Dataset Guidelines
📋 Usage Terms
- All datasets are provided for research purposes only
- Commercial use requires explicit permission
- Please cite our work when using these datasets
- Respect privacy and ethical guidelines
📚 How to Cite
@dataset{iapp_datasets_2024, author = {iApp Technology Research Team}, title = {Dataset Name}, year = {2024}, publisher = {iApp Technology}, url = {https://iapp.co.th/researches/datasets} }
🤝 Contribute
We welcome contributions to our datasets. If you have:
- Corrections or improvements to existing datasets
- New data to contribute
- Suggestions for new datasets
Please contact our research team.
🚀 Upcoming Datasets
We're actively working on releasing more datasets:
- Thai Legal Text Corpus
- Thai Medical Terminology Dataset
- Thai Sentiment Analysis Dataset
- Multi-dialect Thai Speech Dataset
📄 License Information
Different datasets come with different licenses. Please review the license terms for each dataset before use.
For commercial licensing inquiries, please contact us.
We're committed to advancing AI research in Thailand through open collaboration and data sharing.