Name | Task type | Train | Dev | Test | characteristic | Link | |
Miracl | IR | 868 | 213 | - | Multilingual dataset | https://huggingface.co/datasets/miracl/miracl | |
KLUE | QA | 17554 | 5841 | - | Korean version of GLUE | https://github.com/KLUE-benchmark/KLUE | |
KorQUAD v2 | QA | 83486 | 10165 | - | Korean version of SQUAD | https://korquad.github.io/ | |
뉴스기사 기계독해데이터 |
MRC | 200K | AI hub | https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=577 | |||
행정문서 기계독해데이터 |
MRC | 205K | AI hub | https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=569 | |||
도서자료 기계독해데이터 |
MRC | 500K | Short context, AI hub | https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=92 | |||
일반상식 데이터 |
QA | 150K | AI hub | https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=106 | |||
Tydi QA | QA | 10981 | 1698 | 1722 | https://github.com/google-research-datasets/tydiqa |
'NLP > dataset' 카테고리의 다른 글
SAM dataset 사용 방법 (0) | 2024.04.24 |
---|