본문 바로가기

NLP/dataset

[dataset] Korean Information Retrieval Dataset

Name Task type Train Dev Test characteristic Link
Miracl IR 868 213 - Multilingual dataset https://huggingface.co/datasets/miracl/miracl
KLUE QA 17554 5841 - Korean version of GLUE https://github.com/KLUE-benchmark/KLUE
KorQUAD v2 QA 83486 10165 - Korean version of SQUAD https://korquad.github.io/
뉴스기사
기계독해데이터
MRC 200K AI hub https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=577
행정문서
기계독해데이터
MRC 205K AI hub https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=569
도서자료
기계독해데이터
MRC 500K Short context, AI hub https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=92
일반상식
데이터
QA 150K AI hub https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=106
Tydi QA QA 10981 1698 1722 Google https://github.com/google-research-datasets/tydiqa

 

'NLP > dataset' 카테고리의 다른 글

SAM dataset 사용 방법  (0) 2024.04.24