IndoNLP/indonlu
A natural language understanding benchmark for Indonesian language featuring IndoBERT and IndoBERT-lite pre-trained models trained on 20GB of text.
IndoNLU is a collection of NLU resources for Bahasa Indonesia containing 12 downstream tasks. It provides code to reproduce results and large pre-trained models including IndoBERT and IndoBERT-lite, trained on approximately 4 billion words from the Indo4B corpus (over 20GB of text data). The project serves as both a benchmark for evaluating Indonesian language understanding and a provider of ready-to-use Indonesian language models.