DH and NLP Lab

university

https://language.ml

language-ml

Activity Feed Request to join this org

AI & ML interests

Digital Humanities and Natural Language Processing Lab

language-ml-lab's activity

kargaranamir

authored 3 papers 3 months ago

How Transliterations Improve Crosslingual Alignment

Paper • 2409.17326 • Published Sep 25, 2024 • 1

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

Paper • 2410.23825 • Published Oct 31, 2024 • 3

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

Paper • 2410.05873 • Published Oct 8, 2024 • 3

mahsaamani

updated a model 5 months ago

language-ml-lab/KurdBert

Fill-Mask • Updated Sep 1, 2024 • 1

kargaranamir

posted an update 8 months ago

Post

1207

Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

🤗 corpus v1: cis-lmu/GlotCC-V1
🐱 pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).