Wikipedia's Treasure Trove: Advancing Machine Learning with Diverse Data

Community Article Published June 3, 2024

image/png

Wikipedia has been a valuable source for model training and evaluation data from the very beginning of machine learning. Its openly licensed data makes it easy to access, process, and has the potential to have real-world applications and impact. While Wikipedia lies at the heart of the Wikimedia projects, the Wikimedia community maintains a diverse array of projects, each serving a unique purpose in the global ecosystem of knowledge dissemination and collaboration. The projects contain data in a wide range of modalities; textual, image, audio and structured data is maintained by a large, international community, ensuring the quality of the data for the users.

Similar to Wikimedia, Hugging Face centers community-driven work, collaboration, and accessibility of information. While a lot of Wikimedia data is already available on Hugging Face, there are large parts of the data that are yet to be explored. With this community-created data, models with high quality with a wide range of perspectives could be created, which can be beneficial for the community at large. So here is your chance to create more high-quality datasets from already existing data, build more projects, and contribute to the landscape of ML datasets.

Why Wikimedia Data for ML?

  1. Rich, Diverse Content: Wikipedia articles cover different topics, from science and technology to arts and culture, providing a rich and diverse corpus for training and fine-tuning ML models across various domains. Moreover, the multilingual nature of Wikipedia ensures access to knowledge in a large number of languages, making it an invaluable resource for building inclusive and globally relevant AI systems.
  2. Multimodal Data: Wikimedia datasets go beyond textual content, incorporating multimodal data such as images and structured knowledge graphs (KGs) from projects like Wikidata. This multimodal nature enables researchers to explore novel approaches to tasks like image captioning, entity linking, and multimodal learning, thereby enriching the capabilities of AI systems to understand and interpret information across different modalities.
  3. Community-Curated and Openly Licensed: One of the key strengths of Wikimedia datasets is their collaborative and community-driven nature. Articles on Wikipedia and the data on the other projects are authored, edited, and curated by a global community of volunteers, ensuring a diverse range of perspectives and expertise. Moreover, Wikimedia content is released under open licenses such as Creative Commons, making it freely accessible for reuse and redistribution, in line with the principles of open science and knowledge sharing.

Respecting the Community

Central to working with a community is their consent, and respecting their wishes of how and which data is reused and enables model training. Especially data about the community, such as edit history, the different language Wikipedias’ policies, and readership data, should be treated with care. For example, the English Wikipedia community has written about how Wikipedia is not a laboratory. Important aspects to highlight are that one should not disrupt the community and their projects, e.g., by importing generated articles without the consent of the community. Especially with editing data it is important to respect editors’ wishes to opt-out of research.

More Wikimedia data on Hugging Face - How?

Adding datasets to Hugging Face is easy. If you have a dataset related to Wikimedia already, consider uploading it for other people to build on your work. If you would like to find out more about the diversity of Wikimedia data and what you could build based on it, check out the material of our ICWSM 2024 tutorial on Wikimedia data. The tutorial differentiates between modeling content and modeling behavior, i.e., data generated by the community as content, such as in this dataset on Hugging Face of Wikipedia articles across languages, and data generated in community interactions, such as this dataset on Hugging Face of policy use in Article for Deletion discussions on three language Wikipedias. Adding more of this type of data can make a huge difference on the accessibility of the data, and what we can build in the ML community. Especially by adding more modalities, such as from Wikimedia Commons, which has audio and image data.

If you create a new Wikimedia dataset, consider adding the wikimedia tag and adding it to this collection of Hugging Face community’s Wikimedia datasets to make the dataset more findable.

The Hugging Face community has already made significant advancements in converting available data into datasets, and I am looking forward to seeing what you will build next.

Image credits
Title image
Shaking hands: Vectorstall, CC BY-SA 4.0, via Wikimedia Commons; Wikipedia Logo: Wikimedia Foundation, CC BY-SA 3.0, via Wikimedia Commons; Gradient background: JOGOS Public Assets, CC BY-SA 4.0, via Wikimedia Commons