Christopher Schröder

cschroeder

https://github.com/webis-de/small-text

AI & ML interests

NLP, Active Learning, Text Representations, PyTorch

Recent Activity

replied to their post 4 days ago

Here’s just one of the many exciting questions from our survey. If these topics resonate with you and you have experience working on supervised learning with text (i.e., supervised learning in Natural Language Processing), we warmly invite you to participate! Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271 Estimated time required: 5–15 minutes Deadline for participation: January 12, 2025 — ❤️ We’re seeking responses from across the globe! If you know 1–3 people who might qualify for this survey—particularly those in different regions—please share it with them. We’d really appreciate it! #NLProc #ActiveLearning #ML

posted an update 6 days ago

posted an update 17 days ago

💡𝗟𝗼𝗼𝗸𝗶𝗻𝗴 𝗳𝗼𝗿 𝘀𝘂𝗽𝗽𝗼𝗿𝘁: 𝗛𝗮𝘃𝗲 𝘆𝗼𝘂 𝗲𝘃𝗲𝗿 𝗵𝗮𝗱 𝘁𝗼 𝗼𝘃𝗲𝗿𝗰𝗼𝗺𝗲 𝗮 𝗹𝗮𝗰𝗸 𝗼𝗳 𝗹𝗮𝗯𝗲𝗹𝗲𝗱 𝗱𝗮𝘁𝗮 𝘁𝗼 𝗱𝗲𝗮𝗹 𝘄𝗶𝘁𝗵 𝗮𝗻 𝗡𝗟𝗣 𝘁𝗮𝘀𝗸? Are you working on Natural Language Processing tasks and have faced the challenge of a lack of labeled data before? 𝗪𝗲 𝗮𝗿𝗲 𝗰𝘂𝗿𝗿𝗲𝗻𝘁𝗹𝘆 𝗰𝗼𝗻𝗱𝘂𝗰𝘁𝗶𝗻𝗴 𝗮 𝘀𝘂𝗿𝘃𝗲𝘆 to explore the strategies used to address this bottleneck, especially in the context of recent advancements, including but not limited to large language models. The survey is non-commercial and conducted solely for academic research purposes. The results will contribute to an open-access publication that also benefits the community. 👉 With only 5–15 minutes of your time, you would greatly help to investigate which strategies are used by the #NLP community to overcome a lack of labeled data. ❤️How you can help even more: If you know others working on supervised learning and NLP, please share this survey with them—we’d really appreciate it! Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271 Estimated time required: 5–15 minutes Deadline for participation: January 12, 2025 #NLP #ML

View all activity

Organizations

cschroeder's activity

replied to their post 4 days ago

Just a quick note: I will not again enter any ideological debates here.

First off, I think this is a non-issue regardless of which license we use. This is first and foremost a scientific study, and the dataset we’re producing is more of a byproduct—its main purpose is to help other researchers verify our findings. It seems like there might be some misconceptions about this dataset: Think of it as a table of answer codes. It is not a text dataset and therefore not interesting or useful for LLM training (or similar).

Second, we made this decision because the survey doesn’t have any funding and relies on people generously sharing their opinions (without compensation). Given the growing skepticism around data collection, we wanted to be especially careful not to discourage users from participating. Our primary goal is to conduct a study with a population as diverse as possible, and we did not want to lose potential participants who might be less inclined to give away their data without compensation.

posted an update 6 days ago

Post

351

Here’s just one of the many exciting questions from our survey. If these topics resonate with you and you have experience working on supervised learning with text (i.e., supervised learning in Natural Language Processing), we warmly invite you to participate!

Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5–15 minutes
Deadline for participation: January 12, 2025

—

❤️ We’re seeking responses from across the globe! If you know 1–3 people who might qualify for this survey—particularly those in different regions—please share it with them. We’d really appreciate it!

#NLProc #ActiveLearning #ML

2 replies

posted an update 17 days ago

Post

352

💡𝗟𝗼𝗼𝗸𝗶𝗻𝗴 𝗳𝗼𝗿 𝘀𝘂𝗽𝗽𝗼𝗿𝘁: 𝗛𝗮𝘃𝗲 𝘆𝗼𝘂 𝗲𝘃𝗲𝗿 𝗵𝗮𝗱 𝘁𝗼 𝗼𝘃𝗲𝗿𝗰𝗼𝗺𝗲 𝗮 𝗹𝗮𝗰𝗸 𝗼𝗳 𝗹𝗮𝗯𝗲𝗹𝗲𝗱 𝗱𝗮𝘁𝗮 𝘁𝗼 𝗱𝗲𝗮𝗹 𝘄𝗶𝘁𝗵 𝗮𝗻 𝗡𝗟𝗣 𝘁𝗮𝘀𝗸?

Are you working on Natural Language Processing tasks and have faced the challenge of a lack of labeled data before? 𝗪𝗲 𝗮𝗿𝗲 𝗰𝘂𝗿𝗿𝗲𝗻𝘁𝗹𝘆 𝗰𝗼𝗻𝗱𝘂𝗰𝘁𝗶𝗻𝗴 𝗮 𝘀𝘂𝗿𝘃𝗲𝘆 to explore the strategies used to address this bottleneck, especially in the context of recent advancements, including but not limited to large language models.

The survey is non-commercial and conducted solely for academic research purposes. The results will contribute to an open-access publication that also benefits the community.

👉 With only 5–15 minutes of your time, you would greatly help to investigate which strategies are used by the #NLP community to overcome a lack of labeled data.

❤️How you can help even more: If you know others working on supervised learning and NLP, please share this survey with them—we’d really appreciate it!

Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5–15 minutes
Deadline for participation: January 12, 2025

#NLP #ML

liked a Space 24 days ago

Running on CPU Upgrade

🌐

FineWeb-c - Annotation

liked a model about 1 month ago

PleIAs/celadon

Text Classification • Updated Nov 3, 2024 • 97 • 26

posted an update about 1 month ago

Post

1088

🐣 New release: small-text v2.0.0.dev1

With small language models on the rise, the new version of small-text has been long overdue! Despite the generative AI hype, many real-world tasks still rely on supervised learning—which is reliant on labeled data.

Highlights:
- Four new query strategies: Try even more combinations than before.
- Vector indices integration: HNSW and KNN indices are now available via a unified interface and can easily be used within your code.
- Simplified installation: We dropped the torchtext dependency and cleaned up a lot of interfaces.

Github: https://github.com/webis-de/small-text

👂 Try it out for yourself! We are eager to hear your feedback.
🔧 Share your small-text applications and experiments in the newly added showcase section.
🌟 Support the project by leaving a star on the repo!

#activelearning #nlproc #machinelearning

upvoted a collection about 1 month ago

Models for dataset curation

Collection

9 items • Updated 29 days ago • 17

replied to their post about 2 months ago

Paper (at HF): https://huggingface.co/papers/2406.09206
Paper (in the ACL Anthology): https://aclanthology.org/2024.emnlp-main.669/
Code: https://github.com/chschroeder/self-training-for-sample-efficient-active-learning

upvoted a paper about 2 months ago

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Paper • 2406.09206 • Published Jun 13, 2024 • 1

updated a collection about 2 months ago

Active Learning Papers

Collection

An opinionated collection of papers on active learning. • 5 items • Updated Nov 10, 2024

posted an update about 2 months ago

Post

696

#EMNLP2024 is happening soon! Unfortunately, I will not be on site, but I will present our poster virtually on Wednesday, Nov 13 (7:45 EST / 13:45 CEST) in Virtual Poster Session 2.

In this work, we leverage self-training in an active learning loop in order to train small language models with even less data. Hope to see you there!

1 reply

liked a model 2 months ago

numind/NuExtract-1.5

Text Generation • Updated Nov 18, 2024 • 9.33k • 180

upvoted a collection 2 months ago

OpenCulture

Collection

A multilingual dataset of public domain books and newspapers. • 27 items • Updated Nov 6, 2024 • 121

upvoted a collection 3 months ago

EU20-Benchmarks

Collection

Evaluation Benchmarks for 20 European languages. • 5 items • Updated Oct 11, 2024 • 7

reacted to tomaarsen's post with 🔥 4 months ago

Post

2000

I've just shipped the Sentence Transformers v3.1.1 patch release, fixing the hard negatives mining utility for some models. This utility is extremely useful to get more performance out of your embedding training data.

⛏ Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives

🔓 Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1

I'm looking forward to releasing v3.2, I have some exciting things planned 🚀

replied to do-me's post 4 months ago

Did not know text-splitter yet, thanks!

reacted to do-me's post with 👀 4 months ago

Post

1063

What are your favorite text chunkers/splitters?
Mine are:
- https://github.com/benbrandt/text-splitter (Rust/Python, battle-tested, Wasm version coming soon)
- https://github.com/umarbutler/semchunk (Python, really performant but some issues with huge docs)

I tried the huge Jina AI regex, but it failed for my (admittedly messy) documents, e.g. from EUR-LEX. Their free segmenter API is really cool but unfortunately times out on my huge docs (~100 pages): https://jina.ai/segmenter/

Also, I tried to write a Vanilla JS chunker with a simple, adjustable hierarchical logic (inspired from the above). I think it does a decent job for the few lines of code: https://do-me.github.io/js-text-chunker/

Happy to hear your thoughts!

1 reply

upvoted an article 4 months ago

Article

AI Policy @🤗: Open ML Considerations in the EU AI Act

Jul 24, 2023

• 2

reacted to gaodrew's post with 🔥 4 months ago

Post

1420

We used Hugging Face Trainer to fine-tune Deberta-v3-base for Personally Identifiable Information detection, achieving 99.44% overall accuracy (98.27% Recall for PII detection).

Please try our model (Colab Quickstart available) and let us know what you think:
iiiorg/piiranha-v1-detect-personal-information

3 replies

liked a model 4 months ago

iiiorg/piiranha-v1-detect-personal-information

Token Classification • Updated Sep 13, 2024 • 43.2k • 158