arxiv:2101.09763

Analysing the Noise Model Error for Realistic Noisy Label Data

Published on Jan 24, 2021

Authors:

Abstract

Distant and weak supervision allow to obtain large amounts of labeled training data quickly and cheaply, but these automatic annotations tend to contain a high amount of errors. A popular technique to overcome the negative effects of these noisy labels is <PRE_TAG>noise modelling</POST_TAG> where the underlying noise process is modelled. In this work, we study the quality of these estimated noise models from the theoretical side by deriving the expected error of the noise model. Apart from evaluating the theoretical results on commonly used synthetic noise, we also publish NoisyNER, a new noisy label dataset from the NLP domain that was obtained through a realistic distant supervision technique. It provides seven sets of labels with differing noise patterns to evaluate different noise levels on the same instances. Parallel, clean labels are available making it possible to study scenarios where a small amount of gold-standard data can be leveraged. Our theoretical results and the corresponding experiments give insights into the factors that influence the noise model estimation like the noise distribution and the sampling technique.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2101.09763 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2101.09763 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.