Strategies for long documents - document-level context?

#2
by krumeto - opened

First of all, congrats on the great model!

Do you have any recommendation on handling longer than the max token of the underlying model? I've tried document-level context, it has worked fine but also got the WARNING:span_marker.modeling:This model was trained without document-level context: inference with document-level context may cause decreased performance..

Would you know what is the basis for the potential performance decrease with document-level context?

Hi @krumeto !

Regarding document-level context. You can check Tom's thesis, there's a link to it in the github page. Short summary, when adding adjacent context sentences, these are added as additional text tokens but without span-marker pairs. That means, text tokens would be able to attend each other, but start/end markers will not attend context text tokens, only the ones from the target sentence. In theory this setup improves performance, although in the thesis there are only reported results comparing no-context vs context on CONLL03. So the token distribution setup (check the figures in the thesis, they are very insightful to understand the inner working of the architecture), do not differ from context to no-context except for adding more text tokens, thus, when spreading the tokens to generate the embedding matrices to be passed to the encoder, will overflow among more samples meaning training/inference time will increase (check this in table 3.6 of the thesis).

Now, answering your questions:

  1. I'm dealing with long texts in my work. What I do is just split by sentences without document-level. Each sentence is processed independently. I group by document the entities detected after the inference.
  2. That message is just preventive. The model has not been trained with those additional adjacent context-sentences, so it is not "prepared" but, as I said before, internally almost nothing changes and nothing prohibits you to do so. It's just, this model has not been either trained or tested in that scenario, maybe works fine as is or maybe no.

Hopefully this sheds a bit of light on your doubts.

Cheers!

guishe changed discussion status to closed

Sign up or log in to comment