oroszgy commited on
Commit
d32f26e
1 Parent(s): 16ee18e

Update spacy pipeline to 3.5.1

Browse files
README.md CHANGED
@@ -14,69 +14,69 @@ model-index:
14
  metrics:
15
  - name: NER Precision
16
  type: precision
17
- value: 0.8581105169
18
  - name: NER Recall
19
  type: recall
20
- value: 0.8463431786
21
  - name: NER F Score
22
  type: f_score
23
- value: 0.8521862277
24
  - task:
25
  name: TAG
26
  type: token-classification
27
  metrics:
28
  - name: TAG (XPOS) Accuracy
29
  type: accuracy
30
- value: 0.9649265515
31
  - task:
32
  name: POS
33
  type: token-classification
34
  metrics:
35
  - name: POS (UPOS) Accuracy
36
  type: accuracy
37
- value: 0.9634910761
38
  - task:
39
  name: MORPH
40
  type: token-classification
41
  metrics:
42
  - name: Morph (UFeats) Accuracy
43
  type: accuracy
44
- value: 0.9308067758
45
  - task:
46
  name: LEMMA
47
  type: token-classification
48
  metrics:
49
  - name: Lemma Accuracy
50
  type: accuracy
51
- value: 0.9738780978
52
  - task:
53
  name: UNLABELED_DEPENDENCIES
54
  type: token-classification
55
  metrics:
56
  - name: Unlabeled Attachment Score (UAS)
57
  type: f_score
58
- value: 0.8116901329
59
  - task:
60
  name: LABELED_DEPENDENCIES
61
  type: token-classification
62
  metrics:
63
  - name: Labeled Attachment Score (LAS)
64
  type: f_score
65
- value: 0.7417545434
66
  - task:
67
  name: SENTS
68
  type: token-classification
69
  metrics:
70
  - name: Sentences F-Score
71
  type: f_score
72
- value: 0.9710467706
73
  ---
74
  Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morphologizer, lemmatizer, parser, ner
75
 
76
  | Feature | Description |
77
  | --- | --- |
78
  | **Name** | `hu_core_news_md` |
79
- | **Version** | `3.5.0` |
80
  | **spaCy** | `>=3.5.0,<3.6.0` |
81
  | **Default Pipeline** | `tok2vec`, `senter`, `tagger`, `morphologizer`, `lookup_lemmatizer`, `lemmatizer`, `lemma_smoother`, `parser`, `ner` |
82
  | **Components** | `tok2vec`, `senter`, `tagger`, `morphologizer`, `lookup_lemmatizer`, `lemmatizer`, `lemma_smoother`, `parser`, `ner` |
@@ -108,18 +108,18 @@ Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morpholog
108
  | `TOKEN_P` | 99.86 |
109
  | `TOKEN_R` | 99.93 |
110
  | `TOKEN_F` | 99.89 |
111
- | `SENTS_P` | 97.10 |
112
- | `SENTS_R` | 97.10 |
113
- | `SENTS_F` | 97.10 |
114
- | `TAG_ACC` | 96.49 |
115
- | `POS_ACC` | 96.35 |
116
- | `MORPH_ACC` | 93.08 |
117
- | `MORPH_MICRO_P` | 96.72 |
118
- | `MORPH_MICRO_R` | 95.96 |
119
- | `MORPH_MICRO_F` | 96.34 |
120
- | `LEMMA_ACC` | 97.39 |
121
- | `DEP_UAS` | 81.17 |
122
- | `DEP_LAS` | 74.18 |
123
- | `ENTS_P` | 85.81 |
124
- | `ENTS_R` | 84.63 |
125
- | `ENTS_F` | 85.22 |
 
14
  metrics:
15
  - name: NER Precision
16
  type: precision
17
+ value: 0.8572441922
18
  - name: NER Recall
19
  type: recall
20
+ value: 0.849859353
21
  - name: NER F Score
22
  type: f_score
23
+ value: 0.8535357994
24
  - task:
25
  name: TAG
26
  type: token-classification
27
  metrics:
28
  - name: TAG (XPOS) Accuracy
29
  type: accuracy
30
+ value: 0.9624844483
31
  - task:
32
  name: POS
33
  type: token-classification
34
  metrics:
35
  - name: POS (UPOS) Accuracy
36
  type: accuracy
37
+ value: 0.9631543688
38
  - task:
39
  name: MORPH
40
  type: token-classification
41
  metrics:
42
  - name: Morph (UFeats) Accuracy
43
  type: accuracy
44
+ value: 0.928892717
45
  - task:
46
  name: LEMMA
47
  type: token-classification
48
  metrics:
49
  - name: Lemma Accuracy
50
  type: accuracy
51
+ value: 0.9728255669
52
  - task:
53
  name: UNLABELED_DEPENDENCIES
54
  type: token-classification
55
  metrics:
56
  - name: Unlabeled Attachment Score (UAS)
57
  type: f_score
58
+ value: 0.8127597439
59
  - task:
60
  name: LABELED_DEPENDENCIES
61
  type: token-classification
62
  metrics:
63
  - name: Labeled Attachment Score (LAS)
64
  type: f_score
65
+ value: 0.743681905
66
  - task:
67
  name: SENTS
68
  type: token-classification
69
  metrics:
70
  - name: Sentences F-Score
71
  type: f_score
72
+ value: 0.9787709497
73
  ---
74
  Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morphologizer, lemmatizer, parser, ner
75
 
76
  | Feature | Description |
77
  | --- | --- |
78
  | **Name** | `hu_core_news_md` |
79
+ | **Version** | `3.5.1` |
80
  | **spaCy** | `>=3.5.0,<3.6.0` |
81
  | **Default Pipeline** | `tok2vec`, `senter`, `tagger`, `morphologizer`, `lookup_lemmatizer`, `lemmatizer`, `lemma_smoother`, `parser`, `ner` |
82
  | **Components** | `tok2vec`, `senter`, `tagger`, `morphologizer`, `lookup_lemmatizer`, `lemmatizer`, `lemma_smoother`, `parser`, `ner` |
 
108
  | `TOKEN_P` | 99.86 |
109
  | `TOKEN_R` | 99.93 |
110
  | `TOKEN_F` | 99.89 |
111
+ | `SENTS_P` | 98.21 |
112
+ | `SENTS_R` | 97.55 |
113
+ | `SENTS_F` | 97.88 |
114
+ | `TAG_ACC` | 96.25 |
115
+ | `POS_ACC` | 96.32 |
116
+ | `MORPH_ACC` | 92.89 |
117
+ | `MORPH_MICRO_P` | 96.49 |
118
+ | `MORPH_MICRO_R` | 95.78 |
119
+ | `MORPH_MICRO_F` | 96.14 |
120
+ | `LEMMA_ACC` | 97.28 |
121
+ | `DEP_UAS` | 81.28 |
122
+ | `DEP_LAS` | 74.37 |
123
+ | `ENTS_P` | 85.72 |
124
+ | `ENTS_R` | 84.99 |
125
+ | `ENTS_F` | 85.35 |
config.cfg CHANGED
@@ -1,8 +1,8 @@
1
  [paths]
2
- parser_model = "models/hu_core_news_md-parser-3.5.0/model-best"
3
- ner_model = "models/hu_core_news_md-ner-3.5.0/model-best"
4
- lemmatizer_lookups = "models/hu_core_news_md-lookup-lemmatizer-3.5.0"
5
- tagger_model = "models/hu_core_news_md-tagger-3.5.0/model-best"
6
  train = null
7
  dev = null
8
  vectors = null
 
1
  [paths]
2
+ parser_model = "models/hu_core_news_md-parser-3.5.1/model-best"
3
+ ner_model = "models/hu_core_news_md-ner-3.5.1/model-best"
4
+ lemmatizer_lookups = "models/hu_core_news_md-lookup-lemmatizer-3.5.1"
5
+ tagger_model = "models/hu_core_news_md-tagger-3.5.1/model-best"
6
  train = null
7
  dev = null
8
  vectors = null
edit_tree_lemmatizer.py ADDED
@@ -0,0 +1,465 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from functools import lru_cache
2
+
3
+ from typing import cast, Any, Callable, Dict, Iterable, List, Optional
4
+ from typing import Sequence, Tuple, Union
5
+ from collections import Counter
6
+ from copy import deepcopy
7
+ from itertools import islice
8
+ import numpy as np
9
+
10
+ import srsly
11
+ from thinc.api import Config, Model, SequenceCategoricalCrossentropy, NumpyOps
12
+ from thinc.types import Floats2d, Ints2d
13
+
14
+ from spacy.pipeline._edit_tree_internals.edit_trees import EditTrees
15
+ from spacy.pipeline._edit_tree_internals.schemas import validate_edit_tree
16
+ from spacy.pipeline.lemmatizer import lemmatizer_score
17
+ from spacy.pipeline.trainable_pipe import TrainablePipe
18
+ from spacy.errors import Errors
19
+ from spacy.language import Language
20
+ from spacy.tokens import Doc, Token
21
+ from spacy.training import Example, validate_examples, validate_get_examples
22
+ from spacy.vocab import Vocab
23
+ from spacy import util
24
+
25
+
26
+ TOP_K_GUARDRAIL = 20
27
+
28
+
29
+ default_model_config = """
30
+ [model]
31
+ @architectures = "spacy.Tagger.v2"
32
+
33
+ [model.tok2vec]
34
+ @architectures = "spacy.HashEmbedCNN.v2"
35
+ pretrained_vectors = null
36
+ width = 96
37
+ depth = 4
38
+ embed_size = 2000
39
+ window_size = 1
40
+ maxout_pieces = 3
41
+ subword_features = true
42
+ """
43
+ DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["model"]
44
+
45
+
46
+ @Language.factory(
47
+ "trainable_lemmatizer_v2",
48
+ assigns=["token.lemma"],
49
+ requires=[],
50
+ default_config={
51
+ "model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
52
+ "backoff": "orth",
53
+ "min_tree_freq": 3,
54
+ "overwrite": False,
55
+ "top_k": 1,
56
+ "overwrite_labels": True,
57
+ "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
58
+ },
59
+ default_score_weights={"lemma_acc": 1.0},
60
+ )
61
+ def make_edit_tree_lemmatizer(
62
+ nlp: Language,
63
+ name: str,
64
+ model: Model,
65
+ backoff: Optional[str],
66
+ min_tree_freq: int,
67
+ overwrite: bool,
68
+ top_k: int,
69
+ overwrite_labels: bool,
70
+ scorer: Optional[Callable],
71
+ ):
72
+ """Construct an EditTreeLemmatizer component."""
73
+ return EditTreeLemmatizer(
74
+ nlp.vocab,
75
+ model,
76
+ name,
77
+ backoff=backoff,
78
+ min_tree_freq=min_tree_freq,
79
+ overwrite=overwrite,
80
+ top_k=top_k,
81
+ overwrite_labels=overwrite_labels,
82
+ scorer=scorer,
83
+ )
84
+
85
+
86
+ # _f = open("lemmatizer.log", "w")
87
+ # def debug(*args):
88
+ # _f.write(" ".join(args) + "\n")
89
+ def debug(*args):
90
+ pass
91
+
92
+
93
+ class EditTreeLemmatizer(TrainablePipe):
94
+ """
95
+ Lemmatizer that lemmatizes each word using a predicted edit tree.
96
+ """
97
+
98
+ def __init__(
99
+ self,
100
+ vocab: Vocab,
101
+ model: Model,
102
+ name: str = "trainable_lemmatizer",
103
+ *,
104
+ backoff: Optional[str] = "orth",
105
+ min_tree_freq: int = 3,
106
+ overwrite: bool = False,
107
+ top_k: int = 1,
108
+ overwrite_labels,
109
+ scorer: Optional[Callable] = lemmatizer_score,
110
+ ):
111
+ """
112
+ Construct an edit tree lemmatizer.
113
+
114
+ backoff (Optional[str]): backoff to use when the predicted edit trees
115
+ are not applicable. Must be an attribute of Token or None (leave the
116
+ lemma unset).
117
+ min_tree_freq (int): prune trees that are applied less than this
118
+ frequency in the training data.
119
+ overwrite (bool): overwrite existing lemma annotations.
120
+ top_k (int): try to apply at most the k most probable edit trees.
121
+ """
122
+ self.vocab = vocab
123
+ self.model = model
124
+ self.name = name
125
+ self.backoff = backoff
126
+ self.min_tree_freq = min_tree_freq
127
+ self.overwrite = overwrite
128
+ self.top_k = top_k
129
+ self.overwrite_labels = overwrite_labels
130
+
131
+ self.trees = EditTrees(self.vocab.strings)
132
+ self.tree2label: Dict[int, int] = {}
133
+
134
+ self.cfg: Dict[str, Any] = {"labels": []}
135
+ self.scorer = scorer
136
+ self.numpy_ops = NumpyOps()
137
+
138
+ def get_loss(
139
+ self, examples: Iterable[Example], scores: List[Floats2d]
140
+ ) -> Tuple[float, List[Floats2d]]:
141
+ validate_examples(examples, "EditTreeLemmatizer.get_loss")
142
+ loss_func = SequenceCategoricalCrossentropy(normalize=False, missing_value=-1)
143
+
144
+ truths = []
145
+ for eg in examples:
146
+ eg_truths = []
147
+ for (predicted, gold_lemma, gold_pos, gold_sent_start) in zip(
148
+ eg.predicted,
149
+ eg.get_aligned("LEMMA", as_string=True),
150
+ eg.get_aligned("POS", as_string=True),
151
+ eg.get_aligned_sent_starts(),
152
+ ):
153
+ if gold_lemma is None:
154
+ label = -1
155
+ else:
156
+ form = self._get_true_cased_form(
157
+ predicted.text, gold_sent_start, gold_pos
158
+ )
159
+ tree_id = self.trees.add(form, gold_lemma)
160
+ # debug(f"@get_loss: {predicted}/{gold_pos}[{gold_sent_start}]->{form}|{gold_lemma}[{tree_id}]")
161
+ label = self.tree2label.get(tree_id, 0)
162
+ eg_truths.append(label)
163
+
164
+ truths.append(eg_truths)
165
+
166
+ d_scores, loss = loss_func(scores, truths)
167
+ if self.model.ops.xp.isnan(loss):
168
+ raise ValueError(Errors.E910.format(name=self.name))
169
+
170
+ return float(loss), d_scores
171
+
172
+ def predict(self, docs: Iterable[Doc]) -> List[Ints2d]:
173
+ if self.top_k == 1:
174
+ scores2guesses = self._scores2guesses_top_k_equals_1
175
+ elif self.top_k <= TOP_K_GUARDRAIL:
176
+ scores2guesses = self._scores2guesses_top_k_greater_1
177
+ else:
178
+ scores2guesses = self._scores2guesses_top_k_guardrail
179
+ # The behaviour of *_scores2guesses_top_k_greater_1()* is efficient for values
180
+ # of *top_k>1* that are likely to be useful when the edit tree lemmatizer is used
181
+ # for its principal purpose of lemmatizing tokens. However, the code could also
182
+ # be used for other purposes, and with very large values of *top_k* the method
183
+ # becomes inefficient. In such cases, *_scores2guesses_top_k_guardrail()* is used
184
+ # instead.
185
+ n_docs = len(list(docs))
186
+ if not any(len(doc) for doc in docs):
187
+ # Handle cases where there are no tokens in any docs.
188
+ n_labels = len(self.cfg["labels"])
189
+ guesses: List[Ints2d] = [self.model.ops.alloc2i(0, n_labels) for _ in docs]
190
+ assert len(guesses) == n_docs
191
+ return guesses
192
+ scores = self.model.predict(docs)
193
+ assert len(scores) == n_docs
194
+ guesses = scores2guesses(docs, scores)
195
+ assert len(guesses) == n_docs
196
+ return guesses
197
+
198
+ def _scores2guesses_top_k_equals_1(self, docs, scores):
199
+ guesses = []
200
+ for doc, doc_scores in zip(docs, scores):
201
+ doc_guesses = doc_scores.argmax(axis=1)
202
+ doc_guesses = self.numpy_ops.asarray(doc_guesses)
203
+
204
+ doc_compat_guesses = []
205
+ for i, token in enumerate(doc):
206
+ tree_id = self.cfg["labels"][doc_guesses[i]]
207
+ form: str = self._get_true_cased_form_of_token(token)
208
+ if self.trees.apply(tree_id, form) is not None:
209
+ doc_compat_guesses.append(tree_id)
210
+ else:
211
+ doc_compat_guesses.append(-1)
212
+ guesses.append(np.array(doc_compat_guesses))
213
+
214
+ return guesses
215
+
216
+ def _scores2guesses_top_k_greater_1(self, docs, scores):
217
+ guesses = []
218
+ top_k = min(self.top_k, len(self.labels))
219
+ for doc, doc_scores in zip(docs, scores):
220
+ doc_scores = self.numpy_ops.asarray(doc_scores)
221
+ doc_compat_guesses = []
222
+ for i, token in enumerate(doc):
223
+ for _ in range(top_k):
224
+ candidate = int(doc_scores[i].argmax())
225
+ candidate_tree_id = self.cfg["labels"][candidate]
226
+ form: str = self._get_true_cased_form_of_token(token)
227
+ if self.trees.apply(candidate_tree_id, form) is not None:
228
+ doc_compat_guesses.append(candidate_tree_id)
229
+ break
230
+ doc_scores[i, candidate] = np.finfo(np.float32).min
231
+ else:
232
+ doc_compat_guesses.append(-1)
233
+ guesses.append(np.array(doc_compat_guesses))
234
+
235
+ return guesses
236
+
237
+ def _scores2guesses_top_k_guardrail(self, docs, scores):
238
+ guesses = []
239
+ for doc, doc_scores in zip(docs, scores):
240
+ doc_guesses = np.argsort(doc_scores)[..., : -self.top_k - 1 : -1]
241
+ doc_guesses = self.numpy_ops.asarray(doc_guesses)
242
+
243
+ doc_compat_guesses = []
244
+ for token, candidates in zip(doc, doc_guesses):
245
+ tree_id = -1
246
+ for candidate in candidates:
247
+ candidate_tree_id = self.cfg["labels"][candidate]
248
+
249
+ form: str = self._get_true_cased_form_of_token(token)
250
+
251
+ if self.trees.apply(candidate_tree_id, form) is not None:
252
+ tree_id = candidate_tree_id
253
+ break
254
+ doc_compat_guesses.append(tree_id)
255
+
256
+ guesses.append(np.array(doc_compat_guesses))
257
+
258
+ return guesses
259
+
260
+ def set_annotations(self, docs: Iterable[Doc], batch_tree_ids):
261
+ for i, doc in enumerate(docs):
262
+ doc_tree_ids = batch_tree_ids[i]
263
+ if hasattr(doc_tree_ids, "get"):
264
+ doc_tree_ids = doc_tree_ids.get()
265
+ for j, tree_id in enumerate(doc_tree_ids):
266
+ if self.overwrite or doc[j].lemma == 0:
267
+ # If no applicable tree could be found during prediction,
268
+ # the special identifier -1 is used. Otherwise the tree
269
+ # is guaranteed to be applicable.
270
+ if tree_id == -1:
271
+ if self.backoff is not None:
272
+ doc[j].lemma = getattr(doc[j], self.backoff)
273
+ else:
274
+ form = self._get_true_cased_form_of_token(doc[j])
275
+ lemma = self.trees.apply(tree_id, form) or form
276
+ # debug(f"@set_annotations: {doc[j]}/{doc[j].pos_}[{doc[j].is_sent_start}]->{form}|{lemma}[{tree_id}]")
277
+ doc[j].lemma_ = lemma
278
+
279
+ @property
280
+ def labels(self) -> Tuple[int, ...]:
281
+ """Returns the labels currently added to the component."""
282
+ return tuple(self.cfg["labels"])
283
+
284
+ @property
285
+ def hide_labels(self) -> bool:
286
+ return True
287
+
288
+ @property
289
+ def label_data(self) -> Dict:
290
+ trees = []
291
+ for tree_id in range(len(self.trees)):
292
+ tree = self.trees[tree_id]
293
+ if "orig" in tree:
294
+ tree["orig"] = self.vocab.strings[tree["orig"]]
295
+ if "subst" in tree:
296
+ tree["subst"] = self.vocab.strings[tree["subst"]]
297
+ trees.append(tree)
298
+ return dict(trees=trees, labels=tuple(self.cfg["labels"]))
299
+
300
+ def initialize(
301
+ self,
302
+ get_examples: Callable[[], Iterable[Example]],
303
+ *,
304
+ nlp: Optional[Language] = None,
305
+ labels: Optional[Dict] = None,
306
+ ):
307
+ validate_get_examples(get_examples, "EditTreeLemmatizer.initialize")
308
+
309
+ if self.overwrite_labels:
310
+ if labels is None:
311
+ self._labels_from_data(get_examples)
312
+ else:
313
+ self._add_labels(labels)
314
+
315
+ # Sample for the model.
316
+ doc_sample = []
317
+ label_sample = []
318
+ for example in islice(get_examples(), 10):
319
+ doc_sample.append(example.x)
320
+ gold_labels: List[List[float]] = []
321
+ for token in example.reference:
322
+ if token.lemma == 0:
323
+ gold_label = None
324
+ else:
325
+ gold_label = self._pair2label(token.text, token.lemma_)
326
+
327
+ gold_labels.append(
328
+ [
329
+ 1.0 if label == gold_label else 0.0
330
+ for label in self.cfg["labels"]
331
+ ]
332
+ )
333
+
334
+ gold_labels = cast(Floats2d, gold_labels)
335
+ label_sample.append(self.model.ops.asarray(gold_labels, dtype="float32"))
336
+
337
+ self._require_labels()
338
+ assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
339
+ assert len(label_sample) > 0, Errors.E923.format(name=self.name)
340
+
341
+ self.model.initialize(X=doc_sample, Y=label_sample)
342
+
343
+ def from_bytes(self, bytes_data, *, exclude=tuple()):
344
+ deserializers = {
345
+ "cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
346
+ "model": lambda b: self.model.from_bytes(b),
347
+ "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
348
+ "trees": lambda b: self.trees.from_bytes(b),
349
+ }
350
+
351
+ util.from_bytes(bytes_data, deserializers, exclude)
352
+
353
+ return self
354
+
355
+ def to_bytes(self, *, exclude=tuple()):
356
+ serializers = {
357
+ "cfg": lambda: srsly.json_dumps(self.cfg),
358
+ "model": lambda: self.model.to_bytes(),
359
+ "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
360
+ "trees": lambda: self.trees.to_bytes(),
361
+ }
362
+
363
+ return util.to_bytes(serializers, exclude)
364
+
365
+ def to_disk(self, path, exclude=tuple()):
366
+ path = util.ensure_path(path)
367
+ serializers = {
368
+ "cfg": lambda p: srsly.write_json(p, self.cfg),
369
+ "model": lambda p: self.model.to_disk(p),
370
+ "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
371
+ "trees": lambda p: self.trees.to_disk(p),
372
+ }
373
+ util.to_disk(path, serializers, exclude)
374
+
375
+ def from_disk(self, path, exclude=tuple()):
376
+ def load_model(p):
377
+ try:
378
+ with open(p, "rb") as mfile:
379
+ self.model.from_bytes(mfile.read())
380
+ except AttributeError:
381
+ raise ValueError(Errors.E149) from None
382
+
383
+ deserializers = {
384
+ "cfg": lambda p: self.cfg.update(srsly.read_json(p)),
385
+ "model": load_model,
386
+ "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
387
+ "trees": lambda p: self.trees.from_disk(p),
388
+ }
389
+
390
+ util.from_disk(path, deserializers, exclude)
391
+ return self
392
+
393
+ def _add_labels(self, labels: Dict):
394
+ if "labels" not in labels:
395
+ raise ValueError(Errors.E857.format(name="labels"))
396
+ if "trees" not in labels:
397
+ raise ValueError(Errors.E857.format(name="trees"))
398
+
399
+ self.cfg["labels"] = list(labels["labels"])
400
+ trees = []
401
+ for tree in labels["trees"]:
402
+ errors = validate_edit_tree(tree)
403
+ if errors:
404
+ raise ValueError(Errors.E1026.format(errors="\n".join(errors)))
405
+
406
+ tree = dict(tree)
407
+ if "orig" in tree:
408
+ tree["orig"] = self.vocab.strings[tree["orig"]]
409
+ if "orig" in tree:
410
+ tree["subst"] = self.vocab.strings[tree["subst"]]
411
+
412
+ trees.append(tree)
413
+
414
+ self.trees.from_json(trees)
415
+
416
+ for label, tree in enumerate(self.labels):
417
+ self.tree2label[tree] = label
418
+
419
+ def _labels_from_data(self, get_examples: Callable[[], Iterable[Example]]):
420
+ # Count corpus tree frequencies in ad-hoc storage to avoid cluttering
421
+ # the final pipe/string store.
422
+ vocab = Vocab()
423
+ trees = EditTrees(vocab.strings)
424
+ tree_freqs: Counter = Counter()
425
+ repr_pairs: Dict = {}
426
+ for example in get_examples():
427
+ for token in example.reference:
428
+ if token.lemma != 0:
429
+ form = self._get_true_cased_form_of_token(token)
430
+ # debug("_labels_from_data", str(token) + "->" + form, token.lemma_)
431
+ tree_id = trees.add(form, token.lemma_)
432
+ tree_freqs[tree_id] += 1
433
+ repr_pairs[tree_id] = (form, token.lemma_)
434
+
435
+ # Construct trees that make the frequency cut-off using representative
436
+ # form - token pairs.
437
+ for tree_id, freq in tree_freqs.items():
438
+ if freq >= self.min_tree_freq:
439
+ form, lemma = repr_pairs[tree_id]
440
+ self._pair2label(form, lemma, add_label=True)
441
+
442
+ @lru_cache()
443
+ def _get_true_cased_form(self, token: str, is_sent_start: bool, pos: str) -> str:
444
+ if is_sent_start and pos != "PROPN":
445
+ return token.lower()
446
+ else:
447
+ return token
448
+
449
+ def _get_true_cased_form_of_token(self, token: Token) -> str:
450
+ return self._get_true_cased_form(token.text, token.is_sent_start, token.pos_)
451
+
452
+ def _pair2label(self, form, lemma, add_label=False):
453
+ """
454
+ Look up the edit tree identifier for a form/label pair. If the edit
455
+ tree is unknown and "add_label" is set, the edit tree will be added to
456
+ the labels.
457
+ """
458
+ tree_id = self.trees.add(form, lemma)
459
+ if tree_id not in self.tree2label:
460
+ if not add_label:
461
+ return None
462
+
463
+ self.tree2label[tree_id] = len(self.cfg["labels"])
464
+ self.cfg["labels"].append(tree_id)
465
+ return self.tree2label[tree_id]
hu_core_news_md-any-py3-none-any.whl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b5b32840f8af0edb059f49e7042fc2c3675ae777dbabfcddbf3e294b6d515a75
3
- size 126883509
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:49cf69019ae9ecd344dfa914af018ed5f02263beb6194967a21a85fa66460896
3
+ size 126875360
lemma_postprocessing.py ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ This module contains various rule-based components aiming to improve on baseline lemmatization tools.
3
+ """
4
+
5
+ import re
6
+ from typing import List, Callable
7
+
8
+ from spacy.lang.hu import Hungarian
9
+ from spacy.pipeline import Pipe
10
+ from spacy.tokens import Token
11
+ from spacy.tokens.doc import Doc
12
+
13
+
14
+ @Hungarian.component(
15
+ "lemma_case_smoother",
16
+ assigns=["token.lemma"],
17
+ requires=["token.lemma", "token.pos"],
18
+ )
19
+ def lemma_case_smoother(doc: Doc) -> Doc:
20
+ """Smooth lemma casing by POS.
21
+
22
+ DEPRECATED: This is not needed anymore, as the lemmatizer is now case-insensitive.
23
+
24
+ Args:
25
+ doc (Doc): Input document.
26
+
27
+ Returns:
28
+ Doc: Output document.
29
+ """
30
+ for token in doc:
31
+ if token.is_sent_start and token.tag_ != "PROPN":
32
+ token.lemma_ = token.lemma_.lower()
33
+
34
+ return doc
35
+
36
+
37
+ class LemmaSmoother(Pipe):
38
+ """Smooths lemma by fixing common errors of the edit-tree lemmatizer."""
39
+
40
+ _DATE_PATTERN = re.compile(r"(\d+)-j?[éá]?n?a?(t[őó]l)?")
41
+ _NUMBER_PATTERN = re.compile(r"(\d+([-,/_.:]?(._)?\d+)*%?)")
42
+
43
+ # noinspection PyUnusedLocal
44
+ @staticmethod
45
+ @Hungarian.factory("lemma_smoother", assigns=["token.lemma"], requires=["token.lemma", "token.pos"])
46
+ def create_lemma_smoother(nlp: Hungarian, name: str) -> "LemmaSmoother":
47
+ return LemmaSmoother()
48
+
49
+ def __call__(self, doc: Doc) -> Doc:
50
+ rules: List[Callable] = [
51
+ self._remove_exclamation_marks,
52
+ self._remove_question_marks,
53
+ self._remove_date_suffixes,
54
+ self._remove_suffix_after_numbers,
55
+ ]
56
+
57
+ for token in doc:
58
+ for rule in rules:
59
+ rule(token)
60
+
61
+ return doc
62
+
63
+ @classmethod
64
+ def _remove_exclamation_marks(cls, token: Token) -> None:
65
+ """Removes exclamation marks from the lemma.
66
+
67
+ Args:
68
+ token (Token): The original token.
69
+ """
70
+
71
+ if "!" != token.lemma_:
72
+ exclamation_mark_index = token.lemma_.find("!")
73
+ if exclamation_mark_index != -1:
74
+ token.lemma_ = token.lemma_[:exclamation_mark_index]
75
+
76
+ @classmethod
77
+ def _remove_question_marks(cls, token: Token) -> None:
78
+ """Removes question marks from the lemma.
79
+
80
+ Args:
81
+ token (Token): The original token.
82
+ """
83
+
84
+ if "?" != token.lemma_:
85
+ question_mark_index = token.lemma_.find("?")
86
+ if question_mark_index != -1:
87
+ token.lemma_ = token.lemma_[:question_mark_index]
88
+
89
+ @classmethod
90
+ def _remove_date_suffixes(cls, token: Token) -> None:
91
+ """Fixes the suffixes of dates.
92
+
93
+ Args:
94
+ token (Token): The original token.
95
+ """
96
+
97
+ if token.pos_ == "NOUN":
98
+ match = cls._DATE_PATTERN.match(token.lemma_)
99
+ if match is not None:
100
+ token.lemma_ = match.group(1) + "."
101
+
102
+ @classmethod
103
+ def _remove_suffix_after_numbers(cls, token: Token) -> None:
104
+ """Removes suffixes after numbers.
105
+
106
+ Args:
107
+ token (str): The original token.
108
+ """
109
+
110
+ if token.pos_ == "NUM":
111
+ match = cls._NUMBER_PATTERN.match(token.text)
112
+ if match is not None:
113
+ token.lemma_ = match.group(0)
lemmatizer/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3541d553cb13886db87e71979c900dbb609ac823aa1c8961e218d9b104eff70c
3
  size 11282980
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e929a0bc8b59054f5ae0fc3ef376ac59a2bc5182a064769ae6a3af5242231489
3
  size 11282980
lookup_lemmatizer.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from collections import defaultdict
3
+ from operator import itemgetter
4
+ from pathlib import Path
5
+ from re import Pattern
6
+ from typing import Optional, Callable, Iterable, Dict, Tuple
7
+
8
+ from spacy.lang.hu import Hungarian
9
+ from spacy.language import Language
10
+ from spacy.lookups import Lookups, Table
11
+ from spacy.pipeline import Pipe
12
+ from spacy.pipeline.lemmatizer import lemmatizer_score
13
+ from spacy.tokens import Token
14
+ from spacy.tokens.doc import Doc
15
+
16
+ # noinspection PyUnresolvedReferences
17
+ from spacy.training.example import Example
18
+ from spacy.util import ensure_path
19
+
20
+
21
+ class LookupLemmatizer(Pipe):
22
+ """
23
+ LookupLemmatizer learn `(token, pos, morph. feat) -> lemma` mappings during training, and applies them at prediction
24
+ time.
25
+ """
26
+
27
+ _number_pattern: Pattern = re.compile(r"\d")
28
+
29
+ # noinspection PyUnusedLocal
30
+ @staticmethod
31
+ @Hungarian.factory(
32
+ "lookup_lemmatizer",
33
+ assigns=["token.lemma"],
34
+ requires=["token.pos"],
35
+ default_config={"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"}, "source": ""},
36
+ )
37
+ def create(nlp: Language, name: str, scorer: Optional[Callable], source: str) -> "LookupLemmatizer":
38
+ return LookupLemmatizer(None, source, scorer)
39
+
40
+ def train(self, sentences: Iterable[Iterable[Tuple[str, str, str, str]]], min_occurrences: int = 1) -> None:
41
+ """
42
+
43
+ Args:
44
+ sentences (Iterable[Iterable[Tuple[str, str, str, str]]]): Sentences to learn the mappings from
45
+ min_occurrences (int): mapping occurring less than this threshold are not learned
46
+
47
+ """
48
+
49
+ # Lookup table which maps (upos, form) to (lemma -> frequency),
50
+ # e.g. `{ ("NOUN", "alma"): { "alma" : 99, "alom": 1} }`
51
+ lemma_lookup_table: Dict[Tuple[str, str], Dict[str, int]] = defaultdict(lambda: defaultdict(int))
52
+
53
+ for sentence in sentences:
54
+ for token, pos, feats, lemma in sentence:
55
+ token = self.__mask_numbers(token)
56
+ lemma = self.__mask_numbers(lemma)
57
+ feats_str = ("|" + feats) if feats else ""
58
+ key = (token, pos + feats_str)
59
+ lemma_lookup_table[key][lemma] += 1
60
+ lemma_lookup_table = dict(lemma_lookup_table)
61
+
62
+ self._lookups = Lookups()
63
+ table = Table(name="lemma_lookups")
64
+
65
+ lemma_freq: Dict[str, int]
66
+ for (form, pos), lemma_freq in dict(lemma_lookup_table).items():
67
+ most_freq_lemma, freq = sorted(lemma_freq.items(), key=itemgetter(1), reverse=True)[0]
68
+ if freq >= min_occurrences:
69
+ if form not in table:
70
+ # lemma by pos
71
+ table[form]: Dict[str, str] = dict()
72
+ table[form][pos] = most_freq_lemma
73
+
74
+ self._lookups.set_table(name=f"lemma_lookups", table=table)
75
+
76
+ def __init__(
77
+ self,
78
+ lookups: Optional[Lookups] = None,
79
+ source: Optional[str] = None,
80
+ scorer: Optional[Callable] = lemmatizer_score,
81
+ ):
82
+ self._lookups: Optional[Lookups] = lookups
83
+ self.scorer = scorer
84
+ self.source = source
85
+
86
+ def __call__(self, doc: Doc) -> Doc:
87
+ assert self._lookups is not None, "Lookup table should be initialized first"
88
+
89
+ token: Token
90
+ for token in doc:
91
+ lemma_lookup_table = self._lookups.get_table(f"lemma_lookups")
92
+ masked_token = self.__mask_numbers(token.text)
93
+
94
+ if masked_token in lemma_lookup_table:
95
+ lemma_by_pos: Dict[str, str] = lemma_lookup_table[masked_token]
96
+ feats_str = ("|" + str(token.morph)) if str(token.morph) else ""
97
+ key = token.pos_ + feats_str
98
+ if key in lemma_by_pos:
99
+ if masked_token != token.text:
100
+ # If the token contains numbers, we need to replace the numbers in the lemma as well
101
+ token.lemma_ = self.__replace_numbers(lemma_by_pos[key], token.text)
102
+ pass
103
+ else:
104
+ token.lemma_ = lemma_by_pos[key]
105
+ return doc
106
+
107
+ # noinspection PyUnusedLocal
108
+ def to_disk(self, path, exclude=tuple()):
109
+ assert self._lookups is not None, "Lookup table should be initialized first"
110
+
111
+ path: Path = ensure_path(path)
112
+ path.mkdir(exist_ok=True)
113
+ self._lookups.to_disk(path)
114
+
115
+ # noinspection PyUnusedLocal
116
+ def from_disk(self, path, exclude=tuple()) -> "LookupLemmatizer":
117
+ path: Path = ensure_path(path)
118
+ lookups = Lookups()
119
+ self._lookups = lookups.from_disk(path=path)
120
+ return self
121
+
122
+ def initialize(self, get_examples: Callable[[], Iterable[Example]], *, nlp: Language = None) -> None:
123
+ lookups = Lookups()
124
+ self._lookups = lookups.from_disk(path=self.source)
125
+
126
+ @classmethod
127
+ def __mask_numbers(cls, token: str) -> str:
128
+ return cls._number_pattern.sub("0", token)
129
+
130
+ @classmethod
131
+ def __replace_numbers(cls, lemma: str, token: str) -> str:
132
+ return cls._number_pattern.sub(lambda match: token[match.start()], lemma)
meta.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "lang":"hu",
3
  "name":"core_news_md",
4
- "version":"3.5.0",
5
  "description":"Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morphologizer, lemmatizer, parser, ner",
6
  "author":"SzegedAI, MILAB",
7
  "email":"gyorgy@orosz.link",
@@ -1273,80 +1273,85 @@
1273
  "token_p":0.998565417,
1274
  "token_r":0.9993300153,
1275
  "token_f":0.9989475698,
1276
- "sents_p":0.9710467706,
1277
- "sents_r":0.9710467706,
1278
- "sents_f":0.9710467706,
1279
- "tag_acc":0.9649265515,
1280
- "pos_acc":0.9634910761,
1281
- "morph_acc":0.9308067758,
1282
- "morph_micro_p":0.9672095642,
1283
- "morph_micro_r":0.9595616674,
1284
- "morph_micro_f":0.9633704375,
1285
  "morph_per_feat":{
1286
  "Definite":{
1287
- "p":0.9633363886,
1288
  "r":0.9808679421,
1289
- "f":0.9720231214
1290
  },
1291
  "PronType":{
1292
- "p":0.9740331492,
1293
- "r":0.9729580574,
1294
- "f":0.9734953065
1295
  },
1296
  "Case":{
1297
- "p":0.973421263,
1298
- "r":0.9624580123,
1299
- "f":0.9679085941
1300
  },
1301
  "Degree":{
1302
- "p":0.9190391459,
1303
- "r":0.8594009983,
1304
- "f":0.8882201204
1305
  },
1306
  "Number":{
1307
- "p":0.9858179976,
1308
- "r":0.9785486844,
1309
- "f":0.9821698907
1310
  },
1311
  "Mood":{
1312
- "p":0.9290393013,
1313
- "r":0.94345898,
1314
- "f":0.9361936194
1315
  },
1316
  "Person":{
1317
- "p":0.9577114428,
1318
- "r":0.9498355263,
1319
- "f":0.9537572254
1320
  },
1321
  "Tense":{
1322
- "p":0.9650655022,
1323
  "r":0.9767955801,
1324
- "f":0.9708951126
1325
  },
1326
  "VerbForm":{
1327
- "p":0.952853598,
1328
- "r":0.9238171612,
1329
- "f":0.9381107492
1330
  },
1331
  "Voice":{
1332
- "p":0.9596774194,
1333
- "r":0.9734151329,
1334
- "f":0.9664974619
1335
  },
1336
  "Number[psor]":{
1337
- "p":0.9696969697,
1338
- "r":0.9572649573,
1339
- "f":0.9634408602
1340
  },
1341
  "Person[psor]":{
1342
- "p":0.9668109668,
1343
- "r":0.9557774608,
1344
- "f":0.9612625538
1345
  },
1346
  "NumType":{
1347
- "p":0.9191176471,
1348
- "r":0.9146341463,
1349
- "f":0.9168704156
 
 
 
 
 
1350
  },
1351
  "Reflex":{
1352
  "p":1.0,
@@ -1362,121 +1367,116 @@
1362
  "p":0.0,
1363
  "r":0.0,
1364
  "f":0.0
1365
- },
1366
- "Poss":{
1367
- "p":1.0,
1368
- "r":1.0,
1369
- "f":1.0
1370
  }
1371
  },
1372
- "lemma_acc":0.9738780978,
1373
- "dep_uas":0.8116901329,
1374
- "dep_las":0.7417545434,
1375
  "dep_las_per_type":{
1376
  "det":{
1377
- "p":0.8646734854,
1378
- "r":0.875,
1379
- "f":0.8698060942
1380
  },
1381
  "amod:att":{
1382
- "p":0.8457752256,
1383
- "r":0.8430089943,
1384
- "f":0.8443898444
1385
  },
1386
  "nsubj":{
1387
- "p":0.6973293769,
1388
- "r":0.734375,
1389
- "f":0.7153729072
1390
  },
1391
  "advmod:mode":{
1392
- "p":0.5777262181,
1393
  "r":0.6102941176,
1394
- "f":0.5935637664
1395
  },
1396
  "nmod:att":{
1397
- "p":0.7375201288,
1398
- "r":0.7762711864,
1399
- "f":0.7563996697
1400
  },
1401
  "obl":{
1402
- "p":0.7434554974,
1403
- "r":0.7668766877,
1404
- "f":0.7549844927
1405
  },
1406
  "obj":{
1407
- "p":0.8758949881,
1408
- "r":0.8247191011,
1409
- "f":0.849537037
1410
  },
1411
  "root":{
1412
- "p":0.8106904232,
1413
- "r":0.8106904232,
1414
- "f":0.8106904232
1415
  },
1416
  "cc":{
1417
- "p":0.6863157895,
1418
- "r":0.6863157895,
1419
- "f":0.6863157895
1420
  },
1421
  "conj":{
1422
- "p":0.4556213018,
1423
- "r":0.48125,
1424
- "f":0.4680851064
1425
  },
1426
  "advmod":{
1427
- "p":0.847826087,
1428
- "r":0.8210526316,
1429
- "f":0.8342245989
1430
  },
1431
  "flat:name":{
1432
- "p":0.8591549296,
1433
- "r":0.8551401869,
1434
- "f":0.8571428571
1435
  },
1436
  "appos":{
1437
- "p":0.4310344828,
1438
- "r":0.2659574468,
1439
- "f":0.3289473684
1440
  },
1441
  "advcl":{
1442
- "p":0.2989690722,
1443
- "r":0.2959183673,
1444
- "f":0.2974358974
1445
  },
1446
  "advmod:tlocy":{
1447
- "p":0.6905829596,
1448
- "r":0.6695652174,
1449
- "f":0.6799116998
1450
  },
1451
  "ccomp:obj":{
1452
- "p":0.2619047619,
1453
  "r":0.3333333333,
1454
- "f":0.2933333333
1455
  },
1456
  "mark":{
1457
- "p":0.8066666667,
1458
- "r":0.7658227848,
1459
- "f":0.7857142857
1460
  },
1461
  "compound:preverb":{
1462
- "p":0.9339622642,
1463
- "r":0.9082568807,
1464
- "f":0.9209302326
1465
  },
1466
  "advmod:locy":{
1467
- "p":0.75,
1468
- "r":0.46875,
1469
- "f":0.5769230769
1470
  },
1471
  "cop":{
1472
- "p":0.8518518519,
1473
- "r":0.5609756098,
1474
- "f":0.6764705882
1475
  },
1476
  "nmod:obl":{
1477
- "p":0.2368421053,
1478
- "r":0.225,
1479
- "f":0.2307692308
1480
  },
1481
  "advmod:to":{
1482
  "p":0.0,
@@ -1484,84 +1484,89 @@
1484
  "f":0.0
1485
  },
1486
  "obj:lvc":{
1487
- "p":0.3333333333,
1488
  "r":0.0833333333,
1489
- "f":0.1333333333
1490
  },
1491
  "ccomp:obl":{
1492
- "p":0.5,
1493
- "r":0.4375,
1494
- "f":0.4666666667
1495
  },
1496
  "iobj":{
1497
- "p":0.4,
1498
- "r":0.2666666667,
1499
- "f":0.32
1500
- },
1501
- "dep":{
1502
- "p":0.0,
1503
- "r":0.0,
1504
- "f":0.0
1505
  },
1506
- "xcomp":{
1507
- "p":0.8611111111,
1508
- "r":0.8378378378,
1509
- "f":0.8493150685
1510
  },
1511
  "case":{
1512
- "p":0.9195979899,
1513
- "r":0.9336734694,
1514
- "f":0.9265822785
1515
  },
1516
  "csubj":{
1517
- "p":0.6666666667,
1518
- "r":0.3243243243,
1519
- "f":0.4363636364
1520
  },
1521
  "parataxis":{
1522
- "p":0.375,
1523
  "r":0.1232876712,
1524
- "f":0.1855670103
 
 
 
 
 
1525
  },
1526
  "nummod":{
1527
- "p":0.5824175824,
1528
- "r":0.5698924731,
1529
- "f":0.5760869565
1530
  },
1531
- "acl":{
1532
- "p":0.4615384615,
1533
- "r":0.3333333333,
1534
- "f":0.3870967742
 
 
 
 
 
1535
  },
1536
  "advmod:tto":{
1537
- "p":0.6666666667,
1538
- "r":0.2,
1539
- "f":0.3076923077
1540
  },
1541
  "nmod":{
1542
- "p":0.3333333333,
1543
- "r":0.0909090909,
1544
- "f":0.1428571429
1545
- },
1546
- "aux":{
1547
- "p":0.9090909091,
1548
- "r":0.8333333333,
1549
- "f":0.8695652174
1550
  },
1551
  "advmod:tfrom":{
1552
  "p":0.0,
1553
  "r":0.0,
1554
  "f":0.0
1555
  },
 
 
 
 
 
1556
  "goeswith":{
1557
  "p":0.0,
1558
  "r":0.0,
1559
  "f":0.0
1560
  },
1561
  "compound":{
1562
- "p":0.9285714286,
1563
  "r":0.975,
1564
- "f":0.9512195122
1565
  },
1566
  "obl:lvc":{
1567
  "p":0.0,
@@ -1573,6 +1578,11 @@
1573
  "r":0.0,
1574
  "f":0.0
1575
  },
 
 
 
 
 
1576
  "nsubj:lvc":{
1577
  "p":0.0,
1578
  "r":0.0,
@@ -1583,48 +1593,38 @@
1583
  "r":0.1666666667,
1584
  "f":0.2857142857
1585
  },
1586
- "ccomp":{
1587
- "p":0.0,
1588
- "r":0.0,
1589
- "f":0.0
1590
- },
1591
  "advmod:que":{
1592
  "p":1.0,
1593
- "r":0.75,
1594
- "f":0.8571428571
1595
- },
1596
- "ccomp:pred":{
1597
- "p":0.0,
1598
- "r":0.0,
1599
- "f":0.0
1600
  }
1601
  },
1602
- "ents_p":0.8581105169,
1603
- "ents_r":0.8463431786,
1604
- "ents_f":0.8521862277,
1605
  "ents_per_type":{
1606
  "ORG":{
1607
- "p":0.8835616438,
1608
- "r":0.8970792768,
1609
- "f":0.8902691511
1610
  },
1611
  "PER":{
1612
- "p":0.8852163462,
1613
- "r":0.8799283154,
1614
- "f":0.8825644098
1615
  },
1616
  "LOC":{
1617
- "p":0.8632326821,
1618
- "r":0.84375,
1619
- "f":0.853380158
1620
  },
1621
  "MISC":{
1622
- "p":0.6888888889,
1623
- "r":0.6156028369,
1624
- "f":0.6501872659
1625
  }
1626
  },
1627
- "speed":1651.4495157666
1628
  },
1629
  "sources":[
1630
  {
 
1
  {
2
  "lang":"hu",
3
  "name":"core_news_md",
4
+ "version":"3.5.1",
5
  "description":"Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morphologizer, lemmatizer, parser, ner",
6
  "author":"SzegedAI, MILAB",
7
  "email":"gyorgy@orosz.link",
 
1273
  "token_p":0.998565417,
1274
  "token_r":0.9993300153,
1275
  "token_f":0.9989475698,
1276
+ "sents_p":0.9820627803,
1277
+ "sents_r":0.9755011136,
1278
+ "sents_f":0.9787709497,
1279
+ "tag_acc":0.9624844483,
1280
+ "pos_acc":0.9631543688,
1281
+ "morph_acc":0.928892717,
1282
+ "morph_micro_p":0.9648917749,
1283
+ "morph_micro_r":0.9578427159,
1284
+ "morph_micro_f":0.9613543239,
1285
  "morph_per_feat":{
1286
  "Definite":{
1287
+ "p":0.9589416058,
1288
  "r":0.9808679421,
1289
+ "f":0.9697808535
1290
  },
1291
  "PronType":{
1292
+ "p":0.9741331866,
1293
+ "r":0.9768211921,
1294
+ "f":0.9754753376
1295
  },
1296
  "Case":{
1297
+ "p":0.9733840304,
1298
+ "r":0.9610748864,
1299
+ "f":0.9671902963
1300
  },
1301
  "Degree":{
1302
+ "p":0.9179170344,
1303
+ "r":0.8652246256,
1304
+ "f":0.8907922912
1305
  },
1306
  "Number":{
1307
+ "p":0.9834515366,
1308
+ "r":0.9760348584,
1309
+ "f":0.9797291614
1310
  },
1311
  "Mood":{
1312
+ "p":0.9142236699,
1313
+ "r":0.933481153,
1314
+ "f":0.923752057
1315
  },
1316
  "Person":{
1317
+ "p":0.9505766063,
1318
+ "r":0.9490131579,
1319
+ "f":0.9497942387
1320
  },
1321
  "Tense":{
1322
+ "p":0.9598262758,
1323
  "r":0.9767955801,
1324
+ "f":0.9682365827
1325
  },
1326
  "VerbForm":{
1327
+ "p":0.9554822754,
1328
+ "r":0.9294306335,
1329
+ "f":0.9422764228
1330
  },
1331
  "Voice":{
1332
+ "p":0.9519038076,
1333
+ "r":0.9713701431,
1334
+ "f":0.9615384615
1335
  },
1336
  "Number[psor]":{
1337
+ "p":0.9719764012,
1338
+ "r":0.9387464387,
1339
+ "f":0.9550724638
1340
  },
1341
  "Person[psor]":{
1342
+ "p":0.9705014749,
1343
+ "r":0.9386590585,
1344
+ "f":0.9543147208
1345
  },
1346
  "NumType":{
1347
+ "p":0.9209876543,
1348
+ "r":0.9097560976,
1349
+ "f":0.9153374233
1350
+ },
1351
+ "Poss":{
1352
+ "p":0.75,
1353
+ "r":1.0,
1354
+ "f":0.8571428571
1355
  },
1356
  "Reflex":{
1357
  "p":1.0,
 
1367
  "p":0.0,
1368
  "r":0.0,
1369
  "f":0.0
 
 
 
 
 
1370
  }
1371
  },
1372
+ "lemma_acc":0.9728255669,
1373
+ "dep_uas":0.8127597439,
1374
+ "dep_las":0.743681905,
1375
  "dep_las_per_type":{
1376
  "det":{
1377
+ "p":0.86328125,
1378
+ "r":0.8797770701,
1379
+ "f":0.8714511041
1380
  },
1381
  "amod:att":{
1382
+ "p":0.8241758242,
1383
+ "r":0.8585445626,
1384
+ "f":0.8410092111
1385
  },
1386
  "nsubj":{
1387
+ "p":0.7255813953,
1388
+ "r":0.73125,
1389
+ "f":0.7284046693
1390
  },
1391
  "advmod:mode":{
1392
+ "p":0.5872641509,
1393
  "r":0.6102941176,
1394
+ "f":0.5985576923
1395
  },
1396
  "nmod:att":{
1397
+ "p":0.8083941606,
1398
+ "r":0.7508474576,
1399
+ "f":0.7785588752
1400
  },
1401
  "obl":{
1402
+ "p":0.7533632287,
1403
+ "r":0.7560756076,
1404
+ "f":0.7547169811
1405
  },
1406
  "obj":{
1407
+ "p":0.8513513514,
1408
+ "r":0.8494382022,
1409
+ "f":0.8503937008
1410
  },
1411
  "root":{
1412
+ "p":0.8049327354,
1413
+ "r":0.7995545657,
1414
+ "f":0.8022346369
1415
  },
1416
  "cc":{
1417
+ "p":0.7052401747,
1418
+ "r":0.68,
1419
+ "f":0.6923901393
1420
  },
1421
  "conj":{
1422
+ "p":0.4658634538,
1423
+ "r":0.4833333333,
1424
+ "f":0.4744376278
1425
  },
1426
  "advmod":{
1427
+ "p":0.8144329897,
1428
+ "r":0.8315789474,
1429
+ "f":0.8229166667
1430
  },
1431
  "flat:name":{
1432
+ "p":0.871559633,
1433
+ "r":0.8878504673,
1434
+ "f":0.8796296296
1435
  },
1436
  "appos":{
1437
+ "p":0.3714285714,
1438
+ "r":0.2765957447,
1439
+ "f":0.3170731707
1440
  },
1441
  "advcl":{
1442
+ "p":0.3571428571,
1443
+ "r":0.2040816327,
1444
+ "f":0.2597402597
1445
  },
1446
  "advmod:tlocy":{
1447
+ "p":0.6991869919,
1448
+ "r":0.747826087,
1449
+ "f":0.7226890756
1450
  },
1451
  "ccomp:obj":{
1452
+ "p":0.2244897959,
1453
  "r":0.3333333333,
1454
+ "f":0.2682926829
1455
  },
1456
  "mark":{
1457
+ "p":0.7884615385,
1458
+ "r":0.7784810127,
1459
+ "f":0.7834394904
1460
  },
1461
  "compound:preverb":{
1462
+ "p":0.9509803922,
1463
+ "r":0.8899082569,
1464
+ "f":0.9194312796
1465
  },
1466
  "advmod:locy":{
1467
+ "p":0.7222222222,
1468
+ "r":0.40625,
1469
+ "f":0.52
1470
  },
1471
  "cop":{
1472
+ "p":0.7777777778,
1473
+ "r":0.512195122,
1474
+ "f":0.6176470588
1475
  },
1476
  "nmod:obl":{
1477
+ "p":0.175,
1478
+ "r":0.175,
1479
+ "f":0.175
1480
  },
1481
  "advmod:to":{
1482
  "p":0.0,
 
1484
  "f":0.0
1485
  },
1486
  "obj:lvc":{
1487
+ "p":0.2,
1488
  "r":0.0833333333,
1489
+ "f":0.1176470588
1490
  },
1491
  "ccomp:obl":{
1492
+ "p":0.5238095238,
1493
+ "r":0.34375,
1494
+ "f":0.4150943396
1495
  },
1496
  "iobj":{
1497
+ "p":0.3,
1498
+ "r":0.2,
1499
+ "f":0.24
 
 
 
 
 
1500
  },
1501
+ "acl":{
1502
+ "p":0.2772277228,
1503
+ "r":0.3888888889,
1504
+ "f":0.323699422
1505
  },
1506
  "case":{
1507
+ "p":0.9487179487,
1508
+ "r":0.943877551,
1509
+ "f":0.9462915601
1510
  },
1511
  "csubj":{
1512
+ "p":0.4827586207,
1513
+ "r":0.3783783784,
1514
+ "f":0.4242424242
1515
  },
1516
  "parataxis":{
1517
+ "p":0.3913043478,
1518
  "r":0.1232876712,
1519
+ "f":0.1875
1520
+ },
1521
+ "xcomp":{
1522
+ "p":0.84,
1523
+ "r":0.8513513514,
1524
+ "f":0.8456375839
1525
  },
1526
  "nummod":{
1527
+ "p":0.5647058824,
1528
+ "r":0.5161290323,
1529
+ "f":0.5393258427
1530
  },
1531
+ "dep":{
1532
+ "p":0.0,
1533
+ "r":0.0,
1534
+ "f":0.0
1535
+ },
1536
+ "aux":{
1537
+ "p":0.7272727273,
1538
+ "r":0.6666666667,
1539
+ "f":0.6956521739
1540
  },
1541
  "advmod:tto":{
1542
+ "p":0.75,
1543
+ "r":0.3,
1544
+ "f":0.4285714286
1545
  },
1546
  "nmod":{
1547
+ "p":0.0,
1548
+ "r":0.0,
1549
+ "f":0.0
 
 
 
 
 
1550
  },
1551
  "advmod:tfrom":{
1552
  "p":0.0,
1553
  "r":0.0,
1554
  "f":0.0
1555
  },
1556
+ "ccomp":{
1557
+ "p":0.0,
1558
+ "r":0.0,
1559
+ "f":0.0
1560
+ },
1561
  "goeswith":{
1562
  "p":0.0,
1563
  "r":0.0,
1564
  "f":0.0
1565
  },
1566
  "compound":{
1567
+ "p":1.0,
1568
  "r":0.975,
1569
+ "f":0.9873417722
1570
  },
1571
  "obl:lvc":{
1572
  "p":0.0,
 
1578
  "r":0.0,
1579
  "f":0.0
1580
  },
1581
+ "ccomp:pred":{
1582
+ "p":0.0,
1583
+ "r":0.0,
1584
+ "f":0.0
1585
+ },
1586
  "nsubj:lvc":{
1587
  "p":0.0,
1588
  "r":0.0,
 
1593
  "r":0.1666666667,
1594
  "f":0.2857142857
1595
  },
 
 
 
 
 
1596
  "advmod:que":{
1597
  "p":1.0,
1598
+ "r":0.5,
1599
+ "f":0.6666666667
 
 
 
 
 
1600
  }
1601
  },
1602
+ "ents_p":0.8572441922,
1603
+ "ents_r":0.849859353,
1604
+ "ents_f":0.8535357994,
1605
  "ents_per_type":{
1606
  "ORG":{
1607
+ "p":0.9027777778,
1608
+ "r":0.8738989337,
1609
+ "f":0.8881036514
1610
  },
1611
  "PER":{
1612
+ "p":0.8675042833,
1613
+ "r":0.9074074074,
1614
+ "f":0.8870072993
1615
  },
1616
  "LOC":{
1617
+ "p":0.888384755,
1618
+ "r":0.8498263889,
1619
+ "f":0.8686779059
1620
  },
1621
  "MISC":{
1622
+ "p":0.6461318052,
1623
+ "r":0.6397163121,
1624
+ "f":0.6429080542
1625
  }
1626
  },
1627
+ "speed":2535.2452470079
1628
  },
1629
  "sources":[
1630
  {
morphologizer/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:aebba307a814e36fb5d32de0e25f4315867f4678e73ff2e85aceb5c41d3c0af3
3
  size 463022
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56ea873c3ffc818958ecd60553379d277ae3b21f74170486ccfaf6f6d60d563f
3
  size 463022
ner/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6404b8918a3691cadb17f69cd8fa4bccff7aff4b77ceb8e4dfbe2e3bc9d12a2c
3
  size 9791307
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0f02a2f28c88dfb50af9a8555bbe2929abb3c9a5cf1d29d3b34a91526988b0c2
3
  size 9791307
parser/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:075883bd84113bcce6b9a425ee55d44b21fc964b2d2515b6625082872fab2195
3
  size 25601129
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:846779dc9fa38c7dae2730aaf1dc8a99bd24f62aa9ec283427e5382343422284
3
  size 25601129
senter/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:570ee5f2927cf1603436844338d191ec23ca484b811796c9df37d74dde80e0a6
3
  size 1237
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd16a43ec8c789c555386afbb199d4444466c2181c3dc5c4de56c9ca2b57685a
3
  size 1237
tagger/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:40bea158134db0033f7559059533e4c2d64792c1cc934a7fb4f414ed0c67ed28
3
  size 7297
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:04d3ab9953b81bf955e264f667dd1eacdf2ff3b319598560680df13b5ac80f75
3
  size 7297
tok2vec/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7302c83d08f9da7b5388bb1f505e84ad0fb0125d4ac17c4ff3fc683d697400c9
3
  size 9659749
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb31d7b818409d19f80994951eabe64fc35391f363623cc461f61f2fffc39b4f
3
  size 9659749
vocab/strings.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1296dbf1d9d909f4b2521e29174cab614107fe3ab0ed196ba474ee0c59101c5d
3
- size 6405774
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a7cffc79e121b8b25771ee3d13f9f35f7f2af63ee8cbb354d0ece1fdf03cf78
3
+ size 6405688