heack
/

HeackMT5-ZhSum100k

text2text-generation

Inference Endpoints

Model card Files Files and versions Community

heack commited on May 24, 2023

Commit

d09542b

•

1 Parent(s): e9a4aa4

Update README.md

Files changed (1) hide show

README.md +51 -0

README.md CHANGED Viewed

@@ -52,6 +52,57 @@ print(summary)
 包头警方发布一起利用AI实施电信诈骗典型案例:法人代表10分钟内被骗430万元
 ```
 ## Credits
 This model is trained and maintained by KongYang from Shanghai Jiao Tong University. For any questions, please reach out to me at my WeChat ID: kongyang.

 包头警方发布一起利用AI实施电信诈骗典型案例:法人代表10分钟内被骗430万元
 ```
+## If you need a longer abbreviation, refer to the following code 如果需要更长的缩略语，参考如下代码：
+```python
+from transformers import MT5ForConditionalGeneration, T5Tokenizer
+model_heack = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k")
+tokenizer_heack = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k")
+def _split_text(text, length):
+    chunks = []
+    start = 0
+    while start < len(text):
+        if len(text) - start > length:
+            pos_forward = start + length
+            pos_backward = start + length
+            pos = start + length
+            while (pos_forward < len(text)) and (pos_backward >= 0) and (pos_forward < 20 + pos) and  (pos_backward + 20 > pos) and text[pos_forward] not in {'.', '。','，',','} and text[pos_backward] not in {'.', '。','，',','}:
+                pos_forward += 1
+                pos_backward -= 1
+            if pos_forward - pos >= 20 and pos_backward <= pos - 20:
+                pos = start + length
+            elif text[pos_backward] in {'.', '。','，',','}:
+                pos = pos_backward
+            else:
+                pos = pos_forward
+            chunks.append(text[start:pos+1])
+            start = pos + 1
+        else:
+            chunks.append(text[start:])
+            break
+    # Combine last chunk with previous one if it's too short
+    if len(chunks) > 1 and len(chunks[-1]) < 100:
+        chunks[-2] += chunks[-1]
+        chunks.pop()
+    return chunks
+def get_summary_heack(text, each_summary_length=150):
+    chunks = _split_text(text, 300)
+    summaries = []
+    for chunk in chunks:
+        inputs = tokenizer_heack.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True)
+        summary_ids = model_heack.generate(inputs, max_length=each_summary_length, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2)
+        summary = tokenizer_heack.decode(summary_ids[0], skip_special_tokens=True)
+        summaries.append(summary)
+    return " ".join(summaries)
+```
 ## Credits
 This model is trained and maintained by KongYang from Shanghai Jiao Tong University. For any questions, please reach out to me at my WeChat ID: kongyang.