Update README.md
Browse files
README.md
CHANGED
@@ -347,28 +347,23 @@ model-index:
|
|
347 |
verified: true
|
348 |
verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOWI2NjVlYjgwYWJiMjcyMDUzMzEwNDNjZTMxMDM0MjAzMzk1ZmIwY2Q1ZDQ2Y2M5NDBlMDEzYzFkNWEyNzJmNiIsInZlcnNpb24iOjF9.iZ1Iy7FuWL4GH7LS5EylVj5eZRC3L2ZsbYQapAkMNzR_VXPoMGvoM69Hp-kU7gW55tmz2V4Qxhvoz9cM8fciBA
|
349 |
---
|
350 |
-
|
351 |
-
# Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization
|
352 |
|
353 |
<a href="https://colab.research.google.com/gist/pszemraj/3eba944ddc9fc9a4a1bfb21e83b57620/summarization-token-batching.ipynb">
|
354 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
355 |
</a>
|
356 |
|
357 |
-
|
358 |
-
|
359 |
-
Goal: a model that can generalize well and is useful in summarizing long text in academic and daily usage. The result works well on lots of text and can handle 16384 tokens/batch (_if you have the GPU memory to handle that_)
|
360 |
|
361 |
- See the Colab demo linked above or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text)
|
362 |
|
363 |
-
|
364 |
-
> Note: the API is set to generate a max of 64 tokens for runtime reasons, so the summaries may be truncated (depending on the length of input text). For best results use python as below.
|
365 |
|
366 |
---
|
367 |
|
368 |
-
|
369 |
|
370 |
-
|
371 |
-
- this forces the model to use new vocabulary and create an abstractive summary, otherwise it may compile the best _extractive_ summary from the input provided.
|
372 |
|
373 |
Load the model into a pipeline object:
|
374 |
|
@@ -385,7 +380,7 @@ summarizer = pipeline(
|
|
385 |
)
|
386 |
```
|
387 |
|
388 |
-
|
389 |
|
390 |
```python
|
391 |
wall_of_text = "your words here"
|
@@ -402,74 +397,81 @@ result = summarizer(
|
|
402 |
)
|
403 |
```
|
404 |
|
|
|
405 |
|
406 |
-
|
407 |
|
408 |
-
|
409 |
-
- all the parameters for generation on the API here are the same as [the base model](https://huggingface.co/pszemraj/led-base-book-summary) for easy comparison between versions.
|
410 |
|
411 |
-
## Training
|
412 |
|
413 |
-
|
414 |
-
- During training, the input text was the text of the `chapter`, and the output was `summary_text`
|
415 |
-
- Eval results can be found [here](https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-kmfoda__booksum-79c1c0d8-10905463) with metrics on the sidebar.
|
416 |
|
417 |
-
|
418 |
|
419 |
-
|
420 |
-
- **The final four epochs combined the training and validation sets as 'train' in an effort to increase generalization.**
|
421 |
|
422 |
-
|
423 |
|
424 |
-
|
425 |
|
426 |
-
The
|
427 |
-
- learning_rate: 5e-05
|
428 |
-
- train_batch_size: 1
|
429 |
-
- eval_batch_size: 1
|
430 |
-
- seed: 42
|
431 |
-
- distributed_type: multi-GPU
|
432 |
-
- gradient_accumulation_steps: 4
|
433 |
-
- total_train_batch_size: 4
|
434 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
435 |
-
- lr_scheduler_type: linear
|
436 |
-
- num_epochs: 3
|
437 |
|
438 |
-
|
|
|
|
|
439 |
|
440 |
-
|
441 |
|
442 |
-
-
|
443 |
-
-
|
444 |
-
-
|
445 |
-
-
|
446 |
-
- distributed_type: multi-GPU
|
447 |
-
- gradient_accumulation_steps: 16
|
448 |
-
- total_train_batch_size: 32
|
449 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
450 |
-
- lr_scheduler_type: cosine
|
451 |
-
- lr_scheduler_warmup_ratio: 0.05
|
452 |
-
- num_epochs: 6 (in addition to prior model)
|
453 |
|
454 |
-
|
455 |
|
456 |
-
|
457 |
-
- learning_rate: 2e-05
|
458 |
-
- train_batch_size: 1
|
459 |
-
- eval_batch_size: 1
|
460 |
-
- seed: 42
|
461 |
-
- distributed_type: multi-GPU
|
462 |
-
- gradient_accumulation_steps: 16
|
463 |
-
- total_train_batch_size: 16
|
464 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
465 |
-
- lr_scheduler_type: cosine
|
466 |
-
- lr_scheduler_warmup_ratio: 0.03
|
467 |
-
- num_epochs: 2 (in addition to prior model)
|
468 |
|
|
|
469 |
|
470 |
-
|
471 |
|
472 |
-
|
473 |
-
|
474 |
-
|
475 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
347 |
verified: true
|
348 |
verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOWI2NjVlYjgwYWJiMjcyMDUzMzEwNDNjZTMxMDM0MjAzMzk1ZmIwY2Q1ZDQ2Y2M5NDBlMDEzYzFkNWEyNzJmNiIsInZlcnNpb24iOjF9.iZ1Iy7FuWL4GH7LS5EylVj5eZRC3L2ZsbYQapAkMNzR_VXPoMGvoM69Hp-kU7gW55tmz2V4Qxhvoz9cM8fciBA
|
349 |
---
|
350 |
+
# LED-Based Summarization Model (Large): Condensing Extensive Information
|
|
|
351 |
|
352 |
<a href="https://colab.research.google.com/gist/pszemraj/3eba944ddc9fc9a4a1bfb21e83b57620/summarization-token-batching.ipynb">
|
353 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
354 |
</a>
|
355 |
|
356 |
+
This model is a fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on the `BookSum` dataset. It aims to generalize well and be useful in summarizing lengthy text for both academic and everyday purposes. Capable of handling up to 16,384 tokens per batch, this model provides effective summarization of large volumes of text.
|
|
|
|
|
357 |
|
358 |
- See the Colab demo linked above or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text)
|
359 |
|
360 |
+
> **Note:** Due to inference API timeout constraints, outputs may be truncated before the fully summary is returned (try python or the demo)
|
|
|
361 |
|
362 |
---
|
363 |
|
364 |
+
## Basic Usage
|
365 |
|
366 |
+
To improve summary quality, use `encoder_no_repeat_ngram_size=3` when calling the pipeline object. This setting encourages the model to utilize new vocabulary and construct an abstractive summary.
|
|
|
367 |
|
368 |
Load the model into a pipeline object:
|
369 |
|
|
|
380 |
)
|
381 |
```
|
382 |
|
383 |
+
Feed the text into the pipeline object:
|
384 |
|
385 |
```python
|
386 |
wall_of_text = "your words here"
|
|
|
397 |
)
|
398 |
```
|
399 |
|
400 |
+
**Important:** For optimal summary quality, use the global attention mask when decoding, as demonstrated in [this community notebook](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing), see the definition of `generate_answer(batch)`.
|
401 |
|
402 |
+
If you're facing computing constraints, consider using the base version [`pszemraj/led-base-book-summary`](https://huggingface.co/pszemraj/led-base-book-summary). All generation parameters on the API here match those of the base model, enabling easy comparison between versions.
|
403 |
|
404 |
+
---
|
|
|
405 |
|
406 |
+
## Training Information
|
407 |
|
408 |
+
### Data
|
|
|
|
|
409 |
|
410 |
+
The model was trained on the [booksum](https://arxiv.org/abs/2105.08209) dataset. During training, the `chapter`was the input col, while the `summary_text` was the output.
|
411 |
|
412 |
+
### Procedure
|
|
|
413 |
|
414 |
+
Training was completed on the BookSum dataset across 13+ epochs. Notably, the final four epochs combined the training and validation sets as 'train' to enhance generalization.
|
415 |
|
416 |
+
### Hyperparameters
|
417 |
|
418 |
+
The training process involved different settings across stages:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
419 |
|
420 |
+
- **Initial Three Epochs:** Low learning rate (5e-05), batch size of 1, 4 gradient accumulation steps, and a linear learning rate scheduler.
|
421 |
+
- **In-between Epochs:** Learning rate reduced to 4e-05, increased batch size to 2, 16 gradient accumulation steps, and switched to a cosine learning rate scheduler with a 0.05 warmup ratio.
|
422 |
+
- **Final Two Epochs:** Further reduced learning rate (2e-05), batch size reverted to 1, maintained gradient accumulation steps at 16, and continued with a cosine learning rate scheduler, albeit with a lower warmup ratio (0.03).
|
423 |
|
424 |
+
### Versions
|
425 |
|
426 |
+
- Transformers 4.19.2
|
427 |
+
- Pytorch 1.11.0+cu113
|
428 |
+
- Datasets 2.2.2
|
429 |
+
- Tokenizers 0.12.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
430 |
|
431 |
+
---
|
432 |
|
433 |
+
## Simplified Usage with TextSum
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
434 |
|
435 |
+
To streamline the process of using this and other models, I've developed [a Python package utility](https://github.com/pszemraj/textsum) named `textsum`. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
|
436 |
|
437 |
+
Install TextSum:
|
438 |
|
439 |
+
```bash
|
440 |
+
pip install textsum
|
441 |
+
```
|
442 |
+
|
443 |
+
Then use it in Python with this model:
|
444 |
+
|
445 |
+
```python
|
446 |
+
from textsum.summarize import Summarizer
|
447 |
+
|
448 |
+
model_name = "pszemraj/led-large-book-summary"
|
449 |
+
summarizer = Summarizer(
|
450 |
+
model_name_or_path=model_name, # you can use any Seq2Seq model on the Hub
|
451 |
+
token_batch_length=4096, # tokens to batch summarize at a time, up to 16384
|
452 |
+
)
|
453 |
+
long_string = "This is a long string of text that will be summarized."
|
454 |
+
out_str = summarizer.summarize_string(long_string)
|
455 |
+
print(f"summary: {out_str}")
|
456 |
+
```
|
457 |
+
|
458 |
+
Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a demo/web UI.
|
459 |
+
|
460 |
+
For detailed explanations and documentation, check the [README](https://github.com/pszemraj/textsum) or the [wiki](https://github.com/pszemraj/textsum/wiki)
|
461 |
+
|
462 |
+
|
463 |
+
---
|
464 |
+
|
465 |
+
## Related Models
|
466 |
+
|
467 |
+
Check out these other related models, also trained on the BookSum dataset:
|
468 |
+
|
469 |
+
- [Long-T5-tglobal-base](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary)
|
470 |
+
- [BigBird-Pegasus-Large-K](https://huggingface.co/pszemraj/bigbird-pegasus-large-K-booksum)
|
471 |
+
- [Pegasus-X-Large](https://huggingface.co/pszemraj/pegasus-x-large-book-summary)
|
472 |
+
- [Long-T5-tglobal-XL](https://huggingface.co/pszemraj/long-t5-tglobal-xl-16384-book-summary)
|
473 |
+
|
474 |
+
There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
|
475 |
+
|
476 |
+
|
477 |
+
---
|