pszemraj commited on
Commit
6a8855d
·
1 Parent(s): 078f22c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -65
README.md CHANGED
@@ -347,28 +347,23 @@ model-index:
347
  verified: true
348
  verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOWI2NjVlYjgwYWJiMjcyMDUzMzEwNDNjZTMxMDM0MjAzMzk1ZmIwY2Q1ZDQ2Y2M5NDBlMDEzYzFkNWEyNzJmNiIsInZlcnNpb24iOjF9.iZ1Iy7FuWL4GH7LS5EylVj5eZRC3L2ZsbYQapAkMNzR_VXPoMGvoM69Hp-kU7gW55tmz2V4Qxhvoz9cM8fciBA
349
  ---
350
-
351
- # Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization
352
 
353
  <a href="https://colab.research.google.com/gist/pszemraj/3eba944ddc9fc9a4a1bfb21e83b57620/summarization-token-batching.ipynb">
354
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
355
  </a>
356
 
357
- A fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on the `BookSum` dataset.
358
-
359
- Goal: a model that can generalize well and is useful in summarizing long text in academic and daily usage. The result works well on lots of text and can handle 16384 tokens/batch (_if you have the GPU memory to handle that_)
360
 
361
  - See the Colab demo linked above or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text)
362
 
363
-
364
- > Note: the API is set to generate a max of 64 tokens for runtime reasons, so the summaries may be truncated (depending on the length of input text). For best results use python as below.
365
 
366
  ---
367
 
368
- # Usage - Basic
369
 
370
- - use `encoder_no_repeat_ngram_size=3` when calling the pipeline object to improve summary quality.
371
- - this forces the model to use new vocabulary and create an abstractive summary, otherwise it may compile the best _extractive_ summary from the input provided.
372
 
373
  Load the model into a pipeline object:
374
 
@@ -385,7 +380,7 @@ summarizer = pipeline(
385
  )
386
  ```
387
 
388
- - put words into the pipeline object:
389
 
390
  ```python
391
  wall_of_text = "your words here"
@@ -402,74 +397,81 @@ result = summarizer(
402
  )
403
  ```
404
 
 
405
 
406
- **Important:** To generate the best quality summaries, you should use the global attention mask when decoding, as demonstrated in [this community notebook here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing), see the definition of `generate_answer(batch)`.
407
 
408
- If having computing constraints, try the base version [`pszemraj/led-base-book-summary`](https://huggingface.co/pszemraj/led-base-book-summary)
409
- - all the parameters for generation on the API here are the same as [the base model](https://huggingface.co/pszemraj/led-base-book-summary) for easy comparison between versions.
410
 
411
- ## Training and evaluation data
412
 
413
- - the [booksum](https://arxiv.org/abs/2105.08209) dataset (this is what adds the `bsd-3-clause` license)
414
- - During training, the input text was the text of the `chapter`, and the output was `summary_text`
415
- - Eval results can be found [here](https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-kmfoda__booksum-79c1c0d8-10905463) with metrics on the sidebar.
416
 
417
- ## Training procedure
418
 
419
- - Training completed on the BookSum dataset for 13 total epochs
420
- - **The final four epochs combined the training and validation sets as 'train' in an effort to increase generalization.**
421
 
422
- ### Training hyperparameters
423
 
424
- #### Initial Three Epochs
425
 
426
- The following hyperparameters were used during training:
427
- - learning_rate: 5e-05
428
- - train_batch_size: 1
429
- - eval_batch_size: 1
430
- - seed: 42
431
- - distributed_type: multi-GPU
432
- - gradient_accumulation_steps: 4
433
- - total_train_batch_size: 4
434
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
435
- - lr_scheduler_type: linear
436
- - num_epochs: 3
437
 
438
- #### In-between Epochs
 
 
439
 
440
- Unfortunately, don't have all records on-hand for middle epochs; the following should be representative:
441
 
442
- - learning_rate: 4e-05
443
- - train_batch_size: 2
444
- - eval_batch_size: 2
445
- - seed: 42
446
- - distributed_type: multi-GPU
447
- - gradient_accumulation_steps: 16
448
- - total_train_batch_size: 32
449
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
450
- - lr_scheduler_type: cosine
451
- - lr_scheduler_warmup_ratio: 0.05
452
- - num_epochs: 6 (in addition to prior model)
453
 
454
- #### Final Two Epochs
455
 
456
- The following hyperparameters were used during training:
457
- - learning_rate: 2e-05
458
- - train_batch_size: 1
459
- - eval_batch_size: 1
460
- - seed: 42
461
- - distributed_type: multi-GPU
462
- - gradient_accumulation_steps: 16
463
- - total_train_batch_size: 16
464
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
465
- - lr_scheduler_type: cosine
466
- - lr_scheduler_warmup_ratio: 0.03
467
- - num_epochs: 2 (in addition to prior model)
468
 
 
469
 
470
- ### Framework versions
471
 
472
- - Transformers 4.19.2
473
- - Pytorch 1.11.0+cu113
474
- - Datasets 2.2.2
475
- - Tokenizers 0.12.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
347
  verified: true
348
  verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOWI2NjVlYjgwYWJiMjcyMDUzMzEwNDNjZTMxMDM0MjAzMzk1ZmIwY2Q1ZDQ2Y2M5NDBlMDEzYzFkNWEyNzJmNiIsInZlcnNpb24iOjF9.iZ1Iy7FuWL4GH7LS5EylVj5eZRC3L2ZsbYQapAkMNzR_VXPoMGvoM69Hp-kU7gW55tmz2V4Qxhvoz9cM8fciBA
349
  ---
350
+ # LED-Based Summarization Model (Large): Condensing Extensive Information
 
351
 
352
  <a href="https://colab.research.google.com/gist/pszemraj/3eba944ddc9fc9a4a1bfb21e83b57620/summarization-token-batching.ipynb">
353
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
354
  </a>
355
 
356
+ This model is a fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on the `BookSum` dataset. It aims to generalize well and be useful in summarizing lengthy text for both academic and everyday purposes. Capable of handling up to 16,384 tokens per batch, this model provides effective summarization of large volumes of text.
 
 
357
 
358
  - See the Colab demo linked above or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text)
359
 
360
+ > **Note:** Due to inference API timeout constraints, outputs may be truncated before the fully summary is returned (try python or the demo)
 
361
 
362
  ---
363
 
364
+ ## Basic Usage
365
 
366
+ To improve summary quality, use `encoder_no_repeat_ngram_size=3` when calling the pipeline object. This setting encourages the model to utilize new vocabulary and construct an abstractive summary.
 
367
 
368
  Load the model into a pipeline object:
369
 
 
380
  )
381
  ```
382
 
383
+ Feed the text into the pipeline object:
384
 
385
  ```python
386
  wall_of_text = "your words here"
 
397
  )
398
  ```
399
 
400
+ **Important:** For optimal summary quality, use the global attention mask when decoding, as demonstrated in [this community notebook](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing), see the definition of `generate_answer(batch)`.
401
 
402
+ If you're facing computing constraints, consider using the base version [`pszemraj/led-base-book-summary`](https://huggingface.co/pszemraj/led-base-book-summary). All generation parameters on the API here match those of the base model, enabling easy comparison between versions.
403
 
404
+ ---
 
405
 
406
+ ## Training Information
407
 
408
+ ### Data
 
 
409
 
410
+ The model was trained on the [booksum](https://arxiv.org/abs/2105.08209) dataset. During training, the `chapter`was the input col, while the `summary_text` was the output.
411
 
412
+ ### Procedure
 
413
 
414
+ Training was completed on the BookSum dataset across 13+ epochs. Notably, the final four epochs combined the training and validation sets as 'train' to enhance generalization.
415
 
416
+ ### Hyperparameters
417
 
418
+ The training process involved different settings across stages:
 
 
 
 
 
 
 
 
 
 
419
 
420
+ - **Initial Three Epochs:** Low learning rate (5e-05), batch size of 1, 4 gradient accumulation steps, and a linear learning rate scheduler.
421
+ - **In-between Epochs:** Learning rate reduced to 4e-05, increased batch size to 2, 16 gradient accumulation steps, and switched to a cosine learning rate scheduler with a 0.05 warmup ratio.
422
+ - **Final Two Epochs:** Further reduced learning rate (2e-05), batch size reverted to 1, maintained gradient accumulation steps at 16, and continued with a cosine learning rate scheduler, albeit with a lower warmup ratio (0.03).
423
 
424
+ ### Versions
425
 
426
+ - Transformers 4.19.2
427
+ - Pytorch 1.11.0+cu113
428
+ - Datasets 2.2.2
429
+ - Tokenizers 0.12.1
 
 
 
 
 
 
 
430
 
431
+ ---
432
 
433
+ ## Simplified Usage with TextSum
 
 
 
 
 
 
 
 
 
 
 
434
 
435
+ To streamline the process of using this and other models, I've developed [a Python package utility](https://github.com/pszemraj/textsum) named `textsum`. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
436
 
437
+ Install TextSum:
438
 
439
+ ```bash
440
+ pip install textsum
441
+ ```
442
+
443
+ Then use it in Python with this model:
444
+
445
+ ```python
446
+ from textsum.summarize import Summarizer
447
+
448
+ model_name = "pszemraj/led-large-book-summary"
449
+ summarizer = Summarizer(
450
+ model_name_or_path=model_name, # you can use any Seq2Seq model on the Hub
451
+ token_batch_length=4096, # tokens to batch summarize at a time, up to 16384
452
+ )
453
+ long_string = "This is a long string of text that will be summarized."
454
+ out_str = summarizer.summarize_string(long_string)
455
+ print(f"summary: {out_str}")
456
+ ```
457
+
458
+ Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a demo/web UI.
459
+
460
+ For detailed explanations and documentation, check the [README](https://github.com/pszemraj/textsum) or the [wiki](https://github.com/pszemraj/textsum/wiki)
461
+
462
+
463
+ ---
464
+
465
+ ## Related Models
466
+
467
+ Check out these other related models, also trained on the BookSum dataset:
468
+
469
+ - [Long-T5-tglobal-base](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary)
470
+ - [BigBird-Pegasus-Large-K](https://huggingface.co/pszemraj/bigbird-pegasus-large-K-booksum)
471
+ - [Pegasus-X-Large](https://huggingface.co/pszemraj/pegasus-x-large-book-summary)
472
+ - [Long-T5-tglobal-XL](https://huggingface.co/pszemraj/long-t5-tglobal-xl-16384-book-summary)
473
+
474
+ There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
475
+
476
+
477
+ ---