File size: 5,695 Bytes
c1622e4
 
 
 
 
 
 
 
 
 
324bda5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: cc-by-4.0
datasets:
- abisee/cnn_dailymail
language:
- en
tags:
- NLP
- Text-Summarization
- CNN
---

# Seq2Seq Model with Attention for Text Summarization

This repository contains a Sequence-to-Sequence (Seq2Seq) model with attention, trained on the **CNN/DailyMail** dataset for text summarization tasks. The model is built using Keras and leverages pre-trained GloVe embeddings for enhanced word representations. It consists of an encoder-decoder architecture using LSTM layers with attention to capture long-term dependencies.

## Model Architecture

The model follows the classic encoder-decoder structure, with attention to handle long sequences:

- **Embedding Layer**: Uses pre-trained GloVe embeddings (100-dimensional) for both the input (article) and target (summary) texts.
- **Encoder**: A bidirectional LSTM to encode the input sequence. The forward and backward hidden states are concatenated.
- **Decoder**: An LSTM initialized with the encoder's hidden and cell states to generate the target sequence (summary).
- **Attention Mechanism**: While the base code does not explicitly implement attention, this can be easily integrated to improve summarization by focusing on relevant parts of the input sequence during decoding.

### Embeddings

We use GloVe embeddings (100-dimensional) pre-trained on a large corpus of text data. The embedding matrix is constructed for both the input (text) and output (summary) using the GloVe embeddings.

```python
embedding_index = {}
embed_dim = 100
with open('../input/glove6b100dtxt/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

# Embedding for input (articles)
t_embed = np.zeros((t_max_features, embed_dim))
for word, i in t_tokenizer.word_index.items():
    vec = embedding_index.get(word)
    if i < t_max_features and vec is not None:
        t_embed[i] = vec

# Embedding for output (summaries)
s_embed = np.zeros((s_max_features, embed_dim))
for word, i in s_tokenizer.word_index.items():
    vec = embedding_index.get(word)
    if i < s_max_features and vec is not None:
        s_embed[i] = vec

```

## Encoder
A bidirectional LSTM is used for encoding the input text. The forward and backward hidden and cell states are concatenated to pass as the initial states to the decoder.

```python
latent_dim = 128
enc_input = Input(shape=(maxlen_text,))
enc_embed = Embedding(t_max_features, embed_dim, input_length=maxlen_text, weights=[t_embed], trainable=False)(enc_input)
enc_lstm = Bidirectional(LSTM(latent_dim, return_state=True))
enc_output, enc_fh, enc_fc, enc_bh, enc_bc = enc_lstm(enc_embed)

# Concatenate the forward and backward states
enc_h = Concatenate(axis=-1)([enc_fh, enc_bh])
enc_c = Concatenate(axis=-1)([enc_fc, enc_bc])
```

## Decoder
The decoder is an LSTM that takes the encoder's final states as the initial states to generate the output summary sequence.

```python
dec_input = Input(shape=(None,))
dec_embed = Embedding(s_max_features, embed_dim, weights=[s_embed], trainable=False)(dec_input)
dec_lstm = LSTM(latent_dim * 2, return_sequences=True, return_state=True, dropout=0.3, recurrent_dropout=0.2)
dec_outputs, _, _ = dec_lstm(dec_embed, initial_state=[enc_h, enc_c])

# Dense layer with softmax activation for final output
dec_dense = TimeDistributed(Dense(s_max_features, activation='softmax'))
dec_output = dec_dense(dec_outputs)
```

## Model Summary
The full Seq2Seq model with an attention mechanism is compiled using sparse categorical crossentropy loss and the RMSProp optimizer.

### Model Visualization
A diagram of the model is generated using Keras' plot_model function:

![Seq2Seq Encoder-Decoder Model Architecture](./seq2seq_encoder_decoder.png)

## Training
The model is trained with early stopping to prevent overfitting. The model is fit using batches of 128 and a maximum of 10 epochs, with validation data for performance monitoring.

```python
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
model.fit([train_x, train_y[:, :-1]], 
          train_y.reshape(train_y.shape[0], train_y.shape[1], 1)[:, 1:], 
          epochs=10, 
          callbacks=[early_stop], 
          batch_size=128, 
          verbose=2, 
          validation_data=([val_x, val_y[:, :-1]], val_y.reshape(val_y.shape[0], val_y.shape[1], 1)[:, 1:]))

```
## Dataset
The CNN/DailyMail dataset is used for training and validation. It contains news articles and their corresponding summaries, which makes it suitable for the text summarization task.

- Train set: Used to train the model on article-summary pairs.
- Validation set: Used for model performance evaluation and to apply early stopping.
  
## Requirements
- Python 3.x
- Keras
- TensorFlow
- NumPy
- GloVe Embeddings

## How to Run
1. Download the CNN/DailyMail dataset and pre-trained GloVe embeddings.
2. Preprocess the dataset and prepare the embedding matrices.
3. Train the model using the provided code.
4. Evaluate the model on a validation set and generate summaries for new text inputs.
   
## Results
The model generates abstractive summaries of news articles. You can tweak the latent dimensions, embedding sizes, and add attention for improved performance.

## Future Work
* Attention Mechanism: Implementing Bahdanau or Luong Attention for better results.
* Beam Search: Incorporating beam search for enhanced summary generation.

## Resources:
- [Keras Documentation](https://keras.io/)
- [CNN/DailyMail Dataset](https://huggingface.co/datasets/cnn_dailymail)
- [GloVe Embeddings](https://nlp.stanford.edu/projects/glove/)