iamhardikat11 commited on
Commit
324bda5
1 Parent(s): c1622e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -1
README.md CHANGED
@@ -8,4 +8,129 @@ tags:
8
  - NLP
9
  - Text-Summarization
10
  - CNN
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - NLP
9
  - Text-Summarization
10
  - CNN
11
+ ---
12
+
13
+ # Seq2Seq Model with Attention for Text Summarization
14
+
15
+ This repository contains a Sequence-to-Sequence (Seq2Seq) model with attention, trained on the **CNN/DailyMail** dataset for text summarization tasks. The model is built using Keras and leverages pre-trained GloVe embeddings for enhanced word representations. It consists of an encoder-decoder architecture using LSTM layers with attention to capture long-term dependencies.
16
+
17
+ ## Model Architecture
18
+
19
+ The model follows the classic encoder-decoder structure, with attention to handle long sequences:
20
+
21
+ - **Embedding Layer**: Uses pre-trained GloVe embeddings (100-dimensional) for both the input (article) and target (summary) texts.
22
+ - **Encoder**: A bidirectional LSTM to encode the input sequence. The forward and backward hidden states are concatenated.
23
+ - **Decoder**: An LSTM initialized with the encoder's hidden and cell states to generate the target sequence (summary).
24
+ - **Attention Mechanism**: While the base code does not explicitly implement attention, this can be easily integrated to improve summarization by focusing on relevant parts of the input sequence during decoding.
25
+
26
+ ### Embeddings
27
+
28
+ We use GloVe embeddings (100-dimensional) pre-trained on a large corpus of text data. The embedding matrix is constructed for both the input (text) and output (summary) using the GloVe embeddings.
29
+
30
+ ```python
31
+ embedding_index = {}
32
+ embed_dim = 100
33
+ with open('../input/glove6b100dtxt/glove.6B.100d.txt') as f:
34
+ for line in f:
35
+ values = line.split()
36
+ word = values[0]
37
+ coefs = np.asarray(values[1:], dtype='float32')
38
+ embedding_index[word] = coefs
39
+
40
+ # Embedding for input (articles)
41
+ t_embed = np.zeros((t_max_features, embed_dim))
42
+ for word, i in t_tokenizer.word_index.items():
43
+ vec = embedding_index.get(word)
44
+ if i < t_max_features and vec is not None:
45
+ t_embed[i] = vec
46
+
47
+ # Embedding for output (summaries)
48
+ s_embed = np.zeros((s_max_features, embed_dim))
49
+ for word, i in s_tokenizer.word_index.items():
50
+ vec = embedding_index.get(word)
51
+ if i < s_max_features and vec is not None:
52
+ s_embed[i] = vec
53
+
54
+ ```
55
+
56
+ ## Encoder
57
+ A bidirectional LSTM is used for encoding the input text. The forward and backward hidden and cell states are concatenated to pass as the initial states to the decoder.
58
+
59
+ ```python
60
+ latent_dim = 128
61
+ enc_input = Input(shape=(maxlen_text,))
62
+ enc_embed = Embedding(t_max_features, embed_dim, input_length=maxlen_text, weights=[t_embed], trainable=False)(enc_input)
63
+ enc_lstm = Bidirectional(LSTM(latent_dim, return_state=True))
64
+ enc_output, enc_fh, enc_fc, enc_bh, enc_bc = enc_lstm(enc_embed)
65
+
66
+ # Concatenate the forward and backward states
67
+ enc_h = Concatenate(axis=-1)([enc_fh, enc_bh])
68
+ enc_c = Concatenate(axis=-1)([enc_fc, enc_bc])
69
+ ```
70
+
71
+ ## Decoder
72
+ The decoder is an LSTM that takes the encoder's final states as the initial states to generate the output summary sequence.
73
+
74
+ ```python
75
+ dec_input = Input(shape=(None,))
76
+ dec_embed = Embedding(s_max_features, embed_dim, weights=[s_embed], trainable=False)(dec_input)
77
+ dec_lstm = LSTM(latent_dim * 2, return_sequences=True, return_state=True, dropout=0.3, recurrent_dropout=0.2)
78
+ dec_outputs, _, _ = dec_lstm(dec_embed, initial_state=[enc_h, enc_c])
79
+
80
+ # Dense layer with softmax activation for final output
81
+ dec_dense = TimeDistributed(Dense(s_max_features, activation='softmax'))
82
+ dec_output = dec_dense(dec_outputs)
83
+ ```
84
+
85
+ ## Model Summary
86
+ The full Seq2Seq model with an attention mechanism is compiled using sparse categorical crossentropy loss and the RMSProp optimizer.
87
+
88
+ ### Model Visualization
89
+ A diagram of the model is generated using Keras' plot_model function:
90
+
91
+ ![Seq2Seq Encoder-Decoder Model Architecture](./seq2seq_encoder_decoder.png)
92
+
93
+ ## Training
94
+ The model is trained with early stopping to prevent overfitting. The model is fit using batches of 128 and a maximum of 10 epochs, with validation data for performance monitoring.
95
+
96
+ ```python
97
+ early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
98
+ model.fit([train_x, train_y[:, :-1]],
99
+ train_y.reshape(train_y.shape[0], train_y.shape[1], 1)[:, 1:],
100
+ epochs=10,
101
+ callbacks=[early_stop],
102
+ batch_size=128,
103
+ verbose=2,
104
+ validation_data=([val_x, val_y[:, :-1]], val_y.reshape(val_y.shape[0], val_y.shape[1], 1)[:, 1:]))
105
+
106
+ ```
107
+ ## Dataset
108
+ The CNN/DailyMail dataset is used for training and validation. It contains news articles and their corresponding summaries, which makes it suitable for the text summarization task.
109
+
110
+ - Train set: Used to train the model on article-summary pairs.
111
+ - Validation set: Used for model performance evaluation and to apply early stopping.
112
+
113
+ ## Requirements
114
+ - Python 3.x
115
+ - Keras
116
+ - TensorFlow
117
+ - NumPy
118
+ - GloVe Embeddings
119
+
120
+ ## How to Run
121
+ 1. Download the CNN/DailyMail dataset and pre-trained GloVe embeddings.
122
+ 2. Preprocess the dataset and prepare the embedding matrices.
123
+ 3. Train the model using the provided code.
124
+ 4. Evaluate the model on a validation set and generate summaries for new text inputs.
125
+
126
+ ## Results
127
+ The model generates abstractive summaries of news articles. You can tweak the latent dimensions, embedding sizes, and add attention for improved performance.
128
+
129
+ ## Future Work
130
+ * Attention Mechanism: Implementing Bahdanau or Luong Attention for better results.
131
+ * Beam Search: Incorporating beam search for enhanced summary generation.
132
+
133
+ ## Resources:
134
+ - [Keras Documentation](https://keras.io/)
135
+ - [CNN/DailyMail Dataset](https://huggingface.co/datasets/cnn_dailymail)
136
+ - [GloVe Embeddings](https://nlp.stanford.edu/projects/glove/)