Seq2Seq / README.md

Hardik Soni

Update README.md

324bda5 verified 26 days ago

5.7 kB

	---
	license: cc-by-4.0
	datasets:
	- abisee/cnn_dailymail
	language:
	- en
	tags:
	- NLP
	- Text-Summarization
	- CNN
	---

	# Seq2Seq Model with Attention for Text Summarization

	This repository contains a Sequence-to-Sequence (Seq2Seq) model with attention, trained on the CNN/DailyMail dataset for text summarization tasks. The model is built using Keras and leverages pre-trained GloVe embeddings for enhanced word representations. It consists of an encoder-decoder architecture using LSTM layers with attention to capture long-term dependencies.

	## Model Architecture

	The model follows the classic encoder-decoder structure, with attention to handle long sequences:

	- Embedding Layer: Uses pre-trained GloVe embeddings (100-dimensional) for both the input (article) and target (summary) texts.
	- Encoder: A bidirectional LSTM to encode the input sequence. The forward and backward hidden states are concatenated.
	- Decoder: An LSTM initialized with the encoder's hidden and cell states to generate the target sequence (summary).
	- Attention Mechanism: While the base code does not explicitly implement attention, this can be easily integrated to improve summarization by focusing on relevant parts of the input sequence during decoding.

	### Embeddings

	We use GloVe embeddings (100-dimensional) pre-trained on a large corpus of text data. The embedding matrix is constructed for both the input (text) and output (summary) using the GloVe embeddings.

	```python
	embedding_index = {}
	embed_dim = 100
	with open('../input/glove6b100dtxt/glove.6B.100d.txt') as f:
	for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embedding_index[word] = coefs

	# Embedding for input (articles)
	t_embed = np.zeros((t_max_features, embed_dim))
	for word, i in t_tokenizer.word_index.items():
	vec = embedding_index.get(word)
	if i < t_max_features and vec is not None:
	t_embed[i] = vec

	# Embedding for output (summaries)
	s_embed = np.zeros((s_max_features, embed_dim))
	for word, i in s_tokenizer.word_index.items():
	vec = embedding_index.get(word)
	if i < s_max_features and vec is not None:
	s_embed[i] = vec

	```

	## Encoder
	A bidirectional LSTM is used for encoding the input text. The forward and backward hidden and cell states are concatenated to pass as the initial states to the decoder.

	```python
	latent_dim = 128
	enc_input = Input(shape=(maxlen_text,))
	enc_embed = Embedding(t_max_features, embed_dim, input_length=maxlen_text, weights=[t_embed], trainable=False)(enc_input)
	enc_lstm = Bidirectional(LSTM(latent_dim, return_state=True))
	enc_output, enc_fh, enc_fc, enc_bh, enc_bc = enc_lstm(enc_embed)

	# Concatenate the forward and backward states
	enc_h = Concatenate(axis=-1)([enc_fh, enc_bh])
	enc_c = Concatenate(axis=-1)([enc_fc, enc_bc])
	```

	## Decoder
	The decoder is an LSTM that takes the encoder's final states as the initial states to generate the output summary sequence.

	```python
	dec_input = Input(shape=(None,))
	dec_embed = Embedding(s_max_features, embed_dim, weights=[s_embed], trainable=False)(dec_input)
	dec_lstm = LSTM(latent_dim * 2, return_sequences=True, return_state=True, dropout=0.3, recurrent_dropout=0.2)
	dec_outputs, _, _ = dec_lstm(dec_embed, initial_state=[enc_h, enc_c])

	# Dense layer with softmax activation for final output
	dec_dense = TimeDistributed(Dense(s_max_features, activation='softmax'))
	dec_output = dec_dense(dec_outputs)
	```

	## Model Summary
	The full Seq2Seq model with an attention mechanism is compiled using sparse categorical crossentropy loss and the RMSProp optimizer.

	### Model Visualization
	A diagram of the model is generated using Keras' plot_model function:

	![Seq2Seq Encoder-Decoder Model Architecture](./seq2seq_encoder_decoder.png)

	## Training
	The model is trained with early stopping to prevent overfitting. The model is fit using batches of 128 and a maximum of 10 epochs, with validation data for performance monitoring.

	```python
	early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
	model.fit([train_x, train_y[:, :-1]],
	train_y.reshape(train_y.shape[0], train_y.shape[1], 1)[:, 1:],
	epochs=10,
	callbacks=[early_stop],
	batch_size=128,
	verbose=2,
	validation_data=([val_x, val_y[:, :-1]], val_y.reshape(val_y.shape[0], val_y.shape[1], 1)[:, 1:]))

	```
	## Dataset
	The CNN/DailyMail dataset is used for training and validation. It contains news articles and their corresponding summaries, which makes it suitable for the text summarization task.

	- Train set: Used to train the model on article-summary pairs.
	- Validation set: Used for model performance evaluation and to apply early stopping.

	## Requirements
	- Python 3.x
	- Keras
	- TensorFlow
	- NumPy
	- GloVe Embeddings

	## How to Run
	1. Download the CNN/DailyMail dataset and pre-trained GloVe embeddings.
	2. Preprocess the dataset and prepare the embedding matrices.
	3. Train the model using the provided code.
	4. Evaluate the model on a validation set and generate summaries for new text inputs.

	## Results
	The model generates abstractive summaries of news articles. You can tweak the latent dimensions, embedding sizes, and add attention for improved performance.

	## Future Work
	* Attention Mechanism: Implementing Bahdanau or Luong Attention for better results.
	* Beam Search: Incorporating beam search for enhanced summary generation.

	## Resources:
	- [Keras Documentation](https://keras.io/)
	- [CNN/DailyMail Dataset](https://huggingface.co/datasets/cnn_dailymail)
	- [GloVe Embeddings](https://nlp.stanford.edu/projects/glove/)