# **üê¶TweetGPT - Data Pre-Processing**
Before our training process, we prepare and clean our raw tweets. The steps he performed are documented in this Notebook.

## **Data import**
Our data consists of 912 `json_lines` files containing tweets of all politicians in the German Parliament as of 2022.

For the analysis we merged the `.jl` files into one csv, and cleaned the text.

### Step 1: Install necessary libraries
In this step, we install the required libraries to process and clean the tweet data.


In [None]:
# Install necessary libraries
!pip install json_lines
!pip install textblob
!pip install google-colab
import csv
import json_lines
from tqdm import tqdm
import os
import pandas as pd
import re
import nltk
from textblob import TextBlob
import matplotlib.pyplot as plt
from google.colab import drive

### Step 2: Mount Google Drive
This step allows us to access files stored in Google Drive.

In [None]:
# Mount Google Drive
drive.mount('/content/drive/')

### Step 3: Display files in the specified directory 
List files in the directory to ensure that the data files are correctly located.

In [None]:
!ls "/content/drive/MyDrive/Colab Notebooks/ML4B/tweets_data"

## **Data Conversion**
This function reads tweet data from JSON files and writes it to a CSV file.

In [None]:
def tweets_to_csv(json_folder, csv_file):
    try:
        with open(csv_file, 'w', newline='', encoding='utf-8') as csv_file:
            writer = csv.DictWriter(csv_file, fieldnames=[
                "user_name", "Name", "Partei", "text", "hashtags", "mentions", "urls",
                "created_at", "conversation_id"
            ])
            writer.writeheader()

            for json_file in tqdm(os.listdir(json_folder)):
                with json_lines.open(os.path.join(json_folder, json_file)) as jl:
                    for json_data in jl:
                        if json_data.get('http_status') == 200:
                            account_name = json_data.get("account_name", "")
                            name = json_data.get("account_data", {}).get("Name", "")
                            partei = json_data.get("account_data", {}).get("Partei", "")
                            tweets_text = json_data.get("response", {}).get("data", [])

                            for tweet in tweets_text:
                                conversation_id = tweet.get("conversation_id", "")
                                tweets_text = tweet.get("text", "")
                                hashtags = [tag["tag"] for tag in tweet.get("entities", {}).get("hashtags", [])]
                                mentions = [mention["username"] for mention in tweet.get("entities", {}).get("mentions", [])]
                                urls = [url["expanded_url"] for url in tweet.get("entities", {}).get("urls", [])]
                                created_at = tweet.get("created_at", "")

                                writer.writerow({
                                    "user_name": account_name,
                                    "Name": name,
                                    "Partei": partei,
                                    "text": tweets_text,
                                    "hashtags": hashtags,
                                    "mentions": mentions,
                                    "urls": urls,
                                    "created_at": created_at,
                                    "conversation_id": conversation_id
                                })

        print("Extraction complete. Data saved to", csv_file)

    except Exception as e:
        print("Error:", e)



Use the `tweets_to_csv` function to convert our data to csv

In [None]:
tweets_to_csv("/content/drive/MyDrive/Colab Notebooks/ML4B/tweets_data", "/content/drive/MyDrive/Colab Notebooks/ML4B/outputs/raw_bundestag_tweets.csv")

## **Data Cleaning**

### Step 1: Filter Out Retweets
This function filters out retweets to ensure we only have original tweets in our dataset. This step is important in order to retain the authencitiy of each individual, as retweets containin texts from other users. This would result in heterogeneous texts. For our training we want to have low intra class variance, but high inter class variance meaning the texts should be as different as possible between users.

In [None]:
def filter_csv(input_file, output_file, column_name, prefix):
    with open(input_file, 'r', newline='') as infile, open(output_file, 'w', newline='') as outfile:
        reader = csv.DictReader(infile)
        fieldnames = reader.fieldnames
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in reader:
            if not row[column_name].startswith(prefix):
                writer.writerow(row)

Use the `filter_csv` function to remove retweets from the dataset.

In [None]:
filter_csv("/content/drive/MyDrive/Colab Notebooks/ML4B/outputs/raw_bundestag_tweets.csv","/content/drive/MyDrive/Colab Notebooks/ML4B/outputs/bundestag_tweets_no_RT.csv",'text','RT')

### Step 2: Text cleaning
These functions clean the tweet text by fixing HTML entities and removing unwanted content.

In [None]:
def fix_text(text):
    # Replace HTML entity '&amp;' with '&'
    text = text.replace('&amp;', '&')
    # Replace HTML entity '&lt;' with '<'
    text = text.replace('&lt;', '<')
    # Replace HTML entity '&gt;' with '>'
    text = text.replace('&gt;', '>')
    return text

In [None]:
def clean_tweet(tweet, allow_new_lines=False):
    bad_start = ['http:', 'https:']
    for w in bad_start:
        tweet = re.sub(f" {w}\\S+", "", tweet)  # removes white space before url
        tweet = re.sub(f"{w}\\S+ ", "", tweet)  # in case a tweet starts with a url
        tweet = re.sub(f"\n{w}\\S+ ", "", tweet)  # in case the url is on a new line
        tweet = re.sub(f"\n{w}\\S+", "", tweet)  # in case the url is alone on a new line
        tweet = re.sub(f"{w}\\S+", "", tweet)  # any other case?
    tweet = re.sub(' +', ' ', tweet)  # replace multiple spaces with one space
    if not allow_new_lines:  # remove new lines
        tweet = ' '.join(tweet.split())
    return tweet.strip()

Here we filter out tweets that only contain urls or mentions.

In [None]:
def boring_tweet(tweet):
    "Check if this is a boring tweet"
    boring_stuff = ['http', '@', '#']
    not_boring_words = len([None for w in tweet.split() if all(bs not in w.lower() for bs in boring_stuff)])
    return not_boring_words < 3


All functions for cleaning the raw tweets are applied, and saved into a new
clean dataset

In [None]:
clean_tweets = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML4B/outputs/bundestag_tweets_no_RT.csv")

clean_tweets['text'] = clean_tweets['text'].apply(fix_text)
clean_tweets['text'] = clean_tweets['text'].apply(clean_tweet)
clean_tweets['boring'] = clean_tweets['text'].apply(boring_tweet)

clean_tweets.to_csv("/content/drive/MyDrive/Colab Notebooks/ML4B/outputs/clean_tweets.csv", index=False)

## **Testing**
In order to ensure our clean data contains the majority of tweets from the raw dataset, we compare the shape of both dataframes.

In [None]:
raw_tweets =pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML4B/outputs/raw_bundestag_tweets.csv")
no_rt = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML4B/outputs/bundestag_tweets_no_RT.csv")
clean_tweets = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML4B/outputs/clean_tweets.csv")

In [None]:
raw_tweets.shape

In [None]:
no_rt.shape

In [None]:
clean_tweets.shape