# Question generation
This notebook is inspired by [Question Generation tutorial](https://haystack.deepset.ai/tutorials/question-generation), from Haystack documentation.

Here we use a collection of articles about Twin Peaks to generate a variety of questions about that awesome TV series!

The following steps are performed:
* load data
* create document store and write documents
* generate questions and save them

## Preliminary operations

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# install dependencies
! pip install farm-haystack[faiss-gpu]==1.4.0

## Load data

In [4]:
import glob
import json
import os

In [7]:
DATA_DIRECTORY = '/content/drive/MyDrive/Colab Notebooks/wklp/data'

docs=[]

for json_file in glob.glob(f'{DATA_DIRECTORY}/*.json'):
  # select only the largest documents
  if os.path.getsize (json_file)>=5000:
    with open(json_file, 'r') as fin: 
        json_content=json.load(fin)
        
    doc={'content': json_content['text'],
        'meta': {'name': json_content['name'],
                 'url': json_content['url']}}
    docs.append(doc)

In [8]:
len(docs)

134

In [9]:
docs[5]

{'content': 'Part 5\nNot to be confused with Episode 5.\n"Part 5" is the fifth episode of the 2017 series of Twin Peaks and the thirty-fifth episode of the franchise as a whole. It aired on June 4, 2017.\nPlot\n"Case files."\n ―Dale Cooper\nGene and Jake sit in a car, the former on the phone with Lorraine, reporting on the situation with Dougie Jones. Frustrated, she sends the message "2" (leaving 159 characters to type) to her contact "ARGENT" which causes a device in Buenos Aires to ring and flash twice with its two red lights.\nConstance Talbot, Detective Macklay, and Detective Harrison observe the John Doe in the morgue. Talbot confirms the decapitation as the man\'s cause of death and presents a ring found inside the body. On it is an inscription that reads, "To Dougie, with love, Janey-E."\nCooper\'s doppelganger sits in his jail cell and correctly predicts that his food is coming. He takes his food and goes to the mirror, noting that BOB is still with him.\nAt his place of emplo

## Create document store ([FAISS](https://github.com/facebookresearch/faiss)) and write documents



In [10]:
from haystack.document_stores import FAISSDocumentStore

# the document store settings are those compatible with Embedding Retriever
document_store = FAISSDocumentStore(
    similarity="dot_product",
    embedding_dim=768)

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.
INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry


In [11]:
# write documents
document_store.write_documents(docs)


Writing Documents:   0%|          | 0/134 [00:00<?, ?it/s]

In [23]:
len(document_store.get_all_documents())

134

## Generate questions and save them

In [16]:
from haystack.nodes import QuestionGenerator
from haystack.pipelines import QuestionGenerationPipeline
from haystack.utils import launch_es, print_questions


In [19]:
OUTPUT_QUESTIONS_FILE='/content/drive/MyDrive/Colab Notebooks/wklp/questions.txt'

In [24]:
# Initialize Question Generator
question_generator = QuestionGenerator()

question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store):
    if idx%5==0:
      print(idx/len(docs)*100)
    results = question_generation_pipeline.run(documents=[document])

    # save to file
    questions_for_doc=f'{idx}: {document.content[:100]}...\n'+'-'*15+'\n'
    if "generated_questions" in results.keys():
      for result in results["generated_questions"]:
          for question in result["questions"]:
              questions_for_doc+=(f" - {question}\n")
    with open(OUTPUT_QUESTIONS_FILE,'a+') as fo:
      fo.write(questions_for_doc)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1


0.0
3.731343283582089
7.462686567164178
11.194029850746269
14.925373134328357
18.65671641791045
22.388059701492537
26.119402985074625
29.850746268656714
33.582089552238806
37.3134328358209
41.04477611940299
44.776119402985074
48.507462686567166
52.23880597014925
55.970149253731336
59.70149253731343
63.43283582089553
67.16417910447761
70.8955223880597
74.6268656716418
78.35820895522389
82.08955223880598
85.82089552238806
89.55223880597015


In [30]:
print(questions_for_doc)
print_questions(results)

124: Jerry Horne
Jeremy "Jerry" Horne was the playboy brother of Benjamin Horne and the uncle of Audrey a...
---------------
 -  Who was the playboy brother of Benjamin Horne?
 -  What was Jerry Horne's uncle's name?
 -  Who did Jerry watch in his childhood?
 -  Who did Jerry watch dance with with a flashlight?
 -  Who was the place kicker in the starting lineup of Twin Peaks High School football team of 1968?
 -  Who was a member of the Bookhouse Boys?
 -  When did Big Ed graduate from Gonzaga University?
 -  Who graduated last in his class at Gonzaga in 1974?
 -  What year did Jerry begin working with his brother to find business partners for the Ghostwood Development Project?
 -  What state had Jerry's license to practice law revoked?
 -  Who was the owner of the Canadian brothel and casino?
 -  When did Jerry come to the Great Northern Hotel after a trip to Paris?
 -  What did Jerry bring to Ben?
 -  Who brought a sandwich for Ben?
 -  Who told Ben about the murder of Laura Palmer?