Spaces:

diegopacheco
/

gen-ai-multimodel-fun

Sleeping

App Files Files Community

diegopacheco commited on May 5

Commit

edeaf50

•

1 Parent(s): ff067ae

v1 of the app

Browse files

Files changed (7) hide show

README.md +20 -12
audio.mp3 +0 -0
install-deps.sh +3 -0
main.py +61 -0
requirements.txt +19 -0
result.png +0 -0
run.sh +3 -0

README.md CHANGED Viewed

@@ -1,13 +1,21 @@
----
-title: Gen Ai Multimodel Fun
-emoji: 📊
-colorFrom: indigo
-colorTo: green
-sdk: gradio
-sdk_version: 4.29.0
-app_file: app.py
-pinned: false
-license: unlicense
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+### Result
+* Multi-models in action
+* Story Telling
+  * Given a image
+  * Generate the caption for the image
+  * Generate an background story for the text
+* Use LLM models:
+  * Salesforce/blip-image-captioning-base for image captioning
+  * gpt2 for text generation
+  * gTTS for text to speech, gTTS is a Python library and CLI tool to interface with Google Translate's text-to-speech API.
+  * openai/whisper-large-v2 for speach recognition
+  * pipeline/sentiment-analysis task for sentiment analysis of the text story
+Result UI:
+<img src='result.png' />
+Audio Result:
+<audio controls>
+  <source src="audio.mp3" type="audio/mpeg">
+</audio>

audio.mp3 ADDED Viewed

Binary file (240 kB). View file

install-deps.sh ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ #!/bin/bash
2	+
3	+ /bin/pip install -r requirements.txt

main.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import os
+from PIL import Image
+from gtts import gTTS
+import torch
+import gradio as gr
+from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
+from transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer
+def describe_photo(image):
+    image = Image.fromarray(image.astype('uint8'), 'RGB')
+    captioner = pipeline("image-to-text",model="Salesforce/blip-image-captioning-base")
+    results = captioner(image)
+    text = results[0]['generated_text']
+    print(f"Image caption is: {text}")
+    return text
+def generate_story(description):
+    model = GPT2LMHeadModel.from_pretrained("gpt2")
+    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
+    inputs = tokenizer.encode(description + " [SEP] A funny and friendly story:", return_tensors='pt')
+    outputs = model.generate(input_ids=inputs,
+                             max_length=200,
+                             num_return_sequences=1,
+                             temperature=0.7,
+                             no_repeat_ngram_size=2)
+    story = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    return story
+def convert_to_audio(text):
+    tts = gTTS(text)
+    audio_file_path = "audio.mp3"
+    tts.save(audio_file_path)
+    return audio_file_path
+def audio_to_text(audio_file_path):
+    pipe = pipeline("automatic-speech-recognition", "openai/whisper-large-v2")
+    result = pipe("audio.mp3")
+    print(result)
+    return result['text']
+def sentiment_analysis(text):
+    sentiment_analyzer = pipeline("sentiment-analysis")
+    result = sentiment_analyzer(text)
+    print(result)
+    return result
+def app(image):
+    description = describe_photo(image)
+    story = generate_story(description)
+    audio_file = convert_to_audio(story)
+    transcribed_text = audio_to_text(audio_file)
+    sentiment = sentiment_analysis(transcribed_text)
+    return description,audio_file,transcribed_text, sentiment
+ui = gr.Interface(
+    fn=app,
+    inputs="image",
+    outputs=["text", "audio", "text", "text"],
+    title="Diego's Story Telling Multimodel LLM Gen AI"
+)
+ui.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+numpy
+transformers
+sentence-transformers
+seaborn
+torch
+torchvision
+matplotlib
+pandas
+scikit-learn
+nltk
+gensim
+tensorflow
+keras
+opencv-python
+fastapi
+uvicorn
+gTTS
+openai-clip
+gradio

result.png ADDED Viewed

run.sh ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ #!/bin/bash
2	+
3	+ python app.py