Spaces:

taskswithcode
/

semantic_similarity

Runtime error

App Files Files Community

taskswithcode commited on Sep 17, 2022

Commit

07e062e

1 Parent(s): 56e7f3c

Adding

Browse files

Files changed (3) hide show

imdb_sent.txt +2 -2
run.sh +1 -1
twc_embeddings.py +190 -0

imdb_sent.txt CHANGED Viewed

@@ -47,7 +47,7 @@ a mesmerizing film that certainly keeps your attention... Ben Daniels is fascina
 I hope this group of film-makers never re-unites.
 Unwatchable. You can't even make it past the first three minutes. And this is coming from a huge Adam Sandler fan!!1
 "One of the funniest movies made in recent years. Good characterization, plot and exceptional chemistry make this one a classic"
-"Add this little gem to your list of holiday regulars. It is<br /><br />sweet, funny, and endearing"
 "no comment - stupid movie, acting average or worse... screenplay - no sense at all... SKIP IT!"
 "If you haven't seen this, it's terrible. It is pure trash. I saw this about 17 years ago, and I'm still screwed up from it."
 Absolutely fantastic! Whatever I say wouldn't do this underrated movie the justice it deserves. Watch it now! FANTASTIC!
@@ -56,7 +56,7 @@ Widow hires a psychopath as a handyman. Sloppy film noir thriller which doesn't
 The Fiendish Plot of Dr. Fu Manchu (1980). This is hands down the worst film I've ever seen. What a sad way for a great comedian to go out.
 "Obviously written for the stage. Lightweight but worthwhile. How can you go wrong with Ralph Richardson, Olivier and Merle Oberon."
 This movie turned out to be better than I had expected it to be. Some parts were pretty funny. It was nice to have a movie with a new plot.
-This movie is terrible. It's about some no brain surfin dude that inherits some company. Does Carrot Top have no shame?<br /><br />
 Adrian Pasdar is excellent is this film. He makes a fascinating woman.
 "An unfunny, unworthy picture which is an undeserving end to Peter Sellers' career. It is a pity this movie was ever made."
 "The plot was really weak and confused. This is a true Oprah flick. (In Oprah's world, all men are evil and all women are victims.)"

 I hope this group of film-makers never re-unites.
 Unwatchable. You can't even make it past the first three minutes. And this is coming from a huge Adam Sandler fan!!1
 "One of the funniest movies made in recent years. Good characterization, plot and exceptional chemistry make this one a classic"
+"Add this little gem to your list of holiday regulars. It is sweet, funny, and endearing"
 "no comment - stupid movie, acting average or worse... screenplay - no sense at all... SKIP IT!"
 "If you haven't seen this, it's terrible. It is pure trash. I saw this about 17 years ago, and I'm still screwed up from it."
 Absolutely fantastic! Whatever I say wouldn't do this underrated movie the justice it deserves. Watch it now! FANTASTIC!
 The Fiendish Plot of Dr. Fu Manchu (1980). This is hands down the worst film I've ever seen. What a sad way for a great comedian to go out.
 "Obviously written for the stage. Lightweight but worthwhile. How can you go wrong with Ralph Richardson, Olivier and Merle Oberon."
 This movie turned out to be better than I had expected it to be. Some parts were pretty funny. It was nice to have a movie with a new plot.
+This movie is terrible. It's about some no brain surfin dude that inherits some company. Does Carrot Top have no shame?
 Adrian Pasdar is excellent is this film. He makes a fascinating woman.
 "An unfunny, unworthy picture which is an undeserving end to Peter Sellers' career. It is a pity this movie was ever made."
 "The plot was really weak and confused. This is a true Oprah flick. (In Oprah's world, all men are evil and all women are victims.)"

run.sh CHANGED Viewed

	@@ -1,2 +1,2 @@
1	- streamlit run app.py --server.port 80
2


1	+ streamlit run app.py --server.port 80 "1" "sim_app_examples.json" "sim_app_models.json"
2

twc_embeddings.py CHANGED Viewed

@@ -1,4 +1,5 @@
 from transformers import AutoModel, AutoTokenizer
 from scipy.spatial.distance import cosine
 import argparse
 import json
@@ -11,6 +12,195 @@ def read_text(input_file):
     return arr[:-1]
 class SimCSEModel:
     def __init__(self):
         self.model = None

 from transformers import AutoModel, AutoTokenizer
+from transformers import AutoModelForCausalLM
 from scipy.spatial.distance import cosine
 import argparse
 import json
     return arr[:-1]
+class CausalLMModel:
+    def __init__(self):
+        self.model = None
+        self.tokenizer = None
+        self.debug = False
+        print("In CausalLMModel Constructor")
+    def init_model(self,model_name = None):
+        # Get our models - The package will take care of downloading the models automatically
+        # For best performance: Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit
+        if (self.debug):
+            print("Init model",model_name)
+        # For best performance: EleutherAI/gpt-j-6B
+        if (model_name is None):
+            model_name = "EleutherAI/gpt-neo-125M"
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModelForCausalLM.from_pretrained(model_name)
+        self.model.eval()
+        self.prompt = 'Documents are searched to find matches with the same content.\nThe document "{}" is a good search result for "'
+    def compute_embeddings(self,input_data,is_file):
+        if (self.debug):
+            print("Computing embeddings for:", input_data[:20])
+        model = self.model
+        tokenizer = self.tokenizer
+        texts = read_text(input_data) if is_file == True else input_data
+        query = texts[0]
+        docs = texts[1:]
+        # Tokenize input texts
+        #print(f"Query: {query}")
+        scores = []
+        for doc in docs:
+            context = self.prompt.format(doc)
+            context_enc = tokenizer.encode(context, add_special_tokens=False)
+            continuation_enc = tokenizer.encode(query, add_special_tokens=False)
+            # Slice off the last token, as we take its probability from the one before
+            model_input = torch.tensor(context_enc+continuation_enc[:-1])
+            continuation_len = len(continuation_enc)
+            input_len, = model_input.shape
+            # [seq_len] -> [seq_len, vocab]
+            logprobs = torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()
+            # [seq_len, vocab] -> [continuation_len, vocab]
+            logprobs = logprobs[input_len-continuation_len:]
+            # Gather the log probabilities of the continuation tokens -> [continuation_len]
+            logprobs = torch.gather(logprobs, 1, torch.tensor(continuation_enc).unsqueeze(-1)).squeeze(-1)
+            score = torch.sum(logprobs)
+            scores.append(score.tolist())
+        return texts,scores
+    def output_results(self,output_file,texts,scores,main_index = 0):
+        cosine_dict = {}
+        docs = texts[1:]
+        if (self.debug):
+            print("Total sentences",len(texts))
+        assert(len(scores) == len(docs))
+        for i in range(len(docs)):
+            cosine_dict[docs[i]] = scores[i]
+        if (self.debug):
+            print("Input sentence:",texts[main_index])
+        sorted_dict = dict(sorted(cosine_dict.items(), key=lambda item: item[1],reverse = True))
+        if (self.debug):
+            for key in sorted_dict:
+                print("Document score for \"%s\" is: %.3f" % (key[:100], sorted_dict[key]))
+        if (output_file is not None):
+            with open(output_file,"w") as fp:
+                fp.write(json.dumps(sorted_dict,indent=0))
+        return sorted_dict
+class SGPTQnAModel:
+    def __init__(self):
+        self.model = None
+        self.tokenizer = None
+        self.debug = False
+        print("In SGPT Q&A Constructor")
+    def init_model(self,model_name = None):
+        # Get our models - The package will take care of downloading the models automatically
+        # For best performance: Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit
+        if (self.debug):
+            print("Init model",model_name)
+        if (model_name is None):
+            model_name = "Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit"
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModel.from_pretrained(model_name)
+        self.model.eval()
+        self.SPECB_QUE_BOS = self.tokenizer.encode("[", add_special_tokens=False)[0]
+        self.SPECB_QUE_EOS = self.tokenizer.encode("]", add_special_tokens=False)[0]
+        self.SPECB_DOC_BOS = self.tokenizer.encode("{", add_special_tokens=False)[0]
+        self.SPECB_DOC_EOS = self.tokenizer.encode("}", add_special_tokens=False)[0]
+    def tokenize_with_specb(self,texts, is_query):
+        # Tokenize without padding
+        batch_tokens = self.tokenizer(texts, padding=False, truncation=True)
+        # Add special brackets & pay attention to them
+        for seq, att in zip(batch_tokens["input_ids"], batch_tokens["attention_mask"]):
+            if is_query:
+                seq.insert(0, self.SPECB_QUE_BOS)
+                seq.append(self.SPECB_QUE_EOS)
+            else:
+                seq.insert(0, self.SPECB_DOC_BOS)
+                seq.append(self.SPECB_DOC_EOS)
+            att.insert(0, 1)
+            att.append(1)
+        # Add padding
+        batch_tokens = self.tokenizer.pad(batch_tokens, padding=True, return_tensors="pt")
+        return batch_tokens
+    def get_weightedmean_embedding(self,batch_tokens, model):
+        # Get the embeddings
+        with torch.no_grad():
+            # Get hidden state of shape [bs, seq_len, hid_dim]
+            last_hidden_state = self.model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state
+        # Get weights of shape [bs, seq_len, hid_dim]
+        weights = (
+            torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
+            .unsqueeze(0)
+            .unsqueeze(-1)
+            .expand(last_hidden_state.size())
+            .float().to(last_hidden_state.device)
+        )
+        # Get attn mask of shape [bs, seq_len, hid_dim]
+        input_mask_expanded = (
+            batch_tokens["attention_mask"]
+            .unsqueeze(-1)
+            .expand(last_hidden_state.size())
+            .float()
+        )
+        # Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
+        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
+        sum_mask = torch.sum(input_mask_expanded * weights, dim=1)
+        embeddings = sum_embeddings / sum_mask
+        return embeddings
+    def compute_embeddings(self,input_data,is_file):
+        if (self.debug):
+            print("Computing embeddings for:", input_data[:20])
+        model = self.model
+        tokenizer = self.tokenizer
+        texts = read_text(input_data) if is_file == True else input_data
+        queries = [texts[0]]
+        docs = texts[1:]
+        query_embeddings = self.get_weightedmean_embedding(self.tokenize_with_specb(queries, is_query=True), self.model)
+        doc_embeddings = self.get_weightedmean_embedding(self.tokenize_with_specb(docs, is_query=False), self.model)
+        return texts,(query_embeddings,doc_embeddings)
+    def output_results(self,output_file,texts,embeddings,main_index = 0):
+        # Calculate cosine similarities
+        # Cosine similarities are in [-1, 1]. Higher means more similar
+        query_embeddings = embeddings[0]
+        doc_embeddings = embeddings[1]
+        cosine_dict = {}
+        queries = [texts[0]]
+        docs = texts[1:]
+        if (self.debug):
+            print("Total sentences",len(texts))
+        for i in range(len(docs)):
+            cosine_dict[docs[i]] = 1 - cosine(query_embeddings[0], doc_embeddings[i])
+        if (self.debug):
+            print("Input sentence:",texts[main_index])
+        sorted_dict = dict(sorted(cosine_dict.items(), key=lambda item: item[1],reverse = True))
+        if (self.debug):
+            for key in sorted_dict:
+                print("Cosine similarity with  \"%s\" is: %.3f" % (key, sorted_dict[key]))
+        if (output_file is not None):
+            with open(output_file,"w") as fp:
+                fp.write(json.dumps(sorted_dict,indent=0))
+        return sorted_dict
 class SimCSEModel:
     def __init__(self):
         self.model = None