seanpedrickcase commited on
Commit
55f0ce3
·
1 Parent(s): 04a15c5

Can split passages into sentences. Improved embedding, LLM representation models, improved zero shot capabilities

Browse files
.dockerignore ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.pyc
2
+ *.ipynb
3
+ *.zip
4
+ *.npz
5
+ *.csv
6
+ *.xlsx
7
+ *.xls
8
+ *.pkl
9
+ *.parquet
10
+ *.png
11
+ *.safetensors
12
+ *.json
13
+ *.html
14
+ *.log
15
+ *.spec
16
+ *.bin
17
+ .ipynb_checkpoints/*
18
+ old_code/*
19
+ model/*
20
+ output_model/*
21
+ data/*
22
+ build_deps/*
23
+ dist/*
24
+ build/*
.gitignore CHANGED
@@ -1,5 +1,6 @@
1
  *.pyc
2
  *.ipynb
 
3
  *.npz
4
  *.csv
5
  *.xlsx
@@ -12,6 +13,7 @@
12
  *.html
13
  *.log
14
  *.spec
 
15
  .ipynb_checkpoints/*
16
  old_code/*
17
  model/*
 
1
  *.pyc
2
  *.ipynb
3
+ *.zip
4
  *.npz
5
  *.csv
6
  *.xlsx
 
13
  *.html
14
  *.log
15
  *.spec
16
+ *.bin
17
  .ipynb_checkpoints/*
18
  old_code/*
19
  model/*
README.md CHANGED
@@ -14,8 +14,8 @@ license: apache-2.0
14
 
15
  Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
16
 
17
- Uses fast TF-IDF-based embeddings by default, which are fast but not very performant in terms of cluster. Change to [Mixedbread large v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions, 8 bit quantisation) on the options page for topics of much higher quality, but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
18
 
19
  For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
20
 
21
- I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.
 
14
 
15
  Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
16
 
17
+ Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
18
 
19
  For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
20
 
21
+ I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose the passages.parquet file for download.
app.py CHANGED
@@ -6,10 +6,14 @@ import gradio as gr
6
  import pandas as pd
7
  import numpy as np
8
 
9
- from funcs.topic_core_funcs import pre_clean, extract_topics, reduce_outliers, represent_topics, visualise_topics, save_as_pytorch_model
10
- from funcs.helper_functions import initial_file_load, custom_regex_load
11
  from sklearn.feature_extraction.text import CountVectorizer
12
 
 
 
 
 
13
 
14
  # Gradio app
15
 
@@ -17,6 +21,7 @@ block = gr.Blocks(theme = gr.themes.Base())
17
 
18
  with block:
19
 
 
20
  data_state = gr.State(pd.DataFrame())
21
  embeddings_state = gr.State(np.array([]))
22
  embeddings_type_state = gr.State("")
@@ -26,18 +31,20 @@ with block:
26
  docs_state = gr.State()
27
  data_file_name_no_ext_state = gr.State()
28
  label_list_state = gr.State(pd.DataFrame())
29
- vectoriser_state = gr.State(CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=0.1, max_df=0.95))
 
 
30
 
31
  gr.Markdown(
32
  """
33
  # Topic modeller
34
  Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
35
 
36
- Uses fast TF-IDF-based embeddings by default, which are fast but not very performant in terms of cluster. Change to [Mixedbread large v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions, 8 bit quantisation) on the options page for topics of much higher quality, but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
37
 
38
  For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
39
 
40
- I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.
41
  """)
42
 
43
  with gr.Tab("Load files and find topics"):
@@ -48,23 +55,34 @@ with block:
48
 
49
  with gr.Accordion("Clean data", open = False):
50
  with gr.Row():
51
- clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Clean data - remove html, numbers with > 1 digits, emails, postcodes (UK), custom regex.")
52
- drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 50 char strings. May make old embedding files incompatible due to differing lengths.")
53
- anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective. This is slow!")
54
- split_sentence_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Split open text into sentences. Useful for small datasets.")
55
  with gr.Row():
56
- custom_regex = gr.UploadButton(label="Import custom regex file", file_count="multiple")
57
- gr.Markdown("""Import custom regex - csv table with one column of regex patterns with no header. Example pattern: (?i)roosevelt for case insensitive removal of this term.""")
58
  custom_regex_text = gr.Textbox(label="Custom regex load status")
59
  clean_btn = gr.Button("Clean data")
60
 
61
  with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
62
  candidate_topics = gr.File(label="Input topics from file (csv). File should have at least one column with a header and topic keywords in cells below. Topics will be taken from the first column of the file. Currently not compatible with low-resource embeddings.")
63
- zero_shot_similarity = gr.Slider(minimum = 0.5, maximum = 1, value = 0.65, step = 0.001, label = "Minimum similarity value for document to be assigned to zero-shot topic.")
 
 
 
64
 
65
  with gr.Row():
66
- min_docs_slider = gr.Slider(minimum = 2, maximum = 1000, value = 5, step = 1, label = "Minimum number of similar documents needed to make a topic.")
67
- max_topics_slider = gr.Slider(minimum = 2, maximum = 500, value = 50, step = 1, label = "Maximum number of topics")
 
 
 
 
 
 
 
 
68
 
69
  with gr.Row():
70
  topics_btn = gr.Button("Extract topics", variant="primary")
@@ -78,12 +96,12 @@ with block:
78
  representation_type = gr.Dropdown(label = "Method for generating new topic labels", value="Default", choices=["Default", "MMR", "KeyBERT", "LLM"])
79
  represent_llm_btn = gr.Button("Change topic labels")
80
  with gr.Row():
81
- reduce_outliers_btn = gr.Button("Reduce outliers")
82
  save_pytorch_btn = gr.Button("Save model in Pytorch format")
83
 
84
  with gr.Tab("Visualise"):
85
  with gr.Row():
86
- visualisation_type_radio = gr.Radio(label="Visualisation type", choices=["Topic document graph", "Hierarchical view"])
87
  in_label = gr.Dropdown(choices=["Choose a column"], multiselect = True, label="Select column for labelling documents in output visualisations.")
88
  sample_slide = gr.Slider(minimum = 0.01, maximum = 1, value = 0.1, step = 0.01, label = "Proportion of data points to show on output visualisations.")
89
  legend_label = gr.Textbox(label="Custom legend column (optional, any column from the topic details output)", visible=False)
@@ -98,36 +116,43 @@ with block:
98
  with gr.Tab("Options"):
99
  with gr.Accordion("Data load and processing options", open = True):
100
  with gr.Row():
101
- seed_number = gr.Number(label="Random seed to use for dimensionality reduction.", minimum=0, step=1, value=42, precision=0)
102
  calc_probs = gr.Dropdown(label="Calculate all topic probabilities", value="No", choices=["Yes", "No"])
103
  with gr.Row():
104
- low_resource_mode_opt = gr.Dropdown(label = "Use low resource (TF-IDF) embeddings and processing.", value="Yes", choices=["Yes", "No"])
105
- embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp for smaller files with less accuracy.", value="No", choices=["Yes", "No"])
106
- with gr.Row():
107
  return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation.", value="Yes", choices=["Yes", "No"])
108
  save_topic_model = gr.Dropdown(label = "Save topic model to BERTopic format pkl file.", value="No", choices=["Yes", "No"])
109
 
110
  # Load in data. Update column names dropdown when file uploaded
111
- in_files.upload(fn=initial_file_load, inputs=[in_files], outputs=[in_colnames, in_label, data_state, output_single_text, topic_model_state, embeddings_state, data_file_name_no_ext_state, label_list_state])
 
 
 
112
 
113
  # Clean data
114
  custom_regex.upload(fn=custom_regex_load, inputs=[custom_regex], outputs=[custom_regex_text, custom_regex_state])
115
- clean_btn.click(fn=pre_clean, inputs=[data_state, in_colnames, data_file_name_no_ext_state, custom_regex_state, clean_text, drop_duplicate_text, anonymise_drop, split_sentence_drop], outputs=[output_single_text, output_file, data_state, data_file_name_no_ext_state], api_name="clean")
 
 
 
116
 
117
  # Extract topics
118
- topics_btn.click(fn=extract_topics, inputs=[data_state, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext_state, label_list_state, return_intermediate_files, embedding_super_compress, low_resource_mode_opt, save_topic_model, embeddings_state, embeddings_type_state, zero_shot_similarity, seed_number, calc_probs, vectoriser_state], outputs=[output_single_text, output_file, embeddings_state, embeddings_type_state, data_file_name_no_ext_state, topic_model_state, docs_state, vectoriser_state, assigned_topics_state], api_name="topics")
119
 
120
  # Reduce outliers
121
- reduce_outliers_btn.click(fn=reduce_outliers, inputs=[topic_model_state, docs_state, embeddings_state, data_file_name_no_ext_state, assigned_topics_state, vectoriser_state, save_topic_model], outputs=[output_single_text, output_file, topic_model_state], api_name="reduce_outliers")
122
 
123
  # Re-represent topic labels
124
- represent_llm_btn.click(fn=represent_topics, inputs=[topic_model_state, docs_state, data_file_name_no_ext_state, low_resource_mode_opt, save_topic_model, representation_type, vectoriser_state], outputs=[output_single_text, output_file, topic_model_state], api_name="represent_llm")
125
 
126
  # Save in Pytorch format
127
  save_pytorch_btn.click(fn=save_as_pytorch_model, inputs=[topic_model_state, data_file_name_no_ext_state], outputs=[output_single_text, output_file], api_name="pytorch_save")
128
 
129
  # Visualise topics
130
- plot_btn.click(fn=visualise_topics, inputs=[topic_model_state, data_state, data_file_name_no_ext_state, low_resource_mode_opt, embeddings_state, in_label, in_colnames, legend_label, sample_slide, visualisation_type_radio, seed_number], outputs=[vis_output_single_text, out_plot_file, plot, plot_2], api_name="plot")
 
 
 
131
 
132
  # Launch the Gradio app
133
  if __name__ == "__main__":
 
6
  import pandas as pd
7
  import numpy as np
8
 
9
+ from funcs.topic_core_funcs import pre_clean, optimise_zero_shot, extract_topics, reduce_outliers, represent_topics, visualise_topics, save_as_pytorch_model, change_default_vis_col
10
+ from funcs.helper_functions import initial_file_load, custom_regex_load, ensure_output_folder_exists, output_folder, get_connection_params
11
  from sklearn.feature_extraction.text import CountVectorizer
12
 
13
+ min_word_occurence_slider_default = 0.01
14
+ max_word_occurence_slider_default = 0.95
15
+
16
+ ensure_output_folder_exists()
17
 
18
  # Gradio app
19
 
 
21
 
22
  with block:
23
 
24
+ original_data_state = gr.State(pd.DataFrame())
25
  data_state = gr.State(pd.DataFrame())
26
  embeddings_state = gr.State(np.array([]))
27
  embeddings_type_state = gr.State("")
 
31
  docs_state = gr.State()
32
  data_file_name_no_ext_state = gr.State()
33
  label_list_state = gr.State(pd.DataFrame())
34
+ vectoriser_state = gr.State(CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=min_word_occurence_slider_default, max_df=max_word_occurence_slider_default))
35
+
36
+ session_hash_state = gr.State("")
37
 
38
  gr.Markdown(
39
  """
40
  # Topic modeller
41
  Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
42
 
43
+ Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
44
 
45
  For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
46
 
47
+ I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose the passages.parquet file for download.
48
  """)
49
 
50
  with gr.Tab("Load files and find topics"):
 
55
 
56
  with gr.Accordion("Clean data", open = False):
57
  with gr.Row():
58
+ clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove html, > 1 digit nums, emails, postcodes (UK).")
59
+ drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 50 character strings.")
60
+ anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective and slow!")
61
+ split_sentence_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Split text into sentences. Useful for small datasets.")
62
  with gr.Row():
63
+ custom_regex = gr.UploadButton(label="Import custom regex removal file", file_count="multiple")
64
+ gr.Markdown("""Import custom regex - csv table with one column of regex patterns with no header. Strings matching this pattern will be removed. Example pattern: (?i)roosevelt for case insensitive removal of this term.""")
65
  custom_regex_text = gr.Textbox(label="Custom regex load status")
66
  clean_btn = gr.Button("Clean data")
67
 
68
  with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
69
  candidate_topics = gr.File(label="Input topics from file (csv). File should have at least one column with a header and topic keywords in cells below. Topics will be taken from the first column of the file. Currently not compatible with low-resource embeddings.")
70
+
71
+ with gr.Row():
72
+ zero_shot_similarity = gr.Slider(minimum = 0.2, maximum = 1, value = 0.55, step = 0.001, label = "Minimum similarity value for document to be assigned to zero-shot topic. You may need to set this very low to get documents assigned to your topics!", scale=2)
73
+ zero_shot_optimiser_btn = gr.Button("Optimise settings to keep only zero-shot topics", scale=1)
74
 
75
  with gr.Row():
76
+ with gr.Accordion("Topic modelling settings - change documents per topic, max topics, frequency of terms", open = False):
77
+
78
+ with gr.Row():
79
+ min_docs_slider = gr.Slider(minimum = 2, maximum = 1000, value = 3, step = 1, label = "Minimum number of similar documents needed to make a topic.")
80
+ max_topics_slider = gr.Slider(minimum = 2, maximum = 500, value = 100, step = 1, label = "Maximum number of topics")
81
+ with gr.Row():
82
+ min_word_occurence_slider = gr.Slider(minimum = 0.001, maximum = 0.9, value = min_word_occurence_slider_default, step = 0.001, label = "Keep terms that appear in this minimum proportion of documents. Avoids creating topics with very uncommon words.")
83
+ max_word_occurence_slider = gr.Slider(minimum = 0.1, maximum = 1.0, value =max_word_occurence_slider_default, step = 0.01, label = "Keep terms that appear in less than this maximum proportion of documents. Avoids very common words in topic names.")
84
+
85
+ quality_mode_drop = gr.Dropdown(label = "Use high-quality transformers-based embeddings (slower)", value="No", choices=["Yes", "No"])
86
 
87
  with gr.Row():
88
  topics_btn = gr.Button("Extract topics", variant="primary")
 
96
  representation_type = gr.Dropdown(label = "Method for generating new topic labels", value="Default", choices=["Default", "MMR", "KeyBERT", "LLM"])
97
  represent_llm_btn = gr.Button("Change topic labels")
98
  with gr.Row():
99
+ reduce_outliers_btn = gr.Button("Reduce outliers (will create new topic labels)")
100
  save_pytorch_btn = gr.Button("Save model in Pytorch format")
101
 
102
  with gr.Tab("Visualise"):
103
  with gr.Row():
104
+ visualisation_type_radio = gr.Radio(label="Visualisation type", choices=["Topic document graph", "Hierarchical view"], value="Topic document graph")
105
  in_label = gr.Dropdown(choices=["Choose a column"], multiselect = True, label="Select column for labelling documents in output visualisations.")
106
  sample_slide = gr.Slider(minimum = 0.01, maximum = 1, value = 0.1, step = 0.01, label = "Proportion of data points to show on output visualisations.")
107
  legend_label = gr.Textbox(label="Custom legend column (optional, any column from the topic details output)", visible=False)
 
116
  with gr.Tab("Options"):
117
  with gr.Accordion("Data load and processing options", open = True):
118
  with gr.Row():
119
+ seed_number = gr.Number(label="Random seed to use in processing", minimum=0, step=1, value=42, precision=0)
120
  calc_probs = gr.Dropdown(label="Calculate all topic probabilities", value="No", choices=["Yes", "No"])
121
  with gr.Row():
122
+ embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp: smaller files but lower quality.", value="No", choices=["Yes", "No"])
 
 
123
  return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation.", value="Yes", choices=["Yes", "No"])
124
  save_topic_model = gr.Dropdown(label = "Save topic model to BERTopic format pkl file.", value="No", choices=["Yes", "No"])
125
 
126
  # Load in data. Update column names dropdown when file uploaded
127
+ in_files.upload(fn=initial_file_load, inputs=[in_files], outputs=[in_colnames, in_label, data_state, output_single_text, topic_model_state, embeddings_state, data_file_name_no_ext_state, label_list_state, original_data_state])
128
+
129
+ # When topic modelling column is chosen, change the default visualisation column to the same
130
+ in_colnames.change(fn=change_default_vis_col, inputs=[in_colnames],outputs=[in_label])
131
 
132
  # Clean data
133
  custom_regex.upload(fn=custom_regex_load, inputs=[custom_regex], outputs=[custom_regex_text, custom_regex_state])
134
+ clean_btn.click(fn=pre_clean, inputs=[data_state, in_colnames, data_file_name_no_ext_state, custom_regex_state, clean_text, drop_duplicate_text, anonymise_drop, split_sentence_drop], outputs=[output_single_text, output_file, data_state, data_file_name_no_ext_state, embeddings_state], api_name="clean")
135
+
136
+ # Optimise for keeping only zero-shot topics
137
+ zero_shot_optimiser_btn.click(fn=optimise_zero_shot, outputs=[quality_mode_drop, min_docs_slider, max_topics_slider, min_word_occurence_slider, max_word_occurence_slider, zero_shot_similarity])
138
 
139
  # Extract topics
140
+ topics_btn.click(fn=extract_topics, inputs=[data_state, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext_state, label_list_state, return_intermediate_files, embedding_super_compress, quality_mode_drop, save_topic_model, embeddings_state, embeddings_type_state, zero_shot_similarity, calc_probs, vectoriser_state, min_word_occurence_slider, max_word_occurence_slider, split_sentence_drop, seed_number], outputs=[output_single_text, output_file, embeddings_state, embeddings_type_state, data_file_name_no_ext_state, topic_model_state, docs_state, vectoriser_state, assigned_topics_state], api_name="topics")
141
 
142
  # Reduce outliers
143
+ reduce_outliers_btn.click(fn=reduce_outliers, inputs=[topic_model_state, docs_state, embeddings_state, data_file_name_no_ext_state, assigned_topics_state, vectoriser_state, save_topic_model, split_sentence_drop, data_state], outputs=[output_single_text, output_file, topic_model_state], api_name="reduce_outliers")
144
 
145
  # Re-represent topic labels
146
+ represent_llm_btn.click(fn=represent_topics, inputs=[topic_model_state, docs_state, data_file_name_no_ext_state, quality_mode_drop, save_topic_model, representation_type, vectoriser_state, split_sentence_drop, data_state], outputs=[output_single_text, output_file, topic_model_state], api_name="represent_llm")
147
 
148
  # Save in Pytorch format
149
  save_pytorch_btn.click(fn=save_as_pytorch_model, inputs=[topic_model_state, data_file_name_no_ext_state], outputs=[output_single_text, output_file], api_name="pytorch_save")
150
 
151
  # Visualise topics
152
+ plot_btn.click(fn=visualise_topics, inputs=[topic_model_state, data_state, data_file_name_no_ext_state, quality_mode_drop, embeddings_state, in_label, in_colnames, legend_label, sample_slide, visualisation_type_radio, seed_number], outputs=[vis_output_single_text, out_plot_file, plot, plot_2], api_name="plot")
153
+
154
+ # Get session hash from connection parameters
155
+ block.load(get_connection_params, inputs=None, outputs=[session_hash_state])
156
 
157
  # Launch the Gradio app
158
  if __name__ == "__main__":
funcs/anonymiser.py CHANGED
@@ -46,7 +46,7 @@ from presidio_anonymizer.entities import OperatorConfig
46
  # Function to Split Text and Create DataFrame using SpaCy
47
  def expand_sentences_spacy(df, colname, nlp=nlp):
48
  expanded_data = []
49
- df = df.reset_index(names='index')
50
  for index, row in df.iterrows():
51
  doc = nlp(row[colname])
52
  for sent in doc.sents:
 
46
  # Function to Split Text and Create DataFrame using SpaCy
47
  def expand_sentences_spacy(df, colname, nlp=nlp):
48
  expanded_data = []
49
+ df = df.drop('index', axis = 1, errors="ignore").reset_index(names='index')
50
  for index, row in df.iterrows():
51
  doc = nlp(row[colname])
52
  for sent in doc.sents:
funcs/bertopic_vis_documents.py CHANGED
@@ -22,7 +22,8 @@ from tqdm import tqdm
22
  import itertools
23
  import numpy as np
24
 
25
- # Shamelessly taken and adapted from Bertopic original implementation here (Maarten Grootendorst): https://github.com/MaartenGr/BERTopic/blob/master/bertopic/plotting/_documents.py
 
26
 
27
  def visualize_documents_custom(topic_model,
28
  docs: List[str],
@@ -168,16 +169,23 @@ def visualize_documents_custom(topic_model,
168
  df["y"] = embeddings_2d[:, 1]
169
 
170
  # Prepare text and names
 
171
  if isinstance(custom_labels, str):
172
  names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
173
  names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
174
  names = [label if len(label) < 30 else label[:27] + "..." for label in names]
175
  elif topic_model.custom_labels_ is not None and custom_labels:
176
- print("Using custom labels: ", topic_model.custom_labels_)
177
- names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
 
 
 
178
  else:
179
- print("Not using custom labels")
180
- names = [f"{topic} " + ", ".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]
 
 
 
181
 
182
  #print(names)
183
 
 
22
  import itertools
23
  import numpy as np
24
 
25
+
26
+ # Following adapted from Bertopic original implementation here (Maarten Grootendorst): https://github.com/MaartenGr/BERTopic/blob/master/bertopic/plotting/_documents.py
27
 
28
  def visualize_documents_custom(topic_model,
29
  docs: List[str],
 
169
  df["y"] = embeddings_2d[:, 1]
170
 
171
  # Prepare text and names
172
+ trace_name_char_length = 60
173
  if isinstance(custom_labels, str):
174
  names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
175
  names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
176
  names = [label if len(label) < 30 else label[:27] + "..." for label in names]
177
  elif topic_model.custom_labels_ is not None and custom_labels:
178
+ #print("Using custom labels: ", topic_model.custom_labels_)
179
+ #names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
180
+ # Limit label length to 100 chars
181
+ names = [label[:trace_name_char_length] for label in (topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics)]
182
+
183
  else:
184
+ #print("Not using custom labels")
185
+ # Limit label length to 100 chars
186
+ names = [f"{topic} " + ", ".join([word for word, value in topic_model.get_topic(topic)][:3])[:trace_name_char_length] for topic in unique_topics]
187
+
188
+ #names = [f"{topic} " + ", ".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]
189
 
190
  #print(names)
191
 
funcs/clean_funcs.py CHANGED
@@ -23,19 +23,27 @@ def initial_clean(texts, custom_regex, progress=gr.Progress()):
23
  text = text.str.replace_all(email_pattern_regex, ' ')
24
  text = text.str.replace_all(nums_two_more_regex, ' ')
25
  text = text.str.replace_all(postcode_pattern_regex, ' ')
 
 
 
 
 
 
 
 
26
 
27
  # Allow for custom regex patterns to be removed
28
  if len(custom_regex) > 0:
29
  for pattern in custom_regex:
30
  raw_string_pattern = r'{}'.format(pattern)
31
  print("Removing regex pattern: ", raw_string_pattern)
32
- text = text.str.replace_all(raw_string_pattern, ' ')
33
 
34
- text = text.str.replace_all(multiple_spaces_regex, ' ')
35
 
36
- text = text.to_list()
37
 
38
- return text
39
 
40
  def remove_hyphens(text_text):
41
  return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)
 
23
  text = text.str.replace_all(email_pattern_regex, ' ')
24
  text = text.str.replace_all(nums_two_more_regex, ' ')
25
  text = text.str.replace_all(postcode_pattern_regex, ' ')
26
+ text = text.str.replace_all(multiple_spaces_regex, ' ')
27
+
28
+ text = text.to_list()
29
+
30
+ return text
31
+
32
+ def regex_clean(texts, custom_regex, progress=gr.Progress()):
33
+ texts = pl.Series(texts).str.strip_chars()
34
 
35
  # Allow for custom regex patterns to be removed
36
  if len(custom_regex) > 0:
37
  for pattern in custom_regex:
38
  raw_string_pattern = r'{}'.format(pattern)
39
  print("Removing regex pattern: ", raw_string_pattern)
40
+ texts = texts.str.replace_all(raw_string_pattern, ' ')
41
 
42
+ texts = texts.str.replace_all(multiple_spaces_regex, ' ')
43
 
44
+ texts = texts.to_list()
45
 
46
+ return texts
47
 
48
  def remove_hyphens(text_text):
49
  return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)
funcs/embeddings.py CHANGED
@@ -1,15 +1,41 @@
1
  import time
2
  import numpy as np
3
- from torch import cuda
4
 
5
- random_seed = 42
 
 
6
 
 
 
7
  if cuda.is_available():
8
  torch_device = "gpu"
 
 
 
9
  else:
10
  torch_device = "cpu"
 
11
 
12
- def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, embeddings_super_compress, low_resource_mode_opt):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  # If no embeddings found, make or load in
15
  if embeddings_out.size == 0:
@@ -32,7 +58,7 @@ def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, em
32
 
33
  # Custom model
34
  # If on CPU, don't resort to embedding models
35
- if low_resource_mode_opt == "Yes":
36
  print("Creating simplified 'sparse' embeddings based on TfIDF")
37
 
38
  # Fit the pipeline to the text data
@@ -41,13 +67,10 @@ def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, em
41
  # Transform text data to embeddings
42
  embeddings_out = embedding_model.transform(docs)
43
 
44
- #embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32)
45
-
46
- elif low_resource_mode_opt == "No":
47
  print("Creating dense embeddings based on transformers model")
48
 
49
- #embeddings_out = embedding_model.encode(sentences=docs, max_length=1024, show_progress_bar = True, batch_size = 32) # For Jina # #
50
- embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32, precision="int8") # For large
51
 
52
  toc = time.perf_counter()
53
  time_out = f"The embedding took {toc - tic:0.1f} seconds"
 
1
  import time
2
  import numpy as np
3
+ from torch import cuda, backends, version
4
 
5
+ # Check for torch cuda
6
+ # If you want to disable cuda for testing purposes
7
+ #os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
8
 
9
+ print("Is CUDA enabled? ", cuda.is_available())
10
+ print("Is a CUDA device available on this computer?", backends.cudnn.enabled)
11
  if cuda.is_available():
12
  torch_device = "gpu"
13
+ print("Cuda version installed is: ", version.cuda)
14
+ high_quality_mode = "Yes"
15
+ #os.system("nvidia-smi")
16
  else:
17
  torch_device = "cpu"
18
+ high_quality_mode = "No"
19
 
20
+ print("Device used is: ", torch_device)
21
+
22
+
23
+
24
+ def make_or_load_embeddings(docs: list, file_list: list, embeddings_out: np.ndarray, embedding_model, embeddings_super_compress: str, high_quality_mode_opt: str) -> np.ndarray:
25
+ """
26
+ Create or load embeddings for the given documents.
27
+
28
+ Args:
29
+ docs (list): List of documents to embed.
30
+ file_list (list): List of file names to check for existing embeddings.
31
+ embeddings_out (np.ndarray): Array to store the embeddings.
32
+ embedding_model: Model used to generate embeddings.
33
+ embeddings_super_compress (str): Option to super compress embeddings ("Yes" or "No").
34
+ high_quality_mode_opt (str): Option for high quality mode ("Yes" or "No").
35
+
36
+ Returns:
37
+ np.ndarray: The generated or loaded embeddings.
38
+ """
39
 
40
  # If no embeddings found, make or load in
41
  if embeddings_out.size == 0:
 
58
 
59
  # Custom model
60
  # If on CPU, don't resort to embedding models
61
+ if high_quality_mode_opt == "No":
62
  print("Creating simplified 'sparse' embeddings based on TfIDF")
63
 
64
  # Fit the pipeline to the text data
 
67
  # Transform text data to embeddings
68
  embeddings_out = embedding_model.transform(docs)
69
 
70
+ elif high_quality_mode_opt == "Yes":
 
 
71
  print("Creating dense embeddings based on transformers model")
72
 
73
+ embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32)#, precision="int8") # For large
 
74
 
75
  toc = time.perf_counter()
76
  time_out = f"The embedding took {toc - tic:0.1f} seconds"
funcs/helper_functions.py CHANGED
@@ -10,33 +10,70 @@ import numpy as np
10
  from bertopic import BERTopic
11
  from datetime import datetime
12
 
 
 
13
  today = datetime.now().strftime("%d%m%Y")
14
  today_rev = datetime.now().strftime("%Y%m%d")
15
 
16
- # Log terminal output: https://github.com/gradio-app/gradio/issues/2362
17
- class Logger:
18
- def __init__(self, filename):
19
- self.terminal = sys.stdout
20
- self.log = open(filename, "w")
 
 
 
 
 
21
 
22
- def write(self, message):
23
- self.terminal.write(message)
24
- self.log.write(message)
25
-
26
- def flush(self):
27
- self.terminal.flush()
28
- self.log.flush()
29
-
30
- def isatty(self):
31
- return False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- #sys.stdout = Logger("output.log")
 
 
 
 
34
 
35
- # def read_logs():
36
- # sys.stdout.flush()
37
- # with open("output.log", "r") as f:
38
- # return f.read()
39
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  def detect_file_type(filename):
42
  """Detect the file type based on its extension."""
@@ -130,7 +167,7 @@ def initial_file_load(in_file):
130
 
131
 
132
  #The np.array([]) at the end is for clearing the embedding state when a new file is loaded
133
- return gr.Dropdown(choices=concat_choices), gr.Dropdown(choices=concat_choices), df, output_text, topic_model, embeddings, data_file_name_no_ext, custom_labels
134
 
135
  def custom_regex_load(in_file):
136
  '''
@@ -157,8 +194,6 @@ def custom_regex_load(in_file):
157
 
158
  return output_text, custom_regex
159
 
160
-
161
-
162
  def get_file_path_end(file_path):
163
  # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
164
  basename = os.path.basename(file_path)
@@ -177,15 +212,7 @@ def get_file_path_end_with_ext(file_path):
177
 
178
  return filename_end
179
 
180
- def dummy_function(in_colnames):
181
- """
182
- A dummy function that exists just so that dropdown updates work correctly.
183
- """
184
- return None
185
-
186
  # Zip the above to export file
187
-
188
-
189
  def zip_folder(folder_path, output_zip_file):
190
  # Create a ZipFile object in write mode
191
  with zipfile.ZipFile(output_zip_file, 'w', zipfile.ZIP_DEFLATED) as zipf:
@@ -215,59 +242,121 @@ def delete_files_in_folder(folder_path):
215
  except Exception as e:
216
  print(f"Failed to delete {file_path}. Reason: {e}")
217
 
218
-
219
- def save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, progress=gr.Progress()):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
220
 
221
- progress(0.7, desc= "Checking data")
222
-
223
- topic_dets = topic_model.get_topic_info()
224
 
225
- if topic_dets.shape[0] == 1:
226
- topic_det_output_name = "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
227
- topic_dets.to_csv(topic_det_output_name)
228
- output_list.append(topic_det_output_name)
229
 
230
- return output_list, "No topics found, original file returned"
231
-
232
-
233
- progress(0.8, desc= "Saving output")
234
-
235
- topic_det_output_name = "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
236
  topic_dets.to_csv(topic_det_output_name)
237
  output_list.append(topic_det_output_name)
238
 
239
- doc_det_output_name = "doc_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
240
- doc_dets = topic_model.get_document_info(docs)[["Document", "Topic", "Name", "Probability", "Representative_document"]]
241
- doc_dets.to_csv(doc_det_output_name)
242
- output_list.append(doc_det_output_name)
243
 
244
- if "CustomName" in topic_dets.columns:
245
- topics_text_out_str = str(topic_dets["CustomName"])
246
- else:
247
- topics_text_out_str = str(topic_dets["Name"])
248
- output_text = "Topics: " + topics_text_out_str
249
 
250
- # Save topic model to file
251
- if save_topic_model == "Yes":
252
- print("Saving BERTopic model in .pkl format.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253
 
254
- folder_path = "output_model/"
 
 
 
255
 
256
- if not os.path.exists(folder_path):
257
- # Create the folder
258
- os.makedirs(folder_path)
259
 
260
- topic_model_save_name_pkl = folder_path + data_file_name_no_ext + "_topics_" + today_rev + ".pkl"# + ".safetensors"
261
- topic_model_save_name_zip = topic_model_save_name_pkl + ".zip"
262
 
263
- # Clear folder before replacing files
264
- #delete_files_in_folder(topic_model_save_name_pkl)
265
 
266
- topic_model.save(topic_model_save_name_pkl, serialization='pickle', save_embedding_model=False, save_ctfidf=False)
 
 
267
 
268
- # Zip file example
269
-
270
- #zip_folder(topic_model_save_name_pkl, topic_model_save_name_zip)
271
- output_list.append(topic_model_save_name_pkl)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272
 
273
- return output_list, output_text
 
10
  from bertopic import BERTopic
11
  from datetime import datetime
12
 
13
+ from typing import List, Tuple
14
+
15
  today = datetime.now().strftime("%d%m%Y")
16
  today_rev = datetime.now().strftime("%Y%m%d")
17
 
18
+ def get_or_create_env_var(var_name:str, default_value:str) -> str:
19
+ # Get the environment variable if it exists
20
+ value = os.environ.get(var_name)
21
+
22
+ # If it doesn't exist, set it to the default value
23
+ if value is None:
24
+ os.environ[var_name] = default_value
25
+ value = default_value
26
+
27
+ return value
28
 
29
+ # Retrieving or setting output folder
30
+ env_var_name = 'GRADIO_OUTPUT_FOLDER'
31
+ default_value = 'output/'
32
+
33
+ output_folder = get_or_create_env_var(env_var_name, default_value)
34
+ print(f'The value of {env_var_name} is {output_folder}')
35
+
36
+ def ensure_output_folder_exists():
37
+ """Checks if the 'output/' folder exists, creates it if not."""
38
+
39
+ folder_name = "output/"
40
+
41
+ if not os.path.exists(folder_name):
42
+ # Create the folder if it doesn't exist
43
+ os.makedirs(folder_name)
44
+ print(f"Created the 'output/' folder.")
45
+ else:
46
+ print(f"The 'output/' folder already exists.")
47
+
48
+ def get_connection_params(request: gr.Request):
49
+ '''
50
+ Get connection parameter values from request object.
51
+ '''
52
+ if request:
53
 
54
+ # print("Request headers dictionary:", request.headers)
55
+ # print("All host elements", request.client)
56
+ # print("IP address:", request.client.host)
57
+ # print("Query parameters:", dict(request.query_params))
58
+ print("Session hash:", request.session_hash)
59
 
60
+ if 'x-cognito-id' in request.headers:
61
+ out_session_hash = request.headers['x-cognito-id']
62
+ base_folder = "user-files/"
63
+ #print("Cognito ID found:", out_session_hash)
64
 
65
+ else:
66
+ out_session_hash = request.session_hash
67
+ base_folder = "temp-files/"
68
+ #print("Cognito ID not found. Using session hash as save folder.")
69
+
70
+ output_folder = base_folder + out_session_hash + "/"
71
+ #print("S3 output folder is: " + "s3://" + bucket_name + "/" + output_folder)
72
+
73
+ return out_session_hash
74
+ else:
75
+ print("No session parameters found.")
76
+ return ""
77
 
78
  def detect_file_type(filename):
79
  """Detect the file type based on its extension."""
 
167
 
168
 
169
  #The np.array([]) at the end is for clearing the embedding state when a new file is loaded
170
+ return gr.Dropdown(choices=concat_choices), gr.Dropdown(choices=concat_choices), df, output_text, topic_model, embeddings, data_file_name_no_ext, custom_labels, df
171
 
172
  def custom_regex_load(in_file):
173
  '''
 
194
 
195
  return output_text, custom_regex
196
 
 
 
197
  def get_file_path_end(file_path):
198
  # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
199
  basename = os.path.basename(file_path)
 
212
 
213
  return filename_end
214
 
 
 
 
 
 
 
215
  # Zip the above to export file
 
 
216
  def zip_folder(folder_path, output_zip_file):
217
  # Create a ZipFile object in write mode
218
  with zipfile.ZipFile(output_zip_file, 'w', zipfile.ZIP_DEFLATED) as zipf:
 
242
  except Exception as e:
243
  print(f"Failed to delete {file_path}. Reason: {e}")
244
 
245
+ def save_topic_outputs(topic_model: BERTopic, data_file_name_no_ext: str, output_list: List[str], docs: List[str], save_topic_model: bool, prepared_docs: pd.DataFrame, split_sentence_drop: str, output_folder: str = output_folder, progress: gr.Progress = gr.Progress()) -> Tuple[List[str], str]:
246
+ """
247
+ Save the outputs of a topic model to specified files.
248
+
249
+ Args:
250
+ topic_model (BERTopic): The topic model object.
251
+ data_file_name_no_ext (str): The base name of the data file without extension.
252
+ output_list (List[str]): List to store the output file names.
253
+ docs (List[str]): List of documents.
254
+ save_topic_model (bool): Flag to save the topic model.
255
+ prepared_docs (pd.DataFrame): DataFrame containing prepared documents.
256
+ split_sentence_drop (str): Option to split sentences ("Yes" or "No").
257
+ output_folder (str, optional): Folder to save the output files. Defaults to output_folder.
258
+ progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress().
259
+
260
+ Returns:
261
+ Tuple[List[str], str]: A tuple containing the list of output file names and a status message.
262
+ """
263
 
264
+ progress(0.7, desc= "Checking data")
 
 
265
 
266
+ topic_dets = topic_model.get_topic_info()
 
 
 
267
 
268
+ if topic_dets.shape[0] == 1:
269
+ topic_det_output_name = output_folder + "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
 
 
 
 
270
  topic_dets.to_csv(topic_det_output_name)
271
  output_list.append(topic_det_output_name)
272
 
273
+ return output_list, "No topics found, original file returned"
 
 
 
274
 
275
+ progress(0.8, desc= "Saving output")
 
 
 
 
276
 
277
+ topic_det_output_name = output_folder + "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
278
+ topic_dets.to_csv(topic_det_output_name)
279
+ output_list.append(topic_det_output_name)
280
+
281
+ doc_det_output_name = output_folder + "doc_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
282
+
283
+ ## Check that the following columns exist in the dataframe, keep only the ones that exist
284
+ columns_to_check = ["Document", "Topic", "Name", "Probability", "Representative_document"]
285
+
286
+ columns_found = [column for column in columns_to_check if column in topic_model.get_document_info(docs).columns]
287
+ doc_dets = topic_model.get_document_info(docs)[columns_found]
288
+
289
+ # If you have created a 'sentence split' dataset from the cleaning options, map these sentences back to the original document.
290
+ try:
291
+ if split_sentence_drop == "Yes":
292
+ doc_dets = doc_dets.merge(prepared_docs[['document_index']], how = "left", left_index=True, right_index=True)
293
+ doc_dets = doc_dets.rename(columns={"document_index": "parent_document_index"}, errors='ignore')
294
+
295
+ # 1. Group by Parent Document Index:
296
+ grouped = doc_dets.groupby('parent_document_index')
297
+
298
+ # 2. Aggregate Topics and Probabilities:
299
+ def aggregate_topics(group):
300
+ original_text = ' '.join(group['Document'])
301
+ topics = group['Topic'].tolist()
302
+
303
+ if 'Name' in group.columns:
304
+ topic_names = group['Name'].tolist()
305
+ else:
306
+ topic_names = None
307
 
308
+ if 'Probability' in group.columns:
309
+ probabilities = group['Probability'].tolist()
310
+ else:
311
+ probabilities = None # Or any other default value you prefer
312
 
313
+ return pd.Series({'Document':original_text, 'Topic numbers': topics, 'Topic names': topic_names, 'Probabilities': probabilities})
 
 
314
 
315
+ #result_df = grouped.apply(aggregate_topics).reset_index()
316
+ doc_det_agg = grouped.apply(lambda x: aggregate_topics(x)).reset_index()
317
 
318
+ # Join back original text
319
+ #doc_det_agg = doc_det_agg.merge(original_data[[in_colnames_list_first]], how = "left", left_index=True, right_index=True)
320
 
321
+ doc_det_agg_output_name = output_folder + "doc_details_agg_" + data_file_name_no_ext + "_" + today_rev + ".csv"
322
+ doc_det_agg.to_csv(doc_det_agg_output_name)
323
+ output_list.append(doc_det_agg_output_name)
324
 
325
+ except Exception as e:
326
+ print("Creating aggregate document details failed, error:", e)
327
+
328
+ # Save document details to file
329
+ doc_dets.to_csv(doc_det_output_name)
330
+ output_list.append(doc_det_output_name)
331
+
332
+
333
+ if "CustomName" in topic_dets.columns:
334
+ topics_text_out_str = str(topic_dets["CustomName"])
335
+ else:
336
+ topics_text_out_str = str(topic_dets["Name"])
337
+ output_text = "Topics: " + topics_text_out_str
338
+
339
+ # Save topic model to file
340
+ if save_topic_model == "Yes":
341
+ print("Saving BERTopic model in .pkl format.")
342
+
343
+ #folder_path = output_folder #"output_model/"
344
+
345
+ #if not os.path.exists(folder_path):
346
+ # Create the folder
347
+ # os.makedirs(folder_path)
348
+
349
+ topic_model_save_name_pkl = output_folder + data_file_name_no_ext + "_topics_" + today_rev + ".pkl"# + ".safetensors"
350
+ topic_model_save_name_zip = topic_model_save_name_pkl + ".zip"
351
+
352
+ # Clear folder before replacing files
353
+ #delete_files_in_folder(topic_model_save_name_pkl)
354
+
355
+ topic_model.save(topic_model_save_name_pkl, serialization='pickle', save_embedding_model=False, save_ctfidf=False)
356
+
357
+ # Zip file example
358
+
359
+ #zip_folder(topic_model_save_name_pkl, topic_model_save_name_zip)
360
+ output_list.append(topic_model_save_name_pkl)
361
 
362
+ return output_list, output_text
funcs/representation_model.py CHANGED
@@ -3,29 +3,26 @@ from bertopic.representation import LlamaCPP
3
  from llama_cpp import Llama
4
  from pydantic import BaseModel
5
  import torch.cuda
6
- from huggingface_hub import hf_hub_download, snapshot_download
7
 
8
  from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
9
- from funcs.prompts import capybara_prompt, capybara_start, open_hermes_prompt, open_hermes_start, stablelm_prompt, stablelm_start, phi3_prompt, phi3_start
10
-
11
- random_seed = 42
12
 
13
  chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
14
  chosen_start_tag = phi3_start #open_hermes_start # stablelm_start
15
 
 
16
 
17
  # Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
18
- if torch.cuda.is_available():
19
- torch_device = "gpu"
20
  low_resource_mode = "No"
21
- n_gpu_layers = 100
22
- else:
23
- torch_device = "cpu"
24
  low_resource_mode = "Yes"
25
  n_gpu_layers = 0
26
 
27
- #low_resource_mode = "No" # Override for testing
28
-
29
  #print("Running on device:", torch_device)
30
  n_threads = torch.get_num_threads()
31
  print("CPU n_threads:", n_threads)
@@ -37,7 +34,7 @@ top_p: float = 1
37
  repeat_penalty: float = 1.1
38
  last_n_tokens_size: int = 128
39
  max_tokens: int = 500
40
- seed: int = 42
41
  reset: bool = True
42
  stream: bool = False
43
  n_threads: int = n_threads
@@ -83,15 +80,25 @@ llm_config = LLamacppInitConfigGpu(last_n_tokens_size=last_n_tokens_size,
83
  trust_remote_code=trust_remote_code)
84
 
85
  ## Create representation model parameters ##
86
- # KeyBERT
87
  keybert = KeyBERTInspired(random_state=random_seed)
88
- # MMR
89
  mmr = MaximalMarginalRelevance(diversity=0.5)
90
-
91
  base_rep = BaseRepresentation()
92
 
93
  # Find model file
94
- def find_model_file(hf_model_name, hf_model_file, search_folder, sub_folder):
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  hf_loc = search_folder #os.environ["HF_HOME"]
96
  hf_sub_loc = search_folder + sub_folder #os.environ["HF_HOME"]
97
 
@@ -116,17 +123,27 @@ def find_model_file(hf_model_name, hf_model_file, search_folder, sub_folder):
116
 
117
  return found_file
118
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
- def create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, low_resource_mode):
 
 
121
 
122
  if representation_type == "LLM":
123
  print("Generating LLM representation")
124
  # Use llama.cpp to load in model
125
 
126
- # del os.environ["HF_HOME"]
127
-
128
  # Check for HF_HOME environment variable and supply a default value if it's not found (typical location for huggingface models)
129
- # Get HF_HOME environment variable or default to "~/.cache/huggingface/hub"
130
  base_folder = "model" #"~/.cache/huggingface/hub"
131
  hf_home_value = os.getenv("HF_HOME", base_folder)
132
 
@@ -158,9 +175,10 @@ def create_representation_model(representation_type, llm_config, hf_model_name,
158
 
159
  print("Loading representation model with", llm_config.n_gpu_layers, "layers allocated to GPU.")
160
 
 
161
  llm = Llama(model_path=found_file, stop=chosen_start_tag, n_gpu_layers=llm_config.n_gpu_layers, n_ctx=llm_config.n_ctx,seed=seed) #**llm_config.model_dump())# rope_freq_scale=0.5,
162
  #print(llm.n_gpu_layers)
163
- print("Chosen prompt:", chosen_prompt)
164
  llm_model = LlamaCPP(llm, prompt=chosen_prompt)#, **gen_config.model_dump())
165
 
166
  # All representation models
@@ -180,15 +198,6 @@ def create_representation_model(representation_type, llm_config, hf_model_name,
180
  else:
181
  print("Generating default representation type")
182
  representation_model = {"Default":base_rep}
183
-
184
- # Deprecated example using CTransformers. This package is not really used anymore
185
- #model = AutoModelForCausalLM.from_pretrained('NousResearch/Nous-Capybara-7B-V1.9-GGUF', model_type='mistral', model_file='Capybara-7B-V1.9-Q5_K_M.gguf', hf=True, **vars(llm_config))
186
- #tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Capybara-7B-V1.9")
187
- #generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
188
-
189
- # Text generation with Llama 2
190
- #mistral_capybara = TextGeneration(generator, prompt=capybara_prompt)
191
- #mistral_hermes = TextGeneration(generator, prompt=open_hermes_prompt)
192
 
193
  return representation_model
194
 
 
3
  from llama_cpp import Llama
4
  from pydantic import BaseModel
5
  import torch.cuda
6
+ from huggingface_hub import hf_hub_download
7
 
8
  from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
9
+ from funcs.embeddings import torch_device
10
+ from funcs.prompts import phi3_prompt, phi3_start
 
11
 
12
  chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
13
  chosen_start_tag = phi3_start #open_hermes_start # stablelm_start
14
 
15
+ random_seed = 42
16
 
17
  # Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
18
+ print("torch device for representation functions:", torch_device)
19
+ if torch_device == "gpu":
20
  low_resource_mode = "No"
21
+ n_gpu_layers = -1 # i.e. all
22
+ else: # torch_device = "cpu"
 
23
  low_resource_mode = "Yes"
24
  n_gpu_layers = 0
25
 
 
 
26
  #print("Running on device:", torch_device)
27
  n_threads = torch.get_num_threads()
28
  print("CPU n_threads:", n_threads)
 
34
  repeat_penalty: float = 1.1
35
  last_n_tokens_size: int = 128
36
  max_tokens: int = 500
37
+ seed: int = random_seed
38
  reset: bool = True
39
  stream: bool = False
40
  n_threads: int = n_threads
 
80
  trust_remote_code=trust_remote_code)
81
 
82
  ## Create representation model parameters ##
 
83
  keybert = KeyBERTInspired(random_state=random_seed)
 
84
  mmr = MaximalMarginalRelevance(diversity=0.5)
 
85
  base_rep = BaseRepresentation()
86
 
87
  # Find model file
88
+ def find_model_file(hf_model_name: str, hf_model_file: str, search_folder: str, sub_folder: str) -> str:
89
+ """
90
+ Finds the specified model file within the given search folder and subfolder.
91
+
92
+ Args:
93
+ hf_model_name (str): The name of the Hugging Face model.
94
+ hf_model_file (str): The specific file name of the model to find.
95
+ search_folder (str): The base folder to start the search.
96
+ sub_folder (str): The subfolder within the search folder to look into.
97
+
98
+ Returns:
99
+ str: The path to the found model file, or None if the file is not found.
100
+ """
101
+
102
  hf_loc = search_folder #os.environ["HF_HOME"]
103
  hf_sub_loc = search_folder + sub_folder #os.environ["HF_HOME"]
104
 
 
123
 
124
  return found_file
125
 
126
+ def create_representation_model(representation_type: str, llm_config: dict, hf_model_name: str, hf_model_file: str, chosen_start_tag: str, low_resource_mode: bool) -> dict:
127
+ """
128
+ Creates a representation model based on the specified type and configuration.
129
+
130
+ Args:
131
+ representation_type (str): The type of representation model to create (e.g., "LLM", "KeyBERT").
132
+ llm_config (dict): Configuration settings for the LLM model.
133
+ hf_model_name (str): The name of the Hugging Face model.
134
+ hf_model_file (str): The specific file name of the model to find.
135
+ chosen_start_tag (str): The start tag to use for the model.
136
+ low_resource_mode (bool): Whether to enable low resource mode.
137
 
138
+ Returns:
139
+ dict: A dictionary containing the created representation model.
140
+ """
141
 
142
  if representation_type == "LLM":
143
  print("Generating LLM representation")
144
  # Use llama.cpp to load in model
145
 
 
 
146
  # Check for HF_HOME environment variable and supply a default value if it's not found (typical location for huggingface models)
 
147
  base_folder = "model" #"~/.cache/huggingface/hub"
148
  hf_home_value = os.getenv("HF_HOME", base_folder)
149
 
 
175
 
176
  print("Loading representation model with", llm_config.n_gpu_layers, "layers allocated to GPU.")
177
 
178
+ #llm_config.n_gpu_layers
179
  llm = Llama(model_path=found_file, stop=chosen_start_tag, n_gpu_layers=llm_config.n_gpu_layers, n_ctx=llm_config.n_ctx,seed=seed) #**llm_config.model_dump())# rope_freq_scale=0.5,
180
  #print(llm.n_gpu_layers)
181
+ #print("Chosen prompt:", chosen_prompt)
182
  llm_model = LlamaCPP(llm, prompt=chosen_prompt)#, **gen_config.model_dump())
183
 
184
  # All representation models
 
198
  else:
199
  print("Generating default representation type")
200
  representation_model = {"Default":base_rep}
 
 
 
 
 
 
 
 
 
201
 
202
  return representation_model
203
 
funcs/topic_core_funcs.py CHANGED
@@ -8,12 +8,17 @@ import numpy as np
8
  import time
9
  from bertopic import BERTopic
10
 
11
- from funcs.clean_funcs import initial_clean
 
 
 
12
  from funcs.anonymiser import expand_sentences_spacy
13
- from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs
14
- from funcs.embeddings import make_or_load_embeddings
15
  from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
 
16
 
 
17
 
18
  from sentence_transformers import SentenceTransformer
19
  from sklearn.pipeline import make_pipeline
@@ -22,27 +27,10 @@ from sklearn.feature_extraction.text import TfidfVectorizer
22
  import funcs.anonymiser as anon
23
  from umap import UMAP
24
 
25
- from torch import cuda, backends, version
26
-
27
- # Default seed, can be changed in number selection on options page
28
- random_seed = 42
29
-
30
- # Check for torch cuda
31
- # If you want to disable cuda for testing purposes
32
- #os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
33
-
34
- print("Is CUDA enabled? ", cuda.is_available())
35
- print("Is a CUDA device available on this computer?", backends.cudnn.enabled)
36
- if cuda.is_available():
37
- torch_device = "gpu"
38
- print("Cuda version installed is: ", version.cuda)
39
- low_resource_mode = "No"
40
- #os.system("nvidia-smi")
41
- else:
42
- torch_device = "cpu"
43
- low_resource_mode = "Yes"
44
-
45
- print("Device used is: ", torch_device)
46
 
47
  today = datetime.now().strftime("%d%m%Y")
48
  today_rev = datetime.now().strftime("%Y%m%d")
@@ -54,7 +42,35 @@ embeddings_name = "mixedbread-ai/mxbai-embed-large-v1" #"BAAI/large-small-en-v1.
54
  hf_model_name = "QuantFactory/Phi-3-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
55
  hf_model_file = "Phi-3-mini-128k-instruct.Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
56
 
57
- def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text, drop_duplicate_text, anonymise_drop, sentence_split_drop, progress=gr.Progress(track_tqdm=True)):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  output_text = ""
60
  output_list = []
@@ -64,7 +80,7 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
64
  if not in_colnames:
65
  error_message = "Please enter one column name to use for cleaning and finding topics."
66
  print(error_message)
67
- return error_message, None, data_file_name_no_ext, None, None
68
 
69
  all_tic = time.perf_counter()
70
 
@@ -77,17 +93,23 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
77
  clean_tic = time.perf_counter()
78
  print("Starting data clean.")
79
 
80
- data_file_name_no_ext = data_file_name_no_ext + "_clean"
81
 
82
- if not custom_regex.empty:
83
- data[in_colnames_list_first] = initial_clean(data[in_colnames_list_first], custom_regex.iloc[:, 0].to_list())
84
- else:
85
- data[in_colnames_list_first] = initial_clean(data[in_colnames_list_first], [])
86
 
87
  clean_toc = time.perf_counter()
88
  clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
89
  print(clean_time_out)
90
 
 
 
 
 
 
 
 
 
91
  if drop_duplicate_text == "Yes":
92
  progress(0.3, desc= "Drop duplicates - remove short texts")
93
 
@@ -104,7 +126,8 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
104
  if anonymise_drop == "Yes":
105
  progress(0.6, desc= "Anonymising data")
106
 
107
- data_file_name_no_ext = data_file_name_no_ext + "_anon"
 
108
 
109
  anon_tic = time.perf_counter()
110
 
@@ -120,17 +143,19 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
120
  if sentence_split_drop == "Yes":
121
  progress(0.6, desc= "Splitting text into sentences")
122
 
123
- data_file_name_no_ext = data_file_name_no_ext + "_split"
 
124
 
125
  anon_tic = time.perf_counter()
126
 
127
  data = expand_sentences_spacy(data, in_colnames_list_first)
128
- data = data[data[in_colnames_list_first].str.len() >= 5] # Keep only rows with at least 5 characters
 
129
 
130
  anon_toc = time.perf_counter()
131
  time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
132
 
133
- out_data_name = data_file_name_no_ext + "_" + today_rev + ".csv"
134
  data.to_csv(out_data_name)
135
  output_list.append(out_data_name)
136
 
@@ -140,14 +165,84 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
140
 
141
  output_text = "Data clean completed."
142
 
143
- return output_text, output_list, data, data_file_name_no_ext
144
-
145
- def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext, custom_labels_df, return_intermediate_files, embeddings_super_compress, low_resource_mode, save_topic_model, embeddings_out, embeddings_type_state, zero_shot_similarity, random_seed, calc_probs, vectoriser_state, progress=gr.Progress(track_tqdm=True)):
146
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  all_tic = time.perf_counter()
148
 
149
  progress(0, desc= "Loading data")
150
 
 
 
151
  output_list = []
152
  file_list = [string.name for string in in_files]
153
 
@@ -170,10 +265,9 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
170
  # Check if embeddings are being loaded in
171
  progress(0.2, desc= "Loading/creating embeddings")
172
 
173
- print("Low resource mode: ", low_resource_mode)
174
 
175
- if low_resource_mode == "No":
176
- print("Using high resource embedding model")
177
 
178
  # Define a list of possible local locations to search for the model
179
  local_embeddings_locations = [
@@ -205,7 +299,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
205
  embeddings_type_state = "large"
206
 
207
  # UMAP model uses Bertopic defaults
208
- umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory=False, random_state=random_seed)
209
 
210
  else:
211
  print("Choosing low resource TF-IDF model.")
@@ -223,9 +317,9 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
223
 
224
  #umap_model = TruncatedSVD(n_components=5, random_state=random_seed)
225
  # UMAP model uses Bertopic defaults
226
- umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory=True, random_state=random_seed)
227
 
228
- embeddings_out = make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, embeddings_super_compress, low_resource_mode)
229
 
230
  # This is saved as a Gradio state object
231
  vectoriser_model = vectoriser_state
@@ -250,7 +344,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
250
 
251
  if calc_probs == True:
252
  topics_probs_out = pd.DataFrame(topic_model.probabilities_)
253
- topics_probs_out_name = "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
254
  topics_probs_out.to_csv(topics_probs_out_name)
255
  output_list.append(topics_probs_out_name)
256
 
@@ -258,20 +352,24 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
258
  print(error)
259
  print(fail_error_message)
260
 
261
- return fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
 
 
262
 
263
 
264
  # Do this if you have pre-defined topics
265
  else:
266
- if low_resource_mode == "Yes":
267
- error_message = "Zero shot topic modelling currently not compatible with low-resource embeddings. Please change this option to 'No' on the options tab and retry."
268
- print(error_message)
269
 
270
- return error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
271
 
272
  zero_shot_topics = read_file(candidate_topics.name)
273
  zero_shot_topics_lower = list(zero_shot_topics.iloc[:, 0].str.lower())
274
 
 
 
275
 
276
  try:
277
  topic_model = BERTopic( embedding_model=embedding_model, #embedding_model_pipe, # for Jina
@@ -288,7 +386,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
288
 
289
  if calc_probs == True:
290
  topics_probs_out = pd.DataFrame(topic_model.probabilities_)
291
- topics_probs_out_name = "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
292
  topics_probs_out.to_csv(topics_probs_out_name)
293
  output_list.append(topics_probs_out_name)
294
 
@@ -296,14 +394,14 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
296
  print("An exception occurred:", error)
297
  print(fail_error_message)
298
 
299
- return fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
 
 
300
 
301
  # For some reason, zero topic modelling exports assigned topics as a np.array instead of a list. Converting it back here.
302
  if isinstance(assigned_topics, np.ndarray):
303
  assigned_topics = assigned_topics.tolist()
304
 
305
-
306
-
307
  # Zero shot modelling is a model merge, which wipes the c_tf_idf part of the resulting model completely. To get hierarchical modelling to work, we need to recreate this part of the model with the CountVectorizer options used to create the initial model. Since with zero shot, we are merging two models that have exactly the same set of documents, the vocubulary should be the same, and so recreating the cf_tf_idf component in this way shouldn't be a problem. Discussion here, and below based on Maarten's suggested code: https://github.com/MaartenGr/BERTopic/issues/1700
308
 
309
  # Get document info
@@ -312,16 +410,12 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
312
  documents_per_topic = doc_dets.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
313
 
314
  # Assign CountVectorizer to merged model
315
-
316
  topic_model.vectorizer_model = vectoriser_model
317
 
318
  # Re-calculate c-TF-IDF
319
  c_tf_idf, _ = topic_model._c_tf_idf(documents_per_topic)
320
  topic_model.c_tf_idf_ = c_tf_idf
321
 
322
- ###
323
-
324
-
325
  # Check we have topics
326
  if not assigned_topics:
327
  return "No topics found.", output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model,[]
@@ -329,8 +423,14 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
329
  print("Topic model created.")
330
 
331
  # Tidy up topic label format a bit to have commas and spaces by default
332
- new_topic_labels = topic_model.generate_topic_labels(nr_words=3, separator=", ")
333
- topic_model.set_topic_labels(new_topic_labels)
 
 
 
 
 
 
334
 
335
  # Replace current topic labels if new ones loaded in
336
  if not custom_labels_df.empty:
@@ -342,18 +442,18 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
342
  print("Custom topics: ", topic_model.custom_labels_)
343
 
344
  # Outputs
345
- output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
346
 
347
  # If you want to save your embedding files
348
  if return_intermediate_files == "Yes":
349
  print("Saving embeddings to file")
350
- if low_resource_mode == "Yes":
351
- embeddings_file_name = data_file_name_no_ext + '_' + 'tfidf_embeddings.npz'
352
  else:
353
  if embeddings_super_compress == "No":
354
- embeddings_file_name = data_file_name_no_ext + '_' + 'large_embeddings.npz'
355
  else:
356
- embeddings_file_name = data_file_name_no_ext + '_' + 'large_embeddings_compress.npz'
357
 
358
  np.savez_compressed(embeddings_file_name, embeddings_out)
359
 
@@ -365,7 +465,25 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
365
 
366
  return output_text, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model, assigned_topics
367
 
368
- def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, assigned_topics, vectoriser_model, save_topic_model, progress=gr.Progress(track_tqdm=True)):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
369
 
370
  progress(0, desc= "Preparing data")
371
 
@@ -373,13 +491,9 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
373
 
374
  all_tic = time.perf_counter()
375
 
376
- # This step not necessary?
377
- #assigned_topics, probs = topic_model.fit_transform(docs, embeddings_out)
378
-
379
  if isinstance(assigned_topics, np.ndarray):
380
  assigned_topics = assigned_topics.tolist()
381
 
382
-
383
  # Reduce outliers if required, then update representation
384
  progress(0.2, desc= "Reducing outliers")
385
  print("Reducing outliers.")
@@ -397,20 +511,9 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
397
 
398
  print("Finished reducing outliers.")
399
 
400
- #progress(0.7, desc= "Replacing topic names with LLMs if necessary")
401
-
402
- #topic_dets = topic_model.get_topic_info()
403
-
404
- # # Replace original labels with LLM labels
405
- # if "LLM" in topic_model.get_topic_info().columns:
406
- # llm_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["LLM"].values()]
407
- # topic_model.set_topic_labels(llm_labels)
408
- # else:
409
- # topic_model.set_topic_labels(list(topic_dets["Name"]))
410
-
411
  # Outputs
412
  progress(0.9, desc= "Saving to file")
413
- output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
414
 
415
  all_toc = time.perf_counter()
416
  time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
@@ -418,16 +521,35 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
418
 
419
  return output_text, output_list, topic_model
420
 
421
- def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode, save_topic_model, representation_type, vectoriser_model, progress=gr.Progress(track_tqdm=True)):
422
- from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
423
 
424
  output_list = []
425
 
426
  all_tic = time.perf_counter()
427
 
428
- progress(0.1, desc= "Loading model and creating new representation")
 
 
429
 
430
- representation_model = create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, low_resource_mode)
431
 
432
  progress(0.3, desc= "Updating existing topics")
433
  topic_model.update_topics(docs, vectorizer_model=vectoriser_model, representation_model=representation_model)
@@ -439,7 +561,7 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
439
  llm_labels = [label[0].split("\n")[0] for label in topic_dets["LLM"]]
440
  topic_model.set_topic_labels(llm_labels)
441
 
442
- label_list_file_name = data_file_name_no_ext + '_llm_topic_list_' + today_rev + '.csv'
443
 
444
  llm_labels_df = pd.DataFrame(data={"Label":llm_labels})
445
  llm_labels_df.to_csv(label_list_file_name, index=None)
@@ -452,7 +574,7 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
452
 
453
  # Outputs
454
  progress(0.8, desc= "Saving outputs")
455
- output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
456
 
457
  all_toc = time.perf_counter()
458
  time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
@@ -460,11 +582,51 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
460
 
461
  return output_text, output_list, topic_model
462
 
463
- def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode, embeddings_out, in_label, in_colnames, legend_label, sample_prop, visualisation_type_radio, random_seed, progress=gr.Progress(track_tqdm=True)):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
464
 
465
  progress(0, desc= "Preparing data for visualisation")
466
 
467
  output_list = []
 
468
  vis_tic = time.perf_counter()
469
 
470
 
@@ -500,30 +662,37 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
500
  topic_model.set_topic_labels(labels)
501
 
502
  # Pre-reduce embeddings for visualisation purposes
503
- if low_resource_mode == "No":
504
- reduced_embeddings = UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=random_seed).fit_transform(embeddings_out)
505
  else:
506
  reduced_embeddings = TruncatedSVD(2, random_state=random_seed).fit_transform(embeddings_out)
507
 
508
- progress(0.5, desc= "Creating visualisation (this can take a while)")
509
  # Visualise the topics:
510
 
511
- print("Creating visualisation")
512
-
513
- # "Topic document graph", "Hierarchical view"
514
 
515
  if visualisation_type_radio == "Topic document graph":
516
- topics_vis = visualize_documents_custom(topic_model, docs, hover_labels = label_list, reduced_embeddings=reduced_embeddings, hide_annotations=True, hide_document_hover=False, custom_labels=True, sample = sample_prop, width= 1200, height = 750)
 
517
 
518
- topics_vis_name = data_file_name_no_ext + '_' + 'vis_topic_docs_' + today_rev + '.html'
519
- topics_vis.write_html(topics_vis_name)
520
- output_list.append(topics_vis_name)
 
 
 
 
521
 
522
- topics_vis_2 = topic_model.visualize_heatmap(custom_labels=True, width= 1200, height = 1200)
 
523
 
524
- topics_vis_2_name = data_file_name_no_ext + '_' + 'vis_heatmap_' + today_rev + '.html'
525
- topics_vis_2.write_html(topics_vis_2_name)
526
- output_list.append(topics_vis_2_name)
 
 
 
527
 
528
  elif visualisation_type_radio == "Hierarchical view":
529
 
@@ -532,7 +701,7 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
532
  # Print topic tree - may get encoding errors, so doing try except
533
  try:
534
  tree = topic_model.get_topic_tree(hierarchical_topics, tight_layout = True)
535
- tree_name = data_file_name_no_ext + '_' + 'vis_hierarchy_tree_' + today_rev + '.txt'
536
 
537
  with open(tree_name, "w") as file:
538
  # Write the string to the file
@@ -540,59 +709,71 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
540
 
541
  output_list.append(tree_name)
542
 
543
- except Exception as error:
544
- print("An exception occurred when making topic tree document, skipped:", error)
 
 
545
 
546
 
547
  # Save new hierarchical topic model to file
548
- hierarchical_topics_name = data_file_name_no_ext + '_' + 'vis_hierarchy_topics_dist_' + today_rev + '.csv'
549
- hierarchical_topics.to_csv(hierarchical_topics_name, index = None)
550
- output_list.append(hierarchical_topics_name)
551
-
552
-
553
- #try:
554
- topics_vis, hierarchy_df, hierarchy_topic_names = visualize_hierarchical_documents_custom(topic_model, docs, label_list, hierarchical_topics, hide_annotations=True, reduced_embeddings=reduced_embeddings, sample = sample_prop, hide_document_hover= False, custom_labels=True, width= 1200, height = 750)
555
- topics_vis_2 = visualize_hierarchy_custom(topic_model, hierarchical_topics=hierarchical_topics, width= 1200, height = 750)
 
 
 
 
556
 
557
  # Write hierarchical topics levels to df
558
- hierarchy_df_name = data_file_name_no_ext + '_' + 'hierarchy_topics_df_' + today_rev + '.csv'
559
  hierarchy_df.to_csv(hierarchy_df_name, index = None)
560
  output_list.append(hierarchy_df_name)
561
 
562
  # Write hierarchical topics names to df
563
- hierarchy_topic_names_name = data_file_name_no_ext + '_' + 'hierarchy_topics_names_' + today_rev + '.csv'
564
  hierarchy_topic_names.to_csv(hierarchy_topic_names_name, index = None)
565
  output_list.append(hierarchy_topic_names_name)
566
 
567
- #except:
568
- # error_message = "Visualisation preparation failed. Perhaps you need more topics to create the full hierarchy (more than 10)?"
569
- # return error_message, output_list, None, None
570
 
571
- topics_vis_name = data_file_name_no_ext + '_' + 'vis_hierarchy_topic_doc_' + today_rev + '.html'
572
  topics_vis.write_html(topics_vis_name)
573
  output_list.append(topics_vis_name)
574
 
575
- topics_vis_2_name = data_file_name_no_ext + '_' + 'vis_hierarchy_' + today_rev + '.html'
576
  topics_vis_2.write_html(topics_vis_2_name)
577
  output_list.append(topics_vis_2_name)
578
 
579
  all_toc = time.perf_counter()
580
- time_out = f"Creating visualisation took {all_toc - vis_tic:0.1f} seconds"
581
- print(time_out)
 
 
582
 
583
- return time_out, output_list, topics_vis, topics_vis_2
 
 
584
 
585
- def save_as_pytorch_model(topic_model, data_file_name_no_ext , progress=gr.Progress(track_tqdm=True)):
 
 
 
 
 
 
 
586
 
587
  if not topic_model:
588
- return "No Pytorch model found.", None
 
589
 
590
  progress(0, desc= "Saving topic model in Pytorch format")
591
 
592
- output_list = []
593
-
594
-
595
- topic_model_save_name_folder = "output_model/" + data_file_name_no_ext + "_topics_" + today_rev# + ".safetensors"
596
  topic_model_save_name_zip = topic_model_save_name_folder + ".zip"
597
 
598
  # Clear folder before replacing files
@@ -600,9 +781,10 @@ def save_as_pytorch_model(topic_model, data_file_name_no_ext , progress=gr.Progr
600
 
601
  topic_model.save(topic_model_save_name_folder, serialization='pytorch', save_embedding_model=True, save_ctfidf=False)
602
 
603
- # Zip file example
604
-
605
  zip_folder(topic_model_save_name_folder, topic_model_save_name_zip)
606
  output_list.append(topic_model_save_name_zip)
607
 
608
- return "Model saved in Pytorch format.", output_list
 
 
 
8
  import time
9
  from bertopic import BERTopic
10
 
11
+ from typing import List, Type, Union
12
+ PandasDataFrame = Type[pd.DataFrame]
13
+
14
+ from funcs.clean_funcs import initial_clean, regex_clean
15
  from funcs.anonymiser import expand_sentences_spacy
16
+ from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs, output_folder
17
+ from funcs.embeddings import make_or_load_embeddings, torch_device
18
  from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
19
+ from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag, random_seed
20
 
21
+ from sklearn.feature_extraction.text import CountVectorizer
22
 
23
  from sentence_transformers import SentenceTransformer
24
  from sklearn.pipeline import make_pipeline
 
27
  import funcs.anonymiser as anon
28
  from umap import UMAP
29
 
30
+ # Default options can be changed in number selection on options page
31
+ umap_n_neighbours = 15
32
+ umap_min_dist = 0.0
33
+ umap_metric = 'cosine'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  today = datetime.now().strftime("%d%m%Y")
36
  today_rev = datetime.now().strftime("%Y%m%d")
 
42
  hf_model_name = "QuantFactory/Phi-3-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
43
  hf_model_file = "Phi-3-mini-128k-instruct.Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
44
 
45
+ # When topic modelling column is chosen, change the default visualisation column to the same
46
+ def change_default_vis_col(in_colnames:List[str]):
47
+ '''
48
+ When topic modelling column is chosen, change the default visualisation column to the same
49
+ '''
50
+ if in_colnames:
51
+ return gr.Dropdown(value=in_colnames[0])
52
+ else:
53
+ return gr.Dropdown()
54
+
55
+ def pre_clean(data: pd.DataFrame, in_colnames: list, data_file_name_no_ext: str, custom_regex: pd.DataFrame, clean_text: str, drop_duplicate_text: str, anonymise_drop: str, sentence_split_drop: str, embeddings_state: dict, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
56
+ """
57
+ Pre-processes the input data by cleaning text, removing duplicates, anonymizing data, and splitting sentences based on the provided options.
58
+
59
+ Args:
60
+ data (pd.DataFrame): The input data to be cleaned.
61
+ in_colnames (list): List of column names to be used for cleaning and finding topics.
62
+ data_file_name_no_ext (str): The base name of the data file without extension.
63
+ custom_regex (pd.DataFrame): Custom regex patterns for initial cleaning.
64
+ clean_text (str): Option to clean text ("Yes" or "No").
65
+ drop_duplicate_text (str): Option to drop duplicate text ("Yes" or "No").
66
+ anonymise_drop (str): Option to anonymize data ("Yes" or "No").
67
+ sentence_split_drop (str): Option to split text into sentences ("Yes" or "No").
68
+ embeddings_state (dict): State of the embeddings.
69
+ progress (gr.Progress, optional): Progress tracker for the cleaning process.
70
+
71
+ Returns:
72
+ tuple: A tuple containing the error message (if any), cleaned data, updated file name, and embeddings state.
73
+ """
74
 
75
  output_text = ""
76
  output_list = []
 
80
  if not in_colnames:
81
  error_message = "Please enter one column name to use for cleaning and finding topics."
82
  print(error_message)
83
+ return error_message, None, data_file_name_no_ext, None, None, embeddings_state
84
 
85
  all_tic = time.perf_counter()
86
 
 
93
  clean_tic = time.perf_counter()
94
  print("Starting data clean.")
95
 
96
+ data[in_colnames_list_first] = initial_clean(data[in_colnames_list_first], [])
97
 
98
+ if '_clean' not in data_file_name_no_ext:
99
+ data_file_name_no_ext = data_file_name_no_ext + "_clean"
 
 
100
 
101
  clean_toc = time.perf_counter()
102
  clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
103
  print(clean_time_out)
104
 
105
+ # Clean custom regex if exists
106
+ if not custom_regex.empty:
107
+ data[in_colnames_list_first] = regex_clean(data[in_colnames_list_first], custom_regex.iloc[:, 0].to_list())
108
+
109
+ if '_clean' not in data_file_name_no_ext:
110
+ data_file_name_no_ext = data_file_name_no_ext + "_clean"
111
+
112
+
113
  if drop_duplicate_text == "Yes":
114
  progress(0.3, desc= "Drop duplicates - remove short texts")
115
 
 
126
  if anonymise_drop == "Yes":
127
  progress(0.6, desc= "Anonymising data")
128
 
129
+ if '_anon' not in data_file_name_no_ext:
130
+ data_file_name_no_ext = data_file_name_no_ext + "_anon"
131
 
132
  anon_tic = time.perf_counter()
133
 
 
143
  if sentence_split_drop == "Yes":
144
  progress(0.6, desc= "Splitting text into sentences")
145
 
146
+ if '_split' not in data_file_name_no_ext:
147
+ data_file_name_no_ext = data_file_name_no_ext + "_split"
148
 
149
  anon_tic = time.perf_counter()
150
 
151
  data = expand_sentences_spacy(data, in_colnames_list_first)
152
+ data = data[data[in_colnames_list_first].str.len() >= 25] # Keep only rows with at least 25 characters
153
+ data.reset_index(inplace=True, drop=True)
154
 
155
  anon_toc = time.perf_counter()
156
  time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
157
 
158
+ out_data_name = output_folder + data_file_name_no_ext + "_" + today_rev + ".csv"
159
  data.to_csv(out_data_name)
160
  output_list.append(out_data_name)
161
 
 
165
 
166
  output_text = "Data clean completed."
167
 
168
+ # Overwrite existing embeddings as they will likely have changed
169
+ return output_text, output_list, data, data_file_name_no_ext, np.array([])
170
+
171
+ def optimise_zero_shot():
172
+ """
173
+ Return options that optimise the topic model to keep only zero-shot topics as the main topics
174
+ """
175
+ return gr.Dropdown(value="Yes"), gr.Slider(value=2), gr.Slider(value=2), gr.Slider(value=0.01), gr.Slider(value=0.95), gr.Slider(value=0.55)
176
+
177
+ def extract_topics(
178
+ data: pd.DataFrame,
179
+ in_files: list,
180
+ min_docs_slider: int,
181
+ in_colnames: list,
182
+ max_topics_slider: int,
183
+ candidate_topics: list,
184
+ data_file_name_no_ext: str,
185
+ custom_labels_df: pd.DataFrame,
186
+ return_intermediate_files: str,
187
+ embeddings_super_compress: str,
188
+ high_quality_mode: str,
189
+ save_topic_model: str,
190
+ embeddings_out: np.ndarray,
191
+ embeddings_type_state: str,
192
+ zero_shot_similarity: float,
193
+ calc_probs: str,
194
+ vectoriser_state: CountVectorizer,
195
+ min_word_occurence_slider: float,
196
+ max_word_occurence_slider: float,
197
+ split_sentence_drop: str,
198
+ random_seed: int = random_seed,
199
+ output_folder: str = output_folder,
200
+ umap_n_neighbours:int = umap_n_neighbours,
201
+ umap_min_dist:float = umap_min_dist,
202
+ umap_metric:str = umap_metric,
203
+ progress: gr.Progress = gr.Progress(track_tqdm=True)
204
+ ) -> tuple:
205
+ """
206
+ Extract topics from the given data using various parameters and settings.
207
+
208
+ Args:
209
+ data (pd.DataFrame): The input data.
210
+ in_files (list): List of input files.
211
+ min_docs_slider (int): Minimum number of similar documents needed to make a topic.
212
+ in_colnames (list): List of column names to use for cleaning and finding topics.
213
+ max_topics_slider (int): Maximum number of topics.
214
+ candidate_topics (list): List of candidate topics.
215
+ data_file_name_no_ext (str): Data file name without extension.
216
+ custom_labels_df (pd.DataFrame): DataFrame containing custom labels.
217
+ return_intermediate_files (str): Whether to return intermediate files.
218
+ embeddings_super_compress (str): Whether to round embeddings to three decimal places.
219
+ high_quality_mode (str): Whether to use high quality (transformers based) embeddings.
220
+ save_topic_model (str): Whether to save the topic model.
221
+ embeddings_out (np.ndarray): Output embeddings.
222
+ embeddings_type_state (str): State of the embeddings type.
223
+ zero_shot_similarity (float): Zero-shot similarity threshold.
224
+ random_seed (int): Random seed for reproducibility.
225
+ calc_probs (str): Whether to calculate all topic probabilities.
226
+ vectoriser_state (CountVectorizer): Vectorizer state.
227
+ min_word_occurence_slider (float): Minimum word occurrence slider value.
228
+ max_word_occurence_slider (float): Maximum word occurrence slider value.
229
+ split_sentence_drop (str): Whether to split open text into sentences.
230
+ original_data_state (pd.DataFrame): Original data state.
231
+ output_folder (str, optional): Output folder. Defaults to output_folder.
232
+ umap_n_neighbours (int): Nearest neighbours value for UMAP.
233
+ umap_min_dist (float): Minimum distance for UMAP.
234
+ umap_metric (str): Metric for UMAP.
235
+ progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress(track_tqdm=True).
236
+
237
+ Returns:
238
+ tuple: A tuple containing output text, output list, data, data file name without extension, and an empty numpy array.
239
+ """
240
  all_tic = time.perf_counter()
241
 
242
  progress(0, desc= "Loading data")
243
 
244
+ vectoriser_state = CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=min_word_occurence_slider, max_df=max_word_occurence_slider)
245
+
246
  output_list = []
247
  file_list = [string.name for string in in_files]
248
 
 
265
  # Check if embeddings are being loaded in
266
  progress(0.2, desc= "Loading/creating embeddings")
267
 
 
268
 
269
+ if high_quality_mode == "Yes":
270
+ print("Using high quality embedding model")
271
 
272
  # Define a list of possible local locations to search for the model
273
  local_embeddings_locations = [
 
299
  embeddings_type_state = "large"
300
 
301
  # UMAP model uses Bertopic defaults
302
+ umap_model = UMAP(n_neighbors=umap_n_neighbours, n_components=5, min_dist=umap_min_dist, metric=umap_metric, low_memory=False, random_state=random_seed)
303
 
304
  else:
305
  print("Choosing low resource TF-IDF model.")
 
317
 
318
  #umap_model = TruncatedSVD(n_components=5, random_state=random_seed)
319
  # UMAP model uses Bertopic defaults
320
+ umap_model = UMAP(n_neighbors=umap_n_neighbours, n_components=5, min_dist=umap_min_dist, metric=umap_metric, low_memory=True, random_state=random_seed)
321
 
322
+ embeddings_out = make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, embeddings_super_compress, high_quality_mode)
323
 
324
  # This is saved as a Gradio state object
325
  vectoriser_model = vectoriser_state
 
344
 
345
  if calc_probs == True:
346
  topics_probs_out = pd.DataFrame(topic_model.probabilities_)
347
+ topics_probs_out_name = output_folder + "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
348
  topics_probs_out.to_csv(topics_probs_out_name)
349
  output_list.append(topics_probs_out_name)
350
 
 
352
  print(error)
353
  print(fail_error_message)
354
 
355
+ out_fail_error_message = '\n'.join([fail_error_message, str(error)])
356
+
357
+ return out_fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
358
 
359
 
360
  # Do this if you have pre-defined topics
361
  else:
362
+ #if high_quality_mode == "No":
363
+ # error_message = "Zero shot topic modelling currently not compatible with low-resource embeddings. Please change this option to 'No' on the options tab and retry."
364
+ # print(error_message)
365
 
366
+ # return error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
367
 
368
  zero_shot_topics = read_file(candidate_topics.name)
369
  zero_shot_topics_lower = list(zero_shot_topics.iloc[:, 0].str.lower())
370
 
371
+ print("Zero shot topics are:", zero_shot_topics_lower)
372
+
373
 
374
  try:
375
  topic_model = BERTopic( embedding_model=embedding_model, #embedding_model_pipe, # for Jina
 
386
 
387
  if calc_probs == True:
388
  topics_probs_out = pd.DataFrame(topic_model.probabilities_)
389
+ topics_probs_out_name = output_folder + "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
390
  topics_probs_out.to_csv(topics_probs_out_name)
391
  output_list.append(topics_probs_out_name)
392
 
 
394
  print("An exception occurred:", error)
395
  print(fail_error_message)
396
 
397
+ out_fail_error_message = '\n'.join([fail_error_message, str(error)])
398
+
399
+ return out_fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
400
 
401
  # For some reason, zero topic modelling exports assigned topics as a np.array instead of a list. Converting it back here.
402
  if isinstance(assigned_topics, np.ndarray):
403
  assigned_topics = assigned_topics.tolist()
404
 
 
 
405
  # Zero shot modelling is a model merge, which wipes the c_tf_idf part of the resulting model completely. To get hierarchical modelling to work, we need to recreate this part of the model with the CountVectorizer options used to create the initial model. Since with zero shot, we are merging two models that have exactly the same set of documents, the vocubulary should be the same, and so recreating the cf_tf_idf component in this way shouldn't be a problem. Discussion here, and below based on Maarten's suggested code: https://github.com/MaartenGr/BERTopic/issues/1700
406
 
407
  # Get document info
 
410
  documents_per_topic = doc_dets.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
411
 
412
  # Assign CountVectorizer to merged model
 
413
  topic_model.vectorizer_model = vectoriser_model
414
 
415
  # Re-calculate c-TF-IDF
416
  c_tf_idf, _ = topic_model._c_tf_idf(documents_per_topic)
417
  topic_model.c_tf_idf_ = c_tf_idf
418
 
 
 
 
419
  # Check we have topics
420
  if not assigned_topics:
421
  return "No topics found.", output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model,[]
 
423
  print("Topic model created.")
424
 
425
  # Tidy up topic label format a bit to have commas and spaces by default
426
+ if not candidate_topics:
427
+ print("Zero shot topics found, so not renaming")
428
+ new_topic_labels = topic_model.generate_topic_labels(nr_words=3, separator=", ")
429
+ topic_model.set_topic_labels(new_topic_labels)
430
+ if candidate_topics:
431
+ print("Custom labels:", topic_model.custom_labels_)
432
+ print("Topic labels:", topic_model.topic_labels_)
433
+ topic_model.set_topic_labels(topic_model.topic_labels_)
434
 
435
  # Replace current topic labels if new ones loaded in
436
  if not custom_labels_df.empty:
 
442
  print("Custom topics: ", topic_model.custom_labels_)
443
 
444
  # Outputs
445
+ output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
446
 
447
  # If you want to save your embedding files
448
  if return_intermediate_files == "Yes":
449
  print("Saving embeddings to file")
450
+ if high_quality_mode == "Yes":
451
+ embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'tfidf_embeddings.npz'
452
  else:
453
  if embeddings_super_compress == "No":
454
+ embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'large_embeddings.npz'
455
  else:
456
+ embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'large_embeddings_compress.npz'
457
 
458
  np.savez_compressed(embeddings_file_name, embeddings_out)
459
 
 
465
 
466
  return output_text, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model, assigned_topics
467
 
468
+ def reduce_outliers(topic_model: BERTopic, docs: List[str], embeddings_out: np.ndarray, data_file_name_no_ext: str, assigned_topics: Union[np.ndarray, List[int]], vectoriser_model: CountVectorizer, save_topic_model: str, split_sentence_drop: str, data: PandasDataFrame, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
469
+ """
470
+ Reduce outliers in the topic model and update the topic representation.
471
+
472
+ Args:
473
+ topic_model (BERTopic): The BERTopic topic model to be used.
474
+ docs (List[str]): List of documents.
475
+ embeddings_out (np.ndarray): Output embeddings.
476
+ data_file_name_no_ext (str): Data file name without extension.
477
+ assigned_topics (Union[np.ndarray, List[int]]): Assigned topics.
478
+ vectoriser_model (CountVectorizer): Vectorizer model.
479
+ save_topic_model (str): Whether to save the topic model.
480
+ split_sentence_drop (str): Dropdown result indicating whether sentences have been split.
481
+ data (PandasDataFrame): The input dataframe
482
+ progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress(track_tqdm=True).
483
+
484
+ Returns:
485
+ tuple: A tuple containing the output text, output list, and the updated topic model.
486
+ """
487
 
488
  progress(0, desc= "Preparing data")
489
 
 
491
 
492
  all_tic = time.perf_counter()
493
 
 
 
 
494
  if isinstance(assigned_topics, np.ndarray):
495
  assigned_topics = assigned_topics.tolist()
496
 
 
497
  # Reduce outliers if required, then update representation
498
  progress(0.2, desc= "Reducing outliers")
499
  print("Reducing outliers.")
 
511
 
512
  print("Finished reducing outliers.")
513
 
 
 
 
 
 
 
 
 
 
 
 
514
  # Outputs
515
  progress(0.9, desc= "Saving to file")
516
+ output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
517
 
518
  all_toc = time.perf_counter()
519
  time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
 
521
 
522
  return output_text, output_list, topic_model
523
 
524
+ def represent_topics(topic_model: BERTopic, docs: List[str], data_file_name_no_ext: str, high_quality_mode: str, save_topic_model: str, representation_type: str, vectoriser_model: CountVectorizer, split_sentence_drop: str, data: PandasDataFrame, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
525
+ """
526
+ Represents topics using the specified representation model and updates the topic labels accordingly.
527
+
528
+ Args:
529
+ topic_model (BERTopic): The topic model to be used.
530
+ docs (List[str]): List of documents to be processed.
531
+ data_file_name_no_ext (str): The base name of the data file without extension.
532
+ high_quality_mode (str): Whether to use high quality (transformers based) embeddings.
533
+ save_topic_model (str): Whether to save the topic model.
534
+ representation_type (str): The type of representation model to be used.
535
+ vectoriser_model (CountVectorizer): The vectorizer model to be used.
536
+ split_sentence_drop (str): Dropdown result indicating whether sentences have been split.
537
+ data (PandasDataFrame): The input dataframe
538
+ progress (gr.Progress, optional): Progress tracker for the process. Defaults to gr.Progress(track_tqdm=True).
539
+
540
+ Returns:
541
+ tuple: A tuple containing the output text, output list, and the updated topic model.
542
+ """
543
 
544
  output_list = []
545
 
546
  all_tic = time.perf_counter()
547
 
548
+ # Load in representation model
549
+
550
+ progress(0.1, desc= "Loading model and creating new topic representation")
551
 
552
+ representation_model = create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, high_quality_mode)
553
 
554
  progress(0.3, desc= "Updating existing topics")
555
  topic_model.update_topics(docs, vectorizer_model=vectoriser_model, representation_model=representation_model)
 
561
  llm_labels = [label[0].split("\n")[0] for label in topic_dets["LLM"]]
562
  topic_model.set_topic_labels(llm_labels)
563
 
564
+ label_list_file_name = output_folder + data_file_name_no_ext + '_llm_topic_list_' + today_rev + '.csv'
565
 
566
  llm_labels_df = pd.DataFrame(data={"Label":llm_labels})
567
  llm_labels_df.to_csv(label_list_file_name, index=None)
 
574
 
575
  # Outputs
576
  progress(0.8, desc= "Saving outputs")
577
+ output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
578
 
579
  all_toc = time.perf_counter()
580
  time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
 
582
 
583
  return output_text, output_list, topic_model
584
 
585
+ def visualise_topics(
586
+ topic_model: BERTopic,
587
+ data: pd.DataFrame,
588
+ data_file_name_no_ext: str,
589
+ high_quality_mode: str,
590
+ embeddings_out: np.ndarray,
591
+ in_label: List[str],
592
+ in_colnames: List[str],
593
+ legend_label: str,
594
+ sample_prop: float,
595
+ visualisation_type_radio: str,
596
+ random_seed: int = random_seed,
597
+ umap_n_neighbours: int = umap_n_neighbours,
598
+ umap_min_dist: float = umap_min_dist,
599
+ umap_metric: str = umap_metric,
600
+ progress: gr.Progress = gr.Progress(track_tqdm=True)
601
+ ) -> tuple:
602
+ """
603
+ Visualize topics using the provided topic model and data.
604
+
605
+ Args:
606
+ topic_model (BERTopic): The topic model to be used for visualization.
607
+ data (pd.DataFrame): The input data containing the documents.
608
+ data_file_name_no_ext (str): The base name of the data file without extension.
609
+ high_quality_mode (str): Whether to use high quality mode for embeddings.
610
+ embeddings_out (np.ndarray): The output embeddings.
611
+ in_label (List[str]): List of labels for the input data.
612
+ in_colnames (List[str]): List of column names in the input data.
613
+ legend_label (str): The label to be used in the legend.
614
+ sample_prop (float): The proportion of data to sample for visualization.
615
+ visualisation_type_radio (str): The type of visualization to be used.
616
+ random_seed (int, optional): Random seed for reproducibility. Defaults to random_seed.
617
+ umap_n_neighbours (int, optional): Number of neighbors for UMAP. Defaults to umap_n_neighbours.
618
+ umap_min_dist (float, optional): Minimum distance for UMAP. Defaults to umap_min_dist.
619
+ umap_metric (str, optional): Metric for UMAP. Defaults to umap_metric.
620
+ progress (gr.Progress, optional): Progress tracker for the process. Defaults to gr.Progress(track_tqdm=True).
621
+
622
+ Returns:
623
+ tuple: A tuple containing the output message, output list, reduced embeddings, and topic model.
624
+ """
625
 
626
  progress(0, desc= "Preparing data for visualisation")
627
 
628
  output_list = []
629
+ output_message = []
630
  vis_tic = time.perf_counter()
631
 
632
 
 
662
  topic_model.set_topic_labels(labels)
663
 
664
  # Pre-reduce embeddings for visualisation purposes
665
+ if high_quality_mode == "Yes":
666
+ reduced_embeddings = UMAP(n_neighbors=umap_n_neighbours, n_components=2, min_dist=umap_min_dist, metric=umap_metric, random_state=random_seed).fit_transform(embeddings_out)
667
  else:
668
  reduced_embeddings = TruncatedSVD(2, random_state=random_seed).fit_transform(embeddings_out)
669
 
670
+ progress(0.3, desc= "Creating visualisations")
671
  # Visualise the topics:
672
 
673
+ print("Creating visualisations")
 
 
674
 
675
  if visualisation_type_radio == "Topic document graph":
676
+ try:
677
+ topics_vis = visualize_documents_custom(topic_model, docs, hover_labels = label_list, reduced_embeddings=reduced_embeddings, hide_annotations=True, hide_document_hover=False, custom_labels=True, sample = sample_prop, width= 1200, height = 750)
678
 
679
+ topics_vis_name = output_folder + data_file_name_no_ext + '_' + 'vis_topic_docs_' + today_rev + '.html'
680
+ topics_vis.write_html(topics_vis_name)
681
+ output_list.append(topics_vis_name)
682
+ except Exception as e:
683
+ print(e)
684
+ output_message = str(e)
685
+ return output_message, output_list, None, None
686
 
687
+ try:
688
+ topics_vis_2 = topic_model.visualize_heatmap(custom_labels=True, width= 1200, height = 1200)
689
 
690
+ topics_vis_2_name = output_folder + data_file_name_no_ext + '_' + 'vis_heatmap_' + today_rev + '.html'
691
+ topics_vis_2.write_html(topics_vis_2_name)
692
+ output_list.append(topics_vis_2_name)
693
+ except Exception as e:
694
+ print(e)
695
+ output_message.append(str(e))
696
 
697
  elif visualisation_type_radio == "Hierarchical view":
698
 
 
701
  # Print topic tree - may get encoding errors, so doing try except
702
  try:
703
  tree = topic_model.get_topic_tree(hierarchical_topics, tight_layout = True)
704
+ tree_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_tree_' + today_rev + '.txt'
705
 
706
  with open(tree_name, "w") as file:
707
  # Write the string to the file
 
709
 
710
  output_list.append(tree_name)
711
 
712
+ except Exception as e:
713
+ new_out_message = "An exception occurred when making topic tree document, skipped:" + str(e)
714
+ output_message.append(str(new_out_message))
715
+ print(new_out_message)
716
 
717
 
718
  # Save new hierarchical topic model to file
719
+ try:
720
+ hierarchical_topics_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_topics_dist_' + today_rev + '.csv'
721
+ hierarchical_topics.to_csv(hierarchical_topics_name, index = None)
722
+ output_list.append(hierarchical_topics_name)
723
+
724
+ topics_vis, hierarchy_df, hierarchy_topic_names = visualize_hierarchical_documents_custom(topic_model, docs, label_list, hierarchical_topics, hide_annotations=True, reduced_embeddings=reduced_embeddings, sample = sample_prop, hide_document_hover= False, custom_labels=True, width= 1200, height = 750)
725
+ topics_vis_2 = visualize_hierarchy_custom(topic_model, hierarchical_topics=hierarchical_topics, width= 1200, height = 750)
726
+ except Exception as e:
727
+ new_out_message = "An exception occurred when making hierarchical topic visualisation:" + str(e) + ". Maybe your model doesn't have enough topics to create a hierarchy?"
728
+ output_message.append(str(new_out_message))
729
+ print(new_out_message)
730
+ return new_out_message, output_list, None, None
731
 
732
  # Write hierarchical topics levels to df
733
+ hierarchy_df_name = output_folder + data_file_name_no_ext + '_' + 'hierarchy_topics_df_' + today_rev + '.csv'
734
  hierarchy_df.to_csv(hierarchy_df_name, index = None)
735
  output_list.append(hierarchy_df_name)
736
 
737
  # Write hierarchical topics names to df
738
+ hierarchy_topic_names_name = output_folder + data_file_name_no_ext + '_' + 'hierarchy_topics_names_' + today_rev + '.csv'
739
  hierarchy_topic_names.to_csv(hierarchy_topic_names_name, index = None)
740
  output_list.append(hierarchy_topic_names_name)
741
 
 
 
 
742
 
743
+ topics_vis_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_topic_doc_' + today_rev + '.html'
744
  topics_vis.write_html(topics_vis_name)
745
  output_list.append(topics_vis_name)
746
 
747
+ topics_vis_2_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_' + today_rev + '.html'
748
  topics_vis_2.write_html(topics_vis_2_name)
749
  output_list.append(topics_vis_2_name)
750
 
751
  all_toc = time.perf_counter()
752
+ output_message.append(f"Creating visualisation took {all_toc - vis_tic:0.1f} seconds")
753
+ print(output_message)
754
+
755
+ return '\n'.join(output_message), output_list, topics_vis, topics_vis_2
756
 
757
+ def save_as_pytorch_model(topic_model: BERTopic, data_file_name_no_ext:str, progress=gr.Progress(track_tqdm=True)):
758
+ """
759
+ Reduce outliers in the topic model and update the topic representation.
760
 
761
+ Args:
762
+ topic_model (BERTopic): The BERTopic topic model to be used.
763
+ data_file_name_no_ext (str): Document file name.
764
+ Returns:
765
+ tuple: A tuple containing the output text and output list.
766
+ """
767
+ output_list = []
768
+ output_message = ""
769
 
770
  if not topic_model:
771
+ output_message = "No Pytorch model found."
772
+ return output_message, None
773
 
774
  progress(0, desc= "Saving topic model in Pytorch format")
775
 
776
+ topic_model_save_name_folder = output_folder + data_file_name_no_ext + "_topics_" + today_rev# + ".safetensors"
 
 
 
777
  topic_model_save_name_zip = topic_model_save_name_folder + ".zip"
778
 
779
  # Clear folder before replacing files
 
781
 
782
  topic_model.save(topic_model_save_name_folder, serialization='pytorch', save_embedding_model=True, save_ctfidf=False)
783
 
784
+ # Zip file example
 
785
  zip_folder(topic_model_save_name_folder, topic_model_save_name_zip)
786
  output_list.append(topic_model_save_name_zip)
787
 
788
+ output_message = "Model saved in Pytorch format."
789
+
790
+ return output_message, output_list
requirements.txt CHANGED
@@ -1,8 +1,7 @@
1
- gradio
2
  transformers==4.41.2
3
  accelerate==0.26.1
4
  torch==2.3.1
5
- llama-cpp-python==0.2.79
6
  bertopic==0.16.2
7
  spacy==3.7.4
8
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
@@ -13,4 +12,6 @@ presidio_analyzer==2.2.354
13
  presidio_anonymizer==2.2.354
14
  scipy==1.11.4
15
  polars==0.20.6
16
- numpy==1.26.4
 
 
 
1
+ gradio # Not specified version due to interaction with spacy - reinstall latest version after requirements.txt load
2
  transformers==4.41.2
3
  accelerate==0.26.1
4
  torch==2.3.1
 
5
  bertopic==0.16.2
6
  spacy==3.7.4
7
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 
12
  presidio_anonymizer==2.2.354
13
  scipy==1.11.4
14
  polars==0.20.6
15
+ sentence-transformers==3.0.1
16
+ llama-cpp-python==0.2.79 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
17
+ numpy==1.26.4
requirements_gpu.txt CHANGED
@@ -1,7 +1,6 @@
1
- gradio
2
  transformers==4.41.2
3
  accelerate==0.26.1
4
- torch==2.3.1
5
  bertopic==0.16.2
6
  spacy==3.7.4
7
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
@@ -15,3 +14,4 @@ polars==0.20.6
15
  torch --index-url https://download.pytorch.org/whl/cu121
16
  llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
17
  numpy==1.26.4
 
 
1
+ gradio # Not specified version due to interaction with spacy - reinstall latest version after requirements.txt load
2
  transformers==4.41.2
3
  accelerate==0.26.1
 
4
  bertopic==0.16.2
5
  spacy==3.7.4
6
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 
14
  torch --index-url https://download.pytorch.org/whl/cu121
15
  llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
16
  numpy==1.26.4
17
+ sentence-transformers==3.0.1