joshEm commited on
Commit
61de65f
1 Parent(s): 3f45121

Add BERTopic model

Browse files
README.md ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # potloc-topic-model
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("joshEm/potloc-topic-model")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 132
34
+ * Number of training documents: 10000
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | the - in - and - of - to | 10 | -1_the_in_and_of |
42
+ | 0 | computer - yahoo - windows - click - can | 3344 | 0_computer_yahoo_windows_click |
43
+ | 1 | god - jesus - bible - of - believe | 958 | 1_god_jesus_bible_of |
44
+ | 2 | bush - president - war - iraq - us | 487 | 2_bush_president_war_iraq |
45
+ | 3 | court - job - to - for - jail | 387 | 3_court_job_to_for |
46
+ | 4 | equation - solve - number - answer - numbers | 208 | 4_equation_solve_number_answer |
47
+ | 5 | foreheadnheadon - directly - apply - skin - ear | 162 | 5_foreheadnheadon_directly_apply_skin |
48
+ | 6 | song - lyrics - me - oh - music | 130 | 6_song_lyrics_me_oh |
49
+ | 7 | degree - school - university - college - courses | 123 | 7_degree_school_university_college |
50
+ | 8 | credit - account - mortgage - bank - loan | 122 | 8_credit_account_mortgage_bank |
51
+ | 9 | music - song - songs - rock - favorite | 111 | 9_music_song_songs_rock |
52
+ | 10 | friends - meet - friend - looking - woman | 109 | 10_friends_meet_friend_looking |
53
+ | 11 | sex - orgasm - sexual - ejaculation - men | 95 | 11_sex_orgasm_sexual_ejaculation |
54
+ | 12 | where - ebay - buy - shirt - find | 94 | 12_where_ebay_buy_shirt |
55
+ | 13 | usa - cup - world - win - team | 90 | 13_usa_cup_world_win |
56
+ | 14 | illegal - immigrants - mexico - illegals - immigration | 81 | 14_illegal_immigrants_mexico_illegals |
57
+ | 15 | ball - sport - bowling - play - tennis | 80 | 15_ball_sport_bowling_play |
58
+ | 16 | plants - plant - whales - species - seals | 78 | 16_plants_plant_whales_species |
59
+ | 17 | man - joke - said - he - she | 77 | 17_man_joke_said_he |
60
+ | 18 | him - he - me - guy - but | 72 | 18_him_he_me_guy |
61
+ | 19 | period - doctor - periods - pregnant - pregnancy | 65 | 19_period_doctor_periods_pregnant |
62
+ | 20 | study - math - test - sat - practice | 64 | 20_study_math_test_sat |
63
+ | 21 | water - solution - moles - reaction - grams | 64 | 21_water_solution_moles_reaction |
64
+ | 22 | flag - 1918 - of - was - the | 63 | 22_flag_1918_of_was |
65
+ | 23 | means - french - mi - word - mean | 63 | 23_means_french_mi_word |
66
+ | 24 | baseball - player - he - sox - pitcher | 62 | 24_baseball_player_he_sox |
67
+ | 25 | girls - guys - men - girl - women | 60 | 25_girls_guys_men_girl |
68
+ | 26 | questions - points - question - bored - answers | 60 | 26_questions_points_question_bored |
69
+ | 27 | sleep - feel - hours - depression - you | 57 | 27_sleep_feel_hours_depression |
70
+ | 28 | dna - genetic - blood - gene - cells | 54 | 28_dna_genetic_blood_gene |
71
+ | 29 | english - language - french - learn - spanish | 54 | 29_english_language_french_learn |
72
+ | 30 | search - find - name - looking - address | 53 | 30_search_find_name_looking |
73
+ | 31 | word - winter - words - letters - letter | 53 | 31_word_winter_words_letters |
74
+ | 32 | weight - calories - fat - diet - eat | 52 | 32_weight_calories_fat_diet |
75
+ | 33 | bowl - game - qb - team - usc | 47 | 33_bowl_game_qb_team |
76
+ | 34 | moon - time - horizon - day - sun | 47 | 34_moon_time_horizon_day |
77
+ | 35 | wwe - guerrero - tna - diva - cena | 45 | 35_wwe_guerrero_tna_diva |
78
+ | 36 | her - mom - gift - she - ideas | 44 | 36_her_mom_gift_she |
79
+ | 37 | him - he - sister - my - his | 44 | 37_him_he_sister_my |
80
+ | 38 | book - books - read - harlem - beard | 43 | 38_book_books_read_harlem |
81
+ | 39 | color - blue - sky - light - colors | 43 | 39_color_blue_sky_light |
82
+ | 40 | tax - taxes - unemployment - state - income | 42 | 40_tax_taxes_unemployment_state |
83
+ | 41 | alamo - movie - movies - trilogy - aka | 42 | 41_alamo_movie_movies_trilogy |
84
+ | 42 | show - watch - anime - episodes - tv | 42 | 42_show_watch_anime_episodes |
85
+ | 43 | insurance - health - disability - help - for | 42 | 43_insurance_health_disability_help |
86
+ | 44 | girl - her - she - likes - ask | 42 | 44_girl_her_she_likes |
87
+ | 45 | navy - military - army - marine - marines | 41 | 45_navy_military_army_marine |
88
+ | 46 | white - black - racist - blacks - racism | 40 | 46_white_black_racist_blacks |
89
+ | 47 | cheat - spouse - wife - her - she | 40 | 47_cheat_spouse_wife_her |
90
+ | 48 | he - him - likes - guy - me | 39 | 48_he_him_likes_guy |
91
+ | 49 | visa - passport - birth - us - citizen | 38 | 49_visa_passport_birth_us |
92
+ | 50 | marijuana - drug - weed - opium - test | 37 | 50_marijuana_drug_weed_opium |
93
+ | 51 | velocity - force - angle - cm - triangle | 37 | 51_velocity_force_angle_cm |
94
+ | 52 | cup - championship - world - player - euro | 36 | 52_cup_championship_world_player |
95
+ | 53 | nascar - racing - fight - gordon - sport | 33 | 53_nascar_racing_fight_gordon |
96
+ | 54 | celebrities - tom - celebrity - her - jolie | 32 | 54_celebrities_tom_celebrity_her |
97
+ | 55 | cricket - india - batsman - dravid - indian | 31 | 55_cricket_india_batsman_dravid |
98
+ | 56 | weight - eat - skinny - fat - healthy | 30 | 56_weight_eat_skinny_fat |
99
+ | 57 | eye - lenses - astigmatism - eyes - glasses | 30 | 57_eye_lenses_astigmatism_eyes |
100
+ | 58 | people - yourself - person - others - confidence | 29 | 58_people_yourself_person_others |
101
+ | 59 | stock - fund - shares - mutual - market | 29 | 59_stock_fund_shares_mutual |
102
+ | 60 | arsenal - liverpool - league - fans - celtic | 29 | 60_arsenal_liverpool_league_fans |
103
+ | 61 | warming - global - climate - ice - snow | 29 | 61_warming_global_climate_ice |
104
+ | 62 | her - she - friend - friends - me | 28 | 62_her_she_friend_friends |
105
+ | 63 | wave - frequency - electromagnetic - radar - antenna | 28 | 63_wave_frequency_electromagnetic_radar |
106
+ | 64 | dream - dreams - my - elevator - was | 27 | 64_dream_dreams_my_elevator |
107
+ | 65 | gauge - bullet - caliber - gun - barrel | 27 | 65_gauge_bullet_caliber_gun |
108
+ | 66 | pain - knee - elbow - tennis - shoulder | 27 | 66_pain_knee_elbow_tennis |
109
+ | 67 | beach - trail - resort - appalachian - shaw | 26 | 67_beach_trail_resort_appalachian |
110
+ | 68 | scam - home - quixtar - money - survey | 25 | 68_scam_home_quixtar_money |
111
+ | 69 | business - sell - idea - start - money | 25 | 69_business_sell_idea_start |
112
+ | 70 | tv - watch - channels - espn - cup | 25 | 70_tv_watch_channels_espn |
113
+ | 71 | abs - muscles - exercises - reps - muscle | 25 | 71_abs_muscles_exercises_reps |
114
+ | 72 | hair - shave - cut - pubic - trim | 25 | 72_hair_shave_cut_pubic |
115
+ | 73 | psychic - divination - astrology - cards - tarot | 25 | 73_psychic_divination_astrology_cards |
116
+ | 74 | number - phone - code - address - area | 24 | 74_number_phone_code_address |
117
+ | 75 | was - my - hit - ever - freezer | 24 | 75_was_my_hit_ever |
118
+ | 76 | trailers - trailer - dvd - media - wmp | 24 | 76_trailers_trailer_dvd_media |
119
+ | 77 | penis - condom - size - sex - inches | 24 | 77_penis_condom_size_sex |
120
+ | 78 | love - person - beloved - live - we | 24 | 78_love_person_beloved_live |
121
+ | 79 | war - world - countries - soviet - were | 24 | 79_war_world_countries_soviet |
122
+ | 80 | de - le - la - et - les | 24 | 80_de_le_la_et |
123
+ | 81 | job - jobs - guard - where - work | 24 | 81_job_jobs_guard_where |
124
+ | 82 | christmas - thanksgiving - holidays - celebrate - tree | 24 | 82_christmas_thanksgiving_holidays_celebrate |
125
+ | 83 | hepatitis - pneumonia - infections - vaccination - link | 23 | 83_hepatitis_pneumonia_infections_vaccination |
126
+ | 84 | peanuts - ibs - may - heartburn - bowel | 22 | 84_peanuts_ibs_may_heartburn |
127
+ | 85 | name - sarah - named - pronounced - my | 21 | 85_name_sarah_named_pronounced |
128
+ | 86 | kids - he - husband - him - cheated | 20 | 86_kids_he_husband_him |
129
+ | 87 | gas - oil - energy - kingdom - 2006 | 19 | 87_gas_oil_energy_kingdom |
130
+ | 88 | taller - tall - height - grow - short | 19 | 88_taller_tall_height_grow |
131
+ | 89 | estate - property - heirs - lien - damages | 19 | 89_estate_property_heirs_lien |
132
+ | 90 | aluminum - 68 - element - 212 - metal | 18 | 90_aluminum_68_element_212 |
133
+ | 91 | organization - management - organizational - behavior - leadership | 18 | 91_organization_management_organizational_behavior |
134
+ | 92 | melatonin - medication - effects - dosage - zofran | 18 | 92_melatonin_medication_effects_dosage |
135
+ | 93 | smoking - quit - smoke - smoked - session | 17 | 93_smoking_quit_smoke_smoked |
136
+ | 94 | nba - referees - game - heat - win | 17 | 94_nba_referees_game_heat |
137
+ | 95 | superman - doom - hero - vs - super | 17 | 95_superman_doom_hero_vs |
138
+ | 96 | skateboarding - skateboard - snowboard - snowboarding - gymnastics | 17 | 96_skateboarding_skateboard_snowboard_snowboarding |
139
+ | 97 | electron - quarks - neutrons - antimatter - particle | 16 | 97_electron_quarks_neutrons_antimatter |
140
+ | 98 | happy - happiness - life - rushhour - secret | 16 | 98_happy_happiness_life_rushhour |
141
+ | 99 | fart - farting - gas - embarrassing - flatus | 16 | 99_fart_farting_gas_embarrassing |
142
+ | 100 | teeth - tooth - dentist - gums - braces | 16 | 100_teeth_tooth_dentist_gums |
143
+ | 101 | scorpio - zodiac - libra - signs - cancers | 16 | 101_scorpio_zodiac_libra_signs |
144
+ | 102 | dog - wolf - sheep - horse - animal | 16 | 102_dog_wolf_sheep_horse |
145
+ | 103 | hiv - aids - virus - infected - blood | 16 | 103_hiv_aids_virus_infected |
146
+ | 104 | thanked - poem - poetry - she - were | 15 | 104_thanked_poem_poetry_she |
147
+ | 105 | force - motion - mass - momentum - rocket | 15 | 105_force_motion_mass_momentum |
148
+ | 106 | minister - president - kagame - prime - natchaba | 15 | 106_minister_president_kagame_prime |
149
+ | 107 | kiss - kissing - lips - gently - tongue | 15 | 107_kiss_kissing_lips_gently |
150
+ | 108 | pictures - photos - google - site - find | 15 | 108_pictures_photos_google_site |
151
+ | 109 | dating - age - old - young - 19 | 14 | 109_dating_age_old_young |
152
+ | 110 | ebay - sell - smc - selling - products | 14 | 110_ebay_sell_smc_selling |
153
+ | 111 | planets - sun - stars - star - earth | 14 | 111_planets_sun_stars_star |
154
+ | 112 | imports - trade - importing - oil - mobil | 14 | 112_imports_trade_importing_oil |
155
+ | 113 | questions - question - politics - reported - answers | 14 | 113_questions_question_politics_reported |
156
+ | 114 | gay - crackle - snap - girlfriend - marry | 14 | 114_gay_crackle_snap_girlfriend |
157
+ | 115 | gravity - sun - earth - rotating - force | 13 | 115_gravity_sun_earth_rotating |
158
+ | 116 | weaknesses - abt - strengths - interview - job | 13 | 116_weaknesses_abt_strengths_interview |
159
+ | 117 | love - eachother - fall - forget - deeply | 13 | 117_love_eachother_fall_forget |
160
+ | 118 | animals - pets - tv - cage - communicate | 13 | 118_animals_pets_tv_cage |
161
+ | 119 | flax - yogurt - nonfat - health - healthy | 13 | 119_flax_yogurt_nonfat_health |
162
+ | 120 | grants - grant - business - federal - entrepreneurs | 13 | 120_grants_grant_business_federal |
163
+ | 121 | idol - american - chris - win - favorite | 12 | 121_idol_american_chris_win |
164
+ | 122 | clubs - golf - hit - irons - iron | 12 | 122_clubs_golf_hit_irons |
165
+ | 123 | lottery - scam - scammer - money - international | 12 | 123_lottery_scam_scammer_money |
166
+ | 124 | address - email - presale - bart - michaels | 12 | 124_address_email_presale_bart |
167
+ | 125 | jones - her - she - stargate - reynolds | 11 | 125_jones_her_she_stargate |
168
+ | 126 | data - product - analysis - regression - marketing | 11 | 126_data_product_analysis_regression |
169
+ | 127 | cancer - cure - tumor - parasite - cancers | 11 | 127_cancer_cure_tumor_parasite |
170
+ | 128 | nba - paul - team - redick - kobe | 11 | 128_nba_paul_team_redick |
171
+ | 129 | autism - autistic - homeschooling - child - she | 10 | 129_autism_autistic_homeschooling_child |
172
+ | 130 | seller - nike - soccer - jersey - dynamo | 10 | 130_seller_nike_soccer_jersey |
173
+
174
+ </details>
175
+
176
+ ## Training hyperparameters
177
+
178
+ * calculate_probabilities: False
179
+ * language: english
180
+ * low_memory: False
181
+ * min_topic_size: 10
182
+ * n_gram_range: (1, 1)
183
+ * nr_topics: None
184
+ * seed_topic_list: None
185
+ * top_n_words: 10
186
+ * verbose: False
187
+ * zeroshot_min_similarity: 0.7
188
+ * zeroshot_topic_list: None
189
+
190
+ ## Framework versions
191
+
192
+ * Numpy: 1.23.5
193
+ * HDBSCAN: 0.8.33
194
+ * UMAP: 0.5.5
195
+ * Pandas: 1.5.3
196
+ * Scikit-Learn: 1.2.2
197
+ * Sentence-transformers: 2.2.2
198
+ * Transformers: 4.35.2
199
+ * Numba: 0.58.1
200
+ * Plotly: 5.15.0
201
+ * Python: 3.10.12
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": false,
3
+ "language": "english",
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": false,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null,
16
+ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
17
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c605e6ed1e25a6350470804f2fcddbdfda9a732896a0ae7ff0dc5ad8de04a08c
3
+ size 2648404
ctfidf_config.json ADDED
The diff for this file is too large to render. See raw diff
 
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57922ca0e0dd31a10cdaa216b46e428d9a4a163a913f4f2922970f5b48816ab5
3
+ size 202840
topics.json ADDED
The diff for this file is too large to render. See raw diff