Add BERTopic model
Browse files- README.md +201 -0
- config.json +17 -0
- ctfidf.safetensors +3 -0
- ctfidf_config.json +0 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
tags:
|
4 |
+
- bertopic
|
5 |
+
library_name: bertopic
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
---
|
8 |
+
|
9 |
+
# potloc-topic-model
|
10 |
+
|
11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
13 |
+
|
14 |
+
## Usage
|
15 |
+
|
16 |
+
To use this model, please install BERTopic:
|
17 |
+
|
18 |
+
```
|
19 |
+
pip install -U bertopic
|
20 |
+
```
|
21 |
+
|
22 |
+
You can use the model as follows:
|
23 |
+
|
24 |
+
```python
|
25 |
+
from bertopic import BERTopic
|
26 |
+
topic_model = BERTopic.load("joshEm/potloc-topic-model")
|
27 |
+
|
28 |
+
topic_model.get_topic_info()
|
29 |
+
```
|
30 |
+
|
31 |
+
## Topic overview
|
32 |
+
|
33 |
+
* Number of topics: 132
|
34 |
+
* Number of training documents: 10000
|
35 |
+
|
36 |
+
<details>
|
37 |
+
<summary>Click here for an overview of all topics.</summary>
|
38 |
+
|
39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
40 |
+
|----------|----------------|-----------------|-------|
|
41 |
+
| -1 | the - in - and - of - to | 10 | -1_the_in_and_of |
|
42 |
+
| 0 | computer - yahoo - windows - click - can | 3344 | 0_computer_yahoo_windows_click |
|
43 |
+
| 1 | god - jesus - bible - of - believe | 958 | 1_god_jesus_bible_of |
|
44 |
+
| 2 | bush - president - war - iraq - us | 487 | 2_bush_president_war_iraq |
|
45 |
+
| 3 | court - job - to - for - jail | 387 | 3_court_job_to_for |
|
46 |
+
| 4 | equation - solve - number - answer - numbers | 208 | 4_equation_solve_number_answer |
|
47 |
+
| 5 | foreheadnheadon - directly - apply - skin - ear | 162 | 5_foreheadnheadon_directly_apply_skin |
|
48 |
+
| 6 | song - lyrics - me - oh - music | 130 | 6_song_lyrics_me_oh |
|
49 |
+
| 7 | degree - school - university - college - courses | 123 | 7_degree_school_university_college |
|
50 |
+
| 8 | credit - account - mortgage - bank - loan | 122 | 8_credit_account_mortgage_bank |
|
51 |
+
| 9 | music - song - songs - rock - favorite | 111 | 9_music_song_songs_rock |
|
52 |
+
| 10 | friends - meet - friend - looking - woman | 109 | 10_friends_meet_friend_looking |
|
53 |
+
| 11 | sex - orgasm - sexual - ejaculation - men | 95 | 11_sex_orgasm_sexual_ejaculation |
|
54 |
+
| 12 | where - ebay - buy - shirt - find | 94 | 12_where_ebay_buy_shirt |
|
55 |
+
| 13 | usa - cup - world - win - team | 90 | 13_usa_cup_world_win |
|
56 |
+
| 14 | illegal - immigrants - mexico - illegals - immigration | 81 | 14_illegal_immigrants_mexico_illegals |
|
57 |
+
| 15 | ball - sport - bowling - play - tennis | 80 | 15_ball_sport_bowling_play |
|
58 |
+
| 16 | plants - plant - whales - species - seals | 78 | 16_plants_plant_whales_species |
|
59 |
+
| 17 | man - joke - said - he - she | 77 | 17_man_joke_said_he |
|
60 |
+
| 18 | him - he - me - guy - but | 72 | 18_him_he_me_guy |
|
61 |
+
| 19 | period - doctor - periods - pregnant - pregnancy | 65 | 19_period_doctor_periods_pregnant |
|
62 |
+
| 20 | study - math - test - sat - practice | 64 | 20_study_math_test_sat |
|
63 |
+
| 21 | water - solution - moles - reaction - grams | 64 | 21_water_solution_moles_reaction |
|
64 |
+
| 22 | flag - 1918 - of - was - the | 63 | 22_flag_1918_of_was |
|
65 |
+
| 23 | means - french - mi - word - mean | 63 | 23_means_french_mi_word |
|
66 |
+
| 24 | baseball - player - he - sox - pitcher | 62 | 24_baseball_player_he_sox |
|
67 |
+
| 25 | girls - guys - men - girl - women | 60 | 25_girls_guys_men_girl |
|
68 |
+
| 26 | questions - points - question - bored - answers | 60 | 26_questions_points_question_bored |
|
69 |
+
| 27 | sleep - feel - hours - depression - you | 57 | 27_sleep_feel_hours_depression |
|
70 |
+
| 28 | dna - genetic - blood - gene - cells | 54 | 28_dna_genetic_blood_gene |
|
71 |
+
| 29 | english - language - french - learn - spanish | 54 | 29_english_language_french_learn |
|
72 |
+
| 30 | search - find - name - looking - address | 53 | 30_search_find_name_looking |
|
73 |
+
| 31 | word - winter - words - letters - letter | 53 | 31_word_winter_words_letters |
|
74 |
+
| 32 | weight - calories - fat - diet - eat | 52 | 32_weight_calories_fat_diet |
|
75 |
+
| 33 | bowl - game - qb - team - usc | 47 | 33_bowl_game_qb_team |
|
76 |
+
| 34 | moon - time - horizon - day - sun | 47 | 34_moon_time_horizon_day |
|
77 |
+
| 35 | wwe - guerrero - tna - diva - cena | 45 | 35_wwe_guerrero_tna_diva |
|
78 |
+
| 36 | her - mom - gift - she - ideas | 44 | 36_her_mom_gift_she |
|
79 |
+
| 37 | him - he - sister - my - his | 44 | 37_him_he_sister_my |
|
80 |
+
| 38 | book - books - read - harlem - beard | 43 | 38_book_books_read_harlem |
|
81 |
+
| 39 | color - blue - sky - light - colors | 43 | 39_color_blue_sky_light |
|
82 |
+
| 40 | tax - taxes - unemployment - state - income | 42 | 40_tax_taxes_unemployment_state |
|
83 |
+
| 41 | alamo - movie - movies - trilogy - aka | 42 | 41_alamo_movie_movies_trilogy |
|
84 |
+
| 42 | show - watch - anime - episodes - tv | 42 | 42_show_watch_anime_episodes |
|
85 |
+
| 43 | insurance - health - disability - help - for | 42 | 43_insurance_health_disability_help |
|
86 |
+
| 44 | girl - her - she - likes - ask | 42 | 44_girl_her_she_likes |
|
87 |
+
| 45 | navy - military - army - marine - marines | 41 | 45_navy_military_army_marine |
|
88 |
+
| 46 | white - black - racist - blacks - racism | 40 | 46_white_black_racist_blacks |
|
89 |
+
| 47 | cheat - spouse - wife - her - she | 40 | 47_cheat_spouse_wife_her |
|
90 |
+
| 48 | he - him - likes - guy - me | 39 | 48_he_him_likes_guy |
|
91 |
+
| 49 | visa - passport - birth - us - citizen | 38 | 49_visa_passport_birth_us |
|
92 |
+
| 50 | marijuana - drug - weed - opium - test | 37 | 50_marijuana_drug_weed_opium |
|
93 |
+
| 51 | velocity - force - angle - cm - triangle | 37 | 51_velocity_force_angle_cm |
|
94 |
+
| 52 | cup - championship - world - player - euro | 36 | 52_cup_championship_world_player |
|
95 |
+
| 53 | nascar - racing - fight - gordon - sport | 33 | 53_nascar_racing_fight_gordon |
|
96 |
+
| 54 | celebrities - tom - celebrity - her - jolie | 32 | 54_celebrities_tom_celebrity_her |
|
97 |
+
| 55 | cricket - india - batsman - dravid - indian | 31 | 55_cricket_india_batsman_dravid |
|
98 |
+
| 56 | weight - eat - skinny - fat - healthy | 30 | 56_weight_eat_skinny_fat |
|
99 |
+
| 57 | eye - lenses - astigmatism - eyes - glasses | 30 | 57_eye_lenses_astigmatism_eyes |
|
100 |
+
| 58 | people - yourself - person - others - confidence | 29 | 58_people_yourself_person_others |
|
101 |
+
| 59 | stock - fund - shares - mutual - market | 29 | 59_stock_fund_shares_mutual |
|
102 |
+
| 60 | arsenal - liverpool - league - fans - celtic | 29 | 60_arsenal_liverpool_league_fans |
|
103 |
+
| 61 | warming - global - climate - ice - snow | 29 | 61_warming_global_climate_ice |
|
104 |
+
| 62 | her - she - friend - friends - me | 28 | 62_her_she_friend_friends |
|
105 |
+
| 63 | wave - frequency - electromagnetic - radar - antenna | 28 | 63_wave_frequency_electromagnetic_radar |
|
106 |
+
| 64 | dream - dreams - my - elevator - was | 27 | 64_dream_dreams_my_elevator |
|
107 |
+
| 65 | gauge - bullet - caliber - gun - barrel | 27 | 65_gauge_bullet_caliber_gun |
|
108 |
+
| 66 | pain - knee - elbow - tennis - shoulder | 27 | 66_pain_knee_elbow_tennis |
|
109 |
+
| 67 | beach - trail - resort - appalachian - shaw | 26 | 67_beach_trail_resort_appalachian |
|
110 |
+
| 68 | scam - home - quixtar - money - survey | 25 | 68_scam_home_quixtar_money |
|
111 |
+
| 69 | business - sell - idea - start - money | 25 | 69_business_sell_idea_start |
|
112 |
+
| 70 | tv - watch - channels - espn - cup | 25 | 70_tv_watch_channels_espn |
|
113 |
+
| 71 | abs - muscles - exercises - reps - muscle | 25 | 71_abs_muscles_exercises_reps |
|
114 |
+
| 72 | hair - shave - cut - pubic - trim | 25 | 72_hair_shave_cut_pubic |
|
115 |
+
| 73 | psychic - divination - astrology - cards - tarot | 25 | 73_psychic_divination_astrology_cards |
|
116 |
+
| 74 | number - phone - code - address - area | 24 | 74_number_phone_code_address |
|
117 |
+
| 75 | was - my - hit - ever - freezer | 24 | 75_was_my_hit_ever |
|
118 |
+
| 76 | trailers - trailer - dvd - media - wmp | 24 | 76_trailers_trailer_dvd_media |
|
119 |
+
| 77 | penis - condom - size - sex - inches | 24 | 77_penis_condom_size_sex |
|
120 |
+
| 78 | love - person - beloved - live - we | 24 | 78_love_person_beloved_live |
|
121 |
+
| 79 | war - world - countries - soviet - were | 24 | 79_war_world_countries_soviet |
|
122 |
+
| 80 | de - le - la - et - les | 24 | 80_de_le_la_et |
|
123 |
+
| 81 | job - jobs - guard - where - work | 24 | 81_job_jobs_guard_where |
|
124 |
+
| 82 | christmas - thanksgiving - holidays - celebrate - tree | 24 | 82_christmas_thanksgiving_holidays_celebrate |
|
125 |
+
| 83 | hepatitis - pneumonia - infections - vaccination - link | 23 | 83_hepatitis_pneumonia_infections_vaccination |
|
126 |
+
| 84 | peanuts - ibs - may - heartburn - bowel | 22 | 84_peanuts_ibs_may_heartburn |
|
127 |
+
| 85 | name - sarah - named - pronounced - my | 21 | 85_name_sarah_named_pronounced |
|
128 |
+
| 86 | kids - he - husband - him - cheated | 20 | 86_kids_he_husband_him |
|
129 |
+
| 87 | gas - oil - energy - kingdom - 2006 | 19 | 87_gas_oil_energy_kingdom |
|
130 |
+
| 88 | taller - tall - height - grow - short | 19 | 88_taller_tall_height_grow |
|
131 |
+
| 89 | estate - property - heirs - lien - damages | 19 | 89_estate_property_heirs_lien |
|
132 |
+
| 90 | aluminum - 68 - element - 212 - metal | 18 | 90_aluminum_68_element_212 |
|
133 |
+
| 91 | organization - management - organizational - behavior - leadership | 18 | 91_organization_management_organizational_behavior |
|
134 |
+
| 92 | melatonin - medication - effects - dosage - zofran | 18 | 92_melatonin_medication_effects_dosage |
|
135 |
+
| 93 | smoking - quit - smoke - smoked - session | 17 | 93_smoking_quit_smoke_smoked |
|
136 |
+
| 94 | nba - referees - game - heat - win | 17 | 94_nba_referees_game_heat |
|
137 |
+
| 95 | superman - doom - hero - vs - super | 17 | 95_superman_doom_hero_vs |
|
138 |
+
| 96 | skateboarding - skateboard - snowboard - snowboarding - gymnastics | 17 | 96_skateboarding_skateboard_snowboard_snowboarding |
|
139 |
+
| 97 | electron - quarks - neutrons - antimatter - particle | 16 | 97_electron_quarks_neutrons_antimatter |
|
140 |
+
| 98 | happy - happiness - life - rushhour - secret | 16 | 98_happy_happiness_life_rushhour |
|
141 |
+
| 99 | fart - farting - gas - embarrassing - flatus | 16 | 99_fart_farting_gas_embarrassing |
|
142 |
+
| 100 | teeth - tooth - dentist - gums - braces | 16 | 100_teeth_tooth_dentist_gums |
|
143 |
+
| 101 | scorpio - zodiac - libra - signs - cancers | 16 | 101_scorpio_zodiac_libra_signs |
|
144 |
+
| 102 | dog - wolf - sheep - horse - animal | 16 | 102_dog_wolf_sheep_horse |
|
145 |
+
| 103 | hiv - aids - virus - infected - blood | 16 | 103_hiv_aids_virus_infected |
|
146 |
+
| 104 | thanked - poem - poetry - she - were | 15 | 104_thanked_poem_poetry_she |
|
147 |
+
| 105 | force - motion - mass - momentum - rocket | 15 | 105_force_motion_mass_momentum |
|
148 |
+
| 106 | minister - president - kagame - prime - natchaba | 15 | 106_minister_president_kagame_prime |
|
149 |
+
| 107 | kiss - kissing - lips - gently - tongue | 15 | 107_kiss_kissing_lips_gently |
|
150 |
+
| 108 | pictures - photos - google - site - find | 15 | 108_pictures_photos_google_site |
|
151 |
+
| 109 | dating - age - old - young - 19 | 14 | 109_dating_age_old_young |
|
152 |
+
| 110 | ebay - sell - smc - selling - products | 14 | 110_ebay_sell_smc_selling |
|
153 |
+
| 111 | planets - sun - stars - star - earth | 14 | 111_planets_sun_stars_star |
|
154 |
+
| 112 | imports - trade - importing - oil - mobil | 14 | 112_imports_trade_importing_oil |
|
155 |
+
| 113 | questions - question - politics - reported - answers | 14 | 113_questions_question_politics_reported |
|
156 |
+
| 114 | gay - crackle - snap - girlfriend - marry | 14 | 114_gay_crackle_snap_girlfriend |
|
157 |
+
| 115 | gravity - sun - earth - rotating - force | 13 | 115_gravity_sun_earth_rotating |
|
158 |
+
| 116 | weaknesses - abt - strengths - interview - job | 13 | 116_weaknesses_abt_strengths_interview |
|
159 |
+
| 117 | love - eachother - fall - forget - deeply | 13 | 117_love_eachother_fall_forget |
|
160 |
+
| 118 | animals - pets - tv - cage - communicate | 13 | 118_animals_pets_tv_cage |
|
161 |
+
| 119 | flax - yogurt - nonfat - health - healthy | 13 | 119_flax_yogurt_nonfat_health |
|
162 |
+
| 120 | grants - grant - business - federal - entrepreneurs | 13 | 120_grants_grant_business_federal |
|
163 |
+
| 121 | idol - american - chris - win - favorite | 12 | 121_idol_american_chris_win |
|
164 |
+
| 122 | clubs - golf - hit - irons - iron | 12 | 122_clubs_golf_hit_irons |
|
165 |
+
| 123 | lottery - scam - scammer - money - international | 12 | 123_lottery_scam_scammer_money |
|
166 |
+
| 124 | address - email - presale - bart - michaels | 12 | 124_address_email_presale_bart |
|
167 |
+
| 125 | jones - her - she - stargate - reynolds | 11 | 125_jones_her_she_stargate |
|
168 |
+
| 126 | data - product - analysis - regression - marketing | 11 | 126_data_product_analysis_regression |
|
169 |
+
| 127 | cancer - cure - tumor - parasite - cancers | 11 | 127_cancer_cure_tumor_parasite |
|
170 |
+
| 128 | nba - paul - team - redick - kobe | 11 | 128_nba_paul_team_redick |
|
171 |
+
| 129 | autism - autistic - homeschooling - child - she | 10 | 129_autism_autistic_homeschooling_child |
|
172 |
+
| 130 | seller - nike - soccer - jersey - dynamo | 10 | 130_seller_nike_soccer_jersey |
|
173 |
+
|
174 |
+
</details>
|
175 |
+
|
176 |
+
## Training hyperparameters
|
177 |
+
|
178 |
+
* calculate_probabilities: False
|
179 |
+
* language: english
|
180 |
+
* low_memory: False
|
181 |
+
* min_topic_size: 10
|
182 |
+
* n_gram_range: (1, 1)
|
183 |
+
* nr_topics: None
|
184 |
+
* seed_topic_list: None
|
185 |
+
* top_n_words: 10
|
186 |
+
* verbose: False
|
187 |
+
* zeroshot_min_similarity: 0.7
|
188 |
+
* zeroshot_topic_list: None
|
189 |
+
|
190 |
+
## Framework versions
|
191 |
+
|
192 |
+
* Numpy: 1.23.5
|
193 |
+
* HDBSCAN: 0.8.33
|
194 |
+
* UMAP: 0.5.5
|
195 |
+
* Pandas: 1.5.3
|
196 |
+
* Scikit-Learn: 1.2.2
|
197 |
+
* Sentence-transformers: 2.2.2
|
198 |
+
* Transformers: 4.35.2
|
199 |
+
* Numba: 0.58.1
|
200 |
+
* Plotly: 5.15.0
|
201 |
+
* Python: 3.10.12
|
config.json
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"calculate_probabilities": false,
|
3 |
+
"language": "english",
|
4 |
+
"low_memory": false,
|
5 |
+
"min_topic_size": 10,
|
6 |
+
"n_gram_range": [
|
7 |
+
1,
|
8 |
+
1
|
9 |
+
],
|
10 |
+
"nr_topics": null,
|
11 |
+
"seed_topic_list": null,
|
12 |
+
"top_n_words": 10,
|
13 |
+
"verbose": false,
|
14 |
+
"zeroshot_min_similarity": 0.7,
|
15 |
+
"zeroshot_topic_list": null,
|
16 |
+
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
|
17 |
+
}
|
ctfidf.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c605e6ed1e25a6350470804f2fcddbdfda9a732896a0ae7ff0dc5ad8de04a08c
|
3 |
+
size 2648404
|
ctfidf_config.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
topic_embeddings.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:57922ca0e0dd31a10cdaa216b46e428d9a4a163a913f4f2922970f5b48816ab5
|
3 |
+
size 202840
|
topics.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|