KingKazma commited on
Commit
0e28a9e
1 Parent(s): bbc8490

Add BERTopic model

Browse files
Files changed (4) hide show
  1. README.md +193 -0
  2. config.json +15 -0
  3. topic_embeddings.safetensors +3 -0
  4. topics.json +0 -0
README.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # cnn_dailymail_108_50000_25000_test
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("KingKazma/cnn_dailymail_108_50000_25000_test")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 126
34
+ * Number of training documents: 11490
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | said - one - year - also - would | 5 | -1_said_one_year_also |
42
+ | 0 | madrid - real - barcelona - atletico - ronaldo | 6198 | 0_madrid_real_barcelona_atletico |
43
+ | 1 | isis - syria - islamic - group - iraqi | 263 | 1_isis_syria_islamic_group |
44
+ | 2 | baby - cancer - hospital - birth - mother | 189 | 2_baby_cancer_hospital_birth |
45
+ | 3 | fight - mayweather - pacquiao - manny - floyd | 177 | 3_fight_mayweather_pacquiao_manny |
46
+ | 4 | driver - car - vehicle - road - crash | 163 | 4_driver_car_vehicle_road |
47
+ | 5 | united - van - manchester - gaal - city | 156 | 5_united_van_manchester_gaal |
48
+ | 6 | labour - ukip - party - mr - miliband | 129 | 6_labour_ukip_party_mr |
49
+ | 7 | masters - woods - augusta - mcilroy - spieth | 117 | 7_masters_woods_augusta_mcilroy |
50
+ | 8 | england - test - wicket - cricket - anderson | 115 | 8_england_test_wicket_cricket |
51
+ | 9 | fashion - dress - model - collection - designer | 107 | 9_fashion_dress_model_collection |
52
+ | 10 | school - student - teacher - sexual - sex | 105 | 10_school_student_teacher_sexual |
53
+ | 11 | minute - wigan - goal - watford - league | 99 | 11_minute_wigan_goal_watford |
54
+ | 12 | chocolate - food - sugar - egg - chicken | 97 | 12_chocolate_food_sugar_egg |
55
+ | 13 | liverpool - sterling - rodgers - gerrard - raheem | 93 | 13_liverpool_sterling_rodgers_gerrard |
56
+ | 14 | celtic - rangers - scottish - inverness - mccall | 91 | 14_celtic_rangers_scottish_inverness |
57
+ | 15 | clinton - hillary - clintons - president - campaign | 90 | 15_clinton_hillary_clintons_president |
58
+ | 16 | lion - animal - zoo - elephant - bear | 89 | 16_lion_animal_zoo_elephant |
59
+ | 17 | dog - cat - animal - pet - owner | 89 | 17_dog_cat_animal_pet |
60
+ | 18 | saracens - rugby - clermont - wasps - toulon | 75 | 18_saracens_rugby_clermont_wasps |
61
+ | 19 | murray - djokovic - open - berdych - miami | 74 | 19_murray_djokovic_open_berdych |
62
+ | 20 | chelsea - mourinho - chelseas - hazard - league | 70 | 20_chelsea_mourinho_chelseas_hazard |
63
+ | 21 | villa - sherwood - benteke - ramsey - aston | 68 | 21_villa_sherwood_benteke_ramsey |
64
+ | 22 | property - home - house - room - estate | 68 | 22_property_home_house_room |
65
+ | 23 | arsenal - wenger - gunners - arsene - sanchez | 66 | 23_arsenal_wenger_gunners_arsene |
66
+ | 24 | planet - earth - solar - surface - moon | 66 | 24_planet_earth_solar_surface |
67
+ | 25 | hamilton - rosberg - race - mercedes - prix | 66 | 25_hamilton_rosberg_race_mercedes |
68
+ | 26 | bayern - guardiola - porto - munich - pep | 64 | 26_bayern_guardiola_porto_munich |
69
+ | 27 | wars - film - star - movie - trailer | 62 | 27_wars_film_star_movie |
70
+ | 28 | hughes - capitol - security - snowden - gyrocopter | 60 | 28_hughes_capitol_security_snowden |
71
+ | 29 | newcastle - sunderland - carver - game - fan | 58 | 29_newcastle_sunderland_carver_game |
72
+ | 30 | gray - police - baltimore - officer - grays | 56 | 30_gray_police_baltimore_officer |
73
+ | 31 | shot - shooting - police - gun - said | 56 | 31_shot_shooting_police_gun |
74
+ | 32 | flight - plane - passenger - airport - airline | 53 | 32_flight_plane_passenger_airport |
75
+ | 33 | nepal - earthquake - kathmandu - everest - quake | 53 | 33_nepal_earthquake_kathmandu_everest |
76
+ | 34 | prince - royal - duchess - princess - queen | 51 | 34_prince_royal_duchess_princess |
77
+ | 35 | hotel - island - resort - sea - room | 49 | 35_hotel_island_resort_sea |
78
+ | 36 | fire - blaze - flame - smoke - firefighter | 48 | 36_fire_blaze_flame_smoke |
79
+ | 37 | ship - vessel - boat - crew - titanic | 48 | 37_ship_vessel_boat_crew |
80
+ | 38 | mccoy - jockey - race - ride - ap | 47 | 38_mccoy_jockey_race_ride |
81
+ | 39 | chan - sukumaran - execution - bali - indonesian | 46 | 39_chan_sukumaran_execution_bali |
82
+ | 40 | anzac - gallipoli - war - australian - sbs | 45 | 40_anzac_gallipoli_war_australian |
83
+ | 41 | weight - size - stone - eating - food | 44 | 41_weight_size_stone_eating |
84
+ | 42 | migrant - boat - libya - mediterranean - italian | 42 | 42_migrant_boat_libya_mediterranean |
85
+ | 43 | iran - nuclear - deal - agreement - irans | 41 | 43_iran_nuclear_deal_agreement |
86
+ | 44 | shark - whale - fish - seal - dolphin | 41 | 44_shark_whale_fish_seal |
87
+ | 45 | manziel - nfl - game - sox - quarterback | 41 | 45_manziel_nfl_game_sox |
88
+ | 46 | yemen - saudi - houthi - houthis - rebel | 40 | 46_yemen_saudi_houthi_houthis |
89
+ | 47 | money - fraud - court - bank - account | 39 | 47_money_fraud_court_bank |
90
+ | 48 | cave - dinosaur - neanderthals - bone - researcher | 35 | 48_cave_dinosaur_neanderthals_bone |
91
+ | 49 | bruce - bobbi - bobby - jenner - kris | 35 | 49_bruce_bobbi_bobby_jenner |
92
+ | 50 | pardew - premier - ham - league - palace | 34 | 50_pardew_premier_ham_league |
93
+ | 51 | hernandez - lloyd - hernandezs - odin - murder | 34 | 51_hernandez_lloyd_hernandezs_odin |
94
+ | 52 | apple - watch - battery - iphone - samsung | 33 | 52_apple_watch_battery_iphone |
95
+ | 53 | law - marriage - religious - samesex - gay | 32 | 53_law_marriage_religious_samesex |
96
+ | 54 | tsarnaev - boston - dzhokhar - marathon - tamerlan | 32 | 54_tsarnaev_boston_dzhokhar_marathon |
97
+ | 55 | buckley - police - glasgow - murder - miss | 32 | 55_buckley_police_glasgow_murder |
98
+ | 56 | slager - scott - officer - charleston - taser | 31 | 56_slager_scott_officer_charleston |
99
+ | 57 | alshabaab - garissa - kenya - kenyan - somalia | 31 | 57_alshabaab_garissa_kenya_kenyan |
100
+ | 58 | marathon - running - race - runner - run | 29 | 58_marathon_running_race_runner |
101
+ | 59 | tax - labour - osborne - economy - balls | 29 | 59_tax_labour_osborne_economy |
102
+ | 60 | aldi - phone - per - tesco - app | 29 | 60_aldi_phone_per_tesco |
103
+ | 61 | dellinger - langlais - murder - fradeneck - body | 29 | 61_dellinger_langlais_murder_fradeneck |
104
+ | 62 | vault - gang - raid - thief - hatton | 28 | 62_vault_gang_raid_thief |
105
+ | 63 | lubitz - germanwings - flight - crash - plane | 27 | 63_lubitz_germanwings_flight_crash |
106
+ | 64 | janner - lord - saunders - abuse - public | 27 | 64_janner_lord_saunders_abuse |
107
+ | 65 | point - nba - playoff - scored - rebound | 27 | 65_point_nba_playoff_scored |
108
+ | 66 | kane - tottenham - pochettino - townsend - spurs | 27 | 66_kane_tottenham_pochettino_townsend |
109
+ | 67 | groening - camp - auschwitz - nazi - kor | 26 | 67_groening_camp_auschwitz_nazi |
110
+ | 68 | crucible - osullivan - selby - frame - doherty | 25 | 68_crucible_osullivan_selby_frame |
111
+ | 69 | korea - kim - korean - north - seoul | 25 | 69_korea_kim_korean_north |
112
+ | 70 | melbourne - australia - islamic - rally - australian | 25 | 70_melbourne_australia_islamic_rally |
113
+ | 71 | nhs - patient - gp - gps - ae | 25 | 71_nhs_patient_gp_gps |
114
+ | 72 | bates - harris - tulsa - deputy - taser | 25 | 72_bates_harris_tulsa_deputy |
115
+ | 73 | artist - art - paint - painting - colouring | 25 | 73_artist_art_paint_painting |
116
+ | 74 | south - johannesburg - africa - african - violence | 24 | 74_south_johannesburg_africa_african |
117
+ | 75 | phelps - ennishill - olympic - bolt - championships | 24 | 75_phelps_ennishill_olympic_bolt |
118
+ | 76 | scott - leeton - stephanie - ms - scotts | 23 | 76_scott_leeton_stephanie_ms |
119
+ | 77 | cancer - breast - prostate - gene - treatment | 23 | 77_cancer_breast_prostate_gene |
120
+ | 78 | clarkson - gear - bbc - hammond - top | 21 | 78_clarkson_gear_bbc_hammond |
121
+ | 79 | russian - putin - russia - ukraine - moscow | 21 | 79_russian_putin_russia_ukraine |
122
+ | 80 | marijuana - cannabis - drug - hemp - medical | 21 | 80_marijuana_cannabis_drug_hemp |
123
+ | 81 | emoji - app - user - facebook - instagram | 21 | 81_emoji_app_user_facebook |
124
+ | 82 | brain - memory - dementia - study - rat | 20 | 82_brain_memory_dementia_study |
125
+ | 83 | vaccine - vaccination - cough - whooping - autism | 19 | 83_vaccine_vaccination_cough_whooping |
126
+ | 84 | boko - haram - nigeria - buhari - nigerian | 19 | 84_boko_haram_nigeria_buhari |
127
+ | 85 | housing - tenant - buy - association - property | 19 | 85_housing_tenant_buy_association |
128
+ | 86 | benaud - cricket - richie - test - commentator | 18 | 86_benaud_cricket_richie_test |
129
+ | 87 | space - rocket - spacex - launch - astronaut | 18 | 87_space_rocket_spacex_launch |
130
+ | 88 | cuba - castro - obama - cuban - president | 18 | 88_cuba_castro_obama_cuban |
131
+ | 89 | kun - pingan - xie - baby - china | 18 | 89_kun_pingan_xie_baby |
132
+ | 90 | song - music - studio - abbey - manuscript | 17 | 90_song_music_studio_abbey |
133
+ | 91 | holpin - funeral - care - child - older | 17 | 91_holpin_funeral_care_child |
134
+ | 92 | diamond - underground - rock - cave - garnet | 17 | 92_diamond_underground_rock_cave |
135
+ | 93 | sydney - storm - ses - flooding - weather | 17 | 93_sydney_storm_ses_flooding |
136
+ | 94 | china - chinese - chinas - gao - organ | 17 | 94_china_chinese_chinas_gao |
137
+ | 95 | genocide - armenians - armenian - pope - ottoman | 16 | 95_genocide_armenians_armenian_pope |
138
+ | 96 | dunblane - murray - andy - cathedral - wedding | 16 | 96_dunblane_murray_andy_cathedral |
139
+ | 97 | weather - temperature - warm - sunshine - yesterday | 15 | 97_weather_temperature_warm_sunshine |
140
+ | 98 | crash - died - manyang - accident - guode | 15 | 98_crash_died_manyang_accident |
141
+ | 99 | brandt - dr - kimmy - franff - fredric | 14 | 99_brandt_dr_kimmy_franff |
142
+ | 100 | mchenry - britt - towing - espn - battilana | 14 | 100_mchenry_britt_towing_espn |
143
+ | 101 | population - cent - per - immigrant - country | 14 | 101_population_cent_per_immigrant |
144
+ | 102 | tornado - storm - weather - hail - severe | 13 | 102_tornado_storm_weather_hail |
145
+ | 103 | school - exam - pupil - math - english | 13 | 103_school_exam_pupil_math |
146
+ | 104 | klopp - dortmund - tuchel - borussia - bundesliga | 13 | 104_klopp_dortmund_tuchel_borussia |
147
+ | 105 | water - drought - california - state - snow | 12 | 105_water_drought_california_state |
148
+ | 106 | fifa - blatter - uefa - chess - football | 12 | 106_fifa_blatter_uefa_chess |
149
+ | 107 | koeman - southampton - saints - fuchs - season | 12 | 107_koeman_southampton_saints_fuchs |
150
+ | 108 | nuclear - reactor - radiation - plant - fukushima | 12 | 108_nuclear_reactor_radiation_plant |
151
+ | 109 | car - audi - electric - vehicle - motor | 11 | 109_car_audi_electric_vehicle |
152
+ | 110 | glitter - sex - nauru - sexual - mee | 11 | 110_glitter_sex_nauru_sexual |
153
+ | 111 | luke - search - eildon - shambrook - missing | 11 | 111_luke_search_eildon_shambrook |
154
+ | 112 | music - spotify - radio - streaming - service | 11 | 112_music_spotify_radio_streaming |
155
+ | 113 | pusok - deputy - officer - mcmahon - bernardino | 9 | 113_pusok_deputy_officer_mcmahon |
156
+ | 114 | valle - hilbert - gilberto - dmx - whittington | 8 | 114_valle_hilbert_gilberto_dmx |
157
+ | 115 | alcohol - oak - wine - drinking - bottle | 7 | 115_alcohol_oak_wine_drinking |
158
+ | 116 | dollar - g650 - jesus - catholic - god | 7 | 116_dollar_g650_jesus_catholic |
159
+ | 117 | hair - labium - cheryl - finn - woman | 7 | 117_hair_labium_cheryl_finn |
160
+ | 118 | okawa - oldest - guinness - weber - weaver | 7 | 118_okawa_oldest_guinness_weber |
161
+ | 119 | martinez - everton - mirallas - lennon - roberto | 6 | 119_martinez_everton_mirallas_lennon |
162
+ | 120 | karmel - diet - paleo - vitamin - health | 6 | 120_karmel_diet_paleo_vitamin |
163
+ | 121 | redman - wisconsin - badgers - basketball - wildcats | 5 | 121_redman_wisconsin_badgers_basketball |
164
+ | 122 | singh - india - jyoti - indian - protest | 5 | 122_singh_india_jyoti_indian |
165
+ | 123 | volcano - eruption - dune - linear - ash | 5 | 123_volcano_eruption_dune_linear |
166
+ | 124 | tahir - castner - mascitelli - easy - restraining | 5 | 124_tahir_castner_mascitelli_easy |
167
+
168
+ </details>
169
+
170
+ ## Training hyperparameters
171
+
172
+ * calculate_probabilities: True
173
+ * language: english
174
+ * low_memory: False
175
+ * min_topic_size: 10
176
+ * n_gram_range: (1, 1)
177
+ * nr_topics: None
178
+ * seed_topic_list: None
179
+ * top_n_words: 10
180
+ * verbose: False
181
+
182
+ ## Framework versions
183
+
184
+ * Numpy: 1.22.4
185
+ * HDBSCAN: 0.8.33
186
+ * UMAP: 0.5.3
187
+ * Pandas: 1.5.3
188
+ * Scikit-Learn: 1.2.2
189
+ * Sentence-transformers: 2.2.2
190
+ * Transformers: 4.31.0
191
+ * Numba: 0.57.1
192
+ * Plotly: 5.13.1
193
+ * Python: 3.10.12
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": true,
3
+ "language": "english",
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": false,
14
+ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
15
+ }
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ae8403b0a69922d827e4c40a392ddf770eb2d9b083cedc73b310daa85dcec093
3
+ size 193624
topics.json ADDED
The diff for this file is too large to render. See raw diff