Add BERTopic model
Browse files- README.md +185 -0
- config.json +15 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
@@ -0,0 +1,185 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
tags:
|
4 |
+
- bertopic
|
5 |
+
library_name: bertopic
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
---
|
8 |
+
|
9 |
+
# cnn_dailymail_6789_50000_25000_validation
|
10 |
+
|
11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
13 |
+
|
14 |
+
## Usage
|
15 |
+
|
16 |
+
To use this model, please install BERTopic:
|
17 |
+
|
18 |
+
```
|
19 |
+
pip install -U bertopic
|
20 |
+
```
|
21 |
+
|
22 |
+
You can use the model as follows:
|
23 |
+
|
24 |
+
```python
|
25 |
+
from bertopic import BERTopic
|
26 |
+
topic_model = BERTopic.load("KingKazma/cnn_dailymail_6789_50000_25000_validation")
|
27 |
+
|
28 |
+
topic_model.get_topic_info()
|
29 |
+
```
|
30 |
+
|
31 |
+
## Topic overview
|
32 |
+
|
33 |
+
* Number of topics: 118
|
34 |
+
* Number of training documents: 13368
|
35 |
+
|
36 |
+
<details>
|
37 |
+
<summary>Click here for an overview of all topics.</summary>
|
38 |
+
|
39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
40 |
+
|----------|----------------|-----------------|-------|
|
41 |
+
| -1 | said - one - year - also - time | 5 | -1_said_one_year_also |
|
42 |
+
| 0 | isis - syria - islamic - attack - group | 6535 | 0_isis_syria_islamic_attack |
|
43 |
+
| 1 | police - officer - shooting - ferguson - said | 452 | 1_police_officer_shooting_ferguson |
|
44 |
+
| 2 | labour - mr - party - election - tax | 415 | 2_labour_mr_party_election |
|
45 |
+
| 3 | flight - plane - pilot - aircraft - lubitz | 268 | 3_flight_plane_pilot_aircraft |
|
46 |
+
| 4 | car - driver - driving - road - crash | 224 | 4_car_driver_driving_road |
|
47 |
+
| 5 | hair - fashion - dress - model - look | 223 | 5_hair_fashion_dress_model |
|
48 |
+
| 6 | cricket - england - cup - world - pietersen | 205 | 6_cricket_england_cup_world |
|
49 |
+
| 7 | food - sugar - per - cent - product | 189 | 7_food_sugar_per_cent |
|
50 |
+
| 8 | clinton - email - obama - president - clintons | 188 | 8_clinton_email_obama_president |
|
51 |
+
| 9 | property - house - home - price - room | 186 | 9_property_house_home_price |
|
52 |
+
| 10 | rangers - celtic - scotland - ibrox - game | 165 | 10_rangers_celtic_scotland_ibrox |
|
53 |
+
| 11 | fight - pacquiao - mayweather - manny - floyd | 151 | 11_fight_pacquiao_mayweather_manny |
|
54 |
+
| 12 | england - nations - wales - ireland - six | 143 | 12_england_nations_wales_ireland |
|
55 |
+
| 13 | hamilton - mercedes - prix - race - rosberg | 135 | 13_hamilton_mercedes_prix_race |
|
56 |
+
| 14 | baby - birth - cancer - hospital - born | 126 | 14_baby_birth_cancer_hospital |
|
57 |
+
| 15 | fa - league - game - villa - bradford | 116 | 15_fa_league_game_villa |
|
58 |
+
| 16 | dog - animal - dogs - owner - pet | 114 | 16_dog_animal_dogs_owner |
|
59 |
+
| 17 | police - abuse - sexual - sex - child | 112 | 17_police_abuse_sexual_sex |
|
60 |
+
| 18 | madrid - ronaldo - barcelona - real - messi | 111 | 18_madrid_ronaldo_barcelona_real |
|
61 |
+
| 19 | chelsea - mourinho - terry - league - jose | 106 | 19_chelsea_mourinho_terry_league |
|
62 |
+
| 20 | eclipse - earth - mars - solar - sun | 101 | 20_eclipse_earth_mars_solar |
|
63 |
+
| 21 | kane - england - hodgson - lithuania - rooney | 100 | 21_kane_england_hodgson_lithuania |
|
64 |
+
| 22 | show - film - corden - host - noah | 95 | 22_show_film_corden_host |
|
65 |
+
| 23 | prince - royal - duchess - charles - queen | 92 | 23_prince_royal_duchess_charles |
|
66 |
+
| 24 | murray - wells - tennis - andy - 64 | 88 | 24_murray_wells_tennis_andy |
|
67 |
+
| 25 | putin - russian - nemtsov - moscow - russia | 82 | 25_putin_russian_nemtsov_moscow |
|
68 |
+
| 26 | netanyahu - iran - nuclear - israel - israeli | 80 | 26_netanyahu_iran_nuclear_israel |
|
69 |
+
| 27 | court - money - bank - fraud - stiviano | 80 | 27_court_money_bank_fraud |
|
70 |
+
| 28 | weight - size - fat - stone - diet | 76 | 28_weight_size_fat_stone |
|
71 |
+
| 29 | armstrong - race - olympic - uci - championships | 74 | 29_armstrong_race_olympic_uci |
|
72 |
+
| 30 | cheltenham - hurdle - horse - jockey - festival | 73 | 30_cheltenham_hurdle_horse_jockey |
|
73 |
+
| 31 | arsenal - wenger - monaco - giroud - arsenals | 73 | 31_arsenal_wenger_monaco_giroud |
|
74 |
+
| 32 | mcilroy - golf - masters - woods - round | 72 | 32_mcilroy_golf_masters_woods |
|
75 |
+
| 33 | watch - apple - device - google - user | 66 | 33_watch_apple_device_google |
|
76 |
+
| 34 | fraternity - university - sae - oklahoma - chapter | 65 | 34_fraternity_university_sae_oklahoma |
|
77 |
+
| 35 | united - van - gaal - manchester - arsenal | 62 | 35_united_van_gaal_manchester |
|
78 |
+
| 36 | chan - sukumaran - indonesian - bali - myuran | 61 | 36_chan_sukumaran_indonesian_bali |
|
79 |
+
| 37 | school - teacher - student - district - sexual | 58 | 37_school_teacher_student_district |
|
80 |
+
| 38 | sunderland - poyet - advocaat - johnson - april | 55 | 38_sunderland_poyet_advocaat_johnson |
|
81 |
+
| 39 | clarkson - bbc - gear - top - jeremy | 55 | 39_clarkson_bbc_gear_top |
|
82 |
+
| 40 | fire - building - blaze - explosion - firefighter | 48 | 40_fire_building_blaze_explosion |
|
83 |
+
| 41 | liverpool - gerrard - rodgers - steven - anfield | 46 | 41_liverpool_gerrard_rodgers_steven |
|
84 |
+
| 42 | patient - nhs - ae - cancer - care | 44 | 42_patient_nhs_ae_cancer |
|
85 |
+
| 43 | song - zayn - thicke - gayes - pharrell | 43 | 43_song_zayn_thicke_gayes |
|
86 |
+
| 44 | wedding - married - couple - jaclyn - love | 41 | 44_wedding_married_couple_jaclyn |
|
87 |
+
| 45 | car - vehicle - electric - model - jaguar | 41 | 45_car_vehicle_electric_model |
|
88 |
+
| 46 | nfl - borland - bowl - brady - super | 40 | 46_nfl_borland_bowl_brady |
|
89 |
+
| 47 | pellegrini - city - league - manchester - barcelona | 40 | 47_pellegrini_city_league_manchester |
|
90 |
+
| 48 | school - education - porn - sex - child | 39 | 48_school_education_porn_sex |
|
91 |
+
| 49 | bear - cub - tiger - deer - wildlife | 39 | 49_bear_cub_tiger_deer |
|
92 |
+
| 50 | gay - law - indiana - marriage - religious | 38 | 50_gay_law_indiana_marriage |
|
93 |
+
| 51 | india - rape - indian - documentary - singh | 37 | 51_india_rape_indian_documentary |
|
94 |
+
| 52 | boko - haram - nigeria - nigerian - nigerias | 36 | 52_boko_haram_nigeria_nigerian |
|
95 |
+
| 53 | ebola - sierra - leone - virus - liberia | 35 | 53_ebola_sierra_leone_virus |
|
96 |
+
| 54 | tsarnaev - dzhokhar - boston - tamerlan - tsarnaevs | 35 | 54_tsarnaev_dzhokhar_boston_tamerlan |
|
97 |
+
| 55 | ski - mountain - skier - rock - lift | 32 | 55_ski_mountain_skier_rock |
|
98 |
+
| 56 | robbery - armed - store - police - bank | 31 | 56_robbery_armed_store_police |
|
99 |
+
| 57 | roma - inter - juventus - serie - fiorentina | 30 | 57_roma_inter_juventus_serie |
|
100 |
+
| 58 | fifa - blatter - fa - qatar - cup | 29 | 58_fifa_blatter_fa_qatar |
|
101 |
+
| 59 | marijuana - drug - cannabis - colorado - lsd | 29 | 59_marijuana_drug_cannabis_colorado |
|
102 |
+
| 60 | everton - martinez - lukaku - dynamo - evertons | 27 | 60_everton_martinez_lukaku_dynamo |
|
103 |
+
| 61 | chelsea - racist - paris - train - football | 27 | 61_chelsea_racist_paris_train |
|
104 |
+
| 62 | durst - dursts - berman - orleans - robert | 27 | 62_durst_dursts_berman_orleans |
|
105 |
+
| 63 | basketball - ncaa - coach - tournament - game | 25 | 63_basketball_ncaa_coach_tournament |
|
106 |
+
| 64 | bayern - goal - muller - shakhtar - robben | 25 | 64_bayern_goal_muller_shakhtar |
|
107 |
+
| 65 | hotel - beach - cruise - ship - resort | 25 | 65_hotel_beach_cruise_ship |
|
108 |
+
| 66 | sherwood - villa - aston - tim - brom | 25 | 66_sherwood_villa_aston_tim |
|
109 |
+
| 67 | snow - inch - winter - weather - ice | 24 | 67_snow_inch_winter_weather |
|
110 |
+
| 68 | weather - temperature - rain - snow - expected | 24 | 68_weather_temperature_rain_snow |
|
111 |
+
| 69 | korean - korea - kim - north - lippert | 23 | 69_korean_korea_kim_north |
|
112 |
+
| 70 | hospital - doctor - mrs - fracture - patient | 23 | 70_hospital_doctor_mrs_fracture |
|
113 |
+
| 71 | rail - calais - parking - transport - train | 22 | 71_rail_calais_parking_transport |
|
114 |
+
| 72 | mls - lampard - orlando - city - york | 22 | 72_mls_lampard_orlando_city |
|
115 |
+
| 73 | jesus - stone - circle - ancient - stonehenge | 22 | 73_jesus_stone_circle_ancient |
|
116 |
+
| 74 | hernandez - lloyd - jenkins - hernandezs - lloyds | 21 | 74_hernandez_lloyd_jenkins_hernandezs |
|
117 |
+
| 75 | drug - cocaine - jailed - steroid - cannabis | 20 | 75_drug_cocaine_jailed_steroid |
|
118 |
+
| 76 | secret - clancy - service - agent - white | 20 | 76_secret_clancy_service_agent |
|
119 |
+
| 77 | homo - fossil - specie - ago - human | 20 | 77_homo_fossil_specie_ago |
|
120 |
+
| 78 | image - photographer - photograph - photo - landscape | 19 | 78_image_photographer_photograph_photo |
|
121 |
+
| 79 | parade - patricks - irish - st - green | 19 | 79_parade_patricks_irish_st |
|
122 |
+
| 80 | bale - wales - israel - coleman - gareth | 19 | 80_bale_wales_israel_coleman |
|
123 |
+
| 81 | di - maria - angel - united - manchester | 19 | 81_di_maria_angel_united |
|
124 |
+
| 82 | defence - greece - spending - greek - budget | 19 | 82_defence_greece_spending_greek |
|
125 |
+
| 83 | sleep - store - cent - per - kraft | 18 | 83_sleep_store_cent_per |
|
126 |
+
| 84 | student - johnson - virginia - charlottesville - university | 18 | 84_student_johnson_virginia_charlottesville |
|
127 |
+
| 85 | vanuatu - cyclone - vila - pam - port | 18 | 85_vanuatu_cyclone_vila_pam |
|
128 |
+
| 86 | cnn - transcript - student - news - roll | 18 | 86_cnn_transcript_student_news |
|
129 |
+
| 87 | nazi - anne - nazis - war - camp | 18 | 87_nazi_anne_nazis_war |
|
130 |
+
| 88 | attack - synagogue - hebdo - paris - charlie | 17 | 88_attack_synagogue_hebdo_paris |
|
131 |
+
| 89 | ham - west - tomkins - reid - kouyate | 16 | 89_ham_west_tomkins_reid |
|
132 |
+
| 90 | balotelli - mario - liverpool - italian - striker | 16 | 90_balotelli_mario_liverpool_italian |
|
133 |
+
| 91 | chinese - monk - buddhist - thailand - tourist | 15 | 91_chinese_monk_buddhist_thailand |
|
134 |
+
| 92 | snowden - gchq - intelligence - security - agency | 15 | 92_snowden_gchq_intelligence_security |
|
135 |
+
| 93 | pope - francis - naples - vatican - pontiff | 14 | 93_pope_francis_naples_vatican |
|
136 |
+
| 94 | starbucks - schultz - race - racial - campaign | 14 | 94_starbucks_schultz_race_racial |
|
137 |
+
| 95 | point - rebound - sweeney - playoff - scored | 14 | 95_point_rebound_sweeney_playoff |
|
138 |
+
| 96 | poldark - turner - demelza - aidan - drama | 13 | 96_poldark_turner_demelza_aidan |
|
139 |
+
| 97 | cuba - havana - cuban - us - castro | 13 | 97_cuba_havana_cuban_us |
|
140 |
+
| 98 | italy - conte - italian - eder - juventus | 13 | 98_italy_conte_italian_eder |
|
141 |
+
| 99 | richard - iii - leicester - king - iiis | 13 | 99_richard_iii_leicester_king |
|
142 |
+
| 100 | sena - hartman - child - shaday - sexual | 13 | 100_sena_hartman_child_shaday |
|
143 |
+
| 101 | gordon - bobbi - kristina - phil - dr | 12 | 101_gordon_bobbi_kristina_phil |
|
144 |
+
| 102 | jobs - lu - naomi - cook - business | 12 | 102_jobs_lu_naomi_cook |
|
145 |
+
| 103 | duckenfield - mr - gate - hillsborough - greaney | 11 | 103_duckenfield_mr_gate_hillsborough |
|
146 |
+
| 104 | huang - wang - chen - wife - china | 10 | 104_huang_wang_chen_wife |
|
147 |
+
| 105 | coin - coins - silver - cave - gold | 10 | 105_coin_coins_silver_cave |
|
148 |
+
| 106 | shark - whale - mola - crab - barbero | 10 | 106_shark_whale_mola_crab |
|
149 |
+
| 107 | gissendaner - execution - lethal - death - injection | 10 | 107_gissendaner_execution_lethal_death |
|
150 |
+
| 108 | book - handshake - word - author - app | 9 | 108_book_handshake_word_author |
|
151 |
+
| 109 | cosby - cosbys - thompson - welles - bill | 9 | 109_cosby_cosbys_thompson_welles |
|
152 |
+
| 110 | school - pupil - student - parent - computer | 9 | 110_school_pupil_student_parent |
|
153 |
+
| 111 | china - stopera - li - orange - chinese | 8 | 111_china_stopera_li_orange |
|
154 |
+
| 112 | tb - vaccine - disease - measles - meningitis | 8 | 112_tb_vaccine_disease_measles |
|
155 |
+
| 113 | neymar - brazil - willian - dunga - france | 8 | 113_neymar_brazil_willian_dunga |
|
156 |
+
| 114 | gomis - swansea - muamba - fabrice - bafetimbi | 7 | 114_gomis_swansea_muamba_fabrice |
|
157 |
+
| 115 | netflix - tv - content - screen - definition | 6 | 115_netflix_tv_content_screen |
|
158 |
+
| 116 | snake - eastern - redback - postlethwaite - woolworths | 6 | 116_snake_eastern_redback_postlethwaite |
|
159 |
+
|
160 |
+
</details>
|
161 |
+
|
162 |
+
## Training hyperparameters
|
163 |
+
|
164 |
+
* calculate_probabilities: True
|
165 |
+
* language: english
|
166 |
+
* low_memory: False
|
167 |
+
* min_topic_size: 10
|
168 |
+
* n_gram_range: (1, 1)
|
169 |
+
* nr_topics: None
|
170 |
+
* seed_topic_list: None
|
171 |
+
* top_n_words: 10
|
172 |
+
* verbose: False
|
173 |
+
|
174 |
+
## Framework versions
|
175 |
+
|
176 |
+
* Numpy: 1.23.5
|
177 |
+
* HDBSCAN: 0.8.33
|
178 |
+
* UMAP: 0.5.3
|
179 |
+
* Pandas: 1.5.3
|
180 |
+
* Scikit-Learn: 1.2.2
|
181 |
+
* Sentence-transformers: 2.2.2
|
182 |
+
* Transformers: 4.31.0
|
183 |
+
* Numba: 0.57.1
|
184 |
+
* Plotly: 5.15.0
|
185 |
+
* Python: 3.10.12
|
config.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"calculate_probabilities": true,
|
3 |
+
"language": "english",
|
4 |
+
"low_memory": false,
|
5 |
+
"min_topic_size": 10,
|
6 |
+
"n_gram_range": [
|
7 |
+
1,
|
8 |
+
1
|
9 |
+
],
|
10 |
+
"nr_topics": null,
|
11 |
+
"seed_topic_list": null,
|
12 |
+
"top_n_words": 10,
|
13 |
+
"verbose": false,
|
14 |
+
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
|
15 |
+
}
|
topic_embeddings.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1073d33bcc89cf9c86265355d9d499e319cef4dfdd8c992e8ad69ea46565f731
|
3 |
+
size 181336
|
topics.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|