respapers_topics / README.md
CCatalao's picture
Update README.md
1ced3f0
metadata
tags:
  - bertopic
library_name: bertopic
pipeline_tag: text-classification

respapers_topics

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

This pre-trained model was built to demonstrate the use of representation model inspired on KeyBERT to be use within BERTopic.

This model was trained on ~30000 Research Papers abstracts with the KeyBERTInspired representation method (bertopic.representation). The dataset was downloaded from kaggle, with the two subsets (test and train) being merged into a single dataset.

To access the complete code, you can vist this tutorial on my GitHub page: ResPapers

Usage

To use this model, please install BERTopic:

pip install -U bertopic

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("CCatalao/respapers_topics")

topic_model.get_topic_info()

To view the KeyBERT inspired topic representation please use the following:

>>> topic_model.get_topic(0, full=True)
{'Main': [['spin', 0.01852648864225281],
  ['magnetic', 0.015019436257929909],
  ['phase', 0.013081733986038124],
  ['quantum', 0.012942253723133639],
  ['temperature', 0.012591407440537158],
  ['states', 0.011025582290837643],
  ['field', 0.010954775154251296],
  ['electron', 0.010168708734803916],
  ['transition', 0.009728560280580357],
  ['energy', 0.00937042795113575]],
 'KeyBERTInspired': [['quantum', 0.4072583317756653],
  ['phase transition', 0.35542067885398865],
  ['lattice', 0.34462833404541016],
  ['spin', 0.3268473744392395],
  ['magnetic', 0.3024371564388275],
  ['magnetization', 0.2868726849555969],
  ['phases', 0.27178525924682617],
  ['fermi', 0.26290175318717957],
  ['electron', 0.25709500908851624],
  ['phase', 0.23375216126441956]]}

Topic overview

  • Number of topics: 112
  • Number of training documents: 29961
Click here for an overview of all topics.
Topic ID Topic Keywords Topic Frequency Label
-1 data - model - paper - time - based 20 -1_data_model_paper_time
0 spin - magnetic - phase - quantum - temperature 12937 0_spin_magnetic_phase_quantum
1 mass - star - stars - 10 - stellar 3048 1_mass_star_stars_10
2 reinforcement - reinforcement learning - learning - policy - robot 2564 2_reinforcement_reinforcement learning_learning_policy
3 logic - semantics - programs - automata - languages 556 3_logic_semantics_programs_automata
4 neural - networks - neural networks - deep - training 478 4_neural_networks_neural networks_deep
5 networks - community - network - social - nodes 405 5_networks_community_network_social
6 word - translation - language - words - sentence 340 6_word_translation_language_words
7 object - 3d - camera - pose - localization 298 7_object_3d_camera_pose
8 classification - label - classifier - learning - classifiers 294 8_classification_label_classifier_learning
9 convex - gradient - stochastic - convergence - optimization 287 9_convex_gradient_stochastic_convergence
10 graphs - graph - vertices - vertex - edge 284 10_graphs_graph_vertices_vertex
11 brain - neurons - connectivity - neural - synaptic 273 11_brain_neurons_connectivity_neural
12 robots - robot - planning - control - motion 255 12_robots_robot_planning_control
13 prime - numbers - polynomials - integers - zeta 245 13_prime_numbers_polynomials_integers
14 tensor - rank - matrix - low rank - pca 226 14_tensor_rank_matrix_low rank
15 power - energy - grid - renewable - load 222 15_power_energy_grid_renewable
16 channel - power - mimo - interference - wireless 219 16_channel_power_mimo_interference
17 adversarial - attacks - adversarial examples - attack - examples 208 17_adversarial_attacks_adversarial examples_attack
18 gan - gans - generative - generative adversarial - adversarial 200 18_gan_gans_generative_generative adversarial
19 media - social - twitter - users - social media 196 19_media_social_twitter_users
20 posterior - monte - monte carlo - carlo - bayesian 190 20_posterior_monte_monte carlo_carlo
21 estimator - estimators - regression - quantile - estimation 189 21_estimator_estimators_regression_quantile
22 software - code - developers - projects - development 178 22_software_code_developers_projects
23 regret - bandit - armed - arm - multi armed 177 23_regret_bandit_armed_arm
24 omega - mathbb - solutions - boundary - equation 177 24_omega_mathbb_solutions_boundary
25 numerical - scheme - mesh - method - order 175 25_numerical_scheme_mesh_method
26 causal - treatment - outcome - effects - causal inference 174 26_causal_treatment_outcome_effects
27 curvature - mean curvature - riemannian - ricci - metric 164 27_curvature_mean curvature_riemannian_ricci
28 control - distributed - systems - consensus - agents 156 28_control_distributed_systems_consensus
29 groups - group - subgroup - subgroups - finite 153 29_groups_group_subgroup_subgroups
30 segmentation - images - image - convolutional - medical 148 30_segmentation_images_image_convolutional
31 market - portfolio - asset - price - volatility 144 31_market_portfolio_asset_price
32 recommendation - user - item - items - recommender 138 32_recommendation_user_item_items
33 algebra - algebras - lie - mathfrak - modules 131 33_algebra_algebras_lie_mathfrak
34 quantum - classical - circuits - annealing - circuit 121 34_quantum_classical_circuits_annealing
35 moduli - varieties - projective - curves - bundles 119 35_moduli_varieties_projective_curves
36 graph - embedding - node - graphs - network 117 36_graph_embedding_node_graphs
37 codes - decoding - channel - code - capacity 113 37_codes_decoding_channel_code
38 sparse - signal - recovery - sensing - measurements 107 38_sparse_signal_recovery_sensing
39 knot - knots - homology - invariants - link 103 39_knot_knots_homology_invariants
40 spaces - hardy - operators - mathbb - boundedness 95 40_spaces_hardy_operators_mathbb
41 blockchain - security - privacy - authentication - encryption 90 41_blockchain_security_privacy_authentication
42 turbulence - turbulent - flow - flows - reynolds 89 42_turbulence_turbulent_flow_flows
43 privacy - differential privacy - private - differential - data 86 43_privacy_differential privacy_private_differential
44 epidemic - disease - infection - infected - infectious 83 44_epidemic_disease_infection_infected
45 citation - scientific - research - journal - papers 82 45_citation_scientific_research_journal
46 surface - droplet - fluid - liquid - droplets 81 46_surface_droplet_fluid_liquid
47 chemical - molecules - molecular - protein - learning 79 47_chemical_molecules_molecular_protein
48 kähler - manifolds - manifold - complex - metrics 77 48_kähler_manifolds_manifold_complex
49 games - game - players - nash - player 74 49_games_game_players_nash
50 patients - patient - clinical - ehr - care 73 50_patients_patient_clinical_ehr
51 music - musical - audio - chord - note 70 51_music_musical_audio_chord
52 visual - shot - image - cnns - learning 70 52_visual_shot_image_cnns
53 speaker - speech - end - recognition - speech recognition 70 53_speaker_speech_end_recognition
54 cell - cells - tissue - active - tumor 69 54_cell_cells_tissue_active
55 eeg - brain - signals - sleep - subjects 69 55_eeg_brain_signals_sleep
56 fairness - fair - discrimination - decision - algorithmic 67 56_fairness_fair_discrimination_decision
57 clustering - clusters - data - based clustering - cluster 66 57_clustering_clusters_data_based clustering
58 relativity - black - solutions - einstein - spacetime 65 58_relativity_black_solutions_einstein
59 mathbb - curves - elliptic - conjecture - fields 62 59_mathbb_curves_elliptic_conjecture
60 stokes - navier - navier stokes - equations - stokes equations 61 60_stokes_navier_navier stokes_equations
61 species - population - dispersal - ecosystem - populations 60 61_species_population_dispersal_ecosystem
62 reconstruction - ct - artifacts - image - images 58 62_reconstruction_ct_artifacts_image
63 algebra - algebras - mathcal - alpha - crossed 58 63_algebra_algebras_mathcal_alpha
64 tiling - polytopes - set - polygon - polytope 58 64_tiling_polytopes_set_polygon
65 mobile - video - network - latency - computing 57 65_mobile_video_network_latency
66 latent - variational - vae - generative - inference 55 66_latent_variational_vae_generative
67 players - game - team - player - teams 54 67_players_game_team_player
68 genes - gene - cancer - expression - sequencing 53 68_genes_gene_cancer_expression
69 forcing - kappa - definable - cardinal - zfc 51 69_forcing_kappa_definable_cardinal
70 dna - protein - folding - proteins - molecule 50 70_dna_protein_folding_proteins
71 spaces - space - metric - metric spaces - topology 49 71_spaces_space_metric_metric spaces
72 speech - separation - source separation - enhancement - speaker 49 72_speech_separation_source separation_enhancement
73 imaging - resolution - light - diffraction - phase 47 73_imaging_resolution_light_diffraction
74 traffic - traffic flow - prediction - temporal - transportation 46 74_traffic_traffic flow_prediction_temporal
75 climate - precipitation - sea - flood - extreme 45 75_climate_precipitation_sea_flood
76 audio - sound - event detection - event - bird 43 76_audio_sound_event detection_event
77 memory - storage - cache - performance - write 40 77_memory_storage_cache_performance
78 wishart - matrices - eigenvalue - free - smallest 39 78_wishart_matrices_eigenvalue_free
79 domain - domain adaptation - adaptation - transfer - target 39 79_domain_domain adaptation_adaptation_transfer
80 glass - glasses - glassy - amorphous - liquids 39 80_glass_glasses_glassy_amorphous
81 gpu - gpus - nvidia - code - performance 38 81_gpu_gpus_nvidia_code
82 face - face recognition - facial - recognition - faces 38 82_face_face recognition_facial_recognition
83 stock - market - price - financial - stocks 37 83_stock_market_price_financial
84 reaction - flux - metabolic - growth - biochemical 34 84_reaction_flux_metabolic_growth
85 fleet - routing - vehicles - ride - traffic 34 85_fleet_routing_vehicles_ride
86 cooperation - evolutionary - game - social - payoff 33 86_cooperation_evolutionary_game_social
87 students - courses - student - course - education 33 87_students_courses_student_course
88 action - temporal - video - recognition - videos 33 88_action_temporal_video_recognition
89 irreducible - group - mathcal - representations - let 32 89_irreducible_group_mathcal_representations
90 phylogenetic - tree - trees - species - gene 32 90_phylogenetic_tree_trees_species
91 processes - drift - asymptotic - estimators - stationary 31 91_processes_drift_asymptotic_estimators
92 wave - waves - water - free surface - shallow water 30 92_wave_waves_water_free surface
93 distributed - gradient - byzantine - communication - sgd 30 93_distributed_gradient_byzantine_communication
94 voters - voting - election - voter - winner 30 94_voters_voting_election_voter
95 gaussian process - gaussian - gp - process - gaussian processes 30 95_gaussian process_gaussian_gp_process
96 mathfrak - gorenstein - ring - rings - modules 29 96_mathfrak_gorenstein_ring_rings
97 motivic - gw - cohomology - dm - category 29 97_motivic_gw_cohomology_dm
98 recurrent - lstm - rnn - recurrent neural - memory 28 98_recurrent_lstm_rnn_recurrent neural
99 semigroup - semigroups - xy - ordered - pt 27 99_semigroup_semigroups_xy_ordered
100 robot - robots - human - human robot - children 25 100_robot_robots_human_human robot
101 categories - category - homotopy - functor - grothendieck 25 101_categories_category_homotopy_functor
102 queue - queues - server - scheduling - customer 24 102_queue_queues_server_scheduling
103 topic - topics - topic modeling - lda - documents 24 103_topic_topics_topic modeling_lda
104 synchronization - oscillators - chimera - coupling - coupled 24 104_synchronization_oscillators_chimera_coupling
105 stochastic - existence - equation - solutions - uniqueness 24 105_stochastic_existence_equation_solutions
106 fractional - derivative - derivatives - integral - psi 23 106_fractional_derivative_derivatives_integral
107 lasso - regression - estimator - estimators - bootstrap 23 107_lasso_regression_estimator_estimators
108 soil - moisture - machine - resolution - seismic 22 108_soil_moisture_machine_resolution
109 bayesian optimization - optimization - acquisition - bayesian - bo 21 109_bayesian optimization_optimization_acquisition_bayesian
110 urban - city - mobility - cities - social 21 110_urban_city_mobility_cities

Training Procedure

The model was trained as follows:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

# Prepre sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# Representation models
representation_models = {"KeyBERTInspired": KeyBERTInspired()}

# Fit BERTopic
topic_model = BERTopic(
                umap_model=umap_model,
                hdbscan_model=hdbscan_model,
                vectorizer_model=vectorizer_model,
                representation_model=representation_models,
                min_topic_size= 10,
                n_gram_range= (1, 1),
                nr_topics=None,
                seed_topic_list=None,
                top_n_words=10,
                calculate_probabilities=False,
                language=None,
                verbose = True
).fit(docs)

Training hyperparameters

  • calculate_probabilities: False
  • language: None
  • low_memory: False
  • min_topic_size: 10
  • n_gram_range: (1, 1)
  • nr_topics: None
  • seed_topic_list: None
  • top_n_words: 10
  • verbose: True

Framework versions

  • Numpy: 1.22.4
  • HDBSCAN: 0.8.33
  • UMAP: 0.5.3
  • Pandas: 1.5.3
  • Scikit-Learn: 1.2.2
  • Sentence-transformers: 2.2.2
  • Transformers: 4.29.2
  • Numba: 0.56.4
  • Plotly: 5.13.1
  • Python: 3.10.11