asoria's picture
asoria HF staff
Add BERTopic model
3d9f4d7 verified
metadata
tags:
  - bertopic
library_name: bertopic
pipeline_tag: text-classification

bertopic_github_dataset_viewer_issues

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

Usage

To use this model, please install BERTopic:

pip install -U bertopic

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("asoria/bertopic_github_dataset_viewer_issues")

topic_model.get_topic_info()

Topic overview

  • Number of topics: 78
  • Number of training documents: 3066
Click here for an overview of all topics.
Topic ID Topic Keywords Topic Frequency Label
-1 jobs - datasets - cache - fix - pandas 11 -1_jobs_datasets_cache_fix
0 issue - viewer - dataset - for - bigsciencep3 534 0_issue_viewer_dataset_for
1 parquet - files - metadata - parquetanddatasetinfo - configparquetandinfo 144 1_parquet_files_metadata_parquetanddatasetinfo
2 vulnerability - cryptography - dependencies - 4106 - update 132 2_vulnerability_cryptography_dependencies_4106
3 docs - doc - page - add - md 109 3_docs_doc_page_add
4 rows - firstrows - row - truncated - response 90 4_rows_firstrows_row_truncated
5 duckdb - index - splitduckdbindex - fts - try 78 5_duckdb_index_splitduckdbindex_fts
6 hub - hubcache - timeout - datasethubcache - tags 75 6_hub_hubcache_timeout_datasethubcache
7 audio - opus - extension - torchaudio - torch 59 7_audio_opus_extension_torchaudio
8 filter - endpoint - isvalid - column - parameters 54 8_filter_endpoint_isvalid_column
9 datasets - update - upgrade - dependency - to 54 9_datasets_update_upgrade_dependency
10 docker - images - build - image - compose 53 10_docker_images_build_image
11 cache - refresh - entries - entry - warm 51 11_cache_refresh_entries_entry
12 mongo - mongodb - indexes - atlas - index 48 12_mongo_mongodb_indexes_atlas
13 image - images - modality - support - pdf2image 47 13_image_images_modality_support
14 unblock - block - blocked - blocklist - datasets 46 14_unblock_block_blocked_blocklist
15 error - expected - xerrorcode - messages - catch 44 15_error_expected_xerrorcode_messages
16 backfill - cron - job - time - move 44 16_backfill_cron_job_time
17 jobs - waiting - job - finishedat - started 44 17_jobs_waiting_job_finishedat
18 env - config - configs - vars - default 41 18_env_config_configs_vars
19 gitpython - 3137 - 3141 - github - builddepsdev 41 19_gitpython_3137_3141_github
20 assets - s3 - cachedassets - cached - fsspec 40 20_assets_s3_cachedassets_cached
21 splitnamesfromstreaming - split - streaming - rename - names 39 21_splitnamesfromstreaming_split_streaming_rename
22 statistics - stats - descriptive - splitdescriptivestatistics - class 38 22_statistics_stats_descriptive_splitdescriptivestatistics
23 private - gated - datasets - public - gatedauto 35 23_private_gated_datasets_public
24 metrics - healthcheck - port - adminmetrics - admin 33 24_metrics_healthcheck_port_adminmetrics
25 steps - processing - step - triggers - graph 32 25_steps_processing_step_triggers
26 ci - codecov - pr - fork - invalid 31 26_ci_codecov_pr_fork
27 splits - split - list - configs - returned 31 27_splits_split_list_configs
28 openapi - openapijson - spec - publish - spectral 31 28_openapi_openapijson_spec_publish
29 queue - incremental - based - field - jobs 31 29_queue_incremental_based_field
30 error - datasetwithscriptnotsupportederror - exist - no - datasetgenerationerror 31 30_error_datasetwithscriptnotsupportederror_exist_no
31 ram - 5gb - heavy - reduce - overcommitment 31 31_ram_5gb_heavy_reduce
32 workers - number - reduce - increase - heavy 30 32_workers_number_reduce_increase
33 admin - ui - app - difficulty - prefix 30 33_admin_ui_app_difficulty
34 chart - fixchart - helm - alb - featchart 28 34_chart_fixchart_helm_alb
35 aiohttp - 386 - bump - 392 - 391 27 35_aiohttp_386_bump_392
36 e2e - tests - test - ci - testmetrics 27 36_e2e_tests_test_ci
37 huggingfacehub - upgrade - 0151 - version - branch 27 37_huggingfacehub_upgrade_0151_version
38 test - tests - unit - pytestmemray - fixtures 26 38_test_tests_unit_pytestmemray
39 webhook - webhooks - payload - visibility - hub 26 39_webhook_webhooks_payload_visibility
40 migration - migrations - database - scripts - databases 26 40_migration_migrations_database_scripts
41 refactor - dead - code - remove - abstractions 25 41_refactor_dead_code_remove
42 retry - retryable - codes - every - createcommiterror 25 42_retry_retryable_codes_every
43 log - logs - debug - level - crashes 25 43_log_logs_debug_level
44 croissant - jsonld - fields - either - recordset 25 44_croissant_jsonld_fields_either
45 pods - pod - number - scale - reverseproxy 24 45_pods_pod_number_scale
46 scan - urls - spawning - presidio - optinouturls 24 46_scan_urls_spawning_presidio
47 resources - feat - reduce - increase - production 22 47_resources_feat_reduce_increase
48 download - manual - require - enum - extracted 21 48_download_manual_require_enum
49 comment - issues - close - fix - tag 20 49_comment_issues_close_fix
50 cache - entries - clean - hf - blocked 19 50_cache_entries_clean_hf
51 worker - generic - workerjobtypesblocked - treccartools - dependencies 19 51_worker_generic_workerjobtypesblocked_treccartools
52 datasetviewer - rename - datasetsserver - domain - server 18 52_datasetviewer_rename_datasetsserver_domain
53 across - group - pip - directories - bump 18 53_across_group_pip_directories
54 runner - runners - validation - job - parent 18 54_runner_runners_validation_job
55 upgrade - datasets - feat - 221 - 1162dev0 18 55_upgrade_datasets_feat_221
56 jwt - array - authorization - cookies - bypass 18 56_jwt_array_authorization_cookies
57 allow - script - scriptbased - scripts - redpajamadata1t 17 57_allow_script_scriptbased_scripts
58 unique - metrics - metric - cache - cron 16 58_unique_metrics_metric_cache
59 aiohttp - libslibcommon - libslibapi - 386 - 385 16 59_aiohttp_libslibcommon_libslibapi_386
60 pillow - 1001 - 1020 - bump - from 16 60_pillow_1001_1020_bump
61 storage - disk - storageclient - storageadmin - client 15 61_storage_disk_storageclient_storageadmin
62 resources - increase - 108010 - reduce - 2468 15 62_resources_increase_108010_reduce
63 poetry - dependabot - align - version - 20 14 63_poetry_dependabot_align_version
64 upgrade - datasets - 188 - pufanyimimicit - meaning 14 64_upgrade_datasets_188_pufanyimimicit
65 auth - authentication - asynchronous - authcheck - 307 14 65_auth_authentication_asynchronous_authcheck
66 lock - locks - finishing - release - ttl 14 66_lock_locks_finishing_release
67 nginx - proxy - reverse - reverseproxy - 1253 14 67_nginx_proxy_reverse_reverseproxy
68 orjson - 3915 - 390 - bump - from 13 68_orjson_3915_390_bump
69 gradio - 3340 - 4110 - frontadminui - upgrade 13 69_gradio_3340_4110_frontadminui
70 starlette - 0280 - 0362 - bump - 0231 13 70_starlette_0280_0362_bump
71 secrets - fixs3 - correct - secret - name 13 71_secrets_fixs3_correct_secret
72 search - elastic - functionality - times - currently 13 72_search_elastic_functionality_times
73 token - hftoken - app - secret - hf 12 73_token_hftoken_app_secret
74 efs - nfs - mount - parquetmetadata - storage 12 74_efs_nfs_mount_parquetmetadata
75 ruff - vscode - 045 - settings - ruffcache 12 75_ruff_vscode_045_settings
76 kubernetes - kube - infrastructure - pdb - disruption 12 76_kubernetes_kube_infrastructure_pdb

Training hyperparameters

  • calculate_probabilities: False
  • language: english
  • low_memory: False
  • min_topic_size: 10
  • n_gram_range: (1, 1)
  • nr_topics: None
  • seed_topic_list: None
  • top_n_words: 10
  • verbose: False
  • zeroshot_min_similarity: 0.7
  • zeroshot_topic_list: None

Framework versions

  • Numpy: 1.26.4
  • HDBSCAN: 0.8.38.post1
  • UMAP: 0.5.6
  • Pandas: 2.1.4
  • Scikit-Learn: 1.5.2
  • Sentence-transformers: 3.1.1
  • Transformers: 4.44.2
  • Numba: 0.60.0
  • Plotly: 5.24.1
  • Python: 3.10.12