license: apache-2.0
base_model: bert-base-multilingual-cased
model-index:
- name: >-
bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract
results: []
pipeline_tag: text-classification
widget:
- text: >-
<TITLE> Cleavage of Structural Proteins during the Assembly of the Head of
Bacteriophage T4
- text: >-
<TITLE> From Louvain to Leiden: guaranteeing well-connected communities
<ABSTRACT> Community detection is often used to understand the structure
of large and complex networks. One of the most popular algorithms for
uncovering community structure is the so-called Louvain algorithm. We show
that this algorithm has a major defect that largely went unnoticed until
now: the Louvain algorithm may yield arbitrarily badly connected
communities. In the worst case, communities may even be disconnected,
especially when running the algorithm iteratively. In our experimental
analysis, we observe that up to 25% of the communities are badly connected
and up to 16% are disconnected. To address this problem, we introduce the
Leiden algorithm. We prove that the Leiden algorithm yields communities
that are guaranteed to be connected. In addition, we prove that, when the
Leiden algorithm is applied iteratively, it converges to a partition in
which all subsets of all communities are locally optimally assigned.
Furthermore, by relying on a fast local move approach, the Leiden
algorithm runs faster than the Louvain algorithm. We demonstrate the
performance of the Leiden algorithm for several benchmark and real-world
networks. We find that the Leiden algorithm is faster than the Louvain
algorithm and uncovers better partitions, in addition to providing
explicit guarantees.
bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract
This model is a fine-tuned version of bert-base-multilingual-cased on a labeled dataset provided by CWTS: [CWTS Labeled Data]
This is NOT the full model being used to tag OpenAlex works with a topic. For that, check out the following github repo: OpenAlex Topic Classification
That repository will also contain information about text preprocessing, modeling, testing, and deployment.
Model description
The model was trained using the following input data format (so it is recommended the data be in this format as well):
"<TITLE> {insert-processed-title-here}\n<ABSTRACT> {insert-processed-abstract-here}"
The quickest way to use this model in Python is with the following code (assuming you have the transformers library installed):
from transformers import pipeline
title = "{insert-processed-title-here}"
abstract = "{insert-processed-abstract-here}"
classifier = \
pipeline(model="OpenAlex/bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract", top_k=10)
classifier(f"""<TITLE> {title}\n<ABSTRACT> {abstract}""")
Intended uses & limitations
The model is intended to be used as part of a larger model that also incorporates journal information and citation features. However, this model is good if you want to use it for quickly generating a topic based only on a title/abstract.
Since this model was fine-tuned on a BERT model, all of the biases seen in that model will most likely show up in this model as well.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': {'module': 'transformers.optimization_tf', 'class_name': 'WarmUp', 'config': {'initial_learning_rate': 6e-05, 'decay_schedule_fn': {'module': 'keras.optimizers.schedules', 'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 6e-05, 'decay_steps': 335420, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'registered_name': None}, 'warmup_steps': 500, 'power': 1.0, 'name': None}, 'registered_name': 'WarmUp'}, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False}
- training_precision: float32
Training results
Train Loss | Validation Loss | Train Accuracy | Epoch |
---|---|---|---|
4.8075 | 3.6686 | 0.3839 | 0 |
3.4867 | 3.3360 | 0.4337 | 1 |
3.1865 | 3.2005 | 0.4556 | 2 |
2.9969 | 3.1379 | 0.4675 | 3 |
2.8489 | 3.0900 | 0.4746 | 4 |
2.7212 | 3.0744 | 0.4799 | 5 |
2.6035 | 3.0660 | 0.4831 | 6 |
2.4942 | 3.0737 | 0.4846 | 7 |
Framework versions
- Transformers 4.35.2
- TensorFlow 2.13.0
- Datasets 2.15.0
- Tokenizers 0.15.0