File size: 6,886 Bytes
07ed740
 
 
 
49982d2
 
07ed740
49982d2
c653011
97d3df1
11d4139
 
07ed740
 
 
 
 
c1690ba
97e7fe6
ef2c183
97e7fe6
07ed740
87e115d
 
07ed740
 
7f5acb6
97e7fe6
76f7255
334df3f
97e7fe6
76f7255
 
 
 
 
 
87e115d
 
 
 
 
 
 
 
 
7856c77
87e115d
 
7875acb
678184b
 
 
 
 
07ed740
 
97e7fe6
 
 
07ed740
 
 
 
 
 
 
 
 
 
 
983f956
 
 
 
 
 
 
 
 
 
07ed740
 
 
 
 
 
 
49982d2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: apache-2.0
base_model: bert-base-multilingual-cased
model-index:
- name: >-
    bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract
  results: []
pipeline_tag: text-classification
widget:
- text: "<TITLE> From Louvain to Leiden: guaranteeing well-connected communities\n<ABSTRACT> Community detection is often used to understand the structure of large and complex networks. One of the most popular algorithms for uncovering community structure is the so-called Louvain algorithm. We show that this algorithm has a major defect that largely went unnoticed until now: the Louvain algorithm may yield arbitrarily badly connected communities. In the worst case, communities may even be disconnected, especially when running the algorithm iteratively. In our experimental analysis, we observe that up to 25% of the communities are badly connected and up to 16% are disconnected. To address this problem, we introduce the Leiden algorithm. We prove that the Leiden algorithm yields communities that are guaranteed to be connected. In addition, we prove that, when the Leiden algorithm is applied iteratively, it converges to a partition in which all subsets of all communities are locally optimally assigned. Furthermore, by relying on a fast local move approach, the Leiden algorithm runs faster than the Louvain algorithm. We demonstrate the performance of the Leiden algorithm for several benchmark and real-world networks. We find that the Leiden algorithm is faster than the Louvain algorithm and uncovers better partitions, in addition to providing explicit guarantees."
- text: "<TITLE> Cleavage of Structural Proteins during the Assembly of the Head of Bacteriophage T4"
- text: "<TITLE> NONE\n<ABSTRACT> Surface wave (SW) over-the-horizon (OTH) radars are not only widely used for ocean remote sensing, but they can also be exploited in integrated maritime surveillance systems. This paper represents the first part of the description of the statistical and spectral analysis performed on sea backscattered signals recorded by the oceanographic WEllen RAdar (WERA) system. Data were collected on May 13th 2008 in the Bay of Brest, France. The data statistical analysis, after beamforming, shows that for near range cells the signal amplitude fits well the Rayleigh distribution, while for far cells the data show a more pronounced heavy-tailed behavior. The causes can be traced in man-made (i.e. radio communications) and/or natural (i.e. reflections of the transmitted signal through the ionosphere layers, meteor trails) interferences."
---


# bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract

This model is a fine-tuned version of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) on a labeled dataset provided by [CWTS](https://www.cwts.nl/) (for labeled data: [CWTS Labeled Data](https://zenodo.org/records/10560276)). To see how CWTS labeled the data, please check out the following blog post: [An open approach for classifying research publications](https://www.leidenmadtrics.nl/articles/an-open-approach-for-classifying-research-publications)

It was made with the purpose of being able to classify scholarly work with a fixed set of well-defined topics. This is NOT the full model being used to tag [OpenAlex](https://openalex.org/) works with a topic. For that, check out the following github repo:
[OpenAlex Topic Classification](https://github.com/ourresearch/openalex-topic-classification)

That repository will also contain information about text preprocessing, modeling, testing, and deployment.

## Model description

The model was trained using the following input data format (so it is recommended the data be in this format as well):

Using both title and abstract:
"\<TITLE\> {insert-processed-title-here}\n\<ABSTRACT\> {insert-processed-abstract-here}"

Using only title:
"\<TITLE\> {insert-processed-title-here}"

Using only abstract:
"\<TITLE\> NONE\n\<ABSTRACT\> {insert-processed-abstract-here}"

The quickest way to use this model in Python is with the following code (assuming you have the transformers library installed):

```
from transformers import pipeline

title = "{insert-processed-title-here}"
abstract = "{insert-processed-abstract-here}"

classifier = \
    pipeline(model="OpenAlex/bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract", top_k=10, "truncation":True,"max_length":512)

classifier(f"""<TITLE> {title}\n<ABSTRACT> {abstract}""")
```
This will return the top 10 outputs from the model. There will be 2 pieces of information here:

1. Full Topic Label: Made up of both the [OpenAlex](https://openalex.org/) topic ID and the topic label (ex: "1048: Ecology and Evolution of Viruses in Ecosystems")
2. Model Score: Model's confidence in the topic (ex: "0.364")

## Intended uses & limitations

The model is intended to be used as part of a larger model that also incorporates journal information and citation features. However, this model is good if you want to use it for quickly generating a topic based only on a title/abstract.

Since this model was fine-tuned on a BERT model, all of the biases seen in that model will most likely show up in this model as well.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': {'module': 'transformers.optimization_tf', 'class_name': 'WarmUp', 'config': {'initial_learning_rate': 6e-05, 'decay_schedule_fn': {'module': 'keras.optimizers.schedules', 'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 6e-05, 'decay_steps': 335420, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'registered_name': None}, 'warmup_steps': 500, 'power': 1.0, 'name': None}, 'registered_name': 'WarmUp'}, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False}
- training_precision: float32

### Training results

| Train Loss | Validation Loss | Train Accuracy | Epoch |
|:----------:|:---------------:|:--------------:|:-----:|
| 4.8075     | 3.6686          | 0.3839         | 0     |
| 3.4867     | 3.3360          | 0.4337         | 1     |
| 3.1865     | 3.2005          | 0.4556         | 2     |
| 2.9969     | 3.1379          | 0.4675         | 3     |
| 2.8489     | 3.0900          | 0.4746         | 4     |
| 2.7212     | 3.0744          | 0.4799         | 5     |
| 2.6035     | 3.0660          | 0.4831         | 6     |
| 2.4942     | 3.0737          | 0.4846         | 7     |


### Framework versions

- Transformers 4.35.2
- TensorFlow 2.13.0
- Datasets 2.15.0
- Tokenizers 0.15.0