File size: 4,875 Bytes
a342fb0 980b067 a342fb0 e42bc6f 92eeb19 9547e1e a342fb0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
language:
- zh
tags:
- SequenceClassification
- MetaDis
- 古文
- 文言文
- ancient
- classical
- Biography
- 古代人物传记
license: cc-by-nc-sa-4.0
---
# <font color="IndianRed"> MetaDis (Classical Chinese Biographical Metadata Disambiguation)</font>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1UcyhdfX5_NuZ87XR1fUmMACJ7qY-nn-P#scrollTo=cd-iH6OLpIeV)
Download <font color="IndianRed">template excel sheet</font> from here: https://huggingface.co/cbdb/MetaDis/blob/main/template.xlsx
---
### <font color="IndianRed">MetaDis: Classical Chinese Biographical Metadata Disambiguation </font>
Welcome to the repository for MetaDis, a specialized model designed for disambiguating biographical metadata within Classical Chinese texts.
At the core of the problem MetaDis aims to solve is a common issue researchers encounter when studying historical texts - the identification of individuals sharing the same name. Are these instances referring to the same person or two different people? This is the question MetaDis seeks to answer.
MetaDis is based on the `AutoModelForNextSentencePrediction` architecture, a machine learning model that processes two sequences of data as its input. It then outputs a 0 or 1 - a binary representation indicating whether or not the two sequences refer to the same person. Here, 0 represents 'not the same person', and 1 indicates 'the same person'.
---
### <font color="IndianRed">Input Data Formatting </font>
In order to ensure the highest accuracy and performance of the MetaDis model, we've specifically designed an input format based on the data the model was originally trained on. This is crucial as it allows the model to accurately interpret and process your data.
To assist you in this process, we've provided a template Excel (.xlsx) file. We recommend downloading this template and inputting your data directly into it, ensuring your data matches the same format as the model's training data.
To download our Excel data template, please click [here](https://huggingface.co/cbdb/MetaDis/blob/main/template.xlsx).
---
### <font color="IndianRed">Code Demonstration: Loading and Using MetaDis Model </font>
The following section demonstrates how to directly load the MetaDis model and use it for predicting whether two sets of biographical information refer to the same person or not.
Please ensure that you have the `transformers` library installed in your Python environment. If not, you can install it using pip:
```python
pip install transformers
```
Now, let's load our model and make some predictions:
```python
# Import necessary libraries from HuggingFace Transformers
from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
import torch
# Load our tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("cbdb/MetaDis")
model = AutoModelForNextSentencePrediction.from_pretrained("cbdb/MetaDis")
# Define our sentences to compare
sentence1 = ['first biographical information of person name A', 'first biographical information of person name B']
sentence2 = ['second biographical information of person name A', 'first biographical information of person name B']
# Loop through each sentence pair
for s1, s2 in zip(sentence1, sentence2):
# Prepare the inputs
encoding = tokenizer(s1, s2, truncation=True, padding=True, return_tensors='pt')
# Move the inputs to the device where the model is
for key in encoding:
encoding[key] = encoding[key].to(model.device)
# Make the prediction
outputs = model(**encoding)
# Extract the prediction
logits = outputs.logits
preds = torch.argmax(logits, dim=-1)
# Display the results
if preds.item() == 1:
print('Same person')
print(s1, s2)
else:
print('Different person')
print(s1, s2)
```
This code demonstration shows how you can load our MetaDis model, prepare inputs in the necessary format, and extract predictions to determine if the biographical details refer to the same person or different individuals. Remember to replace the example sentences with your own data.
Remember to include a link or instructions on how users can install the `transformers` library if they don't already have it installed.
---
### <font color="IndianRed">Authors </font>
Queenie Luo (queenieluo[at]g.harvard.edu)
<br>
Hongsu Wang
<br>
Peter Bol
<br>
CBDB Group
### <font color="IndianRed">License </font>
Copyright (c) 2023 CBDB
Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or
send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. |