Update README.md

65d3afb verified 3 months ago

8.96 kB

	---
	extra_gated_heading: Access aimped/nlp-health-translation-base-zh-en on Hugging Face
	extra_gated_description: >-
	This is a form to enable access to this model on Hugging Face after you have
	been granted access from the Aimped. Please visit the [Aimped
	website](https://aimped.ai/) to Sign Up and accept our Terms of Use and
	Privacy Policy before submitting this form. Requests will be processed in 1-2
	days.
	extra_gated_prompt: >-
	**Your Hugging Face account email address MUST match the email you provide on
	the Aimped website or your request will not be approved.**
	extra_gated_button_content: Submit
	extra_gated_fields:
	I agree to share my name, email address, and username with Aimped and confirm that I have already been granted download access on the Aimped website: checkbox
	license: cc-by-nc-4.0
	language:
	- en
	- zh
	metrics:
	- bleu
	pipeline_tag: translation
	widget:
	- text: >-
	Infarctul miocardic (IM) este un termen utilizat pentru un atac de cord care
	se datorează formării de plăci în pereții interiori ai arterelor, ceea ce
	duce la reducerea fluxului sanguin către inimă și la afectarea mușchilor
	inimii din cauza lipsei de oxigen.
	- text: >-
	Semnele de avertizare a atacului cerebral și a infarctului miocardic care au
	fost identificate cel mai rar de către respondenți au fost "vederea slabă
	bruscă la unul sau la ambii ochi" (66,1%) și "durere sau disconfort la
	nivelul brațului sau al umărului" (53,8%).
	tags:
	- medical
	- translation
	- medical translation
	datasets:
	- aimped/medical-translation-test-set
	---

	<p align="center">
	<img src="https://raw.githubusercontent.com/ai-amplified/models/main/media/AimpedLogoDark.svg" alt="aimped logo" width="50%" height="50%"/>
	</p>

	# Description of the Model

	<p>
	Paper: <a href="https://arxiv.org/abs/2407.12126" style="text-decoration: underline; color: blue;">LLMs-in-the-loop Part-1: Expert Small AI Models for Bio-Medical Text Translation</a>
	</p>

	<p style="margin-bottom: 0in; text-align: justify; line-height: 1.3;"><span style="font-family: "IBM Plex Sans", sans-serif; font-size: 16px;">The Medical Translation AI model represents a specialized language model, trained for the accurate translations of medical documents from Chinese to English. Its primary objective is to provide healthcare professionals, researchers, and individuals within the medical field with a reliable tool for the precise translation of a wide spectrum of medical documents.            </span></p>
	<p style="margin-bottom: 0in; text-align: justify; line-height: 1.3;">
	<span style="font-family: "IBM Plex Sans", sans-serif; font-size: 16px;">The development of this model entailed the utilization of the
	<a href="https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/models/zho-eng/README.md" style="text-decoration: underline; color: blue;">Hensinki/MarianMT</a> neural translation architecture, which required 2+ days of intensive training using A100 (24G RAM) GPU. To create an exceptionally high-quality corpus for training the translation model, we combined both publicly available and proprietary datasets. These datasets were further enriched by meticulously curated text collected from online sources. In addition, the inclusion of clinical and discharge reports from diverse healthcare institutions enhanced the dataset's depth and diversity. This meticulous curation process plays a pivotal role in ensuring the model's ability to generate accurate translations tailored specifically to the medical domain, meeting the stringent standards expected by our users.<br><br>The versatility of the Medical Translation AI model extends to the translation of a wide array of healthcare-related documents, encompassing medical reports, patient records, medication instructions, research manuscripts, clinical trial documents, and more. By harnessing the capabilities of this model, users can efficiently and dependably obtain translations, thereby streamlining and expediting the often complex task of language translation within the medical field.</span>
	</p>
	<p style="margin-bottom: 0in; text-align: justify; line-height: 1.3;"><span style="font-family: "IBM Plex Sans", sans-serif; font-size: 16px;">The model we have developed outperforms leading translation companies like Google, Helsinki-Opus/MarianMT, and DeepL when compared against our meticulously curated proprietary test data set. </span></p>

	<p><br></p>
	<p style="line-height: 1.3;"><strong style="font-family: "IBM Plex Sans", sans-serif; background-color: transparent; text-align: justify; font-size: 16px;">Text Format Requirements: </strong><span style="font-family: "IBM Plex Sans", sans-serif; background-color: transparent; text-align: justify; font-size: 16px;">The text to be translated must adhere to a structured and grammatically correct format, including proper paragraph and sentence structures. Spelling errors or formatting issues, such as line breaks occurring before the completion of a sentence, will not be automatically corrected.</span><br></p>
	<p style="line-height: 1.3; margin-bottom: 0in; text-align: justify;"><span style="font-family: "IBM Plex Sans", sans-serif; font-size: 16px;"><br><strong>Character and Word Limits:</strong> Each translation process is limited to a maximum of 5K characters both for user interface (UI) and API requests. Please note that exceeding these limits is not supported for translation operations.<br><br><strong>Segmentation of Translation Text: </strong>In cases where the text to be translated exceeds the specified character limits, it is advisable to divide the text into appropriate segments for translation. This approach allows for the translation of larger texts without exceeding the defined limits. When segmenting the text, it is preferable to divide it into paragraphs or topic headings.<br><strong><br>API Requests:</strong> When utilizing the API, exercise caution to ensure that your translation requests conform to the data size limitations. Large data sets should be divided or processed sequentially to effectively complete translation tasks within these constraints.<br></span><br></p>
	<p style="line-height: 1.3; text-align: justify;"><span style="font-family: "IBM Plex Sans", sans-serif; font-size: 16px;"><strong>Limitations:</strong>
	Our translation model has been meticulously designed and extensively trained to cater specifically to the demanding needs of the Healthcare and Biomedical domain. While it excels within this highly specialized realm, it's important to note that if you opt to employ the model in domains outside of healthcare, its performance may not meet the exceptional standards characteristic of the medical field. We advise a thoughtful consideration of this limitation when contemplating the model's application.</span></p>

	## Why should you use Aimped API?

	To get started, you can easily use our open-source version of the models for research purposes. However, the models provided through the Aimped API are trained on new data every three months. This ensures that the models understand ongoing healthcare developments in the world and can identify the most relevant medical terminology without a knowledge cutoff. In addition, we implement post/pre processing steps to improve the translation quality. Naturally, our quality control ensures that the models' performance always remains at least similar to previous versions.

	## How to Use:
	To get the right results, use this function.

	- Install requirements
	```python
	!pip install transformers
	!pip install sentencepiece
	!pip install aimped
	import nltk
	nltk.download('punkt')
	```
	- import libraries
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
	from aimped.nlp.translation import text_translate
	import torch
	device = "cuda" if torch.cuda.is_available() else "cpu"
	```
	- load model
	```python
	model_path = "aimped/nlp-health-translation-base-zh-en"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
	```

	```python
	translater = pipeline(
	task="translation_zh_to_en",
	model=model,
	tokenizer=tokenizer,
	device= device,
	max_length=512,
	num_beams=7,
	early_stopping=False,
	num_return_sequences=1,
	do_sample=False,

	)


	```

	- Use Model:
	```python
	sentence = "门静脉血栓（PVT）是肝硬化常见并发症之一，通过升高门静脉压力，诱发或加重腹水、上消化道出血，甚至增加肝移植难度，影响患者预后。"
	translated_text = text_translate([sentence],source_lang="zh", pipeline=translater)
	```