--- library_name: transformers license: apache-2.0 language: - hi pipeline_tag: token-classification --- ## Model Details ### BertWordPieceTokenizer - tokenizer for hindi language #### Usage ```py from transformers import AutoTokenizer hi_tokenizer = AutoTokenizer.from_pretrained('krinal/BertWordPieceTokenizer-hi') hi_str = "आज का सूर्य देखो, कितना प्यारा, कितना शीतल है" # encode text encoded_str = hi_tokenizer.encode(hi_str) # decode text decoded_str = hi_tokenizer.decode(encoded_str) ``` #### Language - hi #### Training - For training see [Train BertWordPieceTokenizer](https://gist.github.com/kjdeveloper8/57d9e16848cd77df778804c9e2214a78) #### Dataset - trained on BHAAV (hi sentiment analysis dataset) - dataset source: [Bhaav](https://github.com/midas-research/bhaav) - Hindi text corpus (20,304 sentences) #### Citation ```shell @article{kumar2019bhaav, title={BHAAV-A Text Corpus for Emotion Analysis from Hindi Stories}, author={Kumar, Yaman and Mahata, Debanjan and Aggarwal, Sagar and Chugh, Anmol and Maheshwari, Rajat and Shah, Rajiv Ratn}, journal={arXiv preprint arXiv:1910.04073}, year={2019} } ```