Danielus's picture

Danielus

danielus

AI & ML interests

None yet

Recent Activity

liked a Space 8 days ago
fffiloni/expression-editor
liked a model 15 days ago
allenai/olmOCR-7B-0225-preview
liked a Space 16 days ago
lllyasviel/LuminaBrush
View all activity

Organizations

None yet

danielus's activity

New activity in webml-community/kokoro-webgpu about 1 month ago

output is bugged 100%

3
#2 opened about 1 month ago by
froilo
New activity in hexgrad/Kokoro-82M about 1 month ago

Feedback Italian voice

5
#90 opened about 1 month ago by
lucalani
reacted to hexgrad's post with ๐Ÿค— 3 months ago
view post
Post
3206
Tonight, Adam & Michael join the 82M Apache TTS model in hexgrad/Kokoro-82M
reacted to YerbaPage's post with ๐Ÿ‘€ 3 months ago
view post
Post
1424
Curated list of **Repository-level Code Generation** papers & benchmarks! ๐Ÿ”ฅ

Stay ahead with the latest in:
โœ… Repo-level Issue Resolution
โœ… Repo-level Code Completion
โœ… Datasets & Benchmarks

๐Ÿ‘‰ Check it out: https://github.com/YerbaPage/Awesome-Repo-Level-Code-Generation ๐Ÿ”ฅ
reacted to m-ric's post with ๐Ÿ”ฅ 3 months ago
view post
Post
2268
๐—ฃ๐—ผ๐˜๐—ฒ๐—ป๐˜๐—ถ๐—ฎ๐—น ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—ฑ๐—ถ๐—ด๐—บ ๐˜€๐—ต๐—ถ๐—ณ๐˜ ๐—ถ๐—ป ๐—Ÿ๐—Ÿ๐— ๐˜€: ๐—ป๐—ฒ๐˜„ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ ๐—ฏ๐˜† ๐— ๐—ฒ๐˜๐—ฎ ๐—ฐ๐—น๐—ฎ๐—ถ๐—บ๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐˜„๐—ฒ ๐—ฐ๐—ฎ๐—ป ๐—ด๐—ฒ๐˜ ๐—ฟ๐—ถ๐—ฑ ๐—ผ๐—ณ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐—ถ๐˜‡๐—ฒ๐—ฟ๐˜€! ๐Ÿฅณ

Current LLMs process text by first splitting it into tokens. They use a module named "tokenizer", that -spl-it-s- th-e- te-xt- in-to- arbitrary tokens depending on a fixed dictionnary.
On the Hub you can find this dictionary in a model's files under tokenizer.json.

โžก๏ธ This process is called BPE tokenization. It is suboptimal, everyone says it. It breaks text into predefined chunks that often fail to capture the nuance of language. But it has been a necessary evil in language models since their inception.

๐Ÿ’ฅ In Byte Latent Transformer (BLT), Meta researchers propose an elegant solution by eliminating tokenization entirely, working directly with raw bytes while maintaining efficiency through dynamic "patches."

This had been tried before with different byte-level tokenizations, but it's the first time that an architecture of this type scales as well as BPE tokenization. And it could mean a real paradigm shift! ๐Ÿ‘๐Ÿ‘

๐Ÿ—๏ธ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ:
Instead of a lightweight tokenizer, BLT has a lightweight encoder that process raw bytes into patches. Then the patches are processed by the main heavy-duty transformers as we do normally (but for patches of bytes instead of tokens), before converting back to bytes.

๐Ÿงฉ ๐——๐˜†๐—ป๐—ฎ๐—บ๐—ถ๐—ฐ ๐—ฃ๐—ฎ๐˜๐—ฐ๐—ต๐—ถ๐—ป๐—ด:
Instead of fixed tokens, BLT groups bytes based on their predictability (measured by entropy) - using more compute for complex sequences and efficiently handling simple ones. This allows efficient processing while maintaining byte-level understanding.

I hope this breakthrough is confirmed and we can get rid of all the tokenizer stuff, it will make model handling easier!

Read their paper here ๐Ÿ‘‰ https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf
  • 2 replies
ยท
reacted to julien-c's post with ๐Ÿ”ฅ 3 months ago
view post
Post
10342
After some heated discussion ๐Ÿ”ฅ, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community ๐Ÿ”ฅ

cc: @reach-vb @pierric @victor and the HF team
ยท