Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
12
16
13
Xiangtai Li
LXT
Follow
malikzeeshan's profile picture
BryanW's profile picture
yehors-cv's profile picture
8 followers
ยท
0 following
https://lxtgh.github.io/
xtl994
lxtGH
xiangtai-li-b1b76112a
AI & ML interests
Computer Vision, Multi-Modal Understanding, Generative AI
Recent Activity
reacted
to
merve
's
post
with ๐
1 day ago
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license ๐ https://huggingface.co/collections/ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093 > The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos โฏ๏ธ > The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint) > The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM ๐ฌ the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks โคต๏ธ > Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
reacted
to
merve
's
post
with ๐ฅ
1 day ago
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license ๐ https://huggingface.co/collections/ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093 > The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos โฏ๏ธ > The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint) > The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM ๐ฌ the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks โคต๏ธ > Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
authored
a paper
2 days ago
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
View all activity
Organizations
LXT
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
liked
a model
3 days ago
ByteDance/Sa2VA-4B
Image-Text-to-Text
โข
Updated
about 12 hours ago
โข
399
โข
29
liked
a dataset
22 days ago
zhangtao-whu/OMG-LLaVA
Updated
Jul 3, 2024
โข
829
โข
3
liked
a dataset
about 1 month ago
jianzongwu/MangaZero
Viewer
โข
Updated
about 1 month ago
โข
32.7k
โข
198
โข
21
liked
a model
2 months ago
Collov-Labs/Monetico
Text-to-Image
โข
Updated
Oct 28, 2024
โข
26
โข
65
liked
a Space
3 months ago
Running
on
Zero
49
๐
Meissonic Flow
liked
a model
3 months ago
MeissonFlow/Meissonic
Text-to-Image
โข
Updated
Dec 5, 2024
โข
44
โข
98
liked
2 models
6 months ago
zhangtao-whu/OMG-LLaVA
Updated
Jul 3, 2024
โข
5
PhoenixZ/MG-LLaVA
Updated
Jun 26, 2024
โข
7
liked
a Space
6 months ago
Runtime error
30
๐
FaceAdapter
liked
a model
7 months ago
openlm-research/open_llama_3b
Text Generation
โข
Updated
Jun 16, 2023
โข
127k
โข
154
liked
a dataset
9 months ago
ILSVRC/imagenet-1k
Updated
Jul 16, 2024
โข
16.6k
โข
439
liked
a Space
12 months ago
Sleeping
16
๐
OMG-SEG
liked
a model
12 months ago
LXT/OMG_Seg
Updated
Jan 19, 2024
โข
7