Qinghong (Kevin) Lin

KevinQHLin

AI & ML interests

Vision-Language Model, Video Understanding, Human-AI Interaction

Recent Activity

updated a collection 1 day ago
ShowUI
updated a collection 1 day ago
ShowUI
updated a collection 1 day ago
ShowUI
View all activity

Organizations

Show Lab's profile picture

KevinQHLin's activity

New activity in showlab/ShowUI-2B 15 days ago

Quantized versions?

1
#7 opened 17 days ago by
SouthpawIN
New activity in showlab/ShowUI-2B 17 days ago

Agent Loop

6
#6 opened 18 days ago by
Maverick17
reacted to m-ric's post with โค๏ธ๐Ÿ‘€ 18 days ago
view post
Post
1466
๐—ฆ๐—ต๐—ผ๐˜„๐—จ๐—œ: ๐—ฎ ๐˜€๐—บ๐—ฎ๐—น๐—น ๐—ฒ๐—ป๐—ฑ-๐˜๐—ผ-๐—ฒ๐—ป๐—ฑ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐˜๐—ต๐—ฎ๐˜ ๐—ฐ๐—ฎ๐—ป ๐—ป๐—ฎ๐˜ƒ๐—ถ๐—ด๐—ฎ๐˜๐—ฒ ๐—ฎ๐—ป๐˜† ๐—จ๐—œ ๐—ฎ๐—ป๐—ฑ ๐—ผ๐˜‚๐˜๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐˜€ ๐—บ๐˜‚๐—ฐ๐—ต ๐—ฏ๐—ถ๐—ด๐—ด๐—ฒ๐—ฟ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€! ๐Ÿ“ฒ

A team from NUS and Microsoft just released an agent that can act on any UI (Desktop, Android, Web) without needing additional text information. It works extremely well : they applied their method on a tiny Qwen2-VL-2B, and they managed to beat methods that use either much more powerful vision models (like GPT-4V) without using any additional info (e.g. leveraging the DOM of a webpage) like previous methods did ! ๐Ÿ‘๐Ÿ‘

They started from the idea that most existing methods rely heavily on text, which makes them less generalizable, while letting aside rich UI structure that user actually rely on when navigating this interfaces.

โš™๏ธ They put several good ideas to work:

๐Ÿ’ก Simplify screenshots to the max:
They prune a lot the heavy visual content of UI screenshots, by removing cloned image patches (like any vast patch of the same color will be reduced to a small patch, while maintaining positional embeddings), then group patches from the same GUI elements together to simplify even further

๐Ÿ’ก Build a truly generalist dataset:
To train a general UI agent, you need trajectories from each possible UI, and express them in a common language. Authors merge datasets like OmniAct for Desktop, Mind2Web for websites, AMEX for Android trajectories to create a high-quality and diverse dataset.

โžก๏ธ Nice results ensued:
They fine-tune a tiny Qwen-2-VL-2B on their method, and it reaches SOTA on several task (element identification, web navigation), even beating methods that either use additional info from the DOM or use much bigger VLMS like GPT-4v! ๐Ÿ†

And performance could certainly jump with a slightly bigger vision model. Let's hope the community builds this soon! ๐Ÿš€

Paper added to my "Agents" collection ๐Ÿ‘‰ m-ric/agents-65ba776fbd9e29f771c07d4e
reacted to m-ric's post with โค๏ธ 19 days ago
view post
Post
1206
Need a measurement for traction of a GitHub repo, a more reliable one than Github star history? (which is a bit too hype-driven) ๐Ÿ“ˆ

โžก๏ธ I've made a Space to visualize PyPI downloads.

Try it here ๐Ÿ‘‰ m-ric/package-download-history
  • 1 reply
ยท
New activity in showlab/ShowUI 24 days ago
New activity in showlab/ShowUI-2B 24 days ago
New activity in showlab/ShowUI-2B 25 days ago

License?

1
#4 opened 25 days ago by
fs-tom