Mohamed Hisham Abdelzaher's picture

Mohamed Hisham Abdelzaher

MH0386

AI & ML interests

None yet

Recent Activity

Organizations

Cairo University's profile picture AlphaSphere.AI's profile picture

MH0386's activity

upvoted an article about 1 month ago
view article
Article

Use Models from the Hugging Face Hub in LM Studio

By yagilb โ€ข
โ€ข 127
reacted to Symbol-LLM's post with ๐Ÿš€ about 2 months ago
view post
Post
2163
๐Ÿš€ Excited to introduce a new member of the OS-Copilot family: OS-Atlas - an open-sourced foundational action model for GUI agents

๐Ÿ“˜ Paper: OS-ATLAS: A Foundation Action Model for Generalist GUI Agents (2410.23218)
๐Ÿ”— Website: https://osatlas.github.io

๐Ÿ˜‡ TL;DR: OS-Atlas offers:
1. State-of-the-Art GUI Grounding: Helps GUI agents accurately locate GUI elements.
2. Strong OOD Performance and Cross-platform Compatibility: Excels in out-of-domain agentic tasks across MacOS, Windows, Linux, Android, and Web.
3. Complete Infrastructure for GUI Data Synthesis:
You can easily build your own OS agent upon it!

liked a Space 2 months ago
upvoted an article 3 months ago
view article
Article

Introducing Community Tools on HuggingChat

โ€ข 34
updated a Space 4 months ago
upvoted an article 9 months ago
view article
Article

Welcome Llama 3 - Meta's new open LLM

โ€ข 281
reacted to Jaward's post with ๐Ÿ‘ 9 months ago
view post
Post
3271
Let's breakdown the technical details in Microsoft's mind blowing Lifelike audio-driven talking faces framework - VASA and model VASA-1:

Summary of Summaries
- The paper introduces VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) from a single image and speech audio.
- Core innovations include a diffusion-based model for holistic generation of facial dynamics and head movements in an expressive, disentangled face latent space developed using video data..
- VASA-1 Generates high-quality 512x512 videos at up to 40 FPS with low latency.
- Supports real-time generation of lifelike, emotive talking faces.

Summary of Overall Framework:
- VASA generates facial dynamics and head motion in latent space, conditioned on audio and other signals
- Instead of directly generating video frames, it generates holistic facial dynamics and head motion in a latent space, conditioned on audio and optional signals.
- To achieve this, the framework uses a face encoder-decoder to extract appearance and identity features and train a Diffusion Transformer model to generate motion latent codes.

Technical Method Details:
Expressive and Disentangled Face Latent Space Construction:
- Based on 3D-AID face reenactment framework
- Decomposes face into 3D appearance volume, identity code, head pose,
and facial dynamics latents
- Uses encoders to extract these latent factors from face images.
- Applies additional losses to improve disentanglement:
- Pairwise head pose and facial dynamics transfer loss
- Face identity similarity loss for cross-identity pose/dynamics transfer

Holistic Facial Dynamics Generation with Diffusion Transformer:
- Represents all facial movements (lip, expression, gaze, etc.) as a single
latent sequence
- Applies a Diffusion Transformer model to generate the facial dynamics sequence.
- Diffusion Transformer trained with simplified denoising score matching objective.
ยท