# πŸ”‰πŸ‘„ Wav2Lip STUDIO Standalone demo/demo1.mp4 ## πŸ’‘ Description This repository contains a Wav2Lip Studio Standalone Version. It's an all-in-one solution: just choose a video and a speech file (wav or mp3), and the tools will generate a lip-sync video, faceswap, voice clone, and translate video with voice clone (HeyGen like). It improves the quality of the lip-sync videos generated by the [Wav2Lip tool](https://github.com/Rudrabha/Wav2Lip) by applying specific post-processing techniques. ![Illustration](demo/demo.png) ![Illustration](demo/demo1.png) ## πŸ“– Quick Index * [πŸš€ Updates](#-updates) * [πŸ”— Requirements](#-requirements) * [πŸ’» Installation](#-installation) * [🐍 Tutorial](#-tutorial) * [🐍 Usage](#-usage) * [πŸ‘„ Keyframes Manager](#-keyframes-manager) * [πŸ‘„ Input Video](#-input-video) * [πŸ“Ί Examples](#-examples) * [πŸ“– Behind the scenes](#-behind-the-scenes) * [πŸ’ͺ Quality tips](#-quality-tips) * [⚠️Noted Constraints](#-noted-constraints) * [πŸ“ To do](#-to-do) * [😎 Contributing](#-contributing) * [πŸ™ Appreciation](#-appreciation) * [πŸ“ Citation](#-citation) * [πŸ“œ License](#-license) * [β˜• Support Wav2lip Studio](#-support-wav2lip-studio) ## πŸš€ Updates **2024.01.20 Major Update (Standalone version only)** - β™» Manage project: Add a feature to manage multiple project - πŸ‘ͺ Introduced multiple face swap: Can now Swap multiple face in one shot (See Usage section) - β›” Visible face restriction: Can now make whole process even if no face detected on frame! - πŸ“Ί Video Size: works with high resolution video input, (test with 1980x1080, should works with 4K but slow) - πŸ”‘ Keyframe manager: Add a keyframe manager for better control of the video generation - πŸͺ coqui TTS integration: Remove bark integration, use coqui TTS instead (See Usage section) - πŸ’¬ Conversation: Add a conversation feature with multiple person (See Usage section) - πŸ”ˆ Record your own voice: Add a feature to record your own voice (See Usage section) - πŸ‘¬ Clone voice: Add a feature to clone voice from video (See Usage section) - 🎏 translate video: Add a feature to translate video with voice clone (See Usage section) - πŸ”‰ Volume amplifier for wav2lip: Add a feature to amplify the volume of the wav2lip output (See Usage section) - πŸ•‘ Add delay before sound speech start - πŸš€ Speed up process: Speed up the process **2023.09.13** - πŸ‘ͺ Introduced face swap: facefusion integration (See Usage section) **this feature is under experimental**. **2023.08.22** - πŸ‘„ Introduced [bark](https://github.com/suno-ai/bark/) (See Usage section), **this feature is under experimental**. **2023.08.20** - 🚒 Introduced the GFPGAN model as an option. - β–Ά Added the feature to resume generation. - πŸ“ Optimized to release memory post-generation. **2023.08.17** - πŸ› Fixed purple lips bug **2023.08.16** - ⚑ Added Wav2lip and enhanced video output, with the option to download the one that's best for you, likely the "generated video". - 🚒 Updated User Interface: Introduced control over CodeFormer Fidelity. - πŸ‘„ Removed image as input, [SadTalker](https://github.com/OpenTalker/SadTalker) is better suited for this. - πŸ› Fixed a bug regarding the discrepancy between input and output video that incorrectly positioned the mask. - πŸ’ͺ Refined the quality process for greater efficiency. - 🚫 Interruption will now generate videos if the process creates frames **2023.08.13** - ⚑ Speed-up computation - 🚒 Change User Interface : Add controls on hidden parameters - πŸ‘„ Only Track mouth if needed - πŸ“° Control debug - πŸ› Fix resize factor bug ## πŸ”— Requirements - FFmpeg : download it from the [official FFmpeg site](https://ffmpeg.org/download.html). Follow the instructions appropriate for your operating system, note ffmpeg have to be accessible from the command line. ## πŸ’» Installation # Windows Users 1.Install [Visual Studio](https://visualstudio.microsoft.com/fr/downloads/). During the install, make sure to include the Python and C++ packages in visual studio installer. ![Illustration](demo/visual_studio_1.png) ![Illustration](demo/visual_studio_2.png) 2. Install [python 3.10.11](https://www.python.org/downloads/release/python-31011/) 3. Install [git](https://git-scm.com/downloads) 4. Install [Cuda 11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive) if not ever done. ![Illustration](demo/cuda.png) 6. Check python and git installation ```bash python --version git --version nvcc --version ``` Must return something like ```bash Python 3.10.11 git version 2.35.1.windows.2 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0 ``` 7. if you have multiple Python version on your computer edit wav2lip-studio.bat and change the following line: ```bash REM set PYTHON="your python.exe path" ``` ```bash set PYTHON="your python.exe path" ``` 8. double click on wav2lip-studio.bat, that will install the requirements and download models # MACOS Users 1. Install python 3.9 ``` brew update brew install python@3.9 brew install git-lfs git-lfs install ``` 3. Install environnement and requirements ``` cd /YourWav2lipStudioFolder /opt/homebrew/bin/python3.9 -m venv venv ./venv/bin/python3.9 -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 ./venv/bin/python3.9 -m pip install -r requirements.txt ./venv/bin/python3.9 -m pip install transformers==4.33.2 ./venv/bin/python3.9 -m pip install numpy==1.24.4 ``` if It doesn't works or too long on pip install -r requirements.txt ``` ./venv/bin/python3.9 -m pip install inaSpeechSegmenter ./venv/bin/python3.9 -m pip install gradio==4.14.0 imutils==0.5.4 numpy opencv-python==4.8.0.76 scipy==1.11.2 requests==2.28.1 pillow==9.3.0 librosa==0.10.0 opencv-contrib-python==4.8.0.76 huggingface_hub==0.20.2 tqdm==4.66.1 cutlet==0.3.0 numba==0.57.1 imageio_ffmpeg==0.4.9 insightface==0.7.3 unidic==1.1.0 onnx==1.14.1 onnxruntime==1.16.0 psutil==5.9.5 lpips==0.1.4 GitPython==3.1.36 facexlib==0.3.0 gfpgan==1.3.8 gdown==4.7.1 pyannote.audio==3.1.1 TTS==0.21.2 openai-whisper==20231117 resampy==0.4.0 scenedetect==0.6.2 uvicorn==0.23.2 starlette==0.35.1 fastapi==0.109.0 fugashii ./venv/bin/python3.9 -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 ./venv/bin/python3.9 -m pip install transformers==4.33.2 ./venv/bin/python3.9 -m pip install numpy==1.24.4 ``` 3. Install models ``` git clone https://huggingface.co/numz/wav2lip_studio models ``` 5. Launch UI ``` ./venv/bin/python3.9 wav2lip_studio.py ``` # All Users 1. pyannote.audio:You need to agree to share your contact information to access pyannote models. To do so, go to both link: - [pyannote diarization-3.1 huggingface repository](https://huggingface.co/pyannote/speaker-diarization-3.1) - [pyannote segmentation-3.0 huggingface repository](https://huggingface.co/pyannote/segmentation-3.0) set each field and click "Agree and access repository" ![Illustration](demo/hf_aggrement.png) 2. Create an access token to Huggingface: 1. Connect with your account 2. go to [access tokens](https://huggingface.co/settings/token) in settings 3. create a new token in read mode 4. copy the token 5. paste it in the file api_keys.json ```json { "huggingface_token": "your token" } ``` ## Tutorial - [FR version](https://youtu.be/43Q8YASkcUA) - [EN Version](https://youtu.be/B84A5alpPDc) ## 🐍 Usage ##PARAMETERS 1. Enter project name and click enter. 2. Choose a video (avi or mp4 format). Note avi file will not appear in Video input but process will works. 3. Face Swap (take times so be patient): - **Face Swap**: choose the image of the faces you want to swap with the face in the video (multiple faces are now available), left face is id 0. 4. **Resolution Divide Factor**: The resolution of the video will be divided by this factor. The higher the factor, the faster the process, but the lower the resolution of the output video. 5. **Min Face Width Detection**: The minimum width of the face to detect. Allow to ignore little face in the video. 6. **Align Faces**: allows for straightening the head before sending it for Wav2Lip processing. 7. **Keyframes On Speaker Change**: Allows you to generate a keyframe when the speaker changes. This allows you to better control the video generation. 8. **Keyframes On scene Change**: Allows you to generate a keyframe when the scene changes. This allows you to better control the video generation. 9. When parameters above are set click on **Generate Keyframes**, See [Keyframes manager](#keyframes-manager) section for more details. 10. Audio, 3 options: 1. Put audio file in the "Speech" input. or record one with the "Record" button. 2. Generate Audio with the text to speech [coqui TTS](https://github.com/coqui-ai/TTS) integration. 1. Choose the language 2. Choose the Voice 3. Write your speech in the text area "Prompt" in text format or json format: 1. Text format: ```bash Hello, my name is John. I am 25 years old. ``` 2. Json format (you can ask chat GPT to generate discussion for you): ```bash [ { "start": 0.0, "end": 3.0, "text": "Hello, my name is John. I am 25 years old.", "speaker": "arnold" }, { "start": 3.0, "end": 4.0, "text": "Ho really ?", "speaker": "female_01" }, ... ] ``` 4. Input Video: Allow to use audio from the input video, voices cloning and translation. see [Input Video](#input-video) section for more details. 11. **Video Quality**: - **Low**: Original Wav2Lip quality, fast but not very good. - **Medium**: Better quality by apply post processing on the mouth, slower. - **High**: Better quality by apply post processing and upscale the mouth quality, slower. 12. **Wav2lip Checkpoint**: Choose beetwen 2 wav2lip model: - **Wav2lip**: Original Wav2Lip model, fast but not very good. - **Wav2lip GAN**: Better quality by apply post processing on the mouth, slower. 13. **Face Restoration Model**: Choose beetwen 2 face restoration model: - **Code Former**: - A value of 0 offers higher quality but may significantly alter the person's facial appearance and cause noticeable flickering between frames. - A value of 1 provides lower quality but maintains the person's face more consistently and reduces frame flickering. - Using a value below 0.5 is not advised. Adjust this setting to achieve optimal results. Starting with a value of 0.75 is recommended. - **GFPGAN**: Usually better quality. 14. **Volume Amplifier**: Not amplify the volume of the output audio but allows you to amplify the volume of the audio when sending it to Wav2Lip. This allows you to better control on lips movement. ## KEYFRAMES MANAGER ![Illustration](demo/keyframes-manager.png) Global parameters: 1. **Only Track The Mouth**: This option tracks only the mouth, removing other facial motions like those of the cheeks and chin. 2. **Only show Speaker Face**: This option allows you to only focus the face of the speaker, the other faces will be hidden. 3. **Frame Number**: A slider that allows you to move between the frames of the video. 4. **Add Keyframe**: Allows you to add a keyframe at the current Frame Number. 5. **Remove Keyframe**: Allows you to remove a keyframe at the current Frame Number. 6. **Keyframes**: A list of all the keyframes. For each face on keyframe: 1. **Face Id**: List of all the faces in current keyframe. 2. **Speaker**: Checkbox to set the speaker on the current Face Id of the current keyframe. 3. **Face Swap Id**: Checkbox to set the face swap id of the current keyframe on the current Face Id. 4. **Mouth Mask Dilate**: This will dilate the mouth mask to cover more area around the mouth. depends on the mouth size. 5. **Face Mask Erode**: This will erode the face mask to remove some area around the face. depends on the face size. 6. **Mask Blur**: This will blur the mask to make it more smooth, try to keep it under or equal to **Mouth Mask Dilate**. 7. **Padding sliders**: This will add padding to the head to avoid cuting the head in the video. ## Input Video ![Illustration](demo/input-video.png) If no sound in translated audio, will take the audio from the input video. Can be useful if you have a bad lipsync on the input video. Clone Voices: 1. **Number Of Speakers**: The number of speakers in the video. Help clone to know how many voices to clone. 2. **Remove Background Sounf Before Clone**: Remove noise/music from the background sound before clone. 3. **Clone Voices**: Clone voices from the input video. 4. **Voices**: List of the cloned voices. Translation: 1. **Language**: Target language to translate the input video. 2. **Whisper Model**: List of the whisper models to use for the translation, choose beetwen 5 models, the higher the model the better the quality but the slower the process. 3. **Translate**: Translate the input video to the selected language. 4. **Translation**: The translated text. 5. **Translated Audio**: The translated audio. 6. **Convert To Audio**: Convert the translated text to translated audio. ## πŸ“Ί Examples demo/demo2.mp4 demo/demo3.mp4 demo/demo4.mp4 demo/demo5.mp4 ## πŸ“– Behind the scenes This extension operates in several stages to improve the quality of Wav2Lip-generated videos: 1. **Generate face swap video**: The script first generates the face swap video if image is in "face Swap" field, this operation take times so be patient. 2. **Generate a Wav2lip video**: Then script generates a low-quality Wav2Lip video using the input video and audio. 3. **Video Quality Enhancement**: Create a high-quality video using the low-quality video by using the enhancer define by user. 4. **Mask Creation**: The script creates a mask around the mouth and tries to keep other facial motions like those of the cheeks and chin. 5. **Video Generation**: The script then takes the high-quality mouth image and overlays it onto the original image guided by the mouth mask. ## πŸ’ͺ Quality tips - Use a high quality video as input - Use a video with a consistent frame rate. Occasionally, videos may exhibit unusual playback frame rates (not the standard 24, 25, 30, 60), which can lead to issues with the face mask. - Use a high quality audio file as input, without background noise or music. Clean audio with a tool like [https://podcast.adobe.com/enhance](https://podcast.adobe.com/enhance). - Dilate the mouth mask. This will help the model retain some facial motion and hide the original mouth. - Mask Blur maximum twice the value of Mouth Mask Dilate. If you want to increase the blur, increase the value of Mouth Mask Dilate otherwise the mouth will be blurred and the underlying mouth could be visible. - Upscaling can be good for improving result, particularly around the mouth area. However, it will extend the processing duration. Use this tutorial from Olivio Sarikas to upscale your video: [https://www.youtube.com/watch?v=3z4MKUqFEUk](https://www.youtube.com/watch?v=3z4MKUqFEUk). Ensure the denoising strength is set between 0.0 and 0.05, select the 'revAnimated' model, and use the batch mode. i'll create a tutorial for this soon. ## ⚠ Noted Constraints - for speed up process try to keep resolution under 1000x1000px and upscaling after process. - If the initial phase is excessively lengthy, consider using the "resize factor" to decrease the video's dimensions. - While there's no strict size limit for videos, larger videos will require more processing time. It's advisable to employ the "resize factor" to minimize the video size and then upscale the video once processing is complete. ## πŸ“ To do - βœ”οΈ Standalone version - βœ”οΈ Add a way to use a face swap image - βœ”οΈ Add Possibility to use a video for audio input - βœ”οΈ Convert avi to mp4. Avi is not show in video input but process work fine - [ ] ComfyUI intergration ## 😎 Contributing We welcome contributions to this project. When submitting pull requests, please provide a detailed description of the changes. see [CONTRIBUTING](CONTRIBUTING.md) for more information. ## πŸ™ Appreciation - [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) - [CodeFormer](https://github.com/sczhou/CodeFormer) - [Coqui TTS](https://github.com/coqui-ai/TTS) - [facefusion](https://github.com/facefusion/facefusion) - [Vocal Remover](https://github.com/tsurumeso/vocal-remover) ## β˜• Support Wav2lip Studio this project is open-source effort that is free to use and modify. I rely on the support of users to keep this project going and help improve it. If you'd like to support me, you can make a donation on my Patreon page. Any contribution, large or small, is greatly appreciated! Your support helps me cover the costs of development and maintenance, and allows me to allocate more time and resources to enhancing this project. Thank you for your support! [patreon page](https://www.patreon.com/Wav2LipStudio) ## πŸ“ Citation If you use this project in your own work, in articles, tutorials, or presentations, we encourage you to cite this project to acknowledge the efforts put into it. To cite this project, please use the following BibTeX format: ``` @misc{wav2lip_uhq, author = {numz}, title = {Wav2Lip UHQ}, year = {2023}, howpublished = {GitHub repository}, publisher = {numz}, url = {https://github.com/numz/sd-wav2lip-uhq} } ``` ## πŸ“œ License * The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE).