Add STG Support for Video Diffusion in CosyVoice Audio#1391
Open
primepake wants to merge 2 commits intoFunAudioLLM:mainfrom
Open
Add STG Support for Video Diffusion in CosyVoice Audio#1391primepake wants to merge 2 commits intoFunAudioLLM:mainfrom
primepake wants to merge 2 commits intoFunAudioLLM:mainfrom
Conversation
|
Looks interesting, but may I ask, what are the effects of adding STG? Better voice cloning quality or better emotion? |
Author
|
yes, I will improve the model quality. For example, the flow matching in flow model is sometimes difficult to maintain the consistent of speaker like it changed the voice identity from male to female in the same audio with STG it's improved |
|
想问一下,这里的stg_applied_layers_idx是怎么设置的? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces Stage-Guided (STG) support to CosyVoice, inspired by the video diffusion framework from STGuidance. The changes enhance the text-to-speech pipeline by integrating stage-guided techniques, improving [e.g., generation quality, efficiency, or compatibility with diffusion-based workflows].
Changes Made
Updated cosyvoice/flow/decoder.py to [e.g., "incorporate stage-guided decoding logic for better alignment with diffusion processes"].
Modified cosyvoice/flow/flow_matching.py to [e.g., "adapt flow matching to support STG’s stage-based optimization"].
Motivation
The addition of STG support aims to [e.g., "leverage stage-guided diffusion techniques to enhance the quality and speed of speech synthesis, aligning CosyVoice with advanced video diffusion methodologies"]. This builds on the concepts from junhahyung/STGuidance, adapted for audio generation.