Abstract
Overview
Overall Framework of DGTalker. We design a disentangled navigation framework consisting of an anchor $w_{can}$, which encodes a global canonical expression for a specific identity, and two sets of learnable, orthogonal blendshapes ${B_\text{exp}, B_\text{lip}}$ containing $k_e$ and $k_l$ vectors, respectively. Each vector corresponds to a disentangled variation in upper/lower face expressions. The input audio is used to regress the coefficients of these blendshapes. To ensure effective learning, we randomly feed the encoder with different audio inputs and render the output images from two viewpoints. The corresponding masked ground-truth (GT) images are then used for supervision.

Visualization && Controllability
The learned components possess well-defined semantic meaning, and we show the changes of the two blendshapes as the coefficients vary.

DGTalker enables generating diverse talking expressions from the same speech content, providing superior controllability.
Comparison to SOTA
We design three settings to evaluate the 3D-aware reconstruction quality and lip-audio synchronization ability.
For the self-driven setting,
For the 3D-aware self-driven setting,
For the 3D-aware audio generalization setting,
Citation(Coming Soon)
BibTex Code Here