Demo Page for "User-Driven Voice Generation and Editing through Latent Space Navigation"
Voice reconstruction with varying number of principal components

These audio samples demonstrate how the number of principal components impacts voice reconstruction fidelity. We list here some VITS-generated voices that sound noticeably different when limited to using eight components (N=8) compared to using all components (Full). You'll hear that the same voices generated by the two ECAPA-TDNN models, especially ECAPA-TDNN (NANSY), maintain relatively high fidelity even with this reduction.

ID VITS ECAPA_TDNN (SV) ECAPA_TDNN (NANSY)
N=8 N=16 N=32 Full N=8 N=16 N=32 Full N=8 N=16 N=32 Full
train-clean-360/288/131218/288_131218_000007_000002.wav
train-clean-360/2751/142363/2751_142363_000014_000002.wav
train-clean-360/3482/170452/3482_170452_000033_000002.wav
train-other-500/310/129055/310_129055_000034_000000.wav
train-other-500/432/122761/432_122761_000004_000002.wav
train-other-500/1051/133886/1051_133886_000082_000003.wav
train-other-500/2013/147610/2013_147610_000077_000001.wav
train-other-500/3641/134615/3641_134615_000025_000000.wav
train-other-500/6106/58195/6106_58195_000026_000002.wav
train-other-500/7764/106805/7764_106805_000011_000001.wav
Effects of shifting speaker embeddings along each principal direction

These audio samples illustrates how shifting speaker embeddings along different principal directions affects the resulting voices. Notice that shifting speaker embeddings beyond the 10th principal direction induces significantly fewer changes in voice for the two ECAPA-TDNN models compared to VITS. Also, for ECAPA-TDNN (NANSY), only the first three principal directions are related to pitch changes. In contrast, for ECAPA-TDNN (SV), moving speaker embeddings along many nonconsecutive principal directions results in pitch changes. For example, directions 1, 3, 4, 5, 7, 9, 10, and 13 for the female voice below, and directions 2, 3, 4, 5, 6, 7, 10, 11, 12, and 15 for the male voice below.

Female voice ID: train-clean-100/7511/102419/7511_102419_000028_000012.wav

Index
i
VITS ECAPA-TDNN (SV) ECAPA-TDNN (NANSY)
z-4.0σiwi z+4.0σiwi z-4.0σiwi z+4.0σiwi z-4.0σiwi z+4.0σiwi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Male voice ID: train-clean-100/1040/133433/1040_133433_000115_000001.wav

Index
i
VITS ECAPA-TDNN (SV) ECAPA-TDNN (NANSY)
z-4.0σiwi z+4.0σiwi z-4.0σiwi z+4.0σiwi z-4.0σiwi z+4.0σiwi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Searching for the target voice

These audio samples are from the user studies. For each target voice (Easy-1/2, Hard-1/2/3/4/5), you can listen to user-searched voices (both with average and optimal initialization), and the simulated search results.

Sample: Easy-1. Original recording: . Resythesized audio: .

Init Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Easy-2. Original recording: . Resythesized audio: .

Init Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-1. Original recording: . Resythesized audio: .

Init Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-2. Original recording: . Resythesized audio: .

Init Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-3. Original recording: . Resythesized audio: .

Init Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-4. Original recording: . Resythesized audio: .

Init Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-5. Original recording: . Resythesized audio: .

Init Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)
Voice editing within the latent space

These audio samples demonstrate the quality of the voice attribute editing. You can download the full set of samples used in the listening test here.

ID Resynthesized Voice v1
low_pitch - high_pitch
v2
flat - expressive
v3
low_vol - high_vol
v4
relaxed - strained
v5
less_nasal - more_nasal
v6
bright - muffled
z-4.0σ1v1 z+4.0σ1v1 z-4.0σ2v2 z+4.0σ2v2 z-4.0σ3v3 z+4.0σ3v3 z-4.0σ4v4 z+4.0σ4v4 z-4.0σ5v5 z+4.0σ5v5 z-4.0σ6v6 z+4.0σ6v6
dev-clean/6345/93302/6345_93302_000037_000003.wav
dev-clean/2902/9008/2902_9008_000008_000004.wav
test-clean/121/127105/121_127105_000016_000000.wav
test-clean/1320/122612/1320_122612_000056_000003.wav
test-clean/1580/141083/1580_141083_000056_000005.wav
test-other/3764/168671/3764_168671_000003_000005.wav
test-clean/4077/13754/4077_13754_000010_000002.wav
test-clean/5105/28233/5105_28233_000019_000004.wav
test-clean/5683/32879/5683_32879_000027_000003.wav
test-clean/8224/274384/8224_274384_000017_000002.wav