Demo Page

▶ Voice reconstruction with varying number of principal components

These audio samples demonstrate how the number of principal components impacts voice reconstruction fidelity. We list here some VITS-generated voices that sound noticeably different when limited to using eight components (N=8) compared to using all components (Full). You'll hear that the same voices generated by the two ECAPA-TDNN models, especially ECAPA-TDNN (NANSY), maintain relatively high fidelity even with this reduction.

ID	VITS				ECAPA_TDNN (SV)				ECAPA_TDNN (NANSY)
ID	N=8	N=16	N=32	Full	N=8	N=16	N=32	Full	N=8	N=16	N=32	Full
train-clean-360/288/131218/288_131218_000007_000002.wav
train-clean-360/2751/142363/2751_142363_000014_000002.wav
train-clean-360/3482/170452/3482_170452_000033_000002.wav
train-other-500/310/129055/310_129055_000034_000000.wav
train-other-500/432/122761/432_122761_000004_000002.wav
train-other-500/1051/133886/1051_133886_000082_000003.wav
train-other-500/2013/147610/2013_147610_000077_000001.wav
train-other-500/3641/134615/3641_134615_000025_000000.wav
train-other-500/6106/58195/6106_58195_000026_000002.wav
train-other-500/7764/106805/7764_106805_000011_000001.wav

▶ Effects of shifting speaker embeddings along each principal direction

These audio samples illustrates how shifting speaker embeddings along different principal directions affects the resulting voices. Notice that shifting speaker embeddings beyond the 10th principal direction induces significantly fewer changes in voice for the two ECAPA-TDNN models compared to VITS. Also, for ECAPA-TDNN (NANSY), only the first three principal directions are related to pitch changes. In contrast, for ECAPA-TDNN (SV), moving speaker embeddings along many nonconsecutive principal directions results in pitch changes. For example, directions 1, 3, 4, 5, 7, 9, 10, and 13 for the female voice below, and directions 2, 3, 4, 5, 6, 7, 10, 11, 12, and 15 for the male voice below.

Female voice ID: train-clean-100/7511/102419/7511_102419_000028_000012.wav

Index i	VITS		ECAPA-TDNN (SV)		ECAPA-TDNN (NANSY)
Index i	z-4.0σ_iw_i	z+4.0σ_iw_i	z-4.0σ_iw_i	z+4.0σ_iw_i	z-4.0σ_iw_i	z+4.0σ_iw_i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Male voice ID: train-clean-100/1040/133433/1040_133433_000115_000001.wav

Index i	VITS		ECAPA-TDNN (SV)		ECAPA-TDNN (NANSY)
Index i	z-4.0σ_iw_i	z+4.0σ_iw_i	z-4.0σ_iw_i	z+4.0σ_iw_i	z-4.0σ_iw_i	z+4.0σ_iw_i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

▶ Searching for the target voice

These audio samples are from the user studies. For each target voice (Easy-1/2, Hard-1/2/3/4/5), you can listen to user-searched voices (both with average and optimal initialization), and the simulated search results.

Sample: Easy-1. Original recording: . Resythesized audio: .

	Init	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9	Query 10	Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Easy-2. Original recording: . Resythesized audio: .

	Init	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9	Query 10	Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-1. Original recording: . Resythesized audio: .

	Init	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9	Query 10	Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-2. Original recording: . Resythesized audio: .

	Init	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9	Query 10	Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-3. Original recording: . Resythesized audio: .

	Init	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9	Query 10	Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-4. Original recording: . Resythesized audio: .

	Init	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9	Query 10	Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

Sample: Hard-5. Original recording: . Resythesized audio: .

	Init	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9	Query 10	Final
Simulated search (# PC=16, # Iter=2)
User search (average initialization)
User search (optimal initialization)

▶ Voice editing within the latent space

These audio samples demonstrate the quality of the voice attribute editing. You can download the full set of samples used in the listening test here.

ID	Resynthesized Voice	v₁ low_pitch - high_pitch		v₂ flat - expressive		v₃ low_vol - high_vol		v₄ relaxed - strained		v₅ less_nasal - more_nasal		v₆ bright - muffled
ID	Resynthesized Voice	z-4.0σ₁v₁	z+4.0σ₁v₁	z-4.0σ₂v₂	z+4.0σ₂v₂	z-4.0σ₃v₃	z+4.0σ₃v₃	z-4.0σ₄v₄	z+4.0σ₄v₄	z-4.0σ₅v₅	z+4.0σ₅v₅	z-4.0σ₆v₆	z+4.0σ₆v₆
dev-clean/6345/93302/6345_93302_000037_000003.wav
dev-clean/2902/9008/2902_9008_000008_000004.wav
test-clean/121/127105/121_127105_000016_000000.wav
test-clean/1320/122612/1320_122612_000056_000003.wav
test-clean/1580/141083/1580_141083_000056_000005.wav
test-other/3764/168671/3764_168671_000003_000005.wav
test-clean/4077/13754/4077_13754_000010_000002.wav
test-clean/5105/28233/5105_28233_000019_000004.wav
test-clean/5683/32879/5683_32879_000027_000003.wav
test-clean/8224/274384/8224_274384_000017_000002.wav