These audio samples demonstrate how the number of principal components impacts voice reconstruction fidelity. We list here some VITS-generated voices that sound noticeably different when limited to using eight components (N=8) compared to using all components (Full). You'll hear that the same voices generated by the two ECAPA-TDNN models, especially ECAPA-TDNN (NANSY), maintain relatively high fidelity even with this reduction.
ID | VITS | ECAPA_TDNN (SV) | ECAPA_TDNN (NANSY) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
N=8 | N=16 | N=32 | Full | N=8 | N=16 | N=32 | Full | N=8 | N=16 | N=32 | Full | |
train-clean-360/288/131218/288_131218_000007_000002.wav | ||||||||||||
train-clean-360/2751/142363/2751_142363_000014_000002.wav | ||||||||||||
train-clean-360/3482/170452/3482_170452_000033_000002.wav | ||||||||||||
train-other-500/310/129055/310_129055_000034_000000.wav | ||||||||||||
train-other-500/432/122761/432_122761_000004_000002.wav | ||||||||||||
train-other-500/1051/133886/1051_133886_000082_000003.wav | ||||||||||||
train-other-500/2013/147610/2013_147610_000077_000001.wav | ||||||||||||
train-other-500/3641/134615/3641_134615_000025_000000.wav | ||||||||||||
train-other-500/6106/58195/6106_58195_000026_000002.wav | ||||||||||||
train-other-500/7764/106805/7764_106805_000011_000001.wav |