Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

Luo, Zhaojie; Chen, Jinhui; Takiguchi, Tetsuya; Ariki, Yasuo

https://hdl.handle.net/20.500.14094/90005712

このアイテムのアクセス数:19件（2024-05-09 07:54 集計）

閲覧可能ファイル

ファイル	フォーマット	サイズ	閲覧回数	説明
90005712 (fulltext)	pdf	1.48 MB	8

メタデータ

ファイル出力

メタデータID	90005712
アクセス権	open access
出版タイプ	Version of Record
タイトル	Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks
著者	Luo, Zhaojie ; Chen, Jinhui ; Takiguchi, Tetsuya ; Ariki, Yasuo
著者名 Luo, Zhaojie
著者ID A1176 研究者ID 1000050777810 KUID https://kuid-rm-web.ofc.kobe-u.ac.jp/search/detail?systemId=74d8cdfb000c0b49520e17560c007669 著者名 Chen, Jinhui 陳, 金輝チン, キンキ所属機関名計算社会科学研究センター
著者ID A1279 研究者ID 1000040397815 KUID https://kuid-rm-web.ofc.kobe-u.ac.jp/search/detail?systemId=b3ec2a1710d8267b520e17560c007669 著者名 Takiguchi, Tetsuya 滝口, 哲也タキグチ, テツヤ所属機関名都市安全研究センター
著者ID A0260 研究者ID 1000010135519 KUID https://kuid-rm-web.ofc.kobe-u.ac.jp/search/detail?systemId=09a784b8ffbc912c520e17560c007669 著者名 Ariki, Yasuo 有木, 康雄アリキ, ヤスオ所属機関名都市安全研究センター
収録物名	APSIPA Transactions on Signal and Information Processing
巻(号)	8
ページ	e10-e10
出版者	Cambridge University Press
刊行日	2019-03-04
公開日	2019-03-19
抄録	In this paper, we propose a novel neutral-to-emotional voice conversion (VC) model that can effectively learn a mapping from neutral to emotional speech with limited emotional voice data. Although conventional VC techniques have achieved tremendous success in spectral conversion, the lack of representations in fundamental frequency (F0), which explicitly represents prosody information, is still a major limiting factor for emotional VC. To overcome this limitation, in our proposed model, we outline the practical elements of the cross-wavelet transform (XWT) method, highlighting how such a method is applied in synthesizing diverse representations of F0 features in emotional VC. The idea is (1) to decompose F0 into different temporal level representations using continuous wavelet transform (CWT); (2) to use XWT to combine different CWT-F0 features to synthesize interaction XWT-F0 features; (3) and then use both the CWT-F0 and corresponding XWT-F0 features to train the emotional VC model. Moreover, to better measure similarities between the converted and real F0 features, we applied a VA-GAN training model, which combines a variational autoencoder (VAE) with a generative adversarial network (GAN). In the VA-GAN model, VAE learns the latent representations of high-dimensional features (CWT-F0, XWT-F0), while the discriminator of the GAN can use the learned feature representations as a basis for a VAE reconstruction objective.
キーワード	Continuous wavelet transform
	Emotional voice conversion
	Generative adversarial networks
	Variational autoencoder
	F0 features
カテゴリ	計算社会科学研究センター
	都市安全研究センター
	学術雑誌論文
権利	© The Authors, 2019.
権利	This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

資源タイプ	journal article
言語	English (英語)
eISSN	2048-7703　OPACで所蔵を検索　 CiNiiで学外所蔵を検索
関連情報	DOI https://doi.org/10.1017/ATSIP.2019.3

閲覧可能ファイル

メタデータ

詳細を表示