Skip to main content
Log in

Multimodal pre-train then transfer learning approach for speaker recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Cognitive science has well-established the correlation between faces and voices because neuro-cognitive pathways of both information share the same structure. Recently, the task has come to the attention of the computer vision community with the introduction of large-scale face-voice data. To this end, our work aims to leverage the structure of faces and voices along with the availability of large-scale face-voice information to improve speaker recognition tasks including identification and verification. To achieve this task, we propose novel multimodal systems to leverage the structure of face and voice, one with weight sharing and another without weight sharing, to learn joint representations of multiple modalities establishing the Face-voice association. Afterwards, features are extracted from the trained multimodal networks capturing face-voice association to perform speaker recognition tasks. We evaluated our proposed multimodal networks for speaker recognition along with Face-voice association tasks on challenging benchmark datasets including VoxCeleb1 and MAV-Celeb. Our results show that adding facial information improved speaker recognition tasks’ performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availibility

In our experiments, we use VoxCeleb1 [3] dataset to evaluate the proposed method. VoxCeleb1 is gender balanced, with 55% of the speakers male. The speakers span a wide range of different ethnicities, accents, professions and ages. Moreover, data privacy notice is available on the official website of the dataset [69]. Specifically, we extract face embeddings using Inception-ResNet-V1 trained with triplet loss, similar to the work of Schroff et al. [62]. We extract audio embeddings (\(\textbf{e}_i\)) using an utterance level aggregator [34] trained on speaker recognition task.

References

  1. Bai Z, Zhang XL (2021) Speaker recognition based on deep learning: an overview. Neural Netw 140:65–99

    Article  PubMed  Google Scholar 

  2. Chung JS, Nagrani A, Zisserman A (2018) VoxCeleb2: deep speaker recognition. In: INTERSPEECH

  3. Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH

  4. Jung Jw, Kim YJ, Heo HS, Lee BJ, Kwon Y, Chung JS (2022) Pushing the limits of raw waveform speaker recognition. In: Proc. Interspeech

  5. Stoll LL (2011) Finding difficult speakers in automatic speaker recognition. PhD thesis, EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-152.html

  6. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Sig Process 10(1–3):19–41

    Article  Google Scholar 

  7. Kenny P (2005) Joint factor analysis of speaker and session variability: theory and algorithms. CRIM, Montreal,(Report) CRIM-06/08–13 14(28–29):2

  8. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: British machine vision conference

  9. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778

  10. Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 8427–8436

  11. Nagrani A, Albanie S, Zisserman A (2018) Learnable pins: cross-modal embeddings for person identity. In: Proceedings of the European conference on computer vision (ECCV). pp 71–88

  12. Saeed MS, Nawaz S, Yousaf Khan MH, Zaheer MZ, Nandakumar K, Yousaf MH, Mahmood A (2023) Single-branch network for multimodal training. In: ICASSP 2023–2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE

  13. Horiguchi S, Kanda N, Nagamatsu K (2018) Face-voice matching using cross-modal embeddings. In: Proceedings of the 26th ACM international conference on multimedia. pp 1011–1019

  14. Nawaz S, Janjua MK, Gallo I, Mahmood A, Calefati A (2019) Deep latent space learning for cross-modal mapping of audio and visual signals. In: 2019 digital image computing: techniques and applications (DICTA). IEEE, pp 1–7

  15. Wen P, Xu Q, Jiang Y, Yang Z, He Y, Huang Q (2021) Seeking the shape of sound: an adaptive framework for learning voice-face association. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 16347–16356

  16. Wen Y, Ismail MA, Liu W, Raj B, Singh R (2019) Disjoint mapping network for cross-modal matching of voices and faces. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA

  17. Shah SH, Saeed MS, Nawaz S, Yousaf MH (2023) Speaker recognition in realistic scenario using multimodal data. In: 2023 3rd international conference on artificial intelligence (ICAI). IEEE, pp 209–213

  18. Saeed MS, Nawaz S, Khan MH, Javed S, Yousaf MH, Del Bue A (2022) Learning branched fusion and orthogonal projection for face-voice association. arXiv:2208.10238

  19. Nawaz S, Saeed MS, Morerio P, Mahmood A, Gallo I, Yousaf MH, Del Bue A (2021) Cross-modal speaker verification and recognition: a multilingual perspective. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 1682–1691

  20. Albanie S, Nagrani A, Vedaldi A, Zisserman A (2018) Emotion recognition in speech using cross-modal transfer in the wild. In: Proceedings of the 26th ACM international conference on multimedia. pp 292–301

  21. Afouras T, Chung JS, Zisserman A (2018) The conversation: deep audio-visual speech enhancement. In: INTERSPEECH

  22. Koepke AS, Wiles O, Zisserman A (2018) Self-supervised learning of a facial attribute embedding from video. In: BMVC. pp 302

  23. Ellis AW (1989) Neuro-cognitive processing of faces and voices. In: Handbook of research on face processing. Elsevier, pp 207–215

  24. Kamachi M, Hill H, Lander K, Vatikiotis-Bateson E (2003) Putting the face to the voice’: matching identity across modality. Curr Biol 13(19):1709–1714

    Article  PubMed  CAS  Google Scholar 

  25. Kim C, Shin HV, Oh TH, Kaspar A, Elgharib M, Matusik W (2018) On learning associations of faces and voices. In: Asian conference on computer vision. Springer, pp 276–292

  26. Pruzansky S (1963) Pattern-matching procedure for automatic talker recognition. J Acoust Soc Am 35(3):354–358

    Article  ADS  Google Scholar 

  27. Dehak N, Kenny P, Dehak R, Glembek O, Dumouchel P, Burget L, Hubeika V, Castaldo F (2009) Support vector machines and joint factor analysis for speaker verification. In: 2009 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 4237–4240

  28. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2020) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  29. Yapanel U, Zhang X, Hansen JH (2002) High performance digit recognition in real car environments. In: Seventh international conference on spoken language processing

  30. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  ADS  PubMed  CAS  Google Scholar 

  31. Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1695–1699

  32. Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5329–5333

  33. Salman A, Chen K (2011) Exploring speaker-specific characteristics with deep learning. In: The 2011 international joint conference on neural networks. IEEE, pp 103–110

  34. Xie W, Nagrani A, Chung JS, Zisserman A (2019) Utterance-level aggregation for speaker recognition in the wild. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5791–5795

  35. Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5297–5307

  36. Zhong Y, Arandjelović R, Zisserman A (2019) GhostVLAD for set-based face recognition. In: Computer vision–ACCV 2018: 14th Asian conference on computer vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part II 14. Springer, pp 35–50

  37. Wang R, Ao J, Zhou L, Liu S, Wei Z, Ko T, Li Q, Zhang Y (2022) Multi-view self-attention based transformer for speaker recognition. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6732–6736

  38. India M, Safari P, Hernando J (2021) Double multi-head attention for speaker verification. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6144–6148

  39. Zhu H, Lee KA, Li H (2021) Serialized multi-layer multi-head attention for neural speaker embedding. In: Proc. Interspeech 2021. pp 106–110. https://doi.org/10.21437/Interspeech.2021-2210

  40. Wu CY, Hsu CC, Neumann U (2022) Cross-modal perceptionist: can face geometry be gleaned from voices? In: CVPR

  41. Wang J, Li C, Zheng A, Tang J, Luo B (2022) Looking and hearing into details: dual-enhanced Siamese adversarial network for audio-visual matching. IEEE Transactions on Multimedia

  42. Saeed MS, Khan MH, Nawaz S, Yousaf MH, Del Bue A (2022) Fusion and orthogonal projection for improved face-voice association. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7057–7061

  43. Baltrušaitis T, Ahuja C, Morency LP (2018) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443

    Article  PubMed  Google Scholar 

  44. Vielzeuf V, Lechervy A, Pateux S, Jurie F (2018) Centralnet: a multilayer approach for multimodal fusion. In: Proceedings of the European conference on computer vision (ECCV) workshops. pp 0–0

  45. Kiela D, Grave E, Joulin A, Mikolov T (2018) Efficient large-scale multi-modal classification. arXiv:1802.02892

  46. Kiela D, Firooz H, Mohan A, Goswami V, Singh A, Ringshia P, Testuggine D (2020) The hateful memes challenge: detecting hate speech in multimodal memes. Adv Neural Inf Process Syst 33:2611–2624

    Google Scholar 

  47. Gallo I, Calefati A, Nawaz S (2017) Multimodal classification fusion in real-world scenarios. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 5. IEEE, pp 36–41

  48. Arshad O, Gallo I, Nawaz S, Calefati A (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In: 2019 international conference on document analysis and recognition (ICDAR). IEEE, pp 337–342

  49. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6077–6086

  50. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847

  51. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 3156–3164

  52. Yan L, Han C, Xu Z, Liu D, Wang Q (2023) Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In: Proceedings of international joint conference on artificial intelligence (IJCAI). pp 1622–1630

  53. Yan L, Wang Q, Cui Y, Feng F, Quan X, Zhang X, Liu D (2022) GL-RG: global-local representation granularity for video captioning. arXiv:2205.10706

  54. Popattia M, Rafi M, Qureshi R, Nawaz S (2022) Guiding attention using partial order relationships for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4671–4680

  55. Yan L, Liu D, Song Y, Yu C (2020) Multimodal aggregation approach for memory vision-voice indoor navigation with meta-learning. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 5847–5854

  56. Nawaz S, Cavazza J, Del Bue A (2022) Semantically grounded visual embeddings for zero-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4589–4599

  57. Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5005–5013

  58. Nagrani A, Chung JS, Albanie S, Zisserman A (2020) Disentangled speech embeddings using cross-modal self-supervision. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6829–6833

  59. Hajavi A, Etemad A (2023) Audio representation learning by distilling video as privileged information. IEEE Transactions on Artificial Intelligence

  60. Nawaz S (2019) Multimodal representation and learning. PhD thesis, Universitá degli Studi dell’Insubria

  61. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence

  62. Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 815–823

  63. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp 448–456. PMLR

  64. Calefati A, Janjua MK, Nawaz S, Gallo I (2018) Git loss for deep face recognition. In: Proceedings of the British machine vision conference (BMVC)

  65. Sarı L, Singh K, Zhou J, Torresani L, Singhal N, Saraf Y (2021) A multi-view approach to audio-visual speaker verification. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6194–6198

  66. Zheng A, Hu M, Jiang B, Huang Y, Yan Y, Luo B (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multimedia 24:338–351

    Article  Google Scholar 

  67. Ning H, Zheng X, Lu X, Yuan Y (2021) Disentangled representation learning for cross-modal biometric matching. IEEE Trans Multimedia 24:1763–1774

    Article  Google Scholar 

  68. Deng J, Guo J, Xue N, Zafeiriou S (2019) ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4690–4699

  69. VGG Dataset Privacy Notice–robots.ox.ac.uk. https://www.robots.ox.ac.uk/~vgg/terms/url-lists-privacy-notice.html. Accessed 01 Jan 2024

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Summaira Jabeen.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jabeen, S., Amin, M.S. & Li, X. Multimodal pre-train then transfer learning approach for speaker recognition. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18575-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-18575-4

Keywords

Navigation