![]() |
![]() |
Recently popularized self-supervised models appear as a solution to the problem of low data availability via parsimonious learning transfer. We investigate the effectiveness of these multilingual acoustic models, in this case wav2vec 2.0 XLSR-53 and wav2vec 2.0 XLSR-128, for the transcription task of the Ewondo language (spoken in Cameroon). The experiments were conducted on 11 minutes of speech constructed from 103 read sentences. Despite a strong generalization capacity of multilingual acoustic model, preliminary results show that the distance between XLSR embedded languages (English, French, Spanish, German, Mandarin, . . . ) and Ewondo strongly impacts the performance of the transcription model. The highest performances obtained are around 69% on the WER and 28.1% on the CER. An analysis of these preliminary results is carried out andthen interpreted; in order to ultimately propose effective ways of improvement.