So who is really ordering that Christmas present from Amazon for delivery to Grandma that will be stolen from her porch? Or that pizza going to an address not on your contact list? Or that ecommerce buy placed over your smartphone?

Yes, now we have to worry about someone ripping us off by impersonating us in smart speaker and voice command orders.

Help me, Siri … Cortanna … Alexa … Google Assistant [no name yet]?

Amid the havoc constantly being wreaked on technology users by hackers and ransomware writers, Finnish researchers declare in a new study that we have to be careful about who speaks to our gizmos.

“Skilful voice impersonators are able to fool state-of-the-art speaker recognition systems, as these systems generally aren’t efficient yet in recognizing voice modifications,” says the research team at the Unviersity of Eastern Finland.

“The vulnerability of speaker recognition systems poses significant security concerns.”


Their findings have just been published at ScienceDirect.

Voice hackers

“Voice attacks against speaker recognition can be done using technical means, such as voice conversion, speech synthesis and replay attacks,” they add.

“The scientific community is systematically developing techniques and countermeasures against technically generated attacks. However, voice modifications produced by a human, such as impersonation and voice disguise, cannot be easily detected with the developed countermeasures.”

So now we are getting “voice hackers.”

The warning comes as new reports indicate many people will be using more than one voice-driven apps on phones and U.S. consumers buy millions of smart home Internet of Things devices – many being able to be commanded by voice. Making matters worse, some smart voices can now talk across platforms.


Inside the study

Participants in the study “were asked to modify their voices to fake their age, attempting to sound like an old person and like a child.” Impersonators “were able to fool automatic systems and listeners in mimicking some speakers.”

One “successful strategy for voice modification was to sound like a child, as both automatic systems’ and listeners’ performance degraded with this type of disguise,” the study says.

Here are the basic three points of how the study was conducted:

• “We study the effects of voice disguise on speaker verification on a corpus of 60 native Finnish speakers from acoustic and perceptual perspectives based on automatic speaker verification system performance.

• “Acoustic analyses with statistical tests reveal the difference in fundamental frequency and formant frequencies between natural and disguised voices.

• “The listening test with 70 subjects indicates the correspondence between perceptual and automatic speaker recognition evaluation.”

Well, are you going to give a second thought before you buy a smart speaker or place a voice order?

You’ve been warned!

FYI, the abstract of the study follows:

Acoustical and perceptual study of voice disguise by age modification in speaker verification

The task of speaker recognition is feasible when the speakers are co-operative or wish to be recognized. While modern automatic speaker verification (ASV) systems and some listeners are good at recognizing speakers from modal, unmodified speech, the task becomes notoriously difficult in situations of deliberate voice disguise when the speaker aims at masking his or her identity. We approach voice disguise from the perspective of acoustical and perceptual analysis using a self-collected corpus of 60 native Finnish speakers (31 female, 29 male) producing utterances in normal, intended young and intended old voice modes. The normal voices form a starting point and we are interested in studying how the two disguise modes impact the acoustical parameters and perceptual speaker similarity judgments.

First, we study the effect of disguise as a relative change in fundamental frequency (F0) and formant frequencies (F1 to F4) from modal to disguised utterances. Next, we investigate whether or not speaker comparisons that are deemed easy or difficult by a modern ASV system have a similar difficulty level for the human listeners. Further, we study affecting factors from listener-related self-reported information that may explain a particular listener’s success or failure in speaker similarity assessment.

Our acoustic analysis reveals a systematic increase in relative change in mean F0 for the intended young voices while for the intended old voices, the relative change is less prominent in most cases. Concerning the formants F1 through F4, 29% (for male) and 30% (for female) of the utterances did not exhibit a significant change in any formant value, while the remaining  ∼ 70% of utterances had significant changes in at least one formant.

Our listening panel consists of 70 listeners, 32 native and 38 non-native, who listened to 24 utterance pairs selected using rankings produced by an ASV system. The results indicate that speaker pairs categorized as easy by our ASV system were also easy for the average listener. Similarly, the listeners made more errors in the difficult trials. The listening results indicate that target (same speaker) trials were more difficult for the non-native group, while the performance for the non-target pairs was similar for both native and non-native groups.

Read more about the study at: