LONG READ: What comms people need to know about deep fake audio

I’d heard that the biggest single threat posed by deepfake isn’t video at all. It’s deepfake audio.

Audio rather than video, the argument runs, is easier to create and easier to believe.

Of course, I put this to one side onto the pile of things I’d look at when I got around to it.

Days later but hours before Labour leader Keir Starmer was due to make his speech to the Labour Party conference audio was posted that appeared to be of him swearing at an aide for forgetting a tablet.

“F-king idiot,” he appears to say. “I f-king told you, didn’t I? F-ks sake. Bloody moron. No, I’m sick of it.”

Here’s the clip.

The 25-second clip posted to X, formerly Twitter, by an anonymous account with a track record of criticising Starmer, appears to be the politician caught in a private moment.

Online, it was greeted with critics of the opposition leader who gleefully greeted the clip as evidence he has feet of clay.

A follow up tweet claiming to be a screengrab from left wing Skwarkbox website quoting a named audio engineer as saying it is authentic.

Well, then it must be the real deal, mustn’t it?

But then Conservative MP Simon Clarke spoke out against fake clips and asked people to ignore it. Sky News also debunked the recording, as did the politico website, the Daily Mirror and others.

Ernest Hemingway wrote that everyone has a bullsh-t detector.

I was detecting bulls-t.

Interestingly, the Labour Party weren’t all out shooting the audio down. I can see there’s two good reasons for this. First, how fast would it be to authoritatively fact check the contentious audio? It would take time. Secondly, what happens when we deny? Rebuttal should never repeat the lie because it just reinforces the lie.

Thirdly, a denial can breathe life into a story. What’s that? What’s the thing they don’;t want me to hear? I’m off to hear it.

Experience shows the best source of online debunking is a trusted third party.

Here, it was a Conservative MP.

But we’ve known this since The Guardian and Reading University’s excellent Reading the Riots research. In it, amid riots blogger Andy Mabbett debunking the rumour that Birmingham Children’s Hospital was on fire. That hospital, Andy pointed out, is right opposite Steelhouse Lane Police Station. That’s hardly going to happen is it? As a result the rumout stopped being shared.

But it got me thinking, why audio and how easy is it? And can fake audio be spotted?

Why audio? It’s got a long history

There’s a long track record of secret recordings tripping people up. Far before the internet was invented there was a cricketer caller Ian Botham. Fans love him. The cricket establishment hated him because, in good story fashion, he was a maverick who didn’t play by the rules.

So, when Ian Botham was secretly recorded at a charity event calling those who ran cricket ‘gin slinging dodderers’ which didn’t go down well.

Nor did US politician Mitt Romney fare well when he was secretly recorded at a dinner for rich bakers by criticising the poor.

Google the term secret recording and you’ll find Jose Mourinho, Donald Trump, Charles III and Nicolas Sarkozy.

In short, secret audio has history.

Deep fake audio also has an emerging history of influencing elections.

In the last couple of weeks, the Sudanese Civil War has been blighted with deep fake audio. More worrying, the knife edge Slovakian election saw the Progressive Slovakia party in a narrow lead before faked audio circulated of its leader discussing how to rig the election.

When the ballots were counted it they had lost.

Those people warning that deep fake audio will have an impact have got a point.

A problem far, far away?

Of course, it’s tempting to look at this and dismiss it as a story about far, far away.

I don’t think it is. This is likely to drop into the inbox of public sector communications people.

First, a story.

In 2009, the far right English Defence League came to Birmingham to protest. Twitter was still in its infancy and we were all working out how it could be used. A tweet from one of the group claimed that a white youth had been attacked by a gang of Asians. It was reposted. It sparked a flurry of accusations that increased tension with sporadic fighting across the city centre.

At first, the police were baffled as to how to respond until a police Superintendent frpm Wolverhampton worked out that if he was in the police’s Gold Control command centre with his smartphone he could monitor Twitter and if a rumour was posted he could shoot them down in realtime.

So, he did just this succcesfully when the EDL returned.

This example set a gold standard for online rumour. Use a trusted source to debunk it in realtime. Not in tomorrow’s papers but a message within minutes.

Just as Twitter was weaponised to spread rumour deep fakes will undoubtedly be used to cause trouble.

It doesn’t need that much imagination.

For example, a school is facing protests on religious grounds from a section of the community about how they teach their children sex education. How is it going to run if deep fake audio was released of the headteacher abusing that religion’s Holy book?

Or how about the council election where the politician is subject to a deep fake about bribery?

How hard is it to create audio like this?

I thought I’d take a look.

What are social channels’ attitudes to deep fakes?

It’s a bit mixed.

Meta’s policy on deepfakes is that they remove a post under certain conditions.

“the product of artificial intelligence or machine learning, including deep learning techniques (e.g., a technical deepfake)”; and the post would mislead an average person to believe that “a subject of the video said words that they did not say.”
Meta

YouTube also takes steps in its community guidelines on manipulated content. Twitter have a similar policy on deepfakes.

The trouble is that complaining through the official channel can take time. Relying on that alone is the definition of bolting the stable door after the horse has bolted.

Creating deep fake audio is really easy

I’m looking at this not so anyone creates deep fake audio themselves but so they’re aware of the issue.

One of the largest AI audio platforms on the internet is resemble.ai. With it you can create audio firstly with some generic voices. Paula J, for example, has an English, accent. Beth sounds like an American. There are more than 20 off-the-shelf examples.

Secondly, things start to get even more interesting when you explore the other functionality.

You can add your voice to your resemble.ai account. So, I did. I recorded 25 voice clips of myself reading prepared text with which to train the tool. There is also the option of uploading audio but to do this I had to explain to them why I was looking to do this. I did and they said they’d email me back. After five minutes of waiting and no email I pressed on ahead to use a generic voice.

I’m sure resemble.ai would say that this is a step to stop potential bad actors. But I’m just as sure there are other tools out there that don’t do this.

The plan was for the AI tool to read back some of the script fake Keir Starmer had said.

So, painstakingly I typed in about 15 seconds of audio.

“F-ing idiot. Have you got it? The f-ing tablet. F-ks sake. I literally told you. F-ks sake. Bloody moron.”
Deep fake audio

I downloaded the audio.

Just to make sure, I also screen recorded the playback of the recording on my mobile phone.

Of course, this audio emerged in one burst without the required pauses so editing to insert suitable gaps would be needed to make it sound more authentic.

I also needed some ambient background noise. So, I screen recorded 30-seconds of audio from a cafe from a clip I found on YouTube.

I then used the Kinemaster video editing app to put together a clip that sounded plausible to the untrained ear. I used the app to add a bit of distortion to the audio and change the sound balance. But this was not a hard thing to do.

A basic knowledge of video editing and I was able to record the clip from start to finish in less than half an hour.

Here it is.

And yes, that’s a stock pic of a couple arguing in a cafe.

How can I spot deep fake audio?

For images, there are tools out there. Tineye looks for reversed images. The BBC use forensic study to work out if an image is fake. Again, your the Hemingway filter can often come into play.

The shark pic after floods is such a common meme that it has its own wikipedia page.

This shark pic was posted after Hurricane Sandy.

After all, if you’re feeling smug look at how many people were taken in by a tweet alleging Nigel Farage was a punk.

Nigel Farage as a punk in 1983. Yes really. pic.twitter.com/mLrckReGMK
— Henry Warren (@henrywarren) April 28, 2014

For the benefit of the tape, no he wasn’t.

For audio, there’s annoyingly few tools. Resemble.ai say they have such a tool. But the webpage is behind a wall which harvests contact details and there’s a promise to get in touch. If this is needed in a hurry it’s not the answer you’re looking for.

So what to do?

Well, a piece on Ampere Industrial Security’s website talks about research being published by researcher Siwei Lyu from University of Buffalo, State University of New York. There is such a researcher called Siwei Lyu, I found. But there isn’t the source of the research there. Does that mean it doesn’t exist? You spend anytime looking at fakes online and you start to get a bit sceptical of the most ordinary looking things.

The Keir Starmer example shows that yes, it will happen. But the debunking should probably come from a third party rather than yourself.

That could be tricky.

LONG READ: What comms people need to know about deep fake audio

Why audio? It’s got a long history

A problem far, far away?

Creating deep fake audio is really easy

How can I spot deep fake audio?

Like this:

Leave a comment

Leave a ReplyCancel reply

Why audio? It’s got a long history

A problem far, far away?

Creating deep fake audio is really easy

How can I spot deep fake audio?

Share this:

Like this:

Leave a comment

Leave a ReplyCancel reply

Discover more from Dan Slee