Artificial Intelligence on the Biden Campaign

CouchPotato Architecture Slide

Audio Classification with MFCCs

The best way to do any kind of classification in audio signals is by extracting a set of features known as Mel Frequency Cepstral Coefficients (MFCCs). Similar sounding audio will have computationally similar MFCCs and by processing the coefficients against pre-trained data, you can collate on a pre-defined classification. I’m not going to pretend to fully understand the science and mathematics behind all of this — I am not a digital signal engineer by any stretch of the imagination — but the python open source ecosystem is plentiful with libraries that make extracting MFCCs incredibly simple. We ended up using librosa , which integrated nicely with numpy, which we would need for the next step.

Speaker Classification with Audio and Video

Since we knew our use case specifically was for debates, we knew we would have both audio and video, which we could use as complementary sources of data to solve this problem. We revisited our assumption that real-time video processing would be too taxing and require too much infrastructure for any of this to be worth it. To that end, we got started on a plan to reduce the overhead with inference.

{
"jobName": “xxx”,
"accountId": “xxx”,
"results": {
"transcripts": [
{
"transcript": “<fully transcribed text>“
}
],
"items": [
{
"start_time": "1.11",
"end_time": "1.28",
"alternatives": [
{
"confidence": "0.9521",
"content": "We're"
}
],
"type": "pronunciation"
},
{
"start_time": "1.28",
"end_time": "1.35",
"alternatives": [
{
"confidence": "1.0",
"content": "in"
}
],
"type": "pronunciation"
},
...

Flappy Lips

At this point, we were using OpenCV and dlib to find people within the frame, and that part was working well enough (except for the speed issue). The dlib library worked out great for us because its human face recognition neural network was far superior in terms of inference speed when compared with alternatives. What we needed next was to figure out which of the faces in the frame represented the speaker

Speech Clustering

While we knew that MFCC feature analysis alone wouldn’t be something we could (or should) properly build out, we did still have the understanding that MFCC features were reliable under the conditions of the same broadcast. To say that more simply, we could figure out who a speaker was so long as it was from the same debate because the audio would be the same throughout.

Conclusion

The final system was a combination of AWS Transcribe, video analysis and facial recognition, and k-NN clustering of MFCC features that ended up getting us the most reliable artificial intelligence. We still had plenty of moments where the system could not figure out who the speaker was and in those cases we just had it append an “UNKNOWN” label in the transcript.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dan Woods

Dan Woods

Dan Woods was CTO for Biden for President during the 2020 election. Previously he worked building tech for Hillary Clinton's 2016 presidential campaign.