How people and computers
perceive sound
By JOSH
It’s so strange to think that the most simple, intuitive things in our lives can be broken up into many tiny physical actions that we don’t understand. Anytime someone asks you a question, they transmit energy through the air in the form of waves. These waves have a certain timbre and pitch, and that goes through our ears, gets changed to a neuroelectrical signal, gets processed by neurons, and registered as a question in our brains. Information has to jump through a lot of hoops in order to be understood by our minds.
Amazingly, computers have the task of arriving at the same conclusion, and with only a subset of the data. So the goal is the same, but the utilities used are quite different. We started using convolutional layers in computer models after discovering convolutional networks in our brains. Convolutional layers have been critical in visual processing, and it’s a great example of how we can improve computer models by replicating biological processes.
The first similarity is categorical perception: We perceive sounds categorically rather than continuously. This means that we never hear a mix of phonemes, or something in between two phonemes. It’s always one or the other. A great example of this is with /b/ (“buh” sound) and /p/ (“puh” sound), where the only difference between the two sounds is voice onset time. The lips make the same moves, but if the sound is being voiced earlier, it is /b/, and if it is voiced later, it is /p/. However, if it is voiced somewhere in the middle, we don’t perceive something in the middle. We still perceive one or the other: That is categorical perception.
We typically implement categorical perception in machine learning models as well. In the last neural layer, we are given a probability weight for each phoneme and we choose the phoneme with the highest weight. So we end up with only one, even though multiple phonemes may have been very close in the running. That is a similarity machine learning algorithms and people have.
Another similarity is pitch interpretation. We naturally perceive pitch change on a logarithmic scale. According to the Mel-Frequency Cepstrum, we hear just about the same difference between 500Hz – 1000Hz as we do between 6500Hz – 9500Hz. The higher the pitch is, the harder it is to distinguish from similar pitches. While we can simulate pitch-scaling in computer models to emphasize what stands out to people, it is much harder to simulate the other information we have access to.
Computer algorithms mainly just have access to audio. It is possible to add images/video, but not common. People have access to visuals. One particular psychology experiment exemplified how powerful visuals really are in interpreting sound. Subjects were presented with a video and a recording at the same time. The recording was “fafa”, but the person in the video said “baba”. When people experienced this, they perceived the person in the video saying “fafa”. In essence, they heard what they saw. This is called the McGurk effect.
“Information has to jump through a lot of hoops in order to be understood by our minds.”
In a fabricated setting like the experiment, it can be misleading. But in noisy environments where it is hard to make out every sound, this phenomenon is an asset and helps humans understand verbal communication. On the contrary, computers typically do not perceive what shape our mouths are making, can’t see what people are gesturing and don’t know that a nearby truck just drove through a puddle, making a big “shhhh” sound. And sometimes, the sound quality is just too poor for a computer to make out what is being communicated. Humans have these cues to help them understand sound, and computers have to solve the problem without them.
Computers and people are obviously different. It’s clear in the fact that humans program machines. We are quite lucky not to have someone tweaking the weights in our brains so that we think what they think. Besides that, the brain has the power to turn all kinds of tiny energy patterns into experiences of thoughts, sights and sounds. Computers understand tiny energy patterns, but making something of them is much harder. Computers and humans will always work in divergent ways, but in their own fashion, both portray processes that bring more light to our everyday experiences.