Psychoacoustics, lossless and what else do I know about audio standards

Relatively recently, I came across an interesting video from the Gutenberg Smoking Room called “Psychoacoustics: Sound Illusions,” albeit a student one. The video inspired me to rummage through my student notes and materials...

To be honest, I didn’t really like the Audio Coding subject when I was a student at TU Ilmenau in the Communication and Signal Processing program - stress and youthful maximalism were doing their dark work. However, from the outside I more often heard the opposite point of view: “Cool subject, why are you complaining? One of your lecturers is Karlheinz Brandenburg himself - seize the moment!”

One of the main developers of the MP3 format, if you didn't recognize it, is posing with headphones. (image source)

As time passed, I, of course, revised my view on this subject. Knowledge at the intersection of digital signal processing, biology, physics and computer technology - that’s cool! The topic of the already mentioned psychoacoustics alone is worth it.

And then one day another adventurous thought came to my mind, and I said to myself: “Why not write a popular science article about audio coding? So to speak, “for the little ones” - for students like me”?

No sooner said than done.

Anatomy is horribly interesting

Before we talk about how exactly a person perceives sound, and what mathematical models can be used for this, let’s talk about the main thing: what allows a person to perceive sound in general?

Of course, the auditory system! To be precise, mainly the inner and middle ear and their specific components:

  • eardrum (eardrum) : transmits air vibrations (sound waves) to the auditory ossicles in the form of vibrations;
  • auditory bones of the middle ear (ossicular bones) : hammer, incus, stirrup - transmit mechanical vibrations to the cochlea;
  • cochlear structure : induces traveling waves along the length of the basilar membrane;
  • neural receptors : convert vibrations into chemical and electrical signals (have connections along the entire length of the basilar membrane).

Rice. 2. The internal structure of the human ear.

Everything seems to be intuitive, provided that you have some school knowledge. The difficulty is usually only caused by the cochlea: what does this abstruse phrase mean: “induces traveling waves along the length of the basilar membrane”?

Paradoxical as it may seem, everything here is also quite simple. First, let’s list what the cochlea consists of:

  • There are fluids inside the cochlea: perilymph and endolymph;
  • there is also a basilar (basal, main) membrane inside;
  • Hair cells (part of the organ of Corti) are attached to the basilar membrane.

The eardrum transmits sound vibrations to the bones of the middle ear; the ossicles of the middle ear transmit vibrations to the pereimph and endolymph; under the influence of vibrations of the perilymph and endolymph, the basilar membrane also vibrates; Due to the movements of the basilar membrane, hair cells produce signals that are transmitted to nerve cells.

I suggest you read more here and here.

Rice. 3. The internal structure of the human ear: the basilar membrane in an “unfolded” form (link to the source of the illustration).

Due to the shape of the basilar membrane (tapering towards the base) and the fact that cells responsible for the perception of different frequencies are connected to different parts of this membrane, the cochlea is a nonlinear system with frequency selectivity.

What if you look at the cochlea through the eyes of digital signal processing?

From a DSP point of view, the cochlea is a bank of bandpass filters. In this case, the filters overlap each other greatly.

Rice. 4. Tone responses in different places of the basilar membrane [1, p. 63].

What is shown in the picture:

  • A tone with a duration of 1 ms, and therefore a frequency of 1 kHz (time function indicated on the top plate) produces responses at five different locations on the basilar membrane (five functions below, depicted opposite specific locations on the membrane).
  • The maximum response corresponds to the middle of the membrane - where it responds to frequencies of 1 kHz (logical).
  • The minimum responses are at the edges of the membrane (x4, x2, x1 indicate how much the graphs have been enlarged for illustration).

Kind people have already drawn useful structural diagrams:

Rice. 5. Part of the perceptual model diagram (see PEMO Model) concerning the basilar membrane.

The overlapping filters are shown, in my opinion, very clearly.

At some point, they decided to somehow put the knowledge about the cochlea as a bank of filters into a simple and accessible model. During a series of classroom experiments [1, p.82-85], scientists determined that:

  • the frequency groups into which the basilar membrane divides the audio signal have a fixed bandwidth;
  • The bandwidth of a frequency group depends nonlinearly .

Moreover, for convenience, we agreed to assume that the filters of our auditory system are rectangular.

All of the above was ultimately generalized into the concept of the Barkov scale - a scale of critical frequency ranges (see RWTHxCA101 - Critical bands), the width of which nonlinearly depends on the average frequency:

Rice. 6. Barkov scale ().

Let's remember this fact, it will be useful to us later.

I couldn't help but share!

While I was looking for illustrations on the Barkov scale I came across this image:

bark scale by spooninglive


Okay, now we have a little better idea of ​​what kind of system allows us to hear. Moreover, we found that the hearing organs are a nonlinear frequency-selective system. We even found out how its selectivity works in terms of the width of the critical ranges.

But we have not yet said whether we hear certain frequencies equally. Perhaps there are some suitable experiments?

Threshold in silence

Of course, there are such experiments. Moreover, such experiments have been carried out for a long time. For example, Eberhard Zwicker describes one of them as follows [1, p. 63]:

The subject, registering the hearing threshold, is tasked with changing the sound pressure level using a switch so that the moments of barely noticeable appearance and disappearance of sound are noted with confidence. In this case, the recorder pen crosses out a zigzag stripe on the paper, consisting of vertical strokes, within which there will be those pressure values ​​for which it is not certain whether a sound was heard or not.

Ultimately, we collected 100 such measurements from people of both sexes aged 20-25 years and calculated the average values.

Rice. 7. Averaged hearing threshold curves for young subjects with healthy hearing. [1, p. 64]

And then the median (the curve between 10% and 90% in Fig. 7) was called the hearing threshold (or “ threshold in silence ”) and was included in the standards (including our GOST).

Rice. 8. Threshold in quiet, hearing threshold, risk of damage, threshold of pain (source). Yes, pain does not warn of danger, but simply states the fact of a negative impact on hearing.

There is even a special formula for this:

where is, as you might guess, the frequency in kilohertz.

Let's talk about the essence of the threshold of audibility once again: in order for any sound to be heard, it must exceed the value of the “threshold in silence.” That is, evolution has arranged everything in such a way that we are almost guaranteed to hear sounds near 2-4 kHz, however, we are almost as guaranteed not to hear too low and too high frequencies.

Do you remember ultrasound on phones?

The threshold in silence in the form in which it is presented in Figure 5 is, as a rule, relevant for the average group of young people. With age, the perception of high frequencies changes:

At one time, this fact, as far as I know, became the basis for replicating the ultrasonic telephone ringing signal among teenagers: it was assumed that adults (for example, teachers) would not hear it, and therefore would not become irritated by extraneous noise. Well, in the years of my youth, this idea brought nothing but “torture” of classmates with an annoying and intrusive sound in the middle of the lesson by a bunch of “passionaries”...

Why is the phrase “in silence” applied to this curve?

Because it is assumed that this is how people perceive sound in the absence of extraneous noise. When noise appears, the threshold will, as it were, “rise.” In the case of broadband noise, the picture will look like this:

Rice. 8. Levels of masking thresholds (the term will be discussed below) with white noise depending on the frequency of the test tone. The dotted line marks the slope of the curves at high frequencies. [2, p. 62]

What about narrowband noise?

