Humans hear pretty much the same way that computers do.
There are two things in play:
1) the mechanical job of hearing. The ear takes the sound(s) and virtually instantaneously splits the sounds into phonemes – granular bits of sound that lets us recognize that a B is different from a P and different from an D or a T. The computer does it digitally; humans do the same thing in an analog method.
Once the phonemes are split and identified, then it comes to
2) context. This is where language and linguistics play a role. If we hear the phonemes for HART, are we understanding it as HART (deer) or HEART (in your body)? That’s where cognition and context – that is the mind’s supercomputer – comes into play. Your brain puts the sounds into the sentence and figures out the word based on the language and context of what is being said.
The pitch isn’t really that important in either. It’s the phonemes and context that make language be understood.