Comprehension by passive synchronization
One idea is that oscillations align phases of high firing to the high information segments of speech through entrainment. Imagine two church bells placed right next to each other with a bit of space between them. Tap one with a hammer and the other will ring too. In the same vein, networks in the brain wired to oscillate at roughly the rate of speech have the propensity to synchronize to incoming sentences, with excitable peaks aligning to sounds, and inhibiting troughs aligning to silences. What you get is a homunculus-free process in which the brain organically samples relevant waveform segments.
A rewarding aspect of science are the glimpses of beauty you see along the way, and this has been one for me. Through simple principles of physics and self-organization at the neural level, you get computational enrichment at the psychological level.
But the flip side of elegance is oversimplification. Recall the task description at hand, which is to explain how the brain extracts linguistic structure from waveforms. With its graceful simplicity, our homunculus-free process is also quite dull: it can only tailor oscillatory networks to what is actually rhythmic in the waveform. Well, what is rhythmic in the waveform?
If we take a step back to peer across the river at the connecting structure psycholinguists have prepared for us, we appear to be off the mark. Not much of what linguists are talking about is linearly decodable from waveforms. Words and syllables might be — they occur every 200 milliseconds or so — but language comprehension is not a simple matter of concatenating words. As we hear a string of words and parse them over time, our minds impose highly nonlinear tree structures onto them, which requires much more than merely tacking one word onto the next.