The difference between listening to music using speakers or a headphone is striking.
When listening to speakers:
When listening to a headphone, each ear will receive 1 and 1 channel only. No mixing, no delay, no indirect sound.
The latter means that the room acoustics are eliminated.
We can localize a sound.

If a speaker is off axis, the sound will reach the far-ear a fraction later than the near-ear.

Interaural Time Differences (ITD)
The far ear is in the shadow of the head, so it hears the sound at a slightly
lower volume.
This will also affect the frequency response.

Iinteraural Intensity Difference (IID)
These are the two primary cues we use for localization.
This is not enough as the Pinna (the outer ear structure) is needed
too but one thing is obvious, all of this won't happen when using a headphone.
This makes why stereo on a headphone sounds like STEREO, the two channels
don't mix and all the effects described above, the Head Related Transfer
Function (HRTF) won't happen.
You can simulate the HRTF by using a crossfeed.
Benjamin B. Bauer was one the pioneers.
A famous article by him: Stereophonic Earphones and Binaural Loudspeakers.
JAES Volume 9 Number 2 pp. 148-151; April 1961.
A crossfeed emulates the HRTF by taking the signal from one channel, EQ and delay it and feed it into the other channel and visa versa.
Your media player might have one or there is a plug-in.
A VST plug-in and a lot more about headphones can be found here: BlogOhl
A VST plug-in simulating both the HRTF and the room acoustics is Isone Pro