General:

It was observed, that the modeling of users would be necessary to create modern, more efficient, and personalized music information retrieval systems because many features of multimedia content delivery are perceptual and user-dependent.
freemium models earn money with targeted ads. Targets could be found on the basis of interests lifestyle, personality values, etc.
Older datasets like last.fm Dataset-1k were either too small or didn't include self-declared demographic data, or identifiers linked to other databases like MBIDs. MLHD tries and addresses all these issues at scale with quality!
Listening Histories are a timeline of listening events, therefore analyzing them linearly is interesting because we can observe when people consume music and what music they enjoy or don't enjoy over time.

Data Collection Methodology:

Only records listeners with a minimum of two years of activity. (for even data)
Only records listeners with an arbitrary average of at least 10 scrobbles per day. (filtered out random casual users.)
Minimum of 7,300 (365 x 2 x 10) music logs per listener.
Passed last.fm's internal user identifiers as arguments to the API requests. These IDs are sequential, enabling scrapers to sample random users in the complete database instead of sampling users based on their friends or an artist's top fans.

each log is organized in the following tuple format: <timestamp, artist-MBID, release-MBID, track-MBID>
Timestamp: UTC synced timestamps.
MBIDs: 36 char UUIDs
Glitched logs removed (same MBID and same timestamps. Or timestamps less than 30s apart in time.) => Avg 8% duplicate logs per user & 1% logs that were too close.
58% of all logs in the dataset have full data (MBIDs for all 3 entities)
27 billion logs -> 583K people -> 555k unique artists -> 900k albums -> 7M tracks
Median Number of logs per user = 35k
- My own findings: avg 46755.07 scrobbles per file in chunk 00.
The median age of listening histories = 4.5 years.