Chen Ruolan - AI vs Human Spotify Playlists

Overview

As recommendation systems take more of the steering wheel for music discovery, a fair question is: do algorithmically-generated playlists actually look different from playlists curated by humans? This INFO 5304 final project (Cornell Tech, Spring 2026) attacks the question with two preregistered hypotheses, a real Spotify dataset, and proper statistical testing.

Working in a 3-person team, I owned the data-collection & cleaning pipeline (Phase 2) and contributed to the analysis side of Phase 4 — variance comparison, regression, and the standardized feature-importance interpretation.

Two hypotheses were registered up-front:

H1. AI-generated playlists are more homogeneous than user-curated playlists in audio features (danceability, energy, valence, tempo, acousticness).

H2. AI-generated playlists have lower artist diversity than user-curated playlists.

Each hypothesis got its own preregistered test — Levene's test for variance equality on H1, and Welch's t-tests on unique-artist count and repetition rate for H2 — at α = 0.05.

Data Pipeline (Phase 2 — My Track)

The merged dataset is the foundation everything else stands on. The pipeline turns two messy external sources into one analysis-ready CSV:

① Spotify Web API
Authenticated via Spotipy / OAuth, pulled 6 playlists (3 AI-generated + 3 user-curated, ~25 tracks each → 150 tracks total). Captured playlist_type, playlist_id, playlist_name, track_name, and artist_name.

② Kaggle Audio-Features Dataset
The Kaggle "Spotify Tracks Dataset (1921–2020)" (~160k songs) supplied precomputed audio features — danceability, energy, valence, tempo, acousticness, plus loudness / speechiness / instrumentalness / liveness — to compensate for limited direct API access to features.

③ Cleaning & Merge
Lowercased and trimmed names on both sides, normalized whitespace, joined on (track_name, artist_name), de-duplicated, and exported analysis_ready_data.csv. Match rate: ~53% — the realistic floor when joining text keys across two non-aligned sources.

Statistical Analysis (Phase 4)

H1 — Homogeneity / Variance
Per-feature variance computed separately for AI vs User playlists, then Levene's test for equality of variances on each of the five features. Lower variance for the AI group across multiple features supports H1.

H2 — Artist Diversity
Aggregated to playlist level: unique_artists and artist_repetition_rate = 1 - unique_artists/total_tracks. Two independent Welch's t-tests (unique-artist count and repetition rate) between AI and User groups.

Regression — Feature Importance
OLS via statsmodels: is_ai ~ danceability + energy + valence + tempo + acousticness. Then a standardized version (StandardScaler on features) so coefficients are directly comparable across feature scales.

Supporting Visuals
Correlation heatmap on the five features, AI-vs-User mean comparison bars, and valence-vs-energy / valence-vs-danceability scatter colored by playlist type to sanity-check what the regression is picking up.

Key Findings

✅ H1 — Reject Null
AI playlists are more homogeneous: lower variance across multiple audio features. Recommendation pipelines do tighten the musical envelope.

❌ H2 — Fail to Reject
Counter-intuitive result: human-curated playlists actually showed higher artist repetition. Users lean into familiar artists; AI playlists distribute across more artists.

📈 Strongest Predictor
After standardization, danceability is the strongest positive predictor of AI playlists. Other features contribute weakly and inconsistently.

Bottom line: AI recommendation systems shape musical structure and consistency — but they don't inherently reduce diversity, and they do not necessarily limit exposure to different artists. R² remains modest, so audio features alone are not a full explanation; user behavior, context, and genre would need to enter the model to push further.

Limitations & Honest Caveats

Partial Feature Coverage
Only ~50% of tracks matched the external Kaggle dataset — name-based joins are inherently noisy (remixes, "feat." tags, capitalization variants). Effective sample for the regression is smaller than 150.

Small & Narrow Sample
6 playlists / 150 tracks is enough to surface effects but not enough to generalize to all AI-generated playlists, let alone all of Spotify's recommendation surfaces.

External Audio Features
Audio features come from a Kaggle dataset rather than directly from Spotify, so values may be slightly stale or inconsistent with current API responses.

Within-Playlist Dependence
Tracks inside a playlist are not independent samples — a known caveat for the t-tests / regression assumptions, flagged in the report.

Team

3-person team for INFO 5304 — Jennifer Cheng (yc2932), Chi (Jaclyn) Pham (cqp4), Ruolan Chen (rc975). My contributions: Spotify API + Kaggle data pipeline (Phase 2), variance / regression analysis and standardized feature-importance interpretation (Phase 4).

Read the full notebooks on GitHub