LAION Unveils Largest Public Music Dataset for AI Research, Boosting Audio-Tech Advancements
November 20, 2024LAION AI has launched LAION-DISCO-12M, the largest publicly available music dataset for audio AI research, comprising 12 million links to YouTube audio samples along with comprehensive metadata.
This dataset features meticulous metadata, including timestamps, descriptions, and keywords, which enhances the exploration and contextualization of audio content.
DISCO-12M includes an expanded selection of artists, totaling 250,516, achieved by analyzing country charts and genre playlists.
It serves as an upgrade from the previous DISCO-10M, utilizing data sourced directly from YouTube Music to eliminate errors from manual matching with Spotify metadata.
The dataset provides significant scale and diversity, addressing limitations faced by existing audio datasets that often lack size and contextual data.
Initial tests on LAION-DISCO-12M have demonstrated a 15% accuracy improvement in music classification models compared to smaller datasets.
Researchers can leverage LAION-DISCO-12M for training large-scale transformer models in various applications, including music generation, audio classification, and audio-to-text translation.
The dataset aims to bridge the data gap between audio and other domains such as computer vision and natural language processing, facilitating advancements in audio and music technologies.
LAION envisions that this dataset will enhance audio AI technologies, improving features like music identification, content-based searches, and recommendation systems.
The availability of LAION-DISCO-12M represents a valuable resource for open, community-driven AI research, free from licensing fees and access restrictions.
Released under the Apache 2.0 license, the dataset is strictly for academic research, with LAION discouraging any commercial applications to avoid copyright issues.
This release aligns with a Hamburg Regional Court ruling that permits data scraping for non-commercial scientific research, further legitimizing the dataset's use.
Summary based on 2 sources