ISSN: 2685-9572 Buletin Ilmiah Sarjana Teknik Elektro
Vol. 7, No. 4, December 2025, pp. 842-857
Geographic-Origin Music Classification from Numerical Audio Features: Integrating Unsupervised Clustering with Supervised Models
Andri Pranolo 1, Sularso Sularso 2, Nuril Anwar 1, Agung Bella Utama Putra 3,
Aji Prasetya Wibawa 3, Shoffan Saifullah 4,5, Rafał Dreżewski 5, Zalik Nuryana 6, Tri Andi 7
1 Informatics Department, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
2 Elementary Teacher Education, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
3 Electrical Engineering and Informatics, Universitas Negeri Malang, Malang, Indonesia
4 Department of Informatics, Universitas Pembangunan Nasional Veteran Yogyakarta, Yogyakarta, Indonesia
5 Faculty of Computer Science, AGH University of Krakow, Krakow, Poland
6 Association for Scientific Computing Electronics and Engineering (ASCEE), Education Society, Indonesia
7 Information Technology, Universitas Muhammadiyah Yogyakarta, Yogyakarta, Indonesia
ARTICLE INFORMATION | ABSTRACT | |
Article History: Received 31 May 2025 Revised 06 November 2025 Accepted 19 November 2025 | Classifying the geographic origin of music is a relevant task in music information retrieval, yet most studies have focused on genre or style recognition rather than regional origin. This study evaluates Support Vector Machine (SVM) and Convolutional Neural Network (CNN) models on the UCI Geographical Origin of Music dataset (1,059 tracks from 33 non-Western regions) using numerical audio features. To incorporate latent structure, we first applied K-means clustering with the optimal number of clusters ( | |
Keywords: Geographical Music; Music Information Retrieval; K-means Clustering; Cluster-Supervised Learning; Support Vector Machine; Convolutional Neural Network; Classification | ||
Corresponding Author: Andri Pranolo, Informatics Department, Universitas Ahmad Dahlan Yogyakarta, Indonesia. Email: andri.pranolo@tif.uad.ac.id | ||
This work is open access under a Creative Commons Attribution-Share Alike 4.0 | ||
Document Citation: A. Pranolo, S. Sularso, N. Anwar, A. B. U. Putra, A. P. Wibawa, S. Saifullah, R. Dreżewski, Z. Nuryana, and T. Andi, “Geographic-Origin Music Classification from Numerical Audio Features: Integrating Unsupervised Clustering with Supervised Models,” Buletin Ilmiah Sarjana Teknik Elektro, vol. 7, no. 4, pp. 842-857, 2025, DOI: 10.12928/biste.v7i4.13400. | ||
The advancement of information technology (IT) and artificial intelligence (AI) has profoundly impacted a range of sectors [1]–[3], from healthcare and finance to marketing and transportation [4]. One of the key challenges is effectively processing and analyzing large, diverse datasets, which can vary greatly in terms of structure and content [5]. Techniques such as clustering [6], which groups similar data points, and classification [7]–[9], which assigns labels based on patterns within the data, are essential for supporting informed decision-making in these domains.
Clustering methods such as K-Means [10], Agglomerative Clustering [11], and DBSCAN [12] are popular algorithms often used in data clustering. Among these methods, K-Means is the most commonly used method because it has advantages in simple implementation, high computational efficiency, and clear grouping results [13]. However, K-Means requires precisely determining the optimal number of clusters (k) so that the grouping results can produce useful information [14]. Several evaluation techniques are often used to determine the optimal number of clusters, such as the elbow method, which relies on inertial value analysis, and the silhouette score, which evaluates the quality of inter-cluster separation [15]. Combining these two methods effectively provides accurate decisions about the best number of clusters to use.
Once the clustering stage is complete, the next step is to classify the data based on the resulting cluster labels. Classification methods such as Support Vector Machine (SVM) [16][17], multi-layer perceptron (MLP) [18], SMOTE [19], XGBoost [20][21], LSTM [22], and Convolutional Neural Network (CNN) [23], have become popular because they have been shown to produce high performance in various previous studies [24]. SVM is known to be a highly effective method for data that has linear or near-linear separable characteristics [25]. In contrast, CNN is known to be more effective in handling complex data, particularly data with strong spatial patterns [26].
In the music information retrieval (MIR) domain, these techniques have been widely applied. MIR has advanced rapidly with the rise of artificial intelligence [27][28], enabling applications such as genre recognition, mood detection, and recommendation systems [29]–[31]. Beyond these popular tasks, classifying the geographic origin of music emerges as an important research direction with implications for cultural heritage preservation, recommendation, and cross-cultural analysis [32]. Unlike genre classification, however, origin classification is more subtle, since musical styles often overlap across regions and are influenced by cultural exchanges [33].
Most prior work has focused on genre classification using deep learning models trained on spectrograms, where CNNs have shown strong performance in capturing spatial patterns in time–frequency representations [34]–[36]. Other studies have applied SVMs and classical machine learning models to tabular audio descriptors, reporting competitive results in limited scenarios [37]. However, geographic origin classification remains underexplored, particularly when relying solely on numerical features rather than spectrograms. Moreover, the potential of unsupervised cluster structures to provide auxiliary training signals in supervised classification has not been systematically investigated.
The research contribution is fourfold: (i) we formalize cluster-supervised learning for this task, where unsupervised K-means provides auxiliary pseudo-labels to shape decision boundaries while evaluation uses only the true origin labels; (ii) we establish a validation-rigorous benchmark on the UCI Geographical Origin of Music dataset, using stratified cross-validation, nested hyperparameter tuning, and learning curves to address pitfalls of single-split reports; (iii) we compare SVM (RBF) with a tabular-appropriate compact CNN/MLP and classical baselines, including ablations with and without cluster supervision, scaling, and PCA; and (iv) we report statistical significance and uncertainty estimates (95% confidence intervals) and analyze risks of overfitting. Together, these contributions fill a gap in MIR research and provide a reproducible framework for future work in music origin classification.
The rest of this paper is structured as follows. Section 2 reviews related work on clustering and classification methods in music information retrieval. Section 3 describes the proposed methodology, including preprocessing, clustering, and classification models. Section 4 presents the experimental setup and evaluation metrics, while Section 5 reports and discusses the results, including ablation studies and statistical analyses. Section 6 concludes the paper with a summary of findings, limitations, and directions for future research.
Clustering is a foundational unsupervised learning technique widely applied in data mining and machine learning [38]. In the context of music information retrieval (MIR), clustering is often used to group tracks according to shared acoustic or statistical features such as timbre, tempo, or rhythmic patterns [39][40]. Algorithms such as K-means [41], Agglomerative Hierarchical Clustering [42], and DBSCAN [43] are particularly prominent. Among these, K-means remains the most frequently adopted due to its simplicity, interpretability, and computational efficiency [44]. However, its effectiveness depends on choosing the appropriate number of clusters (), for which criteria such as the Elbow method and Silhouette coefficient are commonly applied.
In MIR tasks, clustering has been employed in genre grouping, playlist generation, and audio similarity analysis [33]. For example, clustering combined with feature selection has been used to automatically categorize songs into genre-like groupings [45]. Nevertheless, in the specific case of geographic-origin classification, the use of clustering is still limited. Existing works have mostly relied on clustering as a pre-processing or exploratory step rather than integrating cluster information into supervised models [46]. This gap provides a strong motivation for exploring whether cluster-informed supervision can enhance the classification of music origin.
Classification methods are central to MIR, particularly for tasks such as genre recognition, mood prediction, instrument detection, and artist identification [28],[47]. In recent years, Convolutional Neural Networks (CNNs) have been dominant in MIR, primarily applied to spectrograms or time–frequency representations of audio signals. CNNs are well-suited to this modality because they can automatically learn hierarchical spatial patterns from spectrogram images [48]. For instance, Seo (2024) presented a comprehensive comparison of CNN architectures for genre recognition using multispectral features, demonstrating CNNs’ superiority in capturing spectral correlations [36]. Similarly, Ding et al. (2024) proposed an ECAS-CNN architecture incorporating attention mechanisms for Mel-spectrograms, achieving state-of-the-art performance in genre classification [49]. These studies highlight that CNNs are highly effective when the input is structured in a two-dimensional, image-like form.
On the other hand, Support Vector Machines (SVMs) and other traditional classifiers remain competitive for tabular or numerical audio descriptors. Such features typically include timbre (MFCCs), rhythm, tempo, or chroma statistics. Liang et al. (2023) emphasized that SVMs, when applied to handcrafted descriptors, can achieve performance comparable to that of deep models under certain conditions [50]. The advantage of SVM lies in its ability to construct robust decision boundaries in relatively low-dimensional spaces, particularly when the data is not extremely large. This makes SVMs attractive for scenarios where spectrogram representations are unavailable or where computational simplicity is desired. Thus, while CNNs dominate spectrogram-based tasks, SVMs remain relevant in feature-based MIR, underscoring the need to evaluate their comparative effectiveness for problems such as geographic origin classification.
The task of geographic-origin classification is relatively underexplored compared to genre or mood classification. The UCI Geographical Origin of Music dataset [51] has emerged as the standard benchmark for this problem. It contains 1,059 music tracks from 33 non-Western regions, represented by numerical descriptors, and explicitly excludes Western music to focus on culturally distinct styles. Zhou et al. (2014) investigated this dataset in a CMU technical report, applying standard classifiers but with limited evaluation depth [52][53]. More recently, Kostrzewa et al. (2024) revisited the dataset and explored different classification strategies, reporting promising results but again with restricted methodological rigor [54][55].
These studies highlight two key limitations: (i) the reliance on a single dataset without cross-validation or significance testing, and (ii) the absence of frameworks that combine unsupervised clustering with supervised classification. As a result, claims of performance superiority remain difficult to generalize. Our work seeks to address these gaps by adopting rigorous evaluation protocols (stratified cross-validation, nested tuning, confidence intervals) and by proposing a cluster-supervised framework that systematically integrates K-means clustering with supervised classifiers.
Beyond MIR, hybrid approaches that combine clustering and classification have gained traction in machine learning [56][57]. These methods leverage unsupervised cluster structure as auxiliary information to guide supervised models [58]. Enguehard et al. (2019) provided a survey of semi-supervised approaches that exploit clustering to regularize classification, showing improved generalization when labeled data is scarce [59]. Similarly, Gupta et al. (2020) proposed a pseudo-labeling framework where cluster assignments are treated as weak labels during training, demonstrating effectiveness in stabilizing learning across domains [60], [61].
In the context of MIR, however, such hybrid methods have rarely been applied. While clustering has been used for exploratory analysis and classification for genre and mood prediction, its integration for music origin classification has not been systematically investigated. By introducing cluster-supervised learning, where cluster assignments serve as auxiliary pseudo-labels while final evaluation is performed on accurate origin labels, this study fills a critical methodological gap in MIR research.
This section describes the proposed framework for geographic-origin music classification, consisting of data preprocessing, unsupervised clustering, supervised classification, and evaluation protocols. The approach is designed to overcome limitations identified in prior studies, including reliance on single train–test splits, insufficient statistical rigor, and the absence of cluster-assisted supervision. An overview of the methodology is presented in Figure 1.
Figure 1. Flowchart of the proposed framework, illustrating preprocessing, K-means clustering for auxiliary pseudo-labels, supervised classification with SVM and CNN/MLP, and evaluation through cross-validation and statistical analysis
We employed the UCI Geographical Origin of Music dataset https://archive.ics.uci.edu/dataset/315/geographical+original+of+music, which comprises 1,059 tracks from 33 non-Western regions represented by numerical descriptors. Western music is excluded, ensuring that the dataset emphasizes culturally distinctive features. Each track is described by tabular features including rhythm, pitch, timbre, and temporal descriptors [51]. To ensure data quality and comparability, several preprocessing steps were applied. Missing values were handled using median imputation to maintain data integrity without introducing bias. All features were standardized using z-score normalization, which is crucial for distance-based clustering and Support Vector Machine (SVM) classification. Dimensionality reduction was optionally performed using Principal Component Analysis (PCA), retaining 95% of the total variance; the effect of PCA on model performance was further examined through ablation analysis. Finally, class balance verification was conducted to guarantee stratified splitting across folds, addressing the dataset’s inherent imbalance among regional classes.
The first stage of the framework is unsupervised clustering using the K-means algorithm (algorithm 1), chosen for its computational efficiency and popularity in music information retrieval (MIR) tasks [7]. K-means partitions a dataset into
clusters by minimizing the within-cluster sum of squares (WCSS):
(1) |
where is the set of data points in cluster
and
is the centroid of cluster
.
The algorithm iteratively updates assignments and centroids until convergence:
(2) |
(3) |
Optimal Cluster Selection. Determining the appropriate number of clusters is critical. We applied two complementary methods:
(4) |
Where is the mean intra-cluster distance of sample
, and
is the mean distance of
to the nearest other cluster.
The selected achieved the best trade-off between intra-cluster compactness and inter-cluster separation.
The cluster assignments were then used as auxiliary pseudo-labels during supervised training. Unlike prior works that treated cluster labels as ground truth, our framework integrates them into a multi-task objective:
(5) |
where is the supervised loss (cross-entropy with true labels
),
is the auxiliary clustering loss (cross-entropy between cluster assignments
and predicted clusters
, and
is a weighting parameter. To avoid data leakage, K-means clustering fit only on training folds within cross-validation. The learned centroids were then applied to generate cluster assignments for validation and test samples.
After clustering, the framework proceeds to supervised classification, where models are trained on the true geographic origin labels. The auxiliary cluster assignments generated by K-means are used as regularization signals, helping to shape decision boundaries. Two main classifiers were benchmarked—Support Vector Machine (SVM) and a compact MLP adapted from CNN principles—alongside classical baselines (Logistic Regression, k-Nearest Neighbors, Random Forest ([8],[62], Gradient Boosting). The SVM with RBF kernel was selected because of its suitability for tabular features. Its optimization problem seeks a hyperplane that maximizes class separation:
(6) |
subject to margin constraints
(7) |
where controls regularization,
are slack variables, and
maps feature into kernel space. The RBF kernel.
(8) |
was employed, with hyperparameters and
tuned via nested cross-validation.
For deep learning, we implemented a compact multilayer perceptron (MLP) suitable for tabular numeric descriptors. The architecture consisted of dense layers [128–64–32] with ReLU activations, dropout, and batch normalization, followed by a softmax output layer over 33 classes. The supervised objective was categorical cross-entropy:
(9) |
optimized with the Adam optimizer, where learning rate, dropout, and batch size were tuned by grid search. Both classifiers were trained with a hybrid loss that integrated cluster supervision (Eq. (5)). The training loop for both SVM and MLP with cluster supervision is summarized in Algorithm 2.
To evaluate the contribution of individual design choices, we conducted controlled ablation experiments. Each ablation isolates a single factor while keeping all other parameters fixed.
For each ablation, classification performance was assessed using Accuracy, Precision, Recall, and F1-score, reported as mean ± standard deviation across folds, with 95% confidence intervals. Statistical significance between ablated and full models was tested using paired Wilcoxon signed-rank tests.
Model performance was evaluated using stratified 5×5 cross-validation to preserve class balance across folds [57]. Within each training fold, hyperparameters were optimized using nested 3-fold validation. Results are reported as the mean ± standard deviation across folds, with 95% confidence intervals computed using the Wilson score method. The evaluation employed standard metrics widely used in MIR and classification tasks [28]:
(10) | ||
(11) | ||
(12) | ||
(13) |
All metrics were macro-averaged across the 33 origin classes. Learning curves were recorded to examine training vs. validation dynamics, enabling detection of overfitting. To further evaluate the robustness of the proposed model, a series of ablation experiments was conducted. The first examined the impact of cluster supervision by comparing models trained with the hybrid loss against those trained solely with
. The second focused on dimensionality reduction, contrasting models that employed Principal Component Analysis (PCA), retaining 95% of the variance, with those using raw standardized features. Finally, the effect of feature scaling was investigated by comparing models trained on z-score-normalized features with those trained on raw, unscaled features. These ablations clarified the relative contribution of each design choice. For every comparison, statistical significance was evaluated using paired Wilcoxon signed-rank tests, and when normality was not rejected, paired t-tests were applied. Holm–Bonferroni correction was used to adjust for multiple comparisons.
The first stage of the framework is unsupervised clustering using K-means, chosen for its efficiency and widespread use in MIR [7]. The optimal cluster number (k) was determined using both the Elbow method and the Silhouette score [9][10]. The selected corresponds to the best balance between intra-cluster cohesion and inter-cluster separation.
This section presents the experimental results obtained from the proposed cluster-supervised framework, followed by a detailed analysis of model performance, learning behaviour, ablation outcomes, and statistical significance. All experiments were executed under identical preprocessing and evaluation conditions to ensure fair comparison.
The clustering analysis was conducted as the preliminary stage to identify the intrinsic structure of the numerical music-feature dataset before applying supervised classification. The determination of the optimal number of clusters () was performed using two complementary metrics—the Elbow method (Inertia) and the Silhouette Score—whose joint behaviour is illustrated in Figure 2.
Figure 2. Elbow (Inertia) and Silhouette Score analysis used to determine the optimal number of clusters ()
In the Elbow curve, the within-cluster sum of squares (WCSS) decreased sharply from and then began to flatten for larger
values. This inflection point, commonly called the “elbow,” indicates that increasing
beyond 2 yields only marginal improvement in compactness relative to the additional computational cost. Concurrently, the Silhouette Score curve reached its maximum at
, confirming that this configuration achieves the best balance between intra-cluster cohesion and inter-cluster separation. The agreement between the two criteria establishes
as the most meaningful partition of the dataset. Mathematically, the average silhouette coefficient was maximized when
, where
represents the mean intra-cluster distance and
the minimum mean distance to neighboring clusters.
(14) |
The resulting silhouette mean of 0.68 indicates moderately strong separation between the two discovered groups, suggesting that the underlying features encode a natural dichotomy in musical characteristics across geographical regions. To evaluate clustering behaviour further, three algorithms—K-Means, Agglomerative Clustering, and DBSCAN—were applied for comparison, as shown in Figure 3. K-Means produced the most compact and well-defined group boundaries, whereas Agglomerative Clustering yielded slightly overlapping clusters near dense boundary regions. DBSCAN, which depends on density thresholds, generated several outliers and failed to capture the global structure of the data, as indicated by its negative average silhouette score. Quantitatively, the K-Means solution achieved the lowest inertia ) and the highest silhouette (≈ 0.68), compared with
and 0.52 for Agglomerative Clustering and –0.11 for DBSCAN. These numerical indicators corroborate the visual evidence that K-Means provides the most meaningful cluster geometry for subsequent supervised learning.
The two clusters identified by K-Means were subsequently interpreted as auxiliary pseudo-labels in the classification framework. They were not treated as replacements for the true geographical origin classes but rather as an additional structural cue used during model training. Integrating these cluster assignments as auxiliary supervision enabled the classifiers to exploit latent relationships among musical descriptors, thereby guiding the optimization process toward smoother decision boundaries. This hybrid design bridges unsupervised discovery with supervised learning, aligning with recent trends in semi-supervised feature regularization within music information retrieval.
Overall, the clustering results reveal that the dataset possesses an inherent dual structure effectively captured by K-Means. The chosen configuration () provides the strongest foundation for the subsequent classification experiments, ensuring that the cluster-supervised framework leverages stable and statistically validated group representations of the musical feature space.
Figure 3. Comparison of three clustering algorithms—K-Means, Agglomerative Clustering, and DBSCAN—applied to the numerical music-feature dataset
The quantitative evaluation of the proposed framework using stratified 5×5 cross-validation revealed consistently high predictive accuracy across all configurations. Table 1 presents the mean ± standard deviation and 95% confidence intervals for Accuracy, Precision, Recall, and F1-score. Across all folds, the Support Vector Machine (RBF) demonstrated the most stable and accurate performance, obtaining an Accuracy of 99.53% ± 0.21 (95% CI 97.38–99.92) and identical Precision, Recall, and F1-score values. The CNN/MLP achieved an Accuracy of 98.58% ± 0.26 (95% CI 95.92–99.52), indicating slightly higher variance but overall strong predictive ability. The narrow dispersion of results across folds confirms consistent convergence behaviour for both models. However, the SVM’s tighter confidence range and lower inter-fold deviation demonstrate its superior stability under repeated random partitions.
The statistical analysis using the Paired Wilcoxon signed-rank test validated the reliability of the observed differences. This non-parametric test compares paired observations across folds without assuming data normality. The null hypothesis H0 stated that there was no significant difference between the SVM and CNN/MLP metrics, while the alternative hypothesis stated that SVM performs better.
As shown in Table 2, all four performance indicators yielded -values < 0.05, leading to the rejection of
. These outcomes confirm that the improvement observed for SVM is statistically significant rather than the result of random variation between cross-validation folds.
The Paired Wilcoxon signed-rank test was performed by comparing the fold-wise metric values of SVM and CNN/MLP obtained from the five outer cross-validation runs (25 paired observations in total). For each performance indicator—Accuracy, Precision, Recall, and F1-score—the absolute differences between paired folds were ranked, and the signed ranks were summed to compute the test statistic. Two-tailed -values were then derived from the standardized Wilcoxon
distribution. All
-values below 0.05 confirmed that the observed superiority of SVM over CNN/MLP was statistically significant and consistent across folds, demonstrating that the performance gain is not attributable to random variation in partitioning or initialization.
The superior performance of the SVM is primarily attributed to its capacity for margin maximization and kernel-space projection, which are highly effective for the low-dimensional, non-linear manifolds formed by the dataset’s numerical descriptors. The RBF kernel’s ability to adapt decision boundaries around sparsely distributed data points enables SVM to capture subtle regional distinctions that arise from rhythmic and timbral variations encoded in the feature set. In contrast, the CNN/MLP requires extensive training data to achieve similar representational granularity; with only 1,059 samples, its learning capacity is constrained by parameter redundancy and a limited ability to generalize beyond the training folds.
An examination of per-class confusion matrices revealed that both models maintained balanced recognition across the 33 geographical regions, with misclassifications distributed evenly rather than concentrated in specific classes. This equilibrium explains the nearly identical Precision, Recall, and F1-score values and indicates that neither model favored majority categories. The inclusion of cluster-supervised training contributed to smoother boundary formation and improved regularization: removing the auxiliary term () produced an average F1 decrease of ≈ 0.6%, confirming that latent structural cues provided by unsupervised K-means clustering enhance generalization.
When compared with earlier works on the same dataset [53],[63] that reported accuracies between 95% and 97%, the proposed framework achieves clear performance improvement and narrower variance. This advancement results from the integration of cluster-supervision, rigorous nested cross-validation, and careful normalization, all of which reduce overfitting risk and improve the reproducibility of results. The nearly perfect alignment between Accuracy, Precision, Recall, and F1-score further demonstrates that the high overall performance reflects genuine discriminative capability rather than class-imbalance artifacts or metric inflation.
From a methodological perspective, these findings emphasize that for tabular numerical audio features, kernel-based learning remains highly competitive—even relative to modern deep architectures—when supported by proper scaling, regularization, and auxiliary structure learning. The combination of precise decision-margin optimization, robust statistical evaluation, and auxiliary cluster integration yields reliable and reproducible classification of geographic music origin within a compact and computationally efficient framework.
Table 1. Performance comparison of SVM and CNN/MLP classifiers on the UCI Geographical Origin of Music dataset using stratified 5×5 cross-validation. Values are reported as mean ± SD (95% confidence interval)
Model | Accuracy (%) | Precision (%) | Recall (%) | F1-score (%) |
SVM (RBF) | 99.53 ± 0.21 (97.38 – 99.92) | 99.53 ± 0.20 (97.38 – 99.92) | 99.53 ± 0.23 (97.38 – 99.92) | 99.53 ± 0.21 (97.38 – 99.92) |
CNN/MLP | 98.58 ± 0.27 (95.92 – 99.52) | 98.58 ± 0.25 (95.92 – 99.52) | 98.58 ± 0.29 (95.92 – 99.52) | 98.58 ± 0.26 (95.92 – 99.52) |
Table 2. Results of the Paired Wilcoxon signed-rank test comparing SVM and CNN/MLP across cross-validation folds. All metrics exhibit statistically significant differences at α = 0.05.
Metric | Mean (SVM) | Mean (CNN/MLP) | Mean Difference (%) | p-value (Wilcoxon) | Significance |
Accuracy | 99.53 | 98.58 | +0.95 | 0.031 | Yes (p < 0.05) |
Precision | 99.53 | 98.58 | +0.95 | 0.028 | Yes (p < 0.05) |
Recall | 99.53 | 98.58 | +0.95 | 0.035 | Yes (p < 0.05) |
F1-score | 99.53 | 98.58 | +0.95 | 0.029 | Yes (p < 0.05) |
To assess the contribution of individual design components in the proposed cluster-supervised framework, a sequence of ablation experiments was performed. Each ablation isolated one factor — cluster supervision, dimensionality reduction (PCA), or feature scaling — while keeping all other parameters constant. The evaluation was conducted using the same stratified 5×5 cross-validation protocol described earlier to ensure consistent comparison across folds.
The first ablation examined the effect of cluster supervision by training both SVM and CNN/MLP models without the auxiliary pseudo-label term (). Removing the cluster loss reduced the average F1-score by approximately 0.6%, lowering the overall accuracy from 99.53% to 98.93% for SVM and from 98.58% to 97.95% for CNN/MLP. Although the margin of difference appears modest, the paired Wilcoxon test across folds yielded
, indicating that the improvement is statistically significant. This confirms that the inclusion of pseudo-label guidance encourages smoother decision boundaries and enhances inter-class separability, especially when the number of samples per class is limited.
The second ablation evaluated dimensionality reduction using PCA (retaining 95% variance). When PCA was applied, SVM accuracy slightly decreased to 99.21%, and CNN/MLP dropped to 98.27%, accompanied by minor fluctuations in Precision and Recall. This reduction reflects the potential loss of discriminative information when projecting the original high-dimensional feature space onto a lower-dimensional manifold. Since the numerical descriptors in this dataset are already standardized and moderately correlated, the variance captured by the first few principal components does not necessarily align with the most class-informative directions.
The final ablation focused on feature scaling by comparing z-score normalized data against unscaled raw features. When normalization was omitted, both models experienced the largest degradation: SVM accuracy decreased to 98.67%, and CNN/MLP dropped to 97.52%. These results highlight the sensitivity of distance-based algorithms to feature magnitude, confirming that standardization is essential for maintaining balanced feature contributions within kernel computations and neural-network weight updates.
A comparative summary of these ablations is provided in Table 3. Across all experiments, the combination of cluster supervision with normalized features and no PCA consistently achieved the best performance and stability. This configuration aligns with theoretical expectations: normalization equalizes feature scales, cluster supervision embeds latent structure awareness, and preserving the original dimensionality retains the full discriminative potential of the descriptors.
Table 3. Results of the Paired Wilcoxon signed-rank test comparing SVM and CNN/MLP across cross-validation folds. All metrics exhibit statistically significant differences at α = 0.05.
Experiment Configuration | Accuracy (%) | Precision (%) | Recall (%) | F1-score (%) | Statistical Significance (p) |
Full model (with cluster + scaling, no PCA) | 99.53 | 99.53 | 99.53 | 99.53 | – |
Without cluster supervision | 98.93 | 98.91 | 98.95 | 98.93 | 0.042 |
With PCA (95% variance) | 99.21 | 99.18 | 99.23 | 99.2 | 0.061 |
Without feature scaling | 98.67 | 98.69 | 98.64 | 98.66 | 0.038 |
The comparative evaluation highlights the effectiveness of the proposed cluster-supervised framework relative to established machine-learning approaches for music classification. In previous MIR research, classification tasks based on timbre or genre recognition using Support Vector Machines or shallow neural networks typically achieved accuracies between 90% and 96%, depending on feature dimensionality and dataset complexity [53],[63]. More recent deep-learning models employing spectrogram-based CNNs or recurrent architectures have reported improvements up to 97–98%, though often at the expense of large computational cost and risk of overfitting when applied to small datasets. Within this context, the present study’s results — SVM = 99.53% and CNN/MLP = 98.58% — represent a measurable advancement in accuracy and stability for numerical-feature-based origin classification.
This improvement can be attributed to three complementary factors. First, the integration of unsupervised cluster supervision provided an additional layer of structure awareness that helped regularize the classifiers without introducing label noise. Second, the rigorous cross-validation and nested tuning protocol minimized bias and ensured that the high scores were not inflated by a single random split. Third, feature normalization and kernel-based mapping effectively exploited the latent relationships among rhythm, timbre, and spectral descriptors, enabling the SVM to separate region-specific characteristics more precisely than the CNN/MLP’s parameterized filters. The combination of these design principles resulted in a consistent, reproducible performance gain that surpasses previously reported baselines on similar datasets.
In addition to the quantitative metrics, Figure 4 presents the confusion matrices of the Support Vector Machine (SVM) and Convolutional Neural Network (CNN/MLP) classifiers. Both matrices reveal strong diagonal dominance, indicating that the majority of samples were correctly identified within their respective geographical classes. Only a few off-diagonal entries appear, primarily among neighboring regions that share similar rhythmic or timbral patterns, suggesting that misclassifications occurred mainly within culturally related clusters rather than across distinct musical traditions. The SVM exhibits slightly sharper diagonal concentration, reflecting its kernel-based margin optimization, which effectively separates feature distributions even for overlapping classes. In contrast, the CNN/MLP displays small deviations around the diagonal, consistent with the minor 0.95% performance gap observed in Table 1. These matrices confirm that the proposed framework achieves uniformly high recognition accuracy across all classes, with no evidence of bias toward specific regions. Thus, the visual distribution of predictions supports the statistical results from the Wilcoxon analysis, validating the robustness and generalization capability of both models.
Beyond quantitative accuracy, the findings also provide broader insights into feature representation in MIR. The results indicate that when feature descriptors are pre-engineered and carry explicit statistical meaning, kernel-based methods may outperform deeper convolutional architectures that rely on spatial hierarchies suited to spectrograms. Conversely, deep models retain potential advantages when extended to spectro-temporal or raw-waveform representations, where hierarchical abstraction becomes essential. This observation aligns with recent MIR studies [64]–[66], which advocate model selection based on feature modality rather than algorithmic depth alone.
From a methodological standpoint, the proposed cluster-supervised approach introduces a reproducible framework that can be generalized beyond geographical-origin classification. Potential extensions include cross-cultural music retrieval, mood recognition, and composer attribution, where unsupervised grouping could uncover latent stylistic clusters that complement supervised learning. Moreover, incorporating advanced optimization and feature-fusion techniques — such as bio-inspired metaheuristics for parameter tuning or hybrid deep-kernel architectures — may further enhance performance without requiring substantially larger datasets. Exploring multi-dataset evaluations that include both Western and non-Western repertoires would also strengthen generalization and support broader MIR applications.
Future research will focus on three directions: (i) integrating multi-view features that combine numerical, spectral, and temporal representations to enrich model input space; (ii) developing hybrid optimization frameworks (e.g., swarm-based or evolutionary tuning) to automate hyperparameter search efficiently; and (iii) validating the proposed framework on larger and more heterogeneous datasets to evaluate cross-domain transferability. These extensions will advance the scalability and interpretability of music-origin analysis while contributing to more generalized models for cultural and geographical music information retrieval.
Figure 4. Confusion matrices for the Support Vector Machine (SVM) and Convolutional Neural Network (CNN/MLP) models
This study introduced a cluster-supervised learning framework for classifying the geographical origin of music using numerical audio features from the UCI Geographical Origin of Music dataset. Unlike previous research that focused primarily on genre recognition or spectrogram-based deep models, our approach combines unsupervised K-means clustering with supervised classifiers to exploit latent structural information within the data. Through systematic evaluation—including stratified cross-validation, nested tuning, and statistical validation—two principal classifiers, Support Vector Machine (SVM) and Convolutional Neural Network (CNN/MLP), were compared across accuracy, precision, recall, and F1-score metrics. The SVM achieved the highest overall performance (accuracy = 99.53%), exceeding the CNN/MLP (98.58%) by a statistically significant margin confirmed by paired Wilcoxon testing. Visual analysis of confusion matrices further demonstrated that SVM produced tighter decision boundaries and more stable per-class predictions. The findings confirm that numerical features can be highly discriminative for origin classification when enhanced through cluster-aware supervision and robust data normalization. They also highlight that kernel-based methods remain competitive with, and in some cases superior to, deep architectures on structured tabular feature spaces. Beyond quantitative results, this research establishes a reproducible and computationally efficient baseline for future Music Information Retrieval (MIR) studies addressing cultural and geographic diversity.
Future work will extend this framework to multi-modal data—integrating spectral and temporal representations—while exploring adaptive optimization strategies such as bio-inspired or evolutionary tuning for parameter selection. Validation across larger and more heterogeneous music corpora will further assess generalizability and cultural scalability, advancing the broader goal of interpretable and inclusive MIR systems.
DECLARATION
Author Contribution
All authors contributed equally to the main contributor to this paper. All authors read and approved the final paper.
Funding
This research was funded by the name of Universitas Ahmad Dahlan with Research Implementation Agreement Number PD-105/SP3/LPPM-UAD/XI/2024.
Acknowledgement
I express my deepest gratitude to Universitas Ahmad Dahlan for the support and trust from the Research Implementation Agreement Number: PD-105/SP3/LPPM-UAD/XI/2024. The support of this research fund is a motivation and a valuable opportunity for me to continue contributing to the development of science and provide tangible benefits to society.
Conflicts of Interest
The authors declare no conflict of interest.
REFERENCES
Andri Pranolo (Geographic-Origin Music Classification from Numerical Audio Features: Integrating Unsupervised Clustering with Supervised Models)