INTRODUCTION
Prostate cancer (PCa) is among the most common malignancies in men worldwide and the second leading cause of cancer-related death. Conventional screening and diagnostic methods, including prostate-specific antigen testing, digital rectal examination, and transrectal ultrasound-guided biopsy, are either invasive or limited in accuracy. Multiparametric magnetic resonance imaging (mpMRI) has been introduced as a noninvasive tool for evaluating PCa with improved diagnostic performance. However, interpretation of lesion aggressiveness on mpMRI remains dependent on reader expertise and is subject to interobserver variability.
The integration of artificial intelligence and machine learning (ML) into medical imaging has accelerated advances in mpMRI-based PCa assessment. Computer-aided diagnosis systems can standardize interpretation, and deep learning approaches further reduce reading time while enhancing accuracy. Concurrently, radiomics (i.e., the high-throughput extraction of quantitative imaging features) has emerged as a cornerstone of precision oncology, as this approach captures subvisual tumor characteristics that may correlate with disease trajectory or treatment response.
Despite its promise, radiomics is highly sensitive to variability in image acquisition, preprocessing, and segmentation, which can compromise reproducibility and hinder clinical implementation [
1-
3]. The intraclass correlation coefficient (ICC) is a widely recognized metric used to quantify feature reliability across repeated measurements or varying conditions [
4]. Selecting features with sufficiently high ICC values can improve model stability and strengthen interpretability.
Using the public PROSTATEx dataset, we systematically evaluated the reproducibility of prostate MRI radiomics features and assessed their influence on ML-based PCa classification. We first validated feature reliability with ICC, then trained multiple classifiers exclusively on reproducible features, while comparing several dimensionality reduction strategies. Our objective was to enhance diagnostic accuracy and consistency and to evaluate the feasibility of a reproducibility-centered pipeline for clinical translation.
MATERIALS AND METHODS
Study Cohort and Image Acquisition
This retrospective study used the public PROSTATEx dataset, which includes 82 subjects (41 PCa and 41 non-cancer controls) [
5-
7]. All subjects underwent T2WI and ADC imaging acquired on either 1.5T or 3.0T MRI scanners. T2WI provides detailed anatomic delineation of the prostate and surrounding soft tissues, whereas ADC reflects water diffusion, indirectly indicating tumor cellularity and infiltration. Preprocessing involved Gaussian denoising, intensity range standardization, and resizing to 256×256 pixels to ensure inter-case consistency. An overview of the workflow is presented in
Fig. 1.
Radiomics Feature Extraction and Processing
Radiomics features were extracted from both T2WI and ADC using standardized environment for radiomics analysis (SERA) (
Table 1). From tumor regions of interest, we computed 215 features per sequence (430 total), encompassing first-order statistics, morphology, and multiple texture families (e.g., GLCM [gray level co-occurrence matrix], GLRLM [gray level run length matrix], GLSZM [gray level size zone matrix], NGTDM [neighborhood gray tone difference matrix], GLDM [gray level dependence matrix]), along with filtered features derived from wavelet and Laplacian-of-Gaussian transforms. The feature categories and counts are summarized below. Features from T2WI and ADC were normalized and concatenated via early fusion.
Reproducibility Assessment and Feature Selection
Within each sequence (T2WI and ADC), we performed repeated segmentation and calculated ICC using a 2-way random-effects, absolute-agreement model [
4]. Features with ICC≥0.75 in both sequences were retained as the core reproducible feature set. ICC values were classified as poor (<0.50), moderate (0.50–0.75), good (0.75–0.90), and excellent (0.90) [
4].
Fig. 2 shows the ICC distributions before filtering by category, and
Table 2 summarizes counts and percentages. To reduce redundancy, we excluded one feature from any pair with |r|>0.9 based on Pearson correlation.
Classifier Development and Evaluation
Using the selected feature set, we evaluated 7 dimensionality reduction techniques: principal component analysis (PCA), kernel PCA, linear discriminant analysis, Isomap, locally linear embedding (LLE), Laplacian eigenmaps, and an autoencoder. The reduced features were then used to train 10 classifiers: support vector machine (SVM), random forest, k-nearest neighbors, decision tree, logistic regression, neural network, XGBoost, LightGBM, CatBoost, and AdaBoost.
To avoid data leakage, all preprocessing steps, including correlation filtering, normalization, dimensionality reduction fitting, and hyperparameter tuning, were performed solely within the training folds, with transformations subsequently applied to validation folds. Model performance was estimated using nested cross-validation (inner and outer 5-fold) to minimize selection bias. The primary evaluation metrics were accuracy and area under the receiver operating characteristic curve (AUC), averaged over the outer folds. Comparative results are presented in
Figs. 3 and
4 and summarized in
Table 3.
RESULTS
Reproducibility of Radiomics Features
Of the 215 T2WI features, 66 (30.7%) demonstrated poor reproducibility (ICC<0.5), 34 (15.8%) were moderate, 63 (29.3%) were good, and 52 (24.2%) were excellent. Among the 215 ADC features, 67 (31.2%) were poor, 67 (31.2%) were moderate, 55 (25.6%) were good, and 26 (12.1%) were excellent. Only features with ICC≥0.75 in both T2WI and ADC were used in downstream analysis, resulting in 115 good/excellent T2WI features and 81 good/excellent ADC features.
Fig. 2 presents the prefiltering ICC distributions (category boxplots), and
Table 2 summarizes the corresponding counts and percentages.
Classification Performance
Across the dimensionality reduction methods, SVM, nearest neighbors (NN), and logistic regression exhibited relatively stable performance, achieving accuracies of 80%–84% (
Table 3;
Figs. 3 and
4). PCA provided the most consistent gains across classifiers, while kernel PCA and Laplacian eigenmaps also performed favorably. In contrast, Isomap and LLE occasionally degraded performance in certain ensemble models (e.g., LightGBM, AdaBoost), where accuracy dropped to 0.65–0.75. The k-NN classifier was particularly sensitive when trained with autoencoder-derived features.
AUC patterns closely mirrored accuracy trends. SVM maintained an AUC of approximately 0.83 across most reductions and achieved a maximum AUC of 0.85 in its best configuration (
Fig. 4). Logistic regression and CatBoost also reached a peak AUC of 0.85, which is notable given the dataset size. NN achieved an AUC of 0.85 with PCA but fell toward 0.5 with LLE, indicating sensitivity to manifold representation. LightGBM occasionally produced near-random performance (AUC≈0.5) under certain nonlinear reductions.
DISCUSSION
Using the PROSTATEx dataset, we developed ML models for PCa diagnosis based on T2WI and ADC radiomics while rigorously quantifying feature reproducibility. Restricting model training to ICC-validated features yielded moderate-to-high diagnostic performance (accuracy 80%–84%; AUC ≤0.85) despite the modest cohort size, supporting the effectiveness of a reproducibility-first approach over aggressive numerical optimization in small datasets.
Li et al. [
8] integrated biparametric radiomics with clinical variables in a larger cohort using logistic regression as the primary classifier. In contrast, our balanced 41/41 cohort systematically evaluated a broader range of dimensionality reduction and classifier combinations, identifying PCA as a consistently robust option. Liu et al. [
9] examined 1,576 mpMRI features using SelectKBest (Select K best features) and LASSO (least absolute shrinkage and selection operator) to distinguish indolent from aggressive disease. While their work emphasized risk stratification, our study focused on binary lesion detection at the diagnostic stage and, crucially, isolated the direct effect of reproducibility filtering. Prior studies have consistently reported that consensus segmentation and the exclusion of unstable features enhance reliability [
3,
4]; our findings extend this evidence, showing that emphasizing feature stability alone can substantially influence performance, even without multimodal or clinical data integration.
The study has several limitations. The sample size was relatively small (n=82), and the analysis was limited to T2WI and ADC sequences, excluding modalities such as dynamic contrast-enhanced T1-weighted imaging. Furthermore, external validation beyond nested cross-validation was not conducted. Future research will integrate automated segmentation, clinical and molecular covariates, and deep learning-based representations to improve generalizability. Validation across multi-institutional cohorts under a standardized, reproducibility-oriented pipeline will also be critical for translation into clinical practice.