INTRODUCTION
In middle-aged and older men, conditions such as benign prostatic hyperplasia (BPH), age-related diseases, neurological disorders, and hormonal changes frequently lead to lower urinary tract symptoms (LUTS), which significantly impact quality of life [
1,
2]. Key voiding symptoms, including a slow stream, hesitancy, and straining, primarily result from 2 causes: bladder outlet obstruction (BOO) and detrusor underactivity (DUA) [
3]. Accurately distinguishing between these conditions is crucial for effective treatment, as BOO and DUA necessitate different management strategies. BOO is generally treated with medications like alpha-blockers or 5-alpha reductase inhibitors, and in more severe cases, surgical options such as transurethral resection of the prostate may be considered [
4]. On the other hand, treatments for DUA might include bladder-emptying techniques such as intermittent catheterization or pharmacological methods to stimulate bladder contractions [
5,
6].
Despite these therapeutic distinctions, LUTS alone cannot differentiate between BOO and DUA, necessitating the use of invasive urodynamic studies (UDS) to measure real-time urine flow and detrusor pressure [
7]. However, UDS poses discomfort and risks to patients, underscoring the need for noninvasive diagnostic alternatives [
8]. This study introduces an artificial intelligence-based approach that utilizes the International Prostate Symptom Score (IPSS), uroflowmetry data, and transrectal ultrasound (TRUS) measurements to accurately differentiate between BOO and DUA. By comparing CatBoost and XGBoost models, our aim is to develop a reliable, noninvasive diagnostic tool that can improve clinical decision-making and patient outcomes without relying on invasive urodynamic tests.
MATERIALS AND METHODS
Participants
The study participants were male patients aged 40 years and older who exhibited LUTS and visited the urology department between December 2006 and December 2020. To qualify for inclusion in the study, participants were required to be male, aged 40 or older, and to have undergone various examinations, including UDS, IPSS assessment, TRUS, and uroflowmetry, to measure maximum flow rate and residual urine volume. Furthermore, only patients with a maximum flow rate of less than 15 mL/sec were included. BOO was defined as a bladder outlet index of 40 or greater, and DUA was defined as a bladder contractility index (BCI) of less than 100 [
9].
Collected Data
The dataset included a range of parameters, and the measurements are summarized in
Table 1.
Feature Engineering
In addition to the previously mentioned data, further features were extracted from the IPSS questionnaire to enhance the un-derstanding of patient symptoms. These additional features encompass the sum of voiding symptom scores from questions 1, 3, 5, and 6; the sum of storage symptom scores from questions 2, 4, and 7; the total sum of all symptom scores from questions 1 through 7; and the quality-of-life score derived from question 8.
From the TRUS imaging data, numerical features, including prostate volume, width, length, and height, as well as those of the transition zone, were extracted. These features were utilized during model training and played a critical role in evaluating the physical characteristics of the prostate associated with conditions such as BOO and DUA.
Model Development
In this study, 4 models were developed, comprising 2 CatBoost models and 2 XGBoost models. Each model was specifically designed to diagnose either BOO or DUA, conditions traditionally identified through invasive UDS. The CatBoost models utilized the CatBoostClassifier from the CatBoost library in Python. The categorical features incorporated into these models included variables such as cerebrovascular accident, diabetes, dementia, hypertension, IPSS, International Continence Society Male Short-Form, age group, and history of radical pelvic surgery.
Similarly, the XGBoost models employed the XGBoostClassifier from the XGBoost library in Python. To ensure a fair comparison between the 2 algorithms, the same set of categorical features used in the CatBoost models was applied to the XGBoost models. This method facilitated a consistent evaluation of the models’ performance in diagnosing BOO and DUA.
Model Training and Evaluation
The models were trained on the dataset, which included appropriate preprocessing steps to handle categorical data, missing values, and normalization as needed. Each model’s effectiveness in diagnosing BOO and DUA was evaluated using standard performance metrics, including accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUROC).
Our approach integrated patient-reported outcomes, objective flow measurements, and precise imaging-derived diagnostics to compare the performance of CatBoost and XGBoost models. This comparison aimed to determine which model offered greater accuracy and reliability in distinguishing between BOO and DUA, potentially reducing the need for invasive urodynamic tests.
RESULTS
In the study, a total of 4,817 patients were enrolled. Among them, 676 patients were diagnosed with both BOO and DUA, 1,058 with BOO only, and 2,335 with DUA only. The remaining 748 patients did not have a diagnosis of either BOO or DUA.
BOO Diagnosis
In
Tables 2 and
3, we present the results of BOO diagnosis using CatBoost and XGBoost. XGBoost demonstrated superior overall performance compared to CatBoost, with an AUROC of 0.826 versus 0.809 for CatBoost. In terms of accuracy, XGBoost achieved a higher value of 0.755, while CatBoost reached 0.730. For sensitivity, CatBoost slightly outperformed XGBoost, scoring 0.767 versus 0.756, indicating a better ability to identify true positive cases. Both models had identical specificity values of 0.755, demonstrating similar effectiveness in detecting true negative cases. Precision was higher for XGBoost at 0.648, compared to 0.610 for CatBoost, suggesting that XGBoost produced fewer false positives. Lastly, XGBoost recorded a marginally better F1-score of 0.697, with CatBoost close behind at 0.680.
In
Fig. 1, the 2 bar plots illustrate the key features utilized by the CatBoost and XGBoost models to predict BOO. The first plot reveals that the CatBoost model considers the maximum flow rate measured during uroflowmetry to be the most crucial feature, along with significant consideration given to PSA and age. In contrast, the second plot highlights that the XGBoost model prioritizes the height of the transition zone as the most important feature, in addition to other anatomical measurements of the prostate and transition zone. The XGBoost model also underscores the significance of the maximum flow rate from uroflowmetry measurements. Overall, while the CatBoost model focuses on demographic and clinical measurements, the XGBoost model leans more toward detailed anatomical characteristics. Despite their different emphases, both models integrate these data types to predict BOO.
Due to the differing feature importance values, we developed an ensemble model that combines CatBoost and XGBoost to leverage their complementary strengths. This model was constructed by assigning different weights to the predictions from CatBoost and XGBoost, followed by calculating the weighted average of their outputs. Specifically, we assigned weights of 0.6 to CatBoost and 0.4 to XGBoost, based on their individual performance and contributions to the final prediction. This strategy is designed to capitalize on the unique contributions of each model, thereby enhancing the overall prediction accuracy for BOO. By integrating the distinct perspectives of both models, the ensemble method offers a more robust and comprehensive diagnostic tool. However, experiments that involved varying the weights showed that the ensemble model did not significantly enhance performance compared to the better-performing XGBoost model alone. As indicated in
Table 4, the performance metrics of the ensemble model are very close to those of XGBoost. This lack of significant improvement can be attributed to the similarity in the prediction values of both models. Despite the different features emphasized by each model, their predictions are sufficiently similar, resulting in minimal gains from the ensemble approach.
DUA Diagnosis
Tables 2 and
3 show the DUA diagnosis results using both models. XGBoost demonstrated superior performance in terms of AUROC, with scores of 0.819 and 0.803, respectively. Both models achieved similar accuracy, with CatBoost at 0.739 and XGBoost at 0.734. However, CatBoost exhibited a higher sensitivity of 0.807 compared to XGBoost’s 0.754, indicating that CatBoost was more effective at identifying true positive cases. Conversely, XGBoost outperformed CatBoost in specificity, scoring 0.701 compared to 0.621, which suggests that XGBoost was more adept at identifying true negative cases. In terms of precision, XGBoost again showed superior performance with a score of 0.813, versus CatBoost’s 0.786. Finally, XGBoost achieved a slightly higher F1-score of 0.782, while CatBoost recorded 0.712.
Feature importance plots for DUA prediction are illustrated in
Fig. 2. The CatBoost model for DUA identifies uroflowmetry voiding time as the most crucial feature, closely followed by maximal flow rate and chart mean bladder capacity. Additionally, this model emphasizes the importance of anatomical features such as prostate height and transition zone height. In contrast, the XGBoost model for DUA assigns the greatest importance to transition zone height, with other significant features including uroflowmetry maximal flow rate and ICS_Q_I4. This model also considers prostate height and radical pelvic surgery as important factors. Both models underscore the significance of transition zone height, uroflowmetry maximal flow rate, and prostate height, indicating these features are essential for predicting DUA. However, there are notable differences in their focus: CatBoost prioritizes uroflowmetry voiding time and chart mean bladder capacity, whereas XGBoost places more emphasis on specific questionnaire items (ICS_Q_I4 and ICS_Q_V_ Sum) and radical pelvic surgery.
Similar to the approach used for BOO, an ensemble model was created by combining CatBoost and XGBoost. However, as indicated in
Table 4, the performance of the ensemble model did not improve.
DISCUSSION
In this study, we developed machine learning models to distinguish between BOO and non-BOO, as well as DUA and non-DUA, in male patients with LUTS. Utilizing a comprehensive dataset that includes uroflowmetry, ultrasound-derived parameters, and patient-reported outcomes, our CatBoost and XGBoost models demonstrated strong performance in classifying BOO and DUA.
There have been ongoing efforts to develop noninvasive methods as alternatives to invasive UDS for diagnosing BOO and DUA. Various noninvasive evaluation techniques have been explored, including the penile cuff test (PCT), bladder wall thickness (BWT), detrusor wall thickness (DWT), and intravesical prostatic protrusion (IPP). The PCT, for instance, provides a noninvasive alternative to pressure flow studies (PFS) by measuring isovolumetric bladder pressure during micturition. Although PCT has demonstrated a high negative predictive value for BOO and offers shorter procedure times than PFS, it is limited by a low positive predictive value and diagnostic uncertainty due to variability in patient responses and voided volumes [
10,
11]. Similarly, techniques such as BWT and DWT, while useful for measuring anatomical changes, are constrained by the absence of standardized protocols and defined cutoff values, which diminish their diagnostic accuracy. IPP, although offering valuable insights into prostate obstruction, is highly dependent on the operator, introducing an additional layer of variability in clinical practice. Most of these studies have focused on differentiating BOO from non-BOO, with less attention given to diagnosing DUA [
12].
Recent advancements in AI technology within the field of urology have been notable [
13]; however, there are relatively few studies that focus specifically on developing machine learning models for diagnosing BOO and DUA in male patients with LUTS. Bang et al. [
14] utilized deep learning techniques, including CNNs, to analyze uroflowmetry graphs and predict BOO and DUA. Despite this innovative approach, their models demonstrated only moderate performance, achieving AUROCs of approximately 73%. In a similar vein, Matsukawa et al. [
15] created an AI-based diagnostic system for LUTS that depended exclusively on uroflowmetry data to classify BOO and DUA. Although this system reached an accuracy of 84%, its reliance on a single data source hindered its ability to comprehensively address the complex, multifactorial nature of LUTS. In a 2023 follow-up study, Matsukawa et al. [
16] further explored the characteristics of uroflowmetry patterns, such as the initial peak flow rate, to improve differentiation between BOO and DUA. While they successfully identified significant patterns in the uroflowmetry data, their study did not incorporate the use of advanced machine learning algorithms.
Our study improved diagnostic accuracy by integrating multiple data sources, including prostate ultrasound and patient-reported outcomes. This comprehensive dataset allows our models to provide a more nuanced assessment of bladder function and obstruction. Utilizing CatBoost and XGBoost, our approach leverages a broad set of clinical data and effectively handles both categorical and continuous variables, thus making our models adaptable to various patient profiles. The CatBoost models have shown high sensitivity, effectively identifying true positive cases, which is essential for initial screenings. Conversely, the XGBoost models exhibit higher specificity and precision, making them ideal for confirming diagnoses and reducing false positives. This distinction underscores the strengths of each model, depending on clinical priorities such as minimizing false negatives or false positives. Our findings indicate that incorporating CatBoost and XGBoost models into the diagnostic workflow for LUTS could significantly improve clinical decision-making. These models provide noninvasive, cost-effective, and patient-friendly alternatives to traditional protocols, potentially offering more accurate diagnoses and customized treatment plans.
The limitations of our study are as follows. First, our models were specifically designed to distinguish between BOO and non-BOO, as well as DUA and non-DUA, rather than directly differentiating between BOO and DUA. This presents a limitation in clinical settings where both conditions might coexist, as ideally, a single model that can differentiate between BOO and DUA would be more beneficial. Second, the inherent complexity of DUA makes its definition based solely on the BCI somewhat limited. Although the BCI is a common metric in clinical research, it may not adequately reflect the complex nature of DUA. Nevertheless, to maintain consistency with previous studies and to establish a clear baseline, we chose to define DUA using the most widely accepted BCI thresholds found in the literature.
In conclusion, our machine learning-based approach to diagnosing BOO and DUA represents a significant advancement over previous studies, as it incorporates a broader array of clinical features and utilizes more sophisticated machine learning algorithms. This approach offers a promising noninvasive diagnostic tool that could improve clinical decision-making. Future research should aim to validate these models using larger datasets and incorporate more clinical and genetic data to further enhance their performance. The ongoing development of machine learning models shows great potential for transforming the diagnosis and management of BPH and other medical conditions.