You are viewing the site in preview mode

Skip to main content

(DA-U)2Net: double attention U2Net for retinal vessel segmentation

Abstract

Background

Morphological changes in the retina are crucial and serve as valuable references in the clinical diagnosis of ophthalmic and cardiovascular diseases. However, the retinal vascular structure is complex, making manual segmentation time-consuming and labor-intensive.

Methods

This paper proposes a retinal segmentation network that integrates feature channel attention and the Convolutional Block Attention Module (CBAM) attention within the U2Net model. First, a feature channel attention module is introduced into the RSU (Residual Spatial Unit) block of U2Net, forming an Attention-RSU block, which focuses more on significant areas during feature extraction and suppresses the influence of noise; Second, a Spatial Attention Module (SAM) is introduced into the high-resolution module of Attention-RSU to enrich feature extraction from both spatial and channel dimensions, and a Channel Attention Module (CAM) is integrated into the lowresolution module of Attention-RSU, which uses dual channel attention to reduce detail loss.Finally, dilated convolution is applied during the upscaling and downscaling processes to expand the receptive field in low-resolution states, allowing the model to better integrate contextual information.

Results

The evaluation across multiple clinical datasets demonstrated excellent performance on various metrics, with an accuracy (ACC) of 98.71%.

Conclusion

The proposed Network is general enough and we believe it can be easily extended to other medical image segmentation tasks where large scale variation and complicated features are the main challenges.

Peer Review reports

Introduction

The morphological attributes of the retinal vasculature at the fundus, such as shape, length, curvature, and branching, are clinically important distinguishing signs [1]. Clinicians can study changes in the body’s microvasculature on the basis of alterations in the morphology of the retinal vessels [2]. For example, the hardening, exudation, and constriction of retinal arterial vessels can reflect a trend toward higher blood pressure within the body [3]; conditions such as leakage, edema, and hemorrhage of the retinal vessels can serve as diagnostic criteria for diabetes and macular diseases [4,5,6]. The segmentation of retinal vessels is a necessary step to obtain the above critical information, and accurate segmentation results help doctors diagnose eye diseases precisely [7, 8]. Owing to the complex structure of retinal vessels, manual segmentation is time-consuming, labor-intensive, and subject to subjective factors, making automated segmentation via computer technology a popular topic in retinal segmentation.

With the continuous development of computer technology, its application in retinal segmentation has become a popular research topic among medical and computer science scholars domestically and internationally [9, 10]. Koukounis et al. used a matched filtering algorithm to increase the difference between vessels and the rest of the retina, thereby achieving retinal vessel segmentation [11]. However, some retinal vessels are small and dense, leading to detail loss with this algorithm. Fraz et al. used morphological reconstruction and Gabor filters for the multiscale localization and segmentation of retinal images [12]. However, morphology is prone to noise interference and binarization difficulties during operation. Improvements in hardware technology have enabled methods based on deep learning to overcome the limitations of traditional approaches, demonstrating powerful capabilities in image processing tasks. Medical image segmentation based on deep learning has become popular research direction in clinical diagnosis [13,14,15,16]. The Fully Convolutional Neural Network (FCN) was the first neural network for image semantic segmentation tasks, it uses convolutional layers instead of fully connected layers to achieve image segmentation [17, 18]. Ronneberger et al. proposed a U-shaped network structure, U-Net, consisting of an encoder composed of convolutional and pooling layers, and a decoder composed of upsampling and deconvolution layers, with skip connections added at corresponding levels of the U-structure to effectively merge low-level and high-level features [19]. It has been widely applied in the field of medical image segmentation and has achieved good results. The attention mechanism, inspired by the ability of humans to select attention points on the basis of voluntary and involuntary cues, focuses the model on information strongly related to the task and has been widely applied in various neural network models [20]. Li et al. embedded the attention module into the U-Net network structure for retinal vascular images, improving model accuracy [21].

However, owing to the irregular morphology, dense and disordered distribution, and unclear boundaries of retinal vessels, there is room for improvement in existing methods.To address issues such as incomplete extraction of the underlying features of the retina, susceptibility to background noise, and neglect of the relevance of contextual information in existing methods, this paper proposes a method that integrates feature channel attention mechanisms [22], the Convolutional Block Attention Module (CBAM) [23], and U2Net [24]. In summary, the contributions of this paper can be summarized as follows:

  • The backbone network is U2Net, and feature channel attention mechanisms are integrated between decoders and encoders at corresponding levels of RSU (Residual U-Block, RSU) units in U2Net, with a focus on salient regions during the extraction of underlying features.

  • During the sampling process at high-resolution levels, a feature space attention mechanism is introduced to form feature maps of dual importance in feature channels and feature space, enriching feature extraction from both channel and spatial dimensions. Feature channel attention mechanisms are again added to the output of the sampling process at low-resolution levels to form dual attention feature maps for fine areas, which focus more on detail feature information and suppress noise points.

  • This achieves the capture of more contextual information from different scales, paying more attention to defect area features, suppressing background noise, reducing loss, and improving segmentation efficiency.

Related Work

U2Net

The U2Net network is a dual-layer nested network proposed by Qin et al. and is based on U-Net, which is primarily used for Salient Object Detection (SOD) tasks [25].As shown in Fig. 1.

U2Net has three advantages in image segmentation:

  • Independent feature extraction by different levels of RSU modules, increasing the depth and breadth of complex image feature information extraction;

  • Effective fusion of output information from different levels, enabling the model to better integrate contextual information and improve model accuracy;

  • The model includes numerous pooling processes, preventing increased computational costs due to complex structures.

Fig. 1
figure 1

Architecture of the U2Net network

Feature channel attention mechanism

The feature channel attention mechanism establishes interconnections between channels, acquires the importance coefficients of each channel through learning, and depends on these importance coefficients to enhance features with high coefficients while suppressing the expression of features with low coefficients.

Its working principle is illustrated in Fig. 2. The feature inputs xl and g undergo 1 × 1 convolution transformations to match channel numbers and are than added together. This is followed sequentially by the ReLU, a 1 × 1 convolution kernel, and the Sigmoid to obtain attention allocation coefficients between 0 and 1. These coefficients are then applied to various parts of the feature map. Finally, the initial input xl is combined with the attention output map to produce the final output.

Fig. 2
figure 2

Architecture of feature channel attention

CBAM attention module

The Convolutional Block Attention Module (CBAM) comprises two parts: the Spatial Attention Module (SAM) and the Channel Attention Module (CAM). As shown in Fig. 3, the SAM module processes the input X (H × W × D) through mean pooling and max pooling to generate two feature maps of size H × W × 1. These maps then go through a convolution calculation with a kernel size of 7 to produce a new feature map, which, after being a Sigmoid, results in the final feature map adjusted for spatial attention. The CAM module is similar to the previously described feature channel attention mechanism: it takes input X and applies mean pooling and max pooling, followed by two convolution operations and a ReLU to generate a new feature map that captures the importance coefficients between channels. This map is then concatenated with feature weights and processed through a sigmoid function to produce the final feature map adjusted for channel attention.

Fig. 3
figure 3

Architecture of the CBAM Attention Module: the CBAM contains two parts, the Spatial Attention Module (SAM) and the Channel Attention Module (CAM)

Within the CBAM attention mechanism, the spatial attention mechanism focuses on the positional information of targets, enhancing the spatial accuracy of segmentation tasks. Moreover, the channel attention mechanism, by leveraging both pooling methods, enriches the extraction of high-level features, thereby improving the accuracy of segmentation tasks in the feature dimension.

Methodology

The structure of the retina is complex, contains a large amount of detailed feature information, and is susceptible to interference from background noise. Therefore, from a structural perspective, the network model needs to extract both high- and low-level feature information to the greatest extent possible, and functionally, it should focus on key parts within the feature information. Inspired by the rich feature extraction capabilities of the U2Net neural network and the principles of attention mechanisms, this paper proposes the (DA-U)2Net model, which is based on the main framework of U2Net and introduces attention mechanisms into the RSU (Residual U-Block) unit modules, thereby constructing a dual attention model.

(DA-U)2Net

As shown in Fig. 4, (DA-U)2Net includes six encoding parts: En_1, En_2, En_3, En_4, En_5, and En_6,and five decoding parts: De_1, De_2, De_3, De_4, and De_5. Among these, En_1 and De_1 utilize the DARSU-7 module;En_2 and De_2 use the DARSU-6 module; En_3 and De_3 use the DARSU-5 module; En_4 and De_4 use the DARSU-4 module; and En_5, En_6, and De_5 use the DARSU-4F module. Each DARSU module is a U-shaped network constructed by referring to U-Net and integrating two attention mechanisms at different levels, with the numbers 7, 6, 5, and 4 indicating the depth of the U-shaped network. As an example of DARSU-4F, it first undergoes three downsampling operations, followed by three upsampling operations. DARSU-5, DARSU-6, and DARSU-7 sequentially add one more upsampling and downsampling operation each. DARSU-4F is a special module; given the significantly reduced resolution of the image after multiple downsampling steps, continuing to downsample further decreases the image resolution, causing loss of detail information during the upsampling process. Therefore, this process does not follow the aforementioned pattern but instead uses dilated convolutions to replace the upsampling and downsampling processes. Dilated Convolution is a process that expands the convolutional kernel’s receptive field without changing the number of parameters, ensuring that the output feature map remains unchanged [26]. It can capture information from different ranges by adjusting the dilation rate. Finally, the outputs from each upsampling RSU module are fused with the final output of the U-structure to obtain the final prediction map.

Fig. 4
figure 4

Architecture of the (DA-U)2Net network

U channel and spatial attention module

In U2Net, the RSU-7, RSU-6, RSU-5, and RSU-4 modules effectively extract features at different levels for the task, but the abundance of feature information may cause the network to overlook significant regions. To retain effective feature information in the context and improve the model’s accuracy, a dual attention mechanism is introduced into these four modules.

As illustrated in Fig. 5, taking DARSU-7 (Double Attention Residual UBlock, DARSU) as an example, the remaining encoding part first goes through two 3 × 3 convolutions (Conv2d), normalization (BN), and ReLU, followed by five downsampling operations (2 × 2 max pooling, DownSample), each of which includes Conv2d + BN + ReLU. In the right decoding part, the result from the previous level is processed with the encoding part of the corresponding level through the feature channel attention mechanism, and the generated feature map is then concatenated with the result from the previous level, with each upsampling process also including Conv2d + BN + ReLU. The processed feature map extracts multiscale features and, by incorporating the attention mechanism, focuses more on significant features. It then sequentially encodes these into high-resolution feature maps, reducing the loss of detail features during the upsampling process. The final concatenated output is then fed into the SAM module of the CBAM, effectively enhancing the model’s sensitivity to spatial information features and balancing detail features with global spatial information features in high-resolution feature maps. DARSU-6, DARSU-5, and DARSU-4 all adopt the same architecture, except that the number of upsampling and downsampling layers decreases by one layer each.

Fig. 5
figure 5

Architecture of the DARSU-7 module

In the aforementioned module, it is important to note the feature channel attention mechanism within the skip connections, as shown in Fig. 6. Let the feature map from the previous layer be H1 × W1 × C1, which after convolution becomes x, and after downsampling becomes half the original feature map size, H1/2 × W1/2 × C1/2. This, after convolution, results in g, which is first used as the feature input (as described in Sect. 2.2) together with the output x from the previous layer’s feature map through the feature channel attention mechanism to produce a new feature map. This new map is then concatenated with the upsampled result of g, H1 × W1 × C1, and serves as the input for the next layer. This process not only better integrates contextual information into the new feature map but also suppresses the expression of features in less important regions.

Fig. 6
figure 6

Attention mechanism structure in skip connections

The input to the module is defined as g, with x representing the skip connection coming from the expansion path. The computation formula is as follows:

$$x_{attention} = Attention(g,x)$$
(1)
$$x_{upsampled} = Upsample(g,scale = 2)$$
(2)
$$output = ConvBlock(concatenate(x_{attention} ,x_{upsampled} ))$$
(3)

U channel dual attention module

As illustrated in Fig. 7, unlike other DARSU modules, since the input feature map undergoes five downsamplings, resulting in a significantly reduced resolution, continuing to downsample leads to the loss of some feature information. Therefore, in this module, dilated convolutions with dilation rates of 2, 4, and 8 are set at depths of 3–5 to expand the receptive field. Moreover, due to the low resolution at this layer and the scarcity of spatial feature information, extracting spatial attention is unnecessary. Thus, a CAM module is added to the final output to extract global salient feature information. The entire model maximizes the retention of partial and global salient area feature information with a minimal increase in computational cost, reducing information loss.

Fig. 7
figure 7

Architecture of DARSU-4F

Loss calculation

The loss function used in this paper combines Dice Loss and CrossEntropy Loss. Dice Loss is a common image segmentation loss function, with the following formula:

$$DiceLoss = 1 - \frac{{2\left| {X \cap Y} \right|}}{\left| X \right| + \left| Y \right|}$$
(4)

where \(\frac{{2\left| {X \cap Y} \right|}}{\left| X \right| + \left| Y \right|}\) represents the Dice coefficient, which is primarily used to evaluate the similarity between two samples. In image segmentation tasks, \(X\) represents the true labels, and \(Y\) represents the predicted labels. \(\left| {X \cap Y} \right|\) indicates the pixel values of the predicted labels in the true labels, and \(\left| X \right| + \left| Y \right|\) represents the sum of the pixel values of the true and predicted labels. The specific calculation formula is as follows:

$$DiceLoss=1-\frac{2\sum_{i=1}^Ny_i{\widehat y}_i}{\sum_{i=1}^Ny_i+\sum_{i=1}^N{\widehat y}_1}$$
(5 )

where \(y_{i}\) is the label value of the pixel, \(y_i\) is the predicted value of the pixel, and \(N\) is the total number of pixels.

Owing to the loss saturation issue of Dice Loss, especially the saturation problem in segmenting detailed features, it is necessary to use CrossEntropy Loss to equally calculate the characteristics of each pixel, effectively complementing Dice Loss. The CrossEntropy Loss formula is as follows:

$$L = - \sum\nolimits_{i = 1}^{N} {y_{i} \log } y_{i}{\prime}$$
(6)

where \(y_{i}\) is the label value of the pixel, \(y_{i}{\prime}\) is the predicted value of the pixel.

Experimental configurations

Experimental parameter settings

The (DA-U)2Net system environment consists of a 16 vCPU Intel(R) Xeon(R) Platinum 8350C CPU @ 2. 60 GHz, paired with an RTX 3090 GPU (24 GB VRAM), and runs on Python 3.8, PyTorch 1.13.0, and CUDA 11.6. The Batchsize value is 2.The optimizer is Adam. The initial learning rate is 0.001.The weight decay is le−4.The number of training epochs is 100.

Experimental data

Datasets

The DRIVE (Digital Retinal Images for Vessel Extraction) dataset was released in 2004 by the Image Sciences Institute, and features Dutch diabetic retinopathy data. This dataset includes 40 TIF color fundus images of size 565 × 584, with 40 GIF images manually annotated by two ophthalmologists, corresponding to 40 GIF mask images [27].

The CHASEDB1 (Challenging Database 1) dataset was released by Kingston University London, and features retinal image analysis data of 14 children’s left and right eyes. This dataset includes 28 JPG color fundus images of size 999 × 960, with 28 PNG images manually annotated by two ophthalmologists. CHASEDB1 does not come with corresponding mask images; we manually created the masks for this dataset [28].

The HRF (High-Resolution Fundus) dataset was released by the Ophthalmology Department at the University of California, San Diego (UCSD). This dataset includes 15 healthy fundus images, 15 diabetic retinopathy fundus images, and 15 glaucoma fundus images, with a resolution of 3504 × 2336 pixels. Each image corresponds to an expert manual segmentation result [29].

Preprocessing

Owing to the low contrast between the background and segmentation targets in fundus images and the presence of noise, preprocessing was conducted to improve the segmentation results. The preprocessing steps include the following:

  • Image grayscale transformation through the G channel to reduce the contrast between the background and segmentation targets, followed by normalization [30];

  • Enhancement of images through contrast-limited adaptive histogram equalization [31];

  • Reduce the number of artifacts and improve of image information via gamma transformation. The preprocessed images after each preprocessing step are shown in Fig. 8.

Fig. 8
figure 8

Typical images after each preprocessing step. a Original image; b G channel image; cImage after the CLAHE operation; d Image after the Gamma correction

Fig. 9
figure 9

Process of data augmentation: Step 1. flipping images. Step 2. translating images. Includes: train images, manual images, and mask

Data augmentation

Due to the limited number of images in the DRIVE, CHASEDB1, and HRF datasets, there is a significant risk of overfitting during the training process. To address this, as shown in Fig. 9, images were initially augmented through horizontal flips, vertical flips, and combined horizontal and vertical flips, expanding the three datasets to 160, 112, and 144 images, respectively. Subsequently, the flipped images underwent horizontal and vertical translations with a translation coefficient of 50, ensuring that the integrity of the images was preserved. As a result, the DRIVE and CHASEDB1 datasets were further expanded to 800 and 560 images, respectively. However, due to the narrow margins on the left and right sides of the HRF dataset, horizontal and vertical translation augmentation was not applied.

Each dataset was divided into a training set and a testing set at a 4:1 ratio.

Evaluation metrics

This paper uses accuracy (Acc), specificity (SP), sensitivity (SE), the F1 score (FMeasure) [7] and Mean Intersection over Union (MIoU) [32] to evaluate and analyze the model.

Results and discussion

Comparison with existing methods

The proposed model was compared with the classic retinal segmentation model U-Net and the commonly improved U-Net model R2UNet [33] across three datasets: DRIVE, CHASEDB1, and HRF. The segmentation results are shown in Figs. 10 and 11.

Fig. 10
figure 10

Segmentation results using the different methods on DRIVE and CHASEDB1. a Original image, b Ground truth, c U-Net, d R2UNet, e Ours

Fig. 11
figure 11

Segmentation results using the different methods on HRF. a Original image, b Ground truth, c U-Net, d R2UNet, e Ours

Compared with the gold standard segmentation results (Figs. 10(b) and 11(b)), the proposed model achieves continuous and clear blood vessel structures, while the segmentation results of the other two models are relatively sparse. As shown in the first row of Fig. 10, fine blood vessels and complex regions were successfully segmented by the proposed model, which accurately restored the complex structures consistent with the gold standard. In contrast, the other two algorithms showed issues such as vessel breakage or redundant vessels, failing to fully replicate the real vascular structures. Moreover, as illustrated in Fig. 11, the proposed model produces smoother and clearer segmentation boundaries. This advantage facilitates disease diagnosis, especially in medical contexts where precise segmentation is critical.

A comparison of the segmentation performance metrics of U-Net and R2UNet on the DRIVE, CHASEDB1, and HRF datasets is shown in Table 1. For the DRIVE dataset, except for the sensitivity (SE) metric, which is lower than that of the UNet model, all other metrics surpass those of both the U-Net and R2UNet models. In the CHASEDB1 dataset, all metrics exceed those of the U-Net model, with both the global accuracy (Acc) and sensitivity (SE) metrics outperforming the R2UNet model. Although the other metrics are slightly lower than those of the U-Net model, they remain comparable. For the HRF dataset, except for the sensitivity (SE) metric, which is slightly lower than that of the R2UNet model, all other metrics show significant improvement. The HRF dataset has more detailed features compared to the other two datasets, and the notable increase in accuracy further confirms the superior performance of the proposed model.

Table 1 Comparative experimental data of U-Net and R2UNet

Quantitative analysis of evaluation metrics compared with other state-of-the-art models

Tables 2, 3, and 4 present a comparison of the quantitative results of our model and existing methods based on evaluation metrics. The proposed model performs better than existing methods on the DRIVE and HRF datasets, with particularly significant improvements on HRF. However, its performance on the CHASEDB1 dataset is relatively weaker. Further analysis reveals that the DRIVE and HRF datasets are characterized by dense, fine vascular branches, while CHASEDB1 contains fewer and thicker vessels. This suggests that the proposed model is better suited for segmenting complex vascular structures but performs slightly less effectively on simpler fundus structures. The inferred reason is that the U2Net model is designed to handle more complex structures, making it more effective for multi-layered and numerous samples. However, for simpler samples with fewer structures, overfitting issues may lead to a decline in segmentation metrics.

Table 2 Quantitative result comparison of different models on the DRIVE
Table 3 Quantitative result comparison of different models on the CHASEDB1
Table 4 Quantitative result comparison of different models on the HRF

Ablation studies

To verify the effectiveness of the dual attention mechanism adopted in this paper, ablation studies were conducted on the DRIVE, CHASEDB1,and HRF datasets. The ablation study models include: 1) the Attention-U2Net model, which incorporates the channel attention mechanism into each RSU of the U2Net, forming the module A-RSU to replace the original RSU module; 2) The Attention-CBAM-U2Net model, which incorporates the CBAM attention mechanism’s SAM module into the ARSU-6, ARSU5, and ARSU-4 blocks of the Attention-U2Net model, and also introduces the CAM module into the ARSU-4F; 3) To better verify the effectiveness of the model, the CBAM-U2Net in this paper removes the channel attention mechanism between RSUs, retaining only the CBAM attention mechanism.

Tables 5, 6, 7, and Fig. 12 present a comparison of the experimental results between the original model and different module models. The results revealed that simply integrating the attention mechanism into the RSU module or adding the CBAM module does not effectively improve segmentation efficiency. However, after the dual attention mechanism is integrated, most metrics significantly improve. This confirms that while a hierarchical structure can capture rich features, it does not improve segmentation efficiency on its own. In contrast, a network model that incorporates a dual attention mechanism can better utilize the rich information characteristics of U2Net. The introduction of the attention mechanism enables the model to focus more on significant areas of space and feature information, suppress the expression of less important regions, and achieve an improvement in detail segmentation.

Table 5 Comparison of the experimental performance of the model improvements on the DRIVE dataset
Table 6 Comparison of the experimental performance of the model improvements on the CHASEDB1 Dataset
Table 7 Comparison of the experimental performance of the model improvements on the HRF Dataset
Fig. 12
figure 12

Comparison of the experimental performance of model Improvements on the DRIVE, CHASEDB1, and HRF. a DRIVE Dataset, b CHASEDB1 Dataset, C HRF Dataset

A comparison of the segmentation results of the ablation study models on the DRIVE, along with detailed segmentation slices, is shown in Fig. 13. The gold standard and our model reveal two finer capillaries within this area, whereas the other experimental models fail to detect these fine vessels in this location. Hence, the dual attention hierarchical model proposed in this paper demonstrates a superior ability to focus on detailed information, resulting in clearer segmentation outcomes.

Fig. 13
figure 13

Detail Comparison in the Ablation Study on the DRIVE Dataset. a Original image, b Ground truth, c U-Net, d Attention-U2Net, e CBAMU2Net, f Attention-CBAM-U2Net

To better demonstrate the effectiveness of the U-channel and spatial attention module and the U-channel dual attention module, we randomly selected an image from the DRIVE dataset for attention visualization in Fig. 14. Figure 14 illustrates: (a) the original image, (b) a grayscale image designed to more clearly display the vascular structures, (c) a model that lacks an attention mechanism, (d) a model that integrates only the channel attention mechanism, (e) a model that integrates only the CBAM attention mechanism, and (f) a model that integrates the dual attention mechanism. The heatmaps indicate that the model with dual attention focuses more effectively on the vascular regions.

Fig. 14
figure 14

Heatmap Comparison on the DRIVE (a) Original image, b Detail image, c U-Net, d Attention-U2Net, e CBAM- U2Net, f Attention-CBAM-U2Net

Discussion and evaluation

To demonstrate the model’s adaptability and reliability, this paper evaluates it on three clinical patient datasets: RBV, SMDG [45], and FIVES [46].

The Retina-Blood-Vessel (RBV) dataset consists of 100 high-resolution retinal fundus images captured using state-of-the-art imaging devices. Each image comes with pixel-level ground truth annotations that mark the exact locations of the blood vessels.These annotations facilitate the development and evaluation of advanced segmentation algorithms.

The Standardized Multi-channel Glaucoma Dataset (SMDG-19) is a collection of 19 publicly available datasets that have been standardized. It includes complete retinal fundus images for glaucoma, along with associated image metadata, such as optic disc, cup, and vessel segmentations, as well as any provided single-instance textual metadata (e.g., gender and age).This dataset is currently the largest publicly available repository of glaucoma retinal images.

The FIVES (Fundus Image Vessel Segmentation) dataset contains 800 highresolution color retinal images, each manually annotated pixel-by-pixel through standardized crowdsourcing by medical professionals. The data was sourced from 573 patients seen at the Ophthalmology Center of the Second Affiliated Hospital of Zhejiang University (SHAZU) between 2016 and 2021, with ages ranging from 4 to 83 years. The dataset contains a significant number of low-quality images, some of which are nearly impossible to distinguish with the naked eye due to imaging issues.Effective segmentation of these images requires clinical experience, which makes this dataset valuable for demonstrating the model’s adaptability and reliability.

After performing transfer learning with the proposed model across the three datasets, its performance was evaluated, with the results shown in Table 8. The data indicates a significant improvement across all metrics compared to public datasets, with especially strong performance on the SMDG and FIVES datasets. The accuracy (ACC) was 0.9871, specificity (SP) was 0.9930, sensitivity (SE) was 0.9220, mean intersection over union (MIoU) was 0.8989,and the F1 score was 0.8960, demonstrating the model’s significant advantages in segmenting large-scale, complex clinical datasets.

Table 8 Comparative Experimental Results for RBV, SMDG, and FIVES

The FIVES dataset contains a large number of low-quality images, some of which are nearly indistinguishable due to capture issues. Clinical doctors must have extensive experience to achieve effective segmentation. As shown in the comparison in Fig. 15, the proposed model demonstrates superior performance even on low-quality images, with segmentation results closely aligned with the ground truth, indicating that its segmentation capability is on par with experienced clinicians.

Fig. 15
figure 15

Segmentation results of the proposed model on low-quality data

The evaluation results on clinically challenging datasets provide stronger evidence of the model’s robustness and its ability to generalize across datasets, ensuring its applicability in real-world scenarios.

Conclusion

This paper proposes the integration of multi-attention mechanisms into the RSU submodules of the U2Net for retinal segmentation tasks. The proposed (DA-U)2Net effectively captures multiscale contextual information and promotes the fusion of features at different levels to obtain richer semantic feature representations. Specifically, for different RSU modules, the spatial attention module (SAM) is integrated into the outputs of higher-resolution level modules, achieving dual attention in both spatial and channel dimensions, which effectively extracts multiscale contextual information. In lower-resolution level modules, due to the scarcity of spatial information, the channel attention module (CAM) is integrated to prevent detail loss while simultaneously suppressing background noise. The model effectively utilizes the hierarchical network characteristics of U2Net to extract rich image information features. The integration of the dual attention mechanism enables the model to effectively suppress noise expression, while the combination of spatial and channel attention reduces the loss of key information, ultimately improving segmentation accuracy. Extensive comparative evaluations on three publicly available retinal fundus extended databases demonstrate the superiority of the proposed method over most existing approaches. Finally, the evaluation on three clinically challenging datasets shows a significant improvement across all metrics, particularly in the segmentation of low-quality images, which is close to the ground truth. This proves the model’s applicability in real-world settings.

The proposed model, however, has the following issues: 1)The structure of the model is complex, leading to high computational costs; 2)The complexity of the model structure imposes high demands on the quantity and quality of images, which is particularly evident in its performance on high-resolution HRF datasets compared to other datasets; 3)While the model achieves more precise segmentation for images rich in details, its segmentation performance is poorer for images with sparse vascular distributions.

These points represent areas for future research improvement. Nonetheless, the proposed model introduces a novel stereoscopic architecture integrated with multiple attention mechanisms, paving the way for complex ocular vascular segmentation. It is believed that the model can be readily extended to other medical image segmentation tasks, particularly those facing significant challenges of large-scale variations and complex features.

Data availability

The datasets used in this study are available in the DRIVE Dataset, which can be accessed at https://drive.grand-challenge.org/, and the CHASEDB1 Dataset, accessible at https://researchdata.kingston.ac.uk/96/.

References

  1. Wu Y, Xia Y, Song Y, Zhang Y, Cai W. Nfn+: a novel network followed network for retinal vessel segmentation. Neural Netw. 2020;126:153–62.

    Article  PubMed  Google Scholar 

  2. Liew G, Benitez-Aguirre P, Craig ME, Jenkins AJ, Hodgson LA, Kifley A, Mitchell P, Wong TY, Donaghue K. Progressive retinal vasodilation in patients with type 1 diabetes: a longitudinal study of retinal vascular geometry. Invest Ophthalmol Vis Sci. 2017;58(5):2503–9.

    Article  CAS  PubMed  Google Scholar 

  3. Wong TY, Mitchell P. Hypertensive retinopathy. N Engl J Med. 2004;351(22):2310–7.

    Article  CAS  PubMed  Google Scholar 

  4. Son J, Park SJ, Jung K-H. Towards accurate segmentation of retinal vessels and the optic disc in fundoscopic images with generative adversarial networks. J Digit Imaging. 2019;32(3):499–512.

    Article  PubMed  Google Scholar 

  5. Joshua AO, Nelwamondo FV, Mabuza-Hocquet G. Blood vessel segmentation from fundus images using modified u-net convolutional neural network. J Image Graph. 2020;8(1):21–5.

    Article  Google Scholar 

  6. Morello CM. Etiology and natural history of diabetic retinopathy: an overview. Am J Health-Syst Pharm. 2007;64(17 Supplement 12):3–7.

  7. Wu H, Wang W, Zhong J, Lei B, Wen Z, Qin J. Scs-net: a scale and context sensitive network for retinal vessel segmentation. Med Image Anal. 2021;70:102025.

    Article  PubMed  Google Scholar 

  8. Liming L, Longsong Z, Xin C, Jie Y, Xingang F. Ghost convolution adaptive retinal vessel segmentation algorithm. Opto-Electronic Eng. 2021;48(10):210291–301.

    Google Scholar 

  9. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, Venugopalan S, Widner K, Madams T, Cuadros J, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama. 2016;316(22):2402–10.

    Article  PubMed  Google Scholar 

  10. Ting DSW, Cheung CYL, Lim G, Tan GSW, Quang ND, Gan A, Hamzah H, Garcia-Franco R, San Yeo IY, Lee SY, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. Jama. 2017;318(22):2211–23.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Koukounis D, Ttofis C, Papadopoulos A, Theocharides T. A high performance hardware architecture for portable, low-power retinal vessel segmentation. Integration. 2014;47(3):377–86.

    Article  Google Scholar 

  12. Fraz MM, Jahangir W, Zahid S, Hamayun MM, Barman SA. Multiscale segmentation of exudates in retinal images using contextual cues and ensemble classification. Biomed Signal Process Control. 2017;35:50–62.

    Article  Google Scholar 

  13. Chen J, Chen Y, Li W, et al. Application and prospect of deep learning in video object segmentation. Chin J Comput. 2021;44(3):609–31.

    Google Scholar 

  14. Sun L, Shao W, Wang M, Zhang D, Liu M. High-order feature learning for multi-atlas based label fusion: application to brain segmentation with mri. IEEE Trans Image Process. 2019;29:2702–13.

    Article  Google Scholar 

  15. Sun L, Shao W, Zhang D, Liu M. Anatomical attention guided deep networks for roi segmentation of brain mr images. IEEE Trans Med Imaging. 2019;39(6):2000–12.

    Article  PubMed  Google Scholar 

  16. Hesamian MH, Jia W, He X, Kennedy P. Deep learning techniques for medical image segmentation: achievements and challenges. J Digit Imaging. 2019;32:582–96.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:3431–3440.

  18. Sun J, Peng Y, Guo Y, Li D. Segmentation of the multimodal brain tumor image used the multi-pathway architecture method based on 3d fcn. Neurocomputing. 2021;423:34–45.

    Article  Google Scholar 

  19. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C].Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing, 2015: 234-241.

  20. Bahdanau, D., Cho, K., & Bengio, Y. Neural machine translation by jointly learning to align and translate[C]. 2015 International Conference on Learning Representations (ICLR), San Diego, CA, USA;  2015. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1409.0473.

  21. Li D, Zhang Z. Improved u-net segmentation algorithm for the retinal blood vessel images. Acta Opt Sin. 2020;40(10):1010001.

    Article  Google Scholar 

  22. Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C].Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:7132-7141.

  23. Woo S, Park J, Lee JY, Kweon IS. Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018:3–19.

  24. Qin X, Zhang Z, Huang C, Dehghan M, Zaiane OR, Jagersand M. U2net: Going deeper with nested u-structure for salient object detection. Pattern Recogn. 2020;106:107404.

    Article  Google Scholar 

  25. Zhang J, Yu X, Li A, Song P, Liu B, Dai Y. Weakly-supervised salient object detection via scribble annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:12546–12555.

  26. Yu, F., & Koltun, V. Multi-scale context aggregation by dilated convolutions[C]. International Conference on Learning Representations (ICLR), San Juan, Puerto Rico; 2016. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1511.07122.

  27. Staal J, Abr`amoff MD, Niemeijer M, Viergever MA, Van Ginneken B. Ridge-based vessel segmentation in color images of the retina. IEEE Trans Med Imaging. 2004;23(4):501–509.

  28. Owen CG, Rudnicka AR, Mullen R, Barman SA, Monekosso D, Whincup PH, Ng J, Paterson C. Measuring retinal vessel tortuosity in 10-year-old children: validation of the computer-assisted image analysis of the retina (caiar) program. Invest Ophthalmol Vis Sci. 2009;50(5):2004–10.

    Article  PubMed  Google Scholar 

  29. Odstrcilik J, Kolar R, Budai A, Hornegger J, Jan J, Gazarek J, Kubena T, Cernosek P, Svoboda O, Angelopoulou E. Retinal vessel segmentation by improved matched filtering: evaluation on a new high-resolution fundus image database. IET Image Proc. 2013;7(4):373–83.

    Article  Google Scholar 

  30. Soares JV, Leandro JJ, Cesar RM, Jelinek HF, Cree MJ. Retinal vessel segmentation using the 2-d gabor wavelet and supervised classification. IEEE Trans Med Imaging. 2006;25(9):1214–22.

    Article  PubMed  Google Scholar 

  31. Pizer SM, Amburn EP, Austin JD, Cromartie R, Geselowitz A, Greer T, Haar Romeny B, Zimmerman JB, Zuiderveld K. Adaptive histogram equalization and its variations. Comput Vision Graph Image Process. 1987;39(3):355–68.

    Article  Google Scholar 

  32. Wu J, Xuan S. Scale-aware dense residual retinal vessel segmentation network with multi-output weighted loss. BMC Med Imaging. 2023;23(1):100.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Alom M Z, Yakopcic C, Taha T M, et al. Nuclei segmentation with recurrent residual convolutional neural networks based U-Net (R2U-Net)[C]//NAECON 2018-IEEE National Aerospace and Electronics Conference. IEEE, 2018: 228-233.

  34. Ding W, Sun Y, Huang J, Ju H, Zhang C, Yang G, Lin CT. Rcarunet: retinal vessel segmentation network algorithm via novel rough attention mechanism. Inf Sci. 2024;657:120007.

    Article  Google Scholar 

  35. Azzopardi G, Strisciuglio N, Vento M, Petkov N. Trainable cosfire filters for vessel delineation with application to retinal images. Med Image Anal. 2015;19(1):46–57.

    Article  PubMed  Google Scholar 

  36. Xie S, Tu Z. Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2015:1395–1403.

  37. Fraz MM, Barman SA, Remagnino P, Hoppe A, Basit A, Uyyanonvara B, Rudnicka AR, Owen CG. An approach to localize the retinal blood vessels using bit planes and centerline detection. Comput Methods Programs Biomed. 2012;108(2):600–16.

    Article  CAS  PubMed  Google Scholar 

  38. Hu K, Zhang Z, Niu X, Zhang Y, Cao C, Xiao F, Gao X. Retinal vessel segmentation of color fundus images using multiscale convolutional neural network with an improved cross-entropy loss function. Neurocomputing. 2018;309:179–91.

    Article  Google Scholar 

  39. Mo J, Zhang L. Multi-level deep supervised networks for retinal vessel segmentation. Int J Comput Assist Radiol Surg. 2017;12:2181–93.

    Article  PubMed  Google Scholar 

  40. Shin SY, Lee S, Yun ID, Lee KM. Deep vessel segmentation by learning graphical connectivity. Med Image Anal. 2019;58:101556.

    Article  PubMed  Google Scholar 

  41. Yan Z, Yang X, Cheng KT. A three-stage deep learning model for accurate retinal vessel segmentation. IEEE J Biomed Health Inform. 2018;23(4):1427–36.

    Article  PubMed  Google Scholar 

  42. Zhou Y, Chen Z, Shen H, Zheng X, Zhao R, Duan X. A refined equilibrium generative adversarial network for retinal vessel segmentation. Neurocomputing. 2021;437:118–30.

    Article  Google Scholar 

  43. Cherukuri V, Bg VK, Bala R, Monga V. Deep retinal image segmentation with regularization under geometric priors. IEEE Trans Image Process. 2019;29:2552–67.

    Article  Google Scholar 

  44. Soomro TA, Afifi AJ, Gao J, Hellwich O, Zheng L, Paul M. Strided fully convolutional neural network for boosting the sensitivity of retinal blood vessels segmentation. Expert Syst Appl. 2019;134:36–52.

    Article  Google Scholar 

  45. Kiefer R, Abid M, Steen J, Ardali MR, Amjadian E. A catalog of public glaucoma datasets for machine learning applications: a detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and oct images. 2023:24–31.

  46. Jin K, Huang X, Zhou J, Li Y, Yan Y, Sun Y, Zhang Q, Wang Y, Ye J. Fives: a fundus image dataset for artificial intelligence based vessel segmentation. Sci Data. 2022;9(1):475.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editor and reviewers for their valuable comments and suggestions.

Funding

This work was supported in part by the Project for the Top Ten Emerging Industries in Anhui Province [2022sdxx031]; in part by the Scientific Research Foundation of Education Department of Anhui Province of China[2024AH051916];and in part by the Middle-aged and Young Scientist Research Fund of Wannan Medical College [WK2023ZQNZ03].

Author information

Authors and Affiliations

Authors

Contributions

BC responsibled for designing networks and experiments, writing and revising papers. ZYX,JSZ responsibled for providing design guidance and modification suggestions and revising paper. ZWQ responsibled for explaining medical knowledge. All authors were directly involved in the study, and read the manuscript and the final manuscript.

Corresponding author

Correspondence to Zhengyuan Xu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chu, B., Zhao, J., Zheng, W. et al. (DA-U)2Net: double attention U2Net for retinal vessel segmentation. BMC Ophthalmol 25, 86 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12886-025-03908-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12886-025-03908-0

Keywords