- Research
- Open access
- Published:
- Hongda Li1,2,
- Yue Zhao1,
- Zeyang Bi1,
- Peng Hao2,3,
- Huarui Wu2,3 &
- …
- Chunjiang Zhao2,3
Plant Methods volume21, Articlenumber:30 (2025) Cite this article
-
282 Accesses
-
Metrics details
Abstract
Background
As an important economic crop, the growth status of the root system of cabbage directly affects its overall health and yield. To monitor the root growth status of cabbage seedlings during their growth period, this study proposes a new network architecture called Swin-Unet++. This architecture integrates the Swin-Transformer module and residual networks and uses attention mechanisms to replace traditional convolution operations for feature extraction. It also adopts the residual concept to fuse contextual information from different levels, addressing the issue of insufficient feature extraction for the thin and mesh-like roots of cabbage seedlings.
Results
Compared with other backbone high-precision semantic segmentation networks, SwinUnet + + achieves superior segmentation results. The results show that the accuracy of Swin-Unet + + in root system segmentation tasks reached as high as 98.19%, with a model parameter of 60M and an average response time of 29.5 ms. Compared with the classic Unet network, the mIoU increased by 1.08%, verifying that the Swin-Transformer and residual networks can accurately extract the fine-grained features of roots. Furthermore, when images after different semantic segmentations are compared to locate the root position through contours, Swin-Unet + + has the best positioning effect. On the basis of the root pixels obtained from semantic segmentation, the calculated maximum root length, extension width, and root thickness are compared with actual measurements. The resulting goodness of fit R² values are 94.82%, 94.43%, and 86.45%, respectively. Verifying the effectiveness of this network in extracting the phenotypic traits of cabbage seedling roots.
Conclusions
The Swin-Unet + + framework developed in this study provides a new technique for the monitoring and analysis of cabbage root systems, ultimately leading to the development of an automated analysis platform that offers technical support for intelligent agriculture and efficient planting practices.
Introduction
Cabbage is a significant global economic crop. According to the 2020 statistics from the Food and Agriculture Organization (FAO) of the United Nations, the total cultivated area reached 3.77million hectares, with an annual production of 96.39million metric tons, and the overall industry value approximated 16.12billion USD. This underscores its substantial agricultural importance and economic relevance [1]. Typically, cabbage seedlings are cultivated in plug trays and transplanted outdoors once they reach an appropriate growth stage. However, the transplant survival rate [2]and subsequent growth of cabbage largely depend on the growth condition of its root system. As the main organ for absorbing water and nutrients, the root system health not only directly affects the overall growth of the plant but is also closely related to the final yield of cabbage [3]. Therefore, studying the growth status of the root system during the seedling stage can help improve cabbage seedling management, optimize transplant survival rates, and provide scientific guidance for subsequent field management.
The growth status of cabbage root systems is directly reflected in a series of phenotypic parameters, and the accurate analysis of these parameters is crucial for a deeper understanding of plant growth mechanisms [4, 5]. Traditional methods for studying plant root systems primarily rely on manual measurement and observation, which are time-consuming and prone to subjective errors. With the advancement of computer technology, researchers have begun to explore automated image processing using computer vision techniques. Early root segmentation methods were mainly based on edge detection [6] and threshold processing [7]. For example, binary threshold methods were used to segment the shape and root system of carrots, extracting phenotypic traits such as root contours and curvature [8]. These methods provided initial tools for the automated measurement of plant root systems, addressing the inefficiencies and errors associated with manual measurements. However, these approaches are sensitive to noise, especially when handling complex scenes or non-uniform lighting conditions, which can result in loss of edge information or false detections. Additionally, they heavily rely on parameter settings, making them difficult to generalize across different application scenarios. To overcome these issues, researchers have gradually shifted to deep learning-based image segmentation methods, driven by advancements in computational power and the rapid development of deep learning techniques. The introduction of convolutional neural networks (CNNs) has led to breakthrough progress in the automated analysis of plant root phenotypes. Deep learning models based on architectures like UNet [9] and SE-ResNet can achieve precise semantic segmentation of rice roots even in noisy environments [10]. Compared to traditional edge detection methods, these models are better at handling complex root structures and offer higher segmentation accuracy [11]. Subsequent research has improved the UNet architecture and incorporated image enhancement techniques, such as EnlightenGAN [12], to further enhance segmentation performance in challenging scenarios. For example, these improved methods can address issues like soil occlusion and uneven lighting, enabling precise segmentation and phenotypic extraction of soybean roots [13]. Moreover, the introduction of the AGSS (Anti-Gravity Stem Seeking) algorithm [14] has further resolved issues of discontinuities and broken segments in root images, significantly improving the restoration and completeness of root system images. This study focuses on the growth status of cabbage seedling roots, with key phenotypic parameters being calculated based on the overall root area, and semantic segmentation provides the pixel-level information necessary for these calculations, enabling root localization through pixel contours. Compared to instance segmentation, semantic segmentation has a lower computational complexity and is more suitable for the practical application of automated analysis of the growth status of individual whole roots.
Current backbone architectures for deep learning-based semantic segmentation can be broadly categorized into two types: CNN-based and Transformer-based models. Early representative works in CNN-based semantic segmentation include PSPNet [15] and DeepLabV3+ [16], which have strong local feature extraction capabilities. However, their fixed receptive fields hinder the ability to capture global contextual relationships. Root systems often have complex fibrous structures, such as overlapping roots or varying thicknesses, and using only CNNs result in difficulties in accurately segmenting boundary contours. While models like YOLO-seg [17] have introduced modifications to the receptive field design to address this limitation, they still fall short in modeling long-range dependencies, which are essential for a comprehensive understanding of complex root architectures. Nevertheless, the hierarchical extraction of local features in CNNs encounters challenges when capturing the intricate, fibrous, and network-like structure of roots, particularly in cases where root structures overlap or exhibit uneven thickness. Additionally, the fixed receptive field in CNNs may fail to model the long-range dependencies necessary for a comprehensive understanding of the root system. Transformer-based models, such as Segformer [18] and TransUnet [19], address these challenges by leveraging self-attention mechanisms, which offer exceptional capabilities for modeling long-range relationships. The pre-trained large model, Grounded SAM [20], exhibits remarkable zero-shot generalization ability, allowing it to adapt effectively to a variety of complex scenarios. However, its performance in segmenting the distal regions of roots remains limited, and the high computational cost renders it impractical for widespread use. Transformer-based models, while powerful, are prone to neglecting crucial fine-grained local features of the roots and suffer from significant computational overhead. The proposed Swin-Unet + + architecture mitigates these limitations by integrating the strengths of both approaches. The Swin Transformer component, with its hierarchical structure and window-based attention mechanism, effectively captures both local and global features while maintaining a reasonable computational complexity. Meanwhile, The skip connections of the UNet [21, 22] architecture preserve fine-grained spatial information crucial for the accurate identification of root nodes.
The root system of cabbage is a typical fibrous root system, characterized by a thin and reticulated structure. This structural complexity makes root identification and analysis challenging, especially in image processing and automated analysis. The intricate morphology and fine root hairs often hinder traditional segmentation methods from accurately capturing the root system. Moreover, most existing research focuses on analyzing phenotypic traits of the above-ground parts, such as head diameter and leaf area [23, 24], while the extraction of root phenotypic parameters remains underdeveloped, limiting the ability to fully reflect cabbage growth status and its response to environmental factors. To systematically analyze the growth condition of cabbage seedling roots and reveal key root phenotypic traits, this paper designs the Swin-Unet + + network. The objective is to achieve accurate semantic segmentation of root and shoot regions in images, allowing the extraction of root phenotypic parameters from the segmented images. The network incorporates the Swin-Transformer module and the concept of residual connections, replacing traditional convolutions with attention mechanisms [25, 26] for feature extraction. By fusing features across different layers, it addresses the problem of feature loss and enhances the global contextual relationships in images. This enables the model to capture critical root phenotypic parameters when processing segmented images. By revealing key phenotypic traits of root growth, this study not only provides a new method for monitoring cabbage growth but also advances the development of intelligent agriculture. The main contributions of this paper are as follows:
- 1.
A Swin-Unet + + language segmentation network architecture is proposed, which combines the Swin Transformer module with residual networks to address the issue of insufficient feature extraction for thin and light roots. This effectively improves the semantic segmentation accuracy between root and stem-leaf regions.
- 2.
Based on the task of Swin-Unet + + in cabbage seedling segmentation, accurate positioning of root pixels is performed to extract the maximum root length, root extension width, root area, and root thickness phenotypic parameters of root growth. The effectiveness of Swin-Unet + + in phenotype analysis is verified, with the goodness of fit R² between the extracted parameter calculation results and actual measured values reaching 94.82%, 94.43%, and 86.45% respectively.
- 3.
An automated platform for extracting cabbage seedling root phenotypic parameters was developed, providing a convenient tool for root phenotype research.
Materials and methods
Data collection
The cabbage seedlings used in this study were sourced from the Lvyuan Demonstration Base in Suning, Hebei. The cultivation cycle for the cabbage seedlings from sowing to transplantation was 35 days. In the data collection process, firstly, the seedlings were removed from the plug tray and cleaned of substrate. Subsequently, the roots were dried to remove moisture and the roots of the seedlings were combed until they were flat before being on a white background board for photography. The smartphone camera was equipped with a CMOS sensor, featuring a resolution of 4000 × 3000, which typically offers an optimal balance between image quality and power consumption. With a shutter speed of 1/100 seconds, an ISO setting of 191, and an aperture of f/1.79, the camera is well-suited for use in typical lighting conditions, delivering high-quality images and producing effective background blur. The images were captured using a 35mm focal length lens, with the camera fixed at a height of 10cm from the ground to ensure clear imaging of the cabbage seedling roots. A 15cm scale bar was placed on the right side of each image to record the root length, which facilitated subsequent image analysis. All data are of high quality and accurately reflect the growth status of cabbage root systems. The original images were divided into three regions: background, seedling roots, and stem-leaf parts. The annotation was manually performed using Label Studio software [27], generating mask images that correspond to three values: 0, 1, and 2, representing the background, root, and stem-leaf regions, respectively. The image acquisition process is illustrated in Fig.1, which details each step of the image collection procedure.
Dataset acquisition and preparation process
Data augmentation
Data augmentation is a method that applies various transformations to the initial dataset to expand the data volume, aiming to enhance the robustness and generalization capability of the model. In this study, data augmentation is implemented through operations such as flipping, rotation, scaling, brightness enhancement, random distortion, contrast adjustment, and random noise with operation on images [28], as illustrated in Fig.2. With a probability of 50%, the image undergoes horizontal flipping to enhance diversity. Additionally, the image is subjected to random rotations, with a probability of 1, where it may rotate up to 2 steps counterclockwise and 25 steps clockwise. There is a 30% chance that the image will be scaled to 85% of its original size to prevent cropping. The brightness is adjusted with certainty within the range of 0.7 to 1.2. Grid distortion is applied with an 80% probability, where both the width and height of the grid are set to 10, and the amplitude is set to 20 to amplify the deformation effect. The contrast is also adjusted with certainty within the range of 0.7 to 1.2. Finally, a random noise factor with a value of 14 is introduced. These operations generate diversified data that differ from the original images, effectively increasing the model adaptability to various conditions. Data augmentation is applied synchronously with the single-channel label images.
After augmenting the dataset with an additional 1,000 images, and the size of each image was standardized to 512 × 512 pixels. To ensure the reliability of model evaluation, we employed a stratified sampling strategy with an 8:1:1 ratio to partition the dataset into training, validation, and testing. The abundant training data enables the model to fully learn the complex feature patterns of the root system, while maintaining sufficient validation and testing for final evaluation. Given the significant class imbalance in the dataset, with roots, stems, leaves, and background regions exhibiting disparate distributions, stratified sampling was applied to maintain consistent class distributions across the three subsets. To further assess the robustness of the results, we conducted 5-fold cross-validation on the combined training and validation sets, ensuring that the original stratified class distributions were preserved in each fold. This data augmentation strategy, while not increasing the actual collection cost, introduces more complex features into the dataset, expanding its diversity. This enables the model to perform more robustly when faced with potential complex backgrounds and noise interference in real-world applications.
Data augmentation of cabbage seedling root images
Segmentation model structure
The semantic segmentation model for cabbage roots designed in this paper is the SwinUnet + + model, with its architecture shown in Fig.3. The backbone of the model is based on the SwinTransformer module, with a maximum depth of 5. The Linear embedding linearly encodes the vector corresponding to each pixel of the image in the C dimension. Each encoding layer consists of Patch Merging and two Swin TransformerBlocks, responsible for downsampling the image and increasing the number of channels. the decoding layer consists of Patch Expanding and two Swin TransformerBlocks, responsible for upsampling the image and reducing the number of channels. The final Patch Expanding layer increases the size fourfold, restoring the image to its original dimensions. The Merging and Expanding operations replace the pooling and upsampling operations in CNNs, while Patch Partition uses the ViT [29] method for encoding, ensuring that the network preserves positional information. The network draws on the concept of residual connections, connecting adjacent feature layers to prevent loss of features. The network also performs pruning based on data characteristics, achieving feature fusion across different layers through the stacking of deep features at level L. Linear Projection employs a fully connected layer to transform the C-channel features into channels corresponding to the number of categories, with each layer containing pixel positions for the respective categories. The Skip-Connection employs the idea of residual networks to achieve direct transmission of features at different levels. Fine-grained features from shallow layers can be directly combined with deep semantic features, enabling the recognition of finer structures in segmentation tasks. The combination of the Swin Transformer module and residual connections provides Swin-Unet + + with powerful feature extraction and multi-scale fusion capabilities. The hierarchical structure of the Swin Transformer allows the model to learn multi-scale features at different levels, while the residual network ensures the effective integration of these features.
Swin-Unet + + model architecture diagram
Swin transformer block
The Swin Transformer Block [25] is based on the multi-head self-attention mechanism (MSA). It reduces the computational complexity of the global self-attention mechanism while maintaining precise modeling of local structural information. The block consists of two sub-modules, as shown in Fig.4. Each module contains a Layer Normalization (LN) layer and a Multi-Layer Perceptron (MLP). Block-1 incorporates the Window-based Multi-Head Self-Attention (W-MSA) module, which divides the image into multiple windows and computes self-attention within each window, thereby reducing computational costs. To address the issue of information not passing between windows, Block-2 adopts the Shifted Window Multi-Head Self-Attention (SW-MSA) module. The position of the window is modified through a Shift Window approach, and then MSA is performed within each window. It enables the fusion of local and global information at different scales, allowing it to extract key features even in complex backgrounds. Swin Transformer has a stronger capability for capturing global features and is well-suited for high-resolution images, particularly excelling at handling noise and complex textures in the images. The formula for the Swin Transformer blocks can be expressed as follows:
Swin transformer block
$$\:\hat {{z}}^{l}=W\text{-}MSA\left(LN\left({z}^{l-1}\right)\right)+{z}^{l-1}$$
(1)
$$\:\hat {{z}}^{l+1}=SW\text{-}MSA\left(LN\right({z}^{l}\left)\right)+{z}^{l}$$
(2)
The notation\(\:\:\hat {{\varvec{z}}}^{\varvec{l}}\) represents the output of the MSA in the l-th module, and \(\:\hat {{\varvec{z}}}^{\varvec{l}+1}\) represents the output of the MLP in the l-th module. Self-attention is computed as follows:
$$\:Attention(Q,K,V)=\text{S}\text{o}\text{f}\text{t}\text{M}\text{a}\text{x}(\frac{Q{K}^{T}}{\sqrt{d}}+B)V$$
(3)
Q, K, V ∈ \(\:{R}^{{M}^{2}\times\:d}\) represent the matrices for Query, Key, and Value, respectively, where M2 is the number of patches in the window and d is the dimension of the query or key. The values in B are derived from a bias matrix \(\:{R}^{(2\:M-1)\times\:(2\:M+1)}\).
Polynomial decay
For large models that require long training times, maintaining a fixed learning rate may lead to suboptimal performance in later stages. Polynomial decay adjusts the rate of learning rate decay by modifying the decay index. In the initial phase of training, a larger learning rate helps to quickly approach the global optimum. In the later stages of training, a smaller learning rate aids in fine-tuning model parameters, preventing overshooting of the optimal point and assisting the model to converge better. The learning rate (lr) gradually decreases from the initial value (\(\:{lr}_{initial}\)) to the minimum value (\(\:{lr}_{min}\)). The polynomial decay formula is as follows:
$$\:\text{l}\text{r}=\left({lr}_{initial}-{lr}_{min}\right)\times\:{\left(1-\frac{\text{c}\text{u}\text{r}\text{r}\text{e}\text{n}\text{t}\_\text{i}\text{t}\text{e}\text{r}}{\text{m}\text{a}\text{x}\_\text{i}\text{t}\text{e}\text{r}\text{s}}\right)}^{power}+{lr}_{min}$$
(4)
Where max_iters represents the total number of training steps, current_iter represents the current step, and power is the decay exponent. The selection of an appropriate decay power is critical to the training process. Larger values of power are suitable for tasks that require slower decay, while smaller values of power can accelerate training and mitigate the risk of overfitting.
Evaluation metrics
Semantic segmentation evaluation
In semantic segmentation evaluation, determining whether each pixel is correctly classified is crucial. A commonly used evaluation tool is the Confusion Matrix, while the Mean Intersection over Union (mIoU) is employed to measure the similarity between the predicted results and the ground truth in semantic segmentation tasks, as well as the similarity between the predicted box location and the annotated box location in root object detection. pe represents the probability of a correct guess based on the actual and predicted frequencies for each class. Metrics such as accuracy, Kappa coefficient, mIoU, and Dice coefficient are used to assess the performance of the model in semantic segmentation.
$$\:Acc=\frac{TP+TN}{TP+TN+FP+FN}\times\:100\%$$
(5)
$$\:pe=\frac{\left(TP+FN\right)\times\:\left(TP+FN\right)+(TN+FP)\times\:(TN+FN)}{{(TP+TN+FP+FN)}^{2}}$$
(6)
$$\:\text{K}\text{a}\text{p}\text{p}\text{a}=\frac{Acc-pe}{1-pe}$$
(7)
$$\:{mIoU}_{image}=\frac{1}{N}\frac{\left|X\right|\cap\:\left|Y\right|}{\left|X\right|\cup\:\left|Y\right|}$$
(8)
$$\:\text{D}\text{i}\text{c}\text{e}=\frac{2|X\cap\:Y|}{\left|X\right|+\left|Y\right|}$$
(9)
|X|,|Y| represent the number of pixels in the predicted image and the ground truth image for the same class, respectively.
Phenotypic extraction evaluation metrics
The coefficient of determination (R²) and root mean square error (RMSE) are used to evaluate the accuracy of extracting root system phenotypic parameters from images and their effectiveness compared to the actual values. The root RMSE is an indicator that measures the degree of difference between the predicted model and the actual observed values. By simultaneously using R² and RMSE for evaluation, a comprehensive measurement of the model performance in the task of extracting root system phenotypic parameters can be achieved.
$$\:RMSE=\sqrt{\frac{1}{n}{\sum\:}_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}}$$
(10)
$$\:{R}^{2}=1-\frac{{\sum\:}_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}}{{\sum\:}_{i=1}^{n}{({y}_{i}-\stackrel{-}{y})}^{2}}$$
(11)
Implementation details
The experimental environment for this study is Ubuntu 24.04 LTS. The hardware configuration includes an Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, a single A100 GPU with 80GB of memory, and CUDA version 11.8. The deep learning framework used is PyTorch, with all network models undergoing 80,000 iterations. This study employs the Stochastic Gradient Descent (SGD) optimizer to update the model weights, with a momentum setting of 0.9 to accelerate the convergence of SGD and avoid oscillations. SGD is a commonly optimization algorithm that updates model parameters based on the gradient of the loss function, typically using a small batch of data at each iteration. The learning rate for the network is set to 1×e− 2, a preliminary value determined through experimental tuning, aimed at balancing training speed and convergence. To prevent overfitting, L2 regularization is employed to control the complexity of the model, with the weight decay coefficient set to 4×e− 5. This approach ensures that the model maintains good generalization performance on the test set. The batch size for each training is set to 4, a smaller batch size helps reduce the variance in each parameter update, making the model training process more stable.
Experimental results and analysis
Model training
The changes of the loss function and mIOU, ACC of the training set and validation set with the increase of iters during semantic segmentation training on the root images of cabbage seedlings by different architectures are shown in Fig.5. At the early stage of training, the model parameters are not yet optimized, resulting in a high loss and low mIoU, indicating insufficient segmentation accuracy. During the mid-stage of model training, the loss function exhibits fluctuations within a certain range, reflecting the dynamic adjustment of model parameters as they seek the global optimum. The mIoU first decreases and then increases, indicating that the model’s preliminary understanding of the data is biased at this stage, leading to a temporary drop in performance. However, as training progresses, the model gradually corrects its parameters, resulting in performance improvement. Specifically, when extracting features of cabbage roots from images, the prediction error relative to the ground truth decreases over time. This process reflects the model gradually moving from an unstable state towards convergence, with performance steadily improving. At the final stage of training, when the loss no longer decreases significantly and the mIoU stabilizes at its highest value, the model is considered to have converged. At this point, the feature extraction and prediction capabilities of the model reach a balance, and further training yields limited benefits. During the training process, the Upernet network structure exhibited significant fluctuations in convergence, followed by FCN, while the Unet and its variant networks showed relatively stable convergence. Throughout the model training process, there was no obvious overfitting, indicating that the model possesses good generalization ability.
(a) Variation of mIoU with training iterations during different models training (b) Variation of Acc with training iterations during different models training (c) The training loss corresponding to the number of iterations (d) The validation loss corresponding to the number of iterations
Ablation experiments
To design a semantic segmentation model architecture suitable for the root images of cabbage seedlings, this study focuses on exploring the impact of different backbone networks as decoders on the model feature extraction capability. Drawing inspiration from the TransUnet [19] architecture, the backbone adopts both SwinTransformer and CNN. Variations in the number of feature extraction layers are expected to affect the model ability to capture multi-scale information, with layer settings of 4 and 5. while the pruning(PR) operation takes the average value of the results of the L segmentation branches in the architecture diagram to obtain the final result. This reduces the consumption of computational resources while maintaining the diversity and robustness of the multi-branch structure. To comprehensively evaluate the impact of these factors, this study will analyze the effects of different backbone networks, the number of feature extraction layers, and SwinUnet + + pruning operations on model performance through ablation experiments.
Table1 shows the segmentation accuracy of different network structures. Compared to using CNN as the decoder, when using the Swin Transformer Block as the decoder, the accuracy of the models with depths of 4 and 5 layers increased by 6.08% and 5.74%, respectively. The Swin Transformer Block can capture fine features of the root, and with increasing model depth, the model accuracy under non-pruned operations improved by 0.32%. This improvement is attributed to the deeper model ability to extract more detailed root features. Additionally, the Swin Transformer Block, through its self-attention mechanism, better understands spatial relationships and subtle differences, demonstrating stronger representational power. The pruning operation optimizes feature extraction at each layer by aggregating and averaging the deepest features of the network. When using the Swin Transformer Block as the decoder, the pruning operation improved accuracy by 0.39% and 0.09% for the 4-layer and 5-layer models, respectively. However, when the decoder is CNN, the pruning operation not only failed to improve accuracy but even led to a decrease. This is due to the excessive feature gap between different layers, which failed to effectively compensate for the loss of contextual information.
As shown in Fig.6, in the segmentation map of cabbage seedlings, with the increase in the number of encoder and decoder layers, the Swin Transformer Block is able to extract increasingly finer features layer by layer. This is particularly evident in the complex structure of the cabbage seedling roots, where the classification of fine root pixels shows continuity, fully demonstrating the superior ability to capture fine-grained features. This deep-level feature extraction enables the thin and net-like structure of the roots to be clearly presented in the segmentation, addressing the issue of insufficient extraction of these subtle features in traditional methods.
Comparison of ablation experiments on cabbage seedling root segmentation performance
Comparison of high-precision semantic segmentation models
To verify the performance of the SwinUnet + + network structure in the semantic segmentation of cabbage seedlings, this study conducted a comparative analysis with current high-precision semantic segmentation networks. Table2 presents the comparison results of SwinUnet + + and other high-precision semantic segmentation models in terms of segmentation accuracy of the cabbage seedling roots. The results show that SwinUnet + + achieved the best performance across multiple metrics, including Accuracy(Acc), mIoU, Kappa, and Dice. The Acc reached 98.19%, while the mIoU was 86.69%. The total number of model parameters is 60M, and the inference speed is 28.5ms. The Swin Transformer Backbone outperformed the traditional CNN architecture. While the FCN model has the smallest FLOPS, its actual inference speed is not ideal, and its performance metric, mIoU, is 2.4% lower than the optimal model due to its relatively simple network structure and limited feature extraction and fusion capabilities. Unet + + has the most lightweight model parameters, but its segmentation accuracy is still lower than SwinUnet++, and the small number of parameters may affect the model generalization ability and robustness. The Unet model has the fastest inference capability, but its mIoU is 1.08% lower than the optimal model, as its overly simple structure results in insufficient feature extraction. Compared to the Transformer-based model Segformer, Swin-Unet + + outperforms Segformer in both accuracy and mIoU, while also demonstrating more efficient inference speed. The Swin Transformer captures multi-scale feature information through its local attention mechanism, while the residual network aids in multi-layer feature fusion, allowing more detailed features to be preserved in deeper networks.
To visually compare the segmentation performance of different network structures, this study conducted a visualization analysis using heatmaps. Figure7 shows the heatmap features of each high-precision semantic segmentation model in the cabbage seedling segmentation task. The results show that the feature visualization maps from Segment and SwinUnet + + indicate that using a Transformer as the backbone is more effective in distinguishing fine root features. SwinUnet + + exhibits a more balanced attention to the entire seedling characteristics, capturing detailed and rich global information. In contrast, the Unet and Unet+++ models primarily focus on the pixels around the seedling roots, while Unet + + pays more attention to the pixel features in areas with higher seedling density. Other networks show relatively dispersed attention on the cabbage seedling roots, failing to effectively capture critical details. The visualization analysis of the segmentation results confirms that the SwinUnet + + network architecture surpasses current high-precision semantic segmentation models in terms of accuracy and feature extraction capabilities.
Heatmaps of semantic segmentation of cabbage seedlings using different models
Extraction of root phenotypic parameters
Root localization methodology and performance comparison
The identification of root position in this study needs to be based on the prediction of the segmentation model, and use the mathematical calculation method of traditional computer vision to identify the pixel contour of the root through the findContours function contour extraction algorithm in OpenCV. Based on the extracted contour, the minimum bounding rectangle algorithm is used to find the minimum rectangular area (bounding box) surrounding the root. The performance of determining the bounding box under different models is shown by the mIOU values in Table3. SwinUNet + + achieved the highest mIoU of 0.8109, while Segformer had the lowest mIoU of 0.7928. SwinUNet + + improved by 1.81% compared to Segformer. The recognition effect of determining the root position according to the mask of different segmentation models is shown in Fig.8. It can be seen from the figure that even under complex shadows, the segmentation effect map can accurately locate the root position. However, for some roots with lighter colors at the end, it is difficult for the model to accurately locate them.
Root localization results based on the masks generated by different segmentation models
In this study, SwinUnet + + was first employed as the segmentation network architecture to perform semantic segmentation on the test set. SwinUnet + + combines the powerful feature extraction capabilities of the Swin Transformer with the contextual linking advantages of residual networks, enabling precise segmentation of seedling roots. A schematic diagram of the parameter extraction from the semantic segmentation results is shown in Fig.9, clearly illustrating the construction of the bounding box and the physical significance of each parameter. In the diagram, red pixels represent the root area, while green pixels indicate the background. To extract geometric parameters of the root from the segmentation results, the geometric parameters of the bounding box are first utilized for further feature extraction: The height of the box represents the longest primary root length. The width of the box reflects the maximum root extension width. The number of pixels within the box represents the total root area, i.e., the root coverage. The portion of pixels where the upper boundary of the box intersects with the root is used to measure root thickness, which corresponds to the average root diameter. The extraction of these parameters lays the foundation for subsequent quantitative analysis.
Phenotypic parameter extraction from cabbage root segmentation images
Cabbage seedling root phenotypic parameter evaluation
To validate the effectiveness of SwinUnet + + in extracting phenotypic features, we established a standardized image acquisition system. A fixed camera mounting platform was constructed at a height of 10cm above a white background board, ensuring consistent imaging conditions. The experimental setup is shown in Fig.1. A calibration ruler was placed in each image frame to establish a precise pixel-to-millimeter conversion ratio (C):
$$\:\text{C}=\frac{L\_actual}{L\_pixels}$$
(12)
where L_actual is the known length of the calibration ruler in millimeters, and L_pixels is its length in pixels.
After segmentation with SwinUnet++, the phenotypic parameters were extracted using the following algorithms:
$$\:\text{L}\text{e}\text{n}\text{g}\text{t}\text{h}=({max}\left(y\_coordinates\right)-min(y\_coordinates))\times\:\text{C}$$
(13)
where y_coordinates represents the vertical positions of root pixels in the segmented image.
$$\:\text{E}\text{x}\text{t}\text{e}\text{n}\text{s}\text{i}\text{o}\text{n}=({max}\left(x\_coordinates\right)-min(x\_coordinates))\times\:\text{C}$$
(14)
where x_coordinates represents the horizontal positions of root pixels.
$$\:\text{A}\text{r}\text{e}\text{a}={N}_{pixels}\times\:{\text{C}}^{2}$$
(15)
where N_pixels classified as root tissue is the total number of pixels.
$$\:\text{T}\text{h}\text{i}\text{c}\text{k}\text{e}\text{n}=\sum\:(inter\_pixel\_count)\times\:\text{C}$$
(16)
where inter_pixel_count represents the number of root pixels in each horizontal scan line near the root base.
Building upon this, we compare the root length, extension, surface area, and root thickness of the cabbage seedling roots obtained from the segmentation model. Subsequently, an error analysis is conducted by comparing these results with the actual measured structures. Subsequently, error analysis was conducted with the actual measured structures. According to the data in Table3, SwinUnet + + achieved the best performance in most metrics. The R² values for Length and Extension (> 0.94) indicate excellent consistency between manual and automated measurements of these parameters. The relatively lower R² value for Area (0.7406) suggests a greater complexity in accurately measuring this parameter. The R² value for root width fitting reaches 0.8645, demonstrating a strong correlation. In Table3, “-” denotes cases where the predicted results significantly deviated from the actual measurements. The fitting curves between the phenotypic parameters extracted by the model and the actual values are shown in Fig.10. SwinUnet + + demonstrated clear advantages in the segmentation of cabbage seedling roots, particularly in the prediction of key morphological traits.
Fitting of Swin-Unet + + Model Predicted Root Phenotypic Parameters to Actual Values: (a) Root Length, (b) Extension, (c) Root Thickness, (d) Root Area
Automated analysis of cabbage seedling root phenotypes
To achieve automated analysis of cabbage seedling root phenotypes, this study developed an automated platform for cabbage seedling root phenotype analysis, as shown in Fig.11. The platform loads the pre-trained Swin-Unet + + model weights to perform automatic inference on input images. By uploading cabbage seedling images provided by the user, the model conducts image segmentation, accurately identifying the cabbage root regions. Based on the model predictions, the platform then automatically extracts and calculates key phenotypic parameters of the cabbage, including root length, root extension width, root area, and root thickness. This process enables efficient and accurate analysis of cabbage seedling root phenotypes, allowing users to quickly obtain essential data.
Automated analysis platform for cabbage seedling root phenotypes
Discussion
In the current study, traditional computer vision methods have been widely employed for root phenotype recognition. However, due to the influences of lighting variations and occlusions, root images often exhibit discontinuities, complicating the precise identification of roots. To address this, several studies have proposed root refinement algorithms aimed at restoring root pixels. Nevertheless, these methods typically focus on the simple distinction between background and root, failing to achieve precise localization and identification of roots within the entire plant [14]. In studies focused on the root phenotypes of cabbage, the primary subjects have been the mature roots of cabbage plants [30]. However, due to the relatively coarse structure of cabbage roots, the extraction of roots from images is less challenging compared to the more intricate task of extracting roots from seedlings. Currently, the SwinUnet + + model for root feature extraction has demonstrated significantly superior segmentation performance when compared to current state-of-the-art high-precision semantic segmentation networks. Testing on a cabbage seedling root image dataset indicates that the SwinUnet + + model achieves a mIoU of 86.69%, which is 2.61% and 2.57% higher than those of CCNet [31] and UPerNet [32], respectively. In terms of determining the boundary mIoU score for roots, SwinUnet + + surpasses Unet [9] by 0.39%. Compared to instance segmentation networks, these methods reduce the computational resources required for root feature recognition and localization, thereby improving overall recognition efficiency. The SwinUnet + + model, through its SwinTransformer Block module, multi-scale feature extraction, and residual network fusion, effectively captures subtle variations in roots and demonstrates excellent segmentation performance on cabbage seedling root images. Additionally, during the data training process, data augmentation techniques enhanced the diversity of the image environment, preventing overfitting and improving the model’s generalization capacity. This is especially evident in the model’s robustness when confronted with complex conditions such as noise and non-uniform lighting. These advancements provide a solid foundation for subsequent high-precision root phenotype extraction and contribute to a more accurate assessment of the robustness of cabbage seedlings.
However, despite the outstanding performance of SwinUnet + + in high-resolution images, especially in capturing root details, the model also has certain limitations. Firstly, SwinUnet + + is designed and trained based on the morphology and image datasets of specific plant roots, so when applied to other plant roots, it may be necessary to adjust the parameters or network structure of the model. In addition, low-resolution images lead to feature loss, especially the details of fine structures in roots, which in turn affects the segmentation accuracy of the model. In the future, this problem can be compensated for by multimodal data fusion to enhance the segmentation effect under low-resolution images. Furthermore, although combing the roots flat during image shooting can effectively improve the accuracy of parameters such as root area, this operation increases the difficulty of data collection, especially when dealing with complex natural root morphologies. In future research, combining depth information or three-dimensional structure data obtained from other sensors will help overcome the problem of information loss caused by relying solely on two-dimensional images. Moreover, image capture typically requires a fixed shooting height; however, for root images with significant scale variations, especially in dense or morphologically complex root systems, this fixed-height approach may not adequately cover all root features, thus affecting segmentation accuracy. Therefore, future research could explore the incorporation of more refined multi-scale feature fusion mechanisms to further improve the model’s performance in scenarios involving large-scale variations or complex root morphologies.
Conclusion
In this study, we proposed an end-to-end semantic segmentation network, Swin-Unet++, for the automated analysis of root phenotypic information during the seedling stage of cabbage. To refine the model architecture, we conducted ablation experiments on factors such as model depth, decoder structure, and pruning operations. These optimizations culminated in achieving high-precision semantic segmentation of both the root system and stem-leaf regions. The experimental results revealed that the model achieved an accuracy of 98.19%, a mIoU of 86.69%, a Kappa value of 90.37%, and a Dice coefficient of 92.38% in the root segmentation task, all of which underscore the superior performance of the model. Using the semantically segmented images to identify the root position and determine the growth status phenotype parameters, Swin-Unet + + performed the best. To validate the practical effectiveness of Swin-Unet + + in parameter extraction, we calculated root length, root extension width, and root thickness, yielding R² values of 94.82%, 94.43%, and 86.45% respectively. These findings not only further corroborate the high accuracy of Swin-Unet + + in extracting root phenotypic parameters but also address the current gap in phenotypic research that predominantly focuses on above-ground traits, thereby contributing to a more comprehensive understanding of the entire growth cycle of cabbage. In future work, as the digitization and intelligence of agriculture deepens, the end-to-end semantic segmentation network Swin-Unet + + will not only be limited to the current task of automated parsing of root phenotype information but also extend to the integration and intelligent management of multimodal data throughout the entire crop growth cycle. Swin-Unet + + can serve as an Agent tool for large models, further enhancing its adaptability and generalization ability in different agricultural scenarios by combining multimodal data and multitask learning methods.
Data availability
No datasets were generated or analysed during the current study.
References
Li X, Wang Y, Cai C, Ji J, Han F, Zhang L et al. Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea. Nat Genet. 2024;56.
Yang X, Hu Z, Liu Y, Xie X, Huang L, Zhang R, et al. Effect of pyrene-induced changes in root activity on growth of Chinese cabbage (Brassica campestris L.), and the health risks caused by pyrene in Chinese cabbage at different growth stages. Chem Biol Technol Agric. 2022;9:7.
Hazarika M, Saikia J, Phookan DB, Kumar P, Gujar D. Effect of different growing media on seedling quality and field performance of Cabbage (Brassica oleracea var. capitata L). Pharma Innov J. 2022;11:1493–7.
CAS Google Scholar
Xu H, Fu L, Li J, Lin X, Chen L, Zhong F, et al. A method for analyzing the phenotypes of Nonheading Chinese Cabbage leaves based on deep learning and OpenCV phenotype extraction. Agronomy. 2024;14:699.
Wu G, Liu Y, Hu B. Image-Based Measurement of Phenotypic Parameters of Chinese Cabbage. 2023 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE; 2023. pp. 1–6.
Yu Q, Tang H, Zhu L, Zhang W, Liu L, Wang N. A method of cotton root segmentation based on edge devices. Front Plant Sci. 2023;14:1122833.
Lixuan S, Jia K, Nan W, Limin S. A new threshold segmentation method for cotton root images. J Hebei Univ (Natural Sci Edition). 2022;42:124.
Brainard SH, Bustamante JA, Dawson JC, Spalding EP, Goldman IL. A Digital Image-based phenotyping platform for analyzing Root shape attributes in Carrot. Front Plant Sci. 2021;12:690031.
Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv; 2015.
Gong L, Du X, Zhu K, Lin C, Lin K, Wang T, et al. Pixel level segmentation of early-stage in-bag rice root for its architecture analysis. Comput Electron Agric. 2021;186:106197.
Li Y, Huang Y, Wang M, Zhao Y. An improved u-net-based in situ root system phenotype segmentation method for plants. Front Plant Sci. 2023;14:1115713.
Yu Q, Wang J, Tang H, Zhang J, Zhang W, Liu L, et al. Application of Improved UNet and EnglightenGAN for Segmentation and Reconstruction of in situ roots. Plant Phenomics. 2023;5:0066.
Chung YS, Lee U, Heo S, Silva RR, Na C-I, Kim Y. Image-based machine learning characterizes Root nodule in soybean exposed to Silicon. Front Plant Sci. 2020;11:520161.
Mingxuan Z, Wei L, Hui L, Ruinan Z, Yiming D. Anti-gravity stem-seeking restoration algorithm for maize seed root image phenotype detection. Comput Electron Agric. 2022;202:107337.
Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. pp. 2881–90.
Chen L-C. Rethinking atrous convolution for semantic image segmentation. arXiv; 2017. Available from: https://doi.org/10.48550/arXiv.1706.05587
Paul A. Smart solutions for capsicum harvesting: unleashing the power of YOLO for Detection, Segmentation, growth stage classification, counting, and real-time mobile identification. Computers and Electronics in Agriculture; 2024.
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv; 2021.
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y et al. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv; 2021.
Paul A, Machavaram R. Advanced segmentation models for automated capsicum peduncle detection in night-time greenhouse environments. Syst Sci Control Eng. 2024;12:2437162.
Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J, UNet++:. A nested U-Net Architecture for Medical Image Segmentation. In: Stoyanov D, Taylor Z, Carneiro G, Syeda-Mahmood T, Martel A, Maier-Hein L, et al. editors. Deep learning in Medical Image Analysis and Multimodal Learning for clinical decision support. Cham: Springer International Publishing; 2018. pp. 3–11.
Li B, Wu F, Liu S, Tang J, Li G, Zhong M, et al. CA-Unet++: an improved structure for medical CT scanning based on the unet + + Architecture. Int J Intell Syst. 2022;37:8814–32.
Lueling N, Reiser D, Straub J, Stana A, Griepentrog HW. Fruit volume and Leaf-Area Determination of Cabbage by a neural-network-based Instance Segmentation for different growth stages. Sensors. 2023;23:129.
ZHU Y, WU H, GUO W, WU X, et al. Identification method of Kale Leaf Ball based on Improved Uper-Net. Smart Agric. 2024;6:128–37.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z et al. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021. pp. 10012–22.
Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y et al. Swin Transformer V2: Scaling Up Capacity and Resolution. 2022. pp. 12009–19.
HumanSignal/label-studio [Internet]. HumanSignal. 2025 [cited 2025 Jan 8]. Available from: https://github.com/HumanSignal/label-studio
Alomar K, Aysel HI, Cai X. Data augmentation in classification and segmentation: a survey and new strategies. J Imaging. 2023;9:46.
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, et al. A Survey on Vision Transformer. IEEE Trans Pattern Anal Mach Intell. 2023;45:87–110.
Qiu F, Shao C, Zhou C, Yao L. A method for cabbage root posture recognition based on YOLOv5s. Heliyon. 2024;10:e31868.
Feng S, Zhuo Z, Pan D, Tian Q, CcNet:. A cross-connected convolutional network for segmenting retinal vessels using multi-scale features. Neurocomputing. 2020;392:268–76.
Xiao T, Liu Y, Zhou B, Jiang Y, Sun J. Unified perceptual parsing for scene understanding. Proceedings of the European conference on computer vision (ECCV). 2018. pp. 418–34.
Funding
This work was supported by the National Key R&D projects—Development of a Fully Intelligent Cloud Platform for Open-Field Vegetable Management and Unmanned Production Validation (Project No. 2023YFD2001205).
Author information
Authors and Affiliations
College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China
Hongda Li,Yue Zhao&Zeyang Bi
National Engineering Research Center for Information Technology in Agriculture, Beijing, 100101, China
Hongda Li,Peng Hao,Huarui Wu&Chunjiang Zhao
Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing, 100097, China
Peng Hao,Huarui Wu&Chunjiang Zhao
Authors
- Hongda Li
View author publications
You can also search for this author inPubMedGoogle Scholar
- Yue Zhao
View author publications
You can also search for this author inPubMedGoogle Scholar
- Zeyang Bi
View author publications
You can also search for this author inPubMedGoogle Scholar
- Peng Hao
View author publications
You can also search for this author inPubMedGoogle Scholar
- Huarui Wu
View author publications
You can also search for this author inPubMedGoogle Scholar
- Chunjiang Zhao
View author publications
You can also search for this author inPubMedGoogle Scholar
Contributions
Each author is expected to have made substantial contributions to the conception HD. L design of the work; Y.Z and ZY.B the acquisition, analysis, P.H interpretation of data; HD.L the creation of new software used in the work; HR.W and CJ.Z have drafted the work or substantively revised it. All the authors have read and approved the final manuscript.
Corresponding authors
Correspondence to Huarui Wu or Chunjiang Zhao.
Ethics declarations
Ethics approval and consent to participate
All the authors agreed to publish this manuscript.
Consent for publication
Consent and approval for publication were obtained from all the authors.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, H., Zhao, Y., Bi, Z. et al. Swin-Unet++: a study on phenotypic parameter analysis of cabbage seedling roots. Plant Methods 21, 30 (2025). https://doi.org/10.1186/s13007-025-01340-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13007-025-01340-5
Keywords
- Cabbage
- Root phenotype
- Attention mechanism
- Semantic segmentation
- Unet
- Residual networks