SEMU Net
SEMU Net
Odile Liboiron-Ladouceur1†
[email protected]
1
Department of Electrical and Computer Engineering, McGill University, Montreal, QC, Canada
Abstract and machine learning systems [23], which often require sig-
nificant resources for managing and processing complex
Integrated silicon photonic devices, which manipulate models [4, 6].
light to transmit and process information on a silicon-on- Integrated silicon photonic devices often underperform
insulator chip, are highly sensitive to structural variations. experimentally due to fabrication-induced structural vari-
Minor deviations during nanofabrication—the precise pro- ations [20]. Nanometer-scale deviations, such as over-
cess of building structures at the nanometer scale—such etching, under-etching, and corner rounding, can signifi-
as over- or under-etching, corner rounding, and unin- cantly degrade device performance, efficiency, and func-
tended defects, can significantly impact performance. To tionality [11]. Research indicates that these variations tend
address these challenges, we introduce SEMU-Net, a com- to follow consistent patterns, which may allow them to be
prehensive set of methods that automatically segments scan- learned and mitigated [11].
ning electron microscope (SEM) images and uses them to With advancements in AI and particularly convolutional
train two deep neural network models based on U-Net and neural networks (CNNs) [19], researchers have achieved re-
its variants. The predictor model anticipates fabrication- markable improvements in computer vision tasks, including
induced variations, while the corrector model adjusts the image [9, 25] and video recognition [30]. These sophisti-
design to address these issues, ensuring that the final fab- cated architectures have enabled the development of mod-
ricated structures closely align with the intended specifi- els that can automatically learn and extract features from
cations. Experimental results show that the segmentation visual data, enhancing accuracy, efficiency, and robustness
U-Net reaches an average IoU score of 99.30%, while the in tasks such as object detection [24] and semantic segmen-
corrector attention U-Net in a tandem architecture achieves tation [21].
an average IoU score of 98.67%. In this paper, we introduce SEMU-Net (see Figure 1),
a comprehensive set of methods designed to automate the
segmentation of scanning electron microscope (SEM) im-
1. Introduction ages of integrated silicon photonic devices into binary clas-
sifications of silicon (core waveguide confining light) and
Integrated silicon photonic devices operate using light silica (cladding material around the core). These segmented
(photons) [8] rather than electrons to perform various func- images, along with their corresponding design files in the
tions through a silicon waveguide on a silicon-on-insulator Graphic Data System (GDS) format—a standard file format
(SOI) die. This approach offers significant advantages over used in the photonics industry for representing the physical
traditional electronic devices [2]. Photonic devices can layout of devices—are then used to train two CNN models:
achieve lower latency, reduced heat generation, and mini- the predictor and the corrector.
mal energy loss, making them ideal for high-bandwidth data The segmentation model is designed to automatically
transmission and processing tasks [11, 17]. This efficiency segment SEM images of photonic devices. It takes SEM
is crucial given the growing computational demands of AI images as input and uses manually segmented images as
Figure 1. Overview of the SEMU-Net framework for improving the fabrication of integrated silicon photonic devices. (i) The segmentation
model converts SEM images into segmented SEM images. (ii) The corrector model uses the binarized SEM images along with their
corresponding GDS design files to train itself, generating corrected GDS layouts that compensate for fabrication-induced variations. (iii)
The predictor model uses the corrected GDS files to predict the final fabricated structures, enabling pre-fabrication validation and further
refinement.
AG
AG
AG
AG
1024 1024
512 512 Bottleneck 512 512 512 512
256 256 256 256 256 256
128 128 128128 128 128
64 64 64 64 6464
Figure 3. Overview of the segmentation, predictor, and corrector models, all based on the U-Net architecture. The corrector model
distinguishes itself by incorporating attention gates (labeled AG) in the decoder path, whereas the segmentation and predictor models do
not utilize attention gates. The segmentation model takes SEM images as input and generates segmented labels, while the corrector model
takes GDS designs as input and outputs corrected designs.
stacked together. The tandem corrector refines the device efits from both fine-grained pixel accuracy and robust shape
structure, generating a corrected design file, while the tan- matching, leading to more precise segmentation results.
dem predictor evaluates this corrected file, simulating its Hyperparameter Tuning: Hyperparameter tuning is
post-fabrication appearance. A loss function is then calcu- crucial in optimizing model performance, involving adjust-
lated by comparing the predicted output to the original SEM ing parameters that control the learning process, such as
image, with the goal of minimizing this loss to ensure the learning rate, batch size, and network architecture [27]. In
corrected output closely matches the SEM image. The tan- our approach, we conduct a thorough search to optimize the
dem predictor model’s weights are frozen to maintain con- hyperparameter set for all three models in our study. This
sistency, while the corrector model’s weights are updated at process ensures that each model is configured with proper
each iteration to improve the correction process. parameters to enhance overall performance and accuracy.
Early Stopping: Early stopping is a technique used dur-
ing model training to prevent overfitting and improve effi- 4. Experiments
ciency [3]. We use this technique by monitoring the valida- In this section, we evaluate each component of our pro-
tion loss and halting training when no significant improve- posed SEMU-Net framework. We present quantitative re-
ment is observed after a certain number of iterations. sults for both the segmentation and correction models, com-
Learning Rate Scheduler: The learning rate scheduler paring their performance using IoU as the metric. To rig-
is a method used to adjust the learning rate during training orously assess the corrector model, we develop a custom
to improve model convergence [7]. By gradually reducing benchmark comprising several hundred structure images
the learning rate over time, the model can make finer adjust- featuring various shapes, including stars, gratings, circles,
ments as it approaches an optimal solution, helping to avoid holes, and more (see Figure 4).
overshooting minima in the loss function. We employ this
technique to enhance training efficiency. 4.1. Experiment Setup
Combined Loss Function: The combination of BCE Dataset: For our dataset, we utilize Applied Nanotools
and 0.5 dice loss is used as a composite loss function to Inc., a reputable integrated photonics foundry, for nanofab-
improve the correction performance [13]. BCE focuses rication and SEM imaging to acquire high-quality data.
on pixel-wise classification, penalizing incorrect predic- Due to the high costs and lengthy timelines associated with
tions for each pixel, making it effective for distinguishing nanofabrication, our training structures are designed to ac-
between foreground and background. Dice Loss, on the quire a large, varied dataset with minimal chip space and
other hand, measures the overlap between the predicted and imaging time.
ground truth masks, emphasizing the overall shape accu- The generated patterns are fabricated on a 220nm
racy. By combining BCE with 0.5 dice loss, the model ben- thick silicon-on-insulator (SOI) platform using electron-
Figure 4. Performance comparison between the attention U-Net in tandem configuration and the original U-Net across three sample
images. Each sample presents (i) the original design and its predicted structure post-fabrication, highlighting corner rounding and structural
deviations that could impact device performance, (ii) the correction and prediction of correction for the tandem attention U-Net, and (iii) the
correction and prediction of correction for the original U-Net. The tandem attention U-Net outperforms the original U-Net, demonstrating
better structural fidelity and achieving a higher IoU score.
beam lithography through a silicon photonic multi-project rectly categorized into silicon or silicon dioxide. A higher
wafer service provided by Applied Nanotools Inc. After IoU score indicates better model performance, with an IoU
lithography and etching, SEM images with a resolution of of 1.0 representing perfect segmentation.
1nm/pixel are taken of each pattern. The GDS and SEM Training Configurations: The segmentation model is
images are then segmented with the help of our segmenta- trained using image slices of size 256×256 over 50 epochs,
tion model, cropped, aligned, and prepared for training of with a batch size of 32 to ensure robust evaluation. The
the predictor and corrector models. model uses BCE as the loss function and a learning rate of
Finally, we augment the binarized SEM and GDS images 0.0001. On the other hand, the predictor and corrector mod-
to artificially increase their size and use these enhanced im- els are trained with a larger slice size of 2048×2048 over 20
ages to train our predictor and corrector models. epochs, employing early stopping after 3 epochs if no im-
Computational Resources: The training and testing of provement in validation loss is observed. A batch size of
our models are conducted using one GeForce RTX 4090 2 and a validation split of 20% are utilized, with a learning
GPU, which is known for its good performance in deep rate of 0.0004 and the AdamW optimizer to ensure efficient
learning tasks, and provides the necessary computational training. Additionally, data augmentation is applied to en-
power to efficiently handle large datasets and complex op- hance dataset diversity. These models are trained using a
erations involved in our SEMU-Net framework. This GPU combination of BCE and 0.5 Dice Loss. For the tandem cor-
allows for faster training times and enables us to experiment rector model, the original training configurations are main-
with various model configurations, ensuring robust and ac- tained, but the number of epochs is extended from 20 to 60
curate results. for more comprehensive learning.
Evaluation Metrics: To evaluate the performance of our
4.2. The Segmentation Model
models, we use IoU as the primary metric, a widely used
metric in segmentation tasks, providing a measure of the In this section, we compare the performance of our seg-
overlap between the predicted segmentation and the ground mentation U-Net model with two baseline models: the seg-
truth. In our experiments, we calculate the IoU for each ment anything model (SAM) [14] fine-tuned on our dataset,
pixel classification, determining whether it has been cor- and a traditional threshold-based model. The goal is to eval-
uate the effectiveness of each model in segmenting SEM 4.3. The Predictor and Corrector Models
images of photonics devices.
Baseline Models: (1) The attention U-Net extends the In this section, we evaluate the performance of our cor-
traditional U-Net architecture by incorporating attention rector model using four different architectures: U-Net [21],
mechanisms into the skip connections between the encoder attention U-Net [18], residual attention U-Net [16], and U-
and decoder. These attention gates focus on relevant fea- Net++ [28]. Additionally, we incorporate the tandem ar-
tures from the encoder while suppressing irrelevant infor- chitecture into each model and assess the performance of
mation. The attention U-Net improves segmentation ac- all configurations. The aim is to assess the effectiveness of
curacy by dynamically highlighting important regions in each model in correcting GDS images of photonics devices
the feature maps, which helps in handling more complex and to identify the most effective configuration for optimal
and varied structures. (2) SAM is a large-scale image seg- performance.
mentation model developed by Meta is based on vision
Results: The results of the comparison between the
transformer (ViT) architecture and is pre-trained on a vast
U-Net, attention U-Net, residual attention U-Net, and U-
dataset. For our experiments, we chose the heavy model
Net++ and their tandem configuration are summarized in
ViT-H, which is then fine-tuned on our dataset to adapt its
Table 2. The table presents the IoU scores for each model,
general segmentation capabilities to the specific character-
providing a clear view of their correction performance by
istics of SEM images. This fine-tuning involves updating
the average IoU, minimum IoU and the median of Iou,
the model parameters to improve performance on our par-
which reveal that the tandem attention U-Net model outper-
ticular task. (3) The traditional threshold-based model is
forms all other models, achieving the highest IoU score. To
implemented following the workflow shown in Figure 5.
ensure consistency in the presence of random dataset order-
This simple method is used as a baseline to evaluate the
ing and weight initializations, each experiment is run five
improvements achieved by more advanced models like U-
times, and the median of the results is used.
Net and SAM. Despite its simplicity, this model provides a
useful comparison point for assessing the effectiveness of Processing images of size 2048×2048 requires a sub-
more sophisticated approaches. stantial amount of GPU memory, which can be challenging
Results: The results of the comparison between the U- to manage. To address this constraint, we reduce the filter
Net, SAM, and the traditional threshold-based model are sizes in our models, which, in turn, decreases the number
summarized in Table 1. The table shows the IoU scores of parameters and allows us to train the models efficiently
for each model, providing their segmentation performance within the limits of our available GPU resources.
by the average IoU, maximum IoU and minimum of IoU,
The results indicate that while the base models show ro-
which reveals that the U-Net model outperforms all other
bust performance, the tandem versions, particularly the at-
models, achieving the highest IoU score. The threshold-
tention U-Net (tandem), outperform others with the highest
based method comes in second and finally the SAM model.
average IoU of 98.67%, a minimum IoU of 88.86%, and
Two example SEM images are used to visualize the seg-
a median IoU of 99.51%. These metrics reflect a notable
mentation results from different models in Figure 6.
enhancement in accuracy and consistency when using the
Despite its state-of-the-art status and extensive pre- tandem architecture. Following closely are the U-Net (tan-
training on a large dataset, SAM’s performance in this con- dem) and residual attention U-Net (tandem) configurations.
text is relatively lower. SAM struggles to achieve optimal The U-Net (tandem) configuration performs commendably,
results due to its unfamiliarity with the specific segmenta- achieving an average IoU of 97.87%, a minimum IoU of
tion task of SEM images. The model’s generalization ca- 71.39%, and a median IoU of 98.90%, indicating its effec-
pabilities are limited when applied to a highly specialized tive correction capabilities. The residual attention U-Net
dataset, indicating that fine-tuning SAM alone is insuffi- (tandem) also demonstrates a strong performance with an
cient for achieving top performance in this domain. average IoU of 96.74%, a minimum IoU of 65.91%, and a
median IoU of 98.64%.
The U-Net’s superior performance can be attributed to
its straightforward architecture and effectiveness in han- These results highlight the effectiveness of the tandem
dling complex segmentation tasks. Its symmetric encoder- architecture in improving the model’s accuracy and consis-
decoder structure with skip connections allows it to capture tency across different configurations. Figure 4 illustrates
both high-level contextual information and fine-grained de- the performance comparison between the attention U-Net in
tails. This high accuracy demonstrates the model’s robust- tandem configuration (representing our best performance)
ness and suitability for SEM image segmentation, where and the original U-Net across three sample images from our
precise delineation of features is crucial. custom dataset.
Canny
Pre-processing
Post-processing
Paddle, Crop
Dilate
Figure 5. Workflow of the threshold-based segmentation model. A combination of Canny and Otsu’s methods is used to detect the contours
and further segment the pre-processed SEM images.
Model # Parameters (M) Average IoU (%) Max IoU (%) Min IoU (%)
U-Net 7.94 99.30 99.71 98.35
SAM (ViT-H) 636 88.35 96.90 43.48
Threshold Model NA 96.54 98.59 86.16
Table 1. Comparison of the segmentation model performance on the custom benchmark. To ensure consistency, each experiment is run
five times due to random dataset ordering and weight initializations, and the median result is taken.
Model # Parameters (M) Average IoU (%) Min IoU (%) Median IoU (%)
U-Net 1.94 92.38 6.12 97.01
Attention U-Net 2.01 93.55 72.15 97.62
Residual Attention U-Net 2.64 93.64 46.73 96.42
U-Net++ 0.56 92.63 66.64 95.19
U-Net (tandem) 3.88 97.87 71.39 98.90
Attention U-Net (tandem) 4.02 98.67 88.86 99.51
Residual Attention U-Net (tandem) 5.28 96.74 65.91 98.64
U-Net++ (tandem) 1.13 93.98 58.55 96.58
Table 2. Comparison of the corrector model performance on the custom benchmark. To ensure consistency, each experiment is run five
times due to random dataset ordering and weight initializations, and the median result is taken.
U-Net U-Net
SAM SAM
Threshold
Threshold
Model Model
Figure 6. Performance comparison between the U-Net, SAM and threshold model for two sample SEM images, which are processed
through the U-Net, SAM and threshold models and the segmentation results are compared with corresponding ground truth.
[8] Po Dong, Young-Kai Chen, Guang-Hua Duan, and David [17] Shupeng Ning, Hanqing Zhu, Chenghao Feng, Jiaqi Gu,
Neilson. Silicon photonic devices and integrated circuits. Zhixing Jiang, Zhoufeng Ying, Jason Midkiff, Sourabh Jain,
Nanophotonics, 3, 2014. 1 May H. Hlaing, David Z. Pan, and Ray T. Chen. Photonic-
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, electronic integrated circuits for high-performance comput-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, ing and ai accelerators, 2024. 1
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [18] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori,
worth 16x16 words: Transformers for image recognition at Steven McDonagh, Nils Y. Hammerla, Bernhard Kainz, Ben
scale, 2021. 1 Glocker, and Daniel Rueckert. Attention u-net: Learning
[10] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. where to look for the pancreas, 2018. 2, 3, 4, 7
Image style transfer using convolutional neural networks. In [19] Keiron O’Shea and Ryan Nash. An introduction to convolu-
Proceedings of the IEEE Conference on Computer Vision tional neural networks, 2015. 1
and Pattern Recognition, pages 2414–2423, 2016. 3 [20] Alexander Y. Piggott, Eric Y. Ma, Logan Su, Geun Ho
[11] Dusan Gostimirovic, Dan-Xia Xu, Odile Liboiron- Ahn, Neil V. Sapra, Dries Vercruysse, Andrew M. Nether-
Ladouceur, and Yuri Grinberg. Deep learning-based ton, Akhilesh S. P. Khope, John E. Bowers, and Jelena
prediction of fabrication-process-induced structural Vučković. Inverse-designed photonics for semiconductor
variations in nanophotonic devices. ACS Photonics, foundries. ACS Photonics, 7(3):569–575, 2020. 1
9(8):2623–2633, 2022. 1, 3
[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
[12] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.
Convolutional networks for biomedical image segmentation,
Efros. Image-to-image translation with conditional adver-
2015. 1, 2, 3, 4, 7
sarial networks, 2018. 3
[22] Martin F. Schubert, Alfred K. C. Cheung, Ian A. D.
[13] Shruti Jadon. A survey of loss functions for semantic seg-
Williamson, Aleksandra Spyra, and David H. Alexander. In-
mentation. In 2020 IEEE Conference on Computational
verse design of photonic devices with strict foundry fabrica-
Intelligence in Bioinformatics and Computational Biology
tion constraints. ACS Photonics, 9(7):2327–2336, 2022. 2
(CIBCB), pages 1–7. IEEE, Oct 2020. 5
[14] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, [23] Shawn Yohanes Siew, Bo Li, Feng Gao, H. Y. Zheng,
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- Weipeng Zhang, Pengfei Guo, S. W. Xie, A. Song, Bo Dong,
head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and L. W. Luo, Chao Li, Xianshu Luo, and GuoQiang Lo. Re-
Ross Girshick. Segment anything, 2023. 6 view of silicon photonics technology and platform develop-
[15] Zeqin Lu, Jaspreet Jhoja, Jackson Klein, Xu Wang, Amy Liu, ment. Journal of Lightwave Technology, 39(13):4374–4389,
Jonas Flueckiger, James Pond, and Lukas Chrostowski. Per- 2021. 1
formance prediction for silicon photonics integrated circuits [24] Farhana Sultana, Abu Sufian, and Paramartha Dutta. A Re-
with layout-dependent correlated manufacturing variability. view of Object Detection Models Based on Convolutional
Optics Express, 25(9):9712–9733, 2017. 3 Neural Network, pages 1–16. Springer Singapore, 2020. 1
[16] Zhen-Liang Ni, Gui-Bin Bian, Xiao-Hu Zhou, Zeng-Guang [25] Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang
Hou, Xiao-Liang Xie, Chen Wang, Yan-Jie Zhou, Rui-Qi Li, Sun. Deep image: Scaling up image recognition, 2015. 1
and Zhen Li. Raunet: Residual attention u-net for semantic [26] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-
segmentation of cataract surgical instruments, 2019. 2, 7 tion by dilated convolutions, 2016. 3
[27] Tong Yu and Hong Zhu. Hyper-parameter optimization: A
review of algorithms and applications, 2020. 5
[28] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima
Tajbakhsh, and Jianming Liang. Unet++: A nested u-net
architecture for medical image segmentation, 2018. 2, 7
[29] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks, 2020. 3
[30] Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari,
Yuanjun Xiong, Chongruo Wu, Zhi Zhang, Joseph Tighe, R.
Manmatha, and Mu Li. A comprehensive study of deep video
action recognition, 2020. 1