July 29, 2022Blog

Using Neural Architecture Search to Achieve Panoptic Segmentation in a Mobility Environment

Share Article

Note: Woven Planet became Woven by Toyota on April 1, 2023.

By Koichiro Yamaguchi, Yuki Kawana, Takaaki Tagawa and Yusuke Yachide, Staff Engineer, Engineer, Senior Engineer and Senior Manager


The future of mobility, in which vehicles of all types can be safe, efficient and as highly automated as the uses may require, is highly dependent on cutting-edge software.

Arene, a software platform that Woven Planet has been developing, aims to enable true state-of-the-art mobility programming as the basis for next-generation vehicles. With such innovative technology, we are all dedicated to realizing Woven Planet’s vision of “Mobility to Love, Safety to Live” and providing mobility solutions that benefit all people worldwide.

Because automated driving will be a key part of mobility’s future, the Arene AI team at Woven Planet, has been developing a neural architecture search (NAS) framework that automatically searches for deep neural network (DNN) architectures that meet the computational resource constraints of the target hardware.

NN architectures, which will enable autonomous vehicles to make sense of the complex environments in which they operate, are key to mobility’s future. NAS uses machine learning to discover and devise NN architectures as a way to speed the development process of such architectures, rather than painstakingly creating and testing.

In this article, we will present the results of our study on NAS for achieving an NN architecture optimized for a form of computer vision known as panoptic segmentation for use in road environments.

Recently, machine learning (ML) models using DNN play an important role for recognizing environments. Building an automated driving system requires solving various types of tasks, including object detection for vehicles and pedestrians; detection of road region and lanes; depth estimation; traffic light recognition, and so on.

A typical solution for diverse tasks is to develop a DNN model specialized in each task and to run all DNN models independently, as shown on the left side of Figure 1 below. By using independent, task-specific DNN models, it is possible to optimize the architecture of DNN for the task and to simplify the process of model training.

But there are constraints to this approach, which requires running inferences of many independent NN models in real-time using the in-vehicle computing environment. And the computational resources and the time budget that can be assigned to each task-specific DNN model are limited.In general, since there is a tradeoff between model size and performance, it is hard to improve the model performance under such limited resources and it is difficult to effectively allocate resources to individual tasks.

Another possible solution is to adopt a multi-task model whose structure consists of a shared network and task-specific branch modules, as shown on the right side of Figure 1. By sharing a part of the network across tasks, a larger computational resource can be allocated to the shared network, greatly increasing the efficiency of the inference process that enables the system to make sense of the information the tasks are providing. It is difficult, however, to manually design an architecture of the shared network that can adapt for various tasks under hardware constraints (e.g. supported operations and inference latency).

Figure 1. Network structures for multiple tasks. (images are from DDAD(*) dataset[6])

In this study, we investigate use of an NAS method to find an optimal shared network architecture for the multi-task problem. We adopt a panoptic segmentation task [1], which unites two different types of pixel labeling tasks and is applicable to the target multi-task problem: automated driving systems. Because our in-house NAS algorithm searches a network architecture considering hardware constraints, we optimize a shared network module for panoptic segmentation with regard to both the performance and inference latency on the hardware.

Problem setting

In this study, we aim to optimize a network architecture for panoptic segmentation in terms of the segmentation performance and inference latency. Panoptic segmentation is the task that simultaneously solves a semantic segmentation task and an instance segmentation, unified output.

Semantic segmentation works by assigning a class label to each pixel; it places objects into broad categories. Instance segmentation, on the other hand, detects and segments each instance of individual objects, such as vehicles and pedestrians, by assigning an instance ID to each pixel.

Figure 2 shows an example of panoptic segmentation. Semantic segmentation classifies all pixels into classes. Although it can segment any type of classes including car, road, building, and vegetation, all cars are classified into the same class and it cannot detect a region of each instance.

On the other hand, instance segmentation detects each distinct object although it does not categorize them into classes. Panoptic segmentation unifies these two segmentation tasks. It assigns both a class label and an instance ID to each pixel. In an automated driving system, it is useful to get such information of static environments (e.g. road and sidewalk regions) and objects of interest (e.g. cars and pedestrians).

Figure 2. Panoptic segmentation. (images are from DDAD dataset[6])

Although most existing methods for panoptic segmentation adopt a shared backbone network to extract feature maps, they use hand-crafted separate branches for two tasks. Auto-panoptic [2] automates the process by applying a NAS algorithm to find an architecture to improve the performance of panoptic segmentation. However, in the auto-pantopic process, segmentation and instance segmentation branches are separated after the backbone. Moreover, when choosing an architecture, the auto-pantopic NAS focuses on only the performance, but not runtime. In this study, we searched for a shared network module to jointly optimize the performance and the inference latency.

Proposed Method

We adopted Panoptic-DeepLab [3], which achieves high performance of panoptic segmentation and is one of widely used methods, as baseline. The architecture consists of a shared backbone of ResNet-50, two decoders, and two heads, as shown on the top in Figure 3.

We unified two decoders and built a shared network including backbone and decoder modules, as shown on the bottom in Figure 3. To make outputs of semantic and instance segmentations, lightweight heads for two tasks were added after the shared network. The final output of panoptic segmentation was generated by applying the same method post-processing that is used by Panoptic-DeepLab.

Figure 3. Network architectures for panoptic segmentation. (images are from DDAD dataset[6])

The shared network was optimized by a latency-aware gradient-based NAS algorithm [4]. As the loss functions for architecture search, we combined a latency loss with a panoptic loss. Because the panoptic loss is computed as a weighted sum of segmentation and instance losses in Panoptic-DeepLab, we added the loss with target latency constraint [5] to the panoptic loss. By using a weighted sum of segmentation, instance, and latency losses, our NAS searched an architecture of the shared network so that the performance of panoptic segmentation was improved in light of the inference latency on the hardware.


We compared our NAS model against the baseline, Panoptic-DeepLab, internally using our datasets. Our NAS successfully achieved the comparable performance to Panoptic-DeepLab, while our model’s lower latency represents an advantage in terms of the time-budget use of computational resources.


In this article, we have presented a study on NAS for panoptic segmentation that is a multi-task problem unifying semantic and instance segmentations. In our experiments, we have shown the NAS model can achieve comparable performance to the baseline with half of inference time.

Although we adopted panoptic segmentation as the target task in this study, we will apply NAS on other types of multi-task problems as a next step. In order to optimize a network architecture for diverse multiple tasks, we will investigate a method to adaptively construct shared networks and separate modules. We are also integrating the NAS algorithm with abstraction APIs for hardware profiling in order to make support of new types of hardware easy. And we have a plan to pursue hardware/software codesign considering various types of recognition tasks.

A Call for Collaboration

This study was done during an internship on the Arene AI team last year. Woven Planet is running the internship program for students this year as well. Please check it out. Arene AI is also looking for new members who have ML algorithms and MLOps skills. If you are interested, please apply for our open positions!


  1. A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar.Panoptic segmentation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

  2. Y. Wu, G. Zhang, H. Xu, X. Liang, and L. Lin. Auto-panoptic: Cooperative multi component architecture search for panoptic segmentation. arXiv preprint arXiv: 2010.16119, 2020.

  3. B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang,H. Adam, and L.-C. Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

  4. W. Bichen, D. Xiaoliang, Z. Peizhao, W.Yanghan, S. Fei,W. Yiming, T. Yuandong, V. Peter, J.Yangqing, and K. Kurt. FBNet: Hardware-aware efficient convnet design via differentiable neural architecture search. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

  5. Yibo Hu, Xiang Wu, and Ran He. TF-NAS: Rethinking three search freedoms of latency-constrained differentiable neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.

  6. Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos and Adrien Gaidon, 3D Packing for Self-Supervised Monocular Depth Estimation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

(*) The use of the DDAD dataset in this blog is expressly licensed by its owner separately from the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License under which the DDAD dataset is publicly made available.