Large Scale Data Annotation Pipeline With Global Dataset Governance Policy in Arene

September 27, 2021Tech Insights

Large Scale Data Annotation Pipeline With Global Dataset Governance Policy in Arene

Note: Woven Planet became Woven by Toyota on April 1, 2023.

By Akira Wakatsuki, Takaaki Tagawa, Yusuke Yachide and Yuta Tsuzuki, Senior Engineers, Senior Manager and Engineer

Introduction

Arene aims to enable true state-of-the-art mobility programming as the basis for next-generation vehicles. With such innovative technology, we are all dedicated to bringing our vision of “Mobility to Love, Safety to Live” to life, and providing mobility solutions that benefit all people worldwide. As a part of the whole Arene team, the Arene AI team at Woven Alpha, Inc. (Woven Alpha), an operating company of Woven Planet Holdings, Inc. (Woven Planet) has been developing machine learning (ML) and annotation infrastructure primarily and using this infrastructure to generate the annotation data service on Arene. One of the Arene AI’s missions is to establish a standard ML platform for its entire partners. The commonality of the platform will enable the sharing of all resources and all ML engineer’s outcomes, which will naturally lead to collaboration, which in turn will contribute to the productization, and ultimately to the safety of vehicles and people. Today, we would like to give you more details about our multi-purpose and large-volume annotation operations (AnnoOps) working for mass production.

Our current focus is automated driving software developments, and ML technologies often appear in various automated driving applications. Although we have high expectations for ML technology, each team has different expectations and requirements depending on the development phase and a target task.

For example:

I. Annotation quality: One team wants relatively lower quality annotation of a small dataset as soon as possible for trial, whereas the other team wants high quality annotation on a large scale dataset for production.

II. Annotation process: One team wants to annotate bounding boxes then masks only for a subset of boxes, whereas the other team wants to annotate objects in image and lidar point cloud of the same timestamp and link them.

III. Annotation delivery: One team wants to get annotated data of a certain geo-location, whereas the other team wants to get annotated data of a certain period.

We have to fulfill these “diverse” tasks and volumes of annotation requirements to satisfy all of our customers. We also have to perform stable operations efficiently and systematically to provide reliable services at scale.

We have been working very hard on solving the problems I. — III. We have found that II. was one of the largest bottlenecks in our pipeline to handle increasing volume and diversity of annotation tasks. We focused on the “II. Annotation process” to improve our pipeline and would like to explain how we approached this problem.

Issues in Our Old Annotation Process

Automated driving software development requires handling a variety of annotations. We work on the variety of annotation tasks requested by the various customers from various domains. We also work with multiple annotation vendors to meet a variety of requirements from customers, such as high quality, huge volume, and so on. Our annotation process easily got complex and the operation overhead became significant. It resulted in issues as follows:

We had to manage different data formats and annotation rules for various customers.
We had to manage different pipelines to handle various annotation requests, annotation tasks, vendors, and task-dependent processes.
The annotation format depends on annotation vendors. We need to convert them to the format each of our customers can accept.

In 1., for instance, we accept annotation requests and data from customers through various ways due to customer limitations or needs. Such as the different storage, data location, and communication channel. It easily got difficult to manage such complexity with increasing number of customers.

In 2., we need to handle specific configurations, options, and vendor APIs for each annotation request. It is tightly coupled to the vendor and the task. We needed to construct a different pipeline for each combination. It easily got complicated and error prone with increasing numbers of customers and vendors.

In 3., the annotation result formats depend on the annotation vendors. We needed to convert each format to what our customers wanted to accept. The same as 2., it easily got complicated and error prone with increasing numbers of customers and vendors.

Our New Annotation Process

We implemented the new annotation process to solve the issues above and to handle various annotation tasks, vendors, and customers at scale.

The annotation process proceeds as the following phases:

Receive targeted data for annotation from customers.
Request annotations for annotation vendors.
Obtain annotation results from vendors and deliver to customers.

The above processes are linked with the aforementioned problem 1–3.

In 1., you receive target data from each customer who requests different annotation tasks. In other words, there is a diversity of relationships between customer and annotation tasks.

In 2., each annotation task goes through a different format and process, depending on which annotation vendor to submit the annotation request. There is a different relation between the annotation task and the vendor process.

In 3., for the annotation results with the different formats from each vendor, they need to deliver a different ML dataset for each customer. There is a diversity of relations between annotation results and the data set deliverables.

There are different variations in each phase of the annotation process. Our new annotation process aims to manage such variations to minimize the complexity. We introduced the configurations to manage relations and the unified format to manage datasets.

_{Figure 1. Our new annotation process overview}

We decoupled the dependencies according to the 3 phases and defined the following three types of relation parameters to handle dependencies between customers, annotations, and datasets.

A. Customer project parameters: Relations between customers and dataset from customer and annotation tasks.

B. Annotation vendor process parameters: Relations between annotation tasks and vendor processes.

C. Delivery parameters: Relations between annotation vendors and ML datasets deliverable formats for customers.

Each set A., B. and C. above describes the diversity of the previous 1., 2., and 3. Importantly, we standardized the schema of each parameter information, so that we can construct an automation process which only requires the above parameters to process without human intervention.

In addition, we have introduced a unified dataset management format in the annotation platform, so that once we convert data into this unified format we don’t need to depend on the data format provided by the customers anymore. The data set management format, as we called Dataset Governance Policy (DGP). DGP is delivered from Toyota Research Institute (TRI) in the U.S.A, our joint-development partner. With Arene AI’s ambitious mission described at the beginning, it has spread to Woven Planet as a whole, and the scope is expanding further. The DGP itself is also open source and is evolving itself.

As a result, we can handle diverse customer annotation projects without losing scalability by maintaining only the parameters listed above, regardless of the diversity of the customer’s data format.

Operational improvement

We introduced the new annotation process in our real operations and it improved our operations significantly.

First, the new process eliminated the operations caused by the differences in vendors and annotation tasks. The current operation requires only a single script which starts the annotation process. Thanks to the simplified operations, it improves to bring into the line and operational efforts. It also enabled multi-operators to manage the annotation operations.

Second, we successfully eliminated manual operations to associate customer data to the annotation tasks every time. Once we assign the customer project parameters, our annotation process automatically handles the data coming from customers.

Third, we only need to update the relations parameters to modify the annotation process, which eliminates the human works needed to handle the effects caused by the changes. It improved our operation efficiency significantly, as well as the human errors reduced at most 90%.

Given such flexibility and scalability, it enables us to easily expand our business to new customers and vendors. We plan to handle all annotation requests in Woven Planet in near future.

Summary

In this article, we explained our new annotation process to manage diverse datasets and annotations at scale for mass production. We decoupled the annotation processes to 3 phases. Then handle each phase automatically based on the standardized parameters and dataset format.

We successfully improved and scaled up the annotation workflow based on the new annotation pipeline. We continue to work on the pipeline improvement to accommodate all the annotation processes in Woven Planet.

We thank you for reading the contents of this page. Now we are seeking new members who can join us. If you are interested in annotation, please apply for the open posts here!