Note: Woven Planet became Woven by Toyota on April 1, 2023.
Data plays a key role in the development of Arene which aims to enable true state-of-the-art mobility programming as the basis for next-generation vehicles. When you build a data platform, it’s crucial to understand what kind of data you’ll be ingesting and how your users plan on interacting with it. Knowing this can allow you to build various mechanisms into your platform that give you the level of control you need to minimize cost, and maximize efficiency. In this post, I’ll talk about how our legacy data platform at Woven Planet Holdings, Inc. (Woven Planet) would have benefitted from having a better understanding of the data and the users beforehand, and what lessons we will be taking into the next generation of Big Data at Woven Planet.
The first data platform that Woven Planet built ingested double-digit petabytes of data over the course of two years. Our users were successfully able to find the data they needed, and were therefore able to build and deploy various Automated Driving / Advanced Driving Support System (AD/ADAS) models to Toyota Motor Corporation vehicles that ultimately increased customer safety and mobility. By all accounts, this platform was a success. After two years in production, we’ve had the chance to question the initial assumptions we made and learn what we could have done even better. Let’s look at some examples.
Perhaps the most telling example is the way our platform handled images recorded by our testing fleet. The data ingestion pipeline received video that was recorded by various cameras across the fleet and then extracted low-quality images from this video at a low sampling rate. These low-quality images, referred to as lossy images, were fast to extract and cheap to store, so we believed this was an effective way to populate a catalog of all available images. Our users then sifted through this catalog to find the images they needed for their model development, and requested the data platform to re-extract the same images with a sampling rate/image quality that was suitable for machine learning (ML) training. We refer to these images as lossless images.With this approach, our platform has extracted billions of lossy images. However, our users so far have only requested a small percentage of that as lossless images. That means that a majority of the work we put into extracting lossy images, and a majority of the costs of storing the lossy images could have been avoided if we had more clarity regarding what kind of data our users wanted and what their access patterns would be. Put simply, there was room to improve.
Benefits and Drawbacks
While being wasteful, the approach described above did have some benefits. It allowed us to immediately add value for our users who had an urgent need to start interacting with the data.Next, because all the lossy images were extracted up-front, our users never had to wait to download or preview the images. This provided a slick user experience with zero delay since the only bottleneck for previewing an image was the user’s network speed.However, let’s think about the problems that this approach creates. First and foremost is cost. Keeping hundreds of million images in the active storage tier of S3 isn’t cheap, nor is the cost of running various Electronic Map Reduce (EMR) clusters to extract images that nobody is interested in. This approach might work for small datasets, but simply doesn’t scale for the future needs of Woven Planet.Further, it would be cost-prohibitive to extract multiple combinations of image quality/sampling rates up front, so our users are stuck with just one combination. If the 1 FPS frame rate meets their needs, that’s great! If not, then they are out of luck. In fact, as you start tracking objects at high speeds, 1 FPS just doesn’t cut it.
Our ‘working hard’ approach didn’t scale well, so let’s take a look at a ‘working smart’ approach instead. After analyzing two years’ of usage reports, we arrived at a key insight. We had unintentionally coupled data discovery with data extraction.Our metadata catalog tells our users what kind of data we have by providing indexes around map attributes, weather data, time of day, etc. The image catalog then ties that metadata to all the lossy images, so our users can use the metadata to find and view lossy images. But perhaps it isn’t necessary to create the image catalog up front.Instead, we decided to explore on-demand image extraction. Our hypothesis is that our users will use the metadata to find the scenes they want, and then request our platform extract the images they want with the attributes they want (quality, format, compression, frame rate).
Benefits and Drawbacks
As owners of the platform, the main benefit is clearly the cost. Moving to an on-demand image extraction allows us to do the least amount of work necessary to meet our user’s needs. It is more resource and cost efficient than the ‘working hard’ approach.Next, it also gives users significantly more control over what kind of images they want. User A might look at the metadata catalog and request a 1 FPS / lossy / JPG and User B might look at the same metadata and instead request a 10 FPS / lossless / PNG. By decoupling our metadata from our image extraction, we allow ourselves to serve multiple users with the same underlying data.However, this approach isn’t perfect. As with all things in life, there are tradeoffs. Since images are extracted ‘on-demand’, users will experience some latency while they wait for the extraction to complete. There is abundant room for optimizations such as optimistic extractions, caching and so on, but it is inevitable that some users will experience some delay, and that’s the challenge we look forward to solving in the next iteration of our data platform. It might not be easy, but it will definitely be fun!
Work hard in the beginning, but constantly look for ways to work smart. Working hard is often easier to set up, quickly creates short-term value for users, and gets you precious feedback on the validity of your assumptions. However, we must be willing to use this feedback to transition towards working smart
Take the time to understand what kind of data you have, and how your users wish to interact with it. Further, break down your users use-cases and understand which use-cases are the 99% that you want to optimize for, and which are the 1% you’re willing to delay until a future version of your platform. Had we known that 99% of our users would only be interested in a small percentage of our data, we would have made significantly different decisions early on.
Create decoupled components that work well together. By decoupling data discovery from data extraction, we moved away from a one-size-fits-all approach and instead allowed our users to interact with data on their own terms.
Decoupling data discovery from data extraction is also a vital first step towards our vision of a data mesh. As new pieces of metadata are added to our catalog, users don’t have to wait for all the metadata to be added. Instead, we see a world where users can target specific metadata they are interested in, and trigger extraction, or even other actions based on that metadata.
As we build the next generation data platform within Arene, we will be experimenting with this approach. We’re excited to see how our ideas play out! If the challenges describe above sound interesting to you then come join us here!