Note: Woven Planet became Woven by Toyota on April 1, 2023.
The Arene Platform is being developed to accelerate the creation of value towards a future where ‘mobility’ replaces ‘transportation.’ As software pioneers, we provide a modern Developer Kit and a Vehicle Runtime with APIs that enable coders to program vehicles in ways never before possible. To make this innovative platform happen, a lot of technologies are needed behind it. Sahir Contractor, a senior engineer in the Arene team at Woven Alpha, manages Arene Data Platform’s engine, the core component that empowers data customers in achieving their objectives.
Service Level Agreements (SLA)
One of the responsibilities is to provide high-quality training data to the various Machine Learning (ML) teams at Woven Alpha. We ingest terabytes of videos recorded by multiple cameras mounted on our testing vehicles, along with various streams of sensor data such as CAN, LiDAR, and RADAR. When new videos become available, the team will
Extract relevant images from these videos
Associate these images with in-vehicle sensor readings
Associate these images with extra-vehicular environmental information
Make all this data searchable for our ML teams
Crucially, in order for this data to be useful for our Artificial Intelligent engineers, we must deliver it to them within 18 hours of when we first receive the videos and sensor streams.
Problems
Our team inherited a workflow that looked like this -
Run a script, twice a week, to check if any new videos were available for processing
If yes, run another script to spin up sufficient Amazon Elastic Map Reduce (EMR) clusters to handle the processing
Wait until clusters are bootstrapped
Once the clusters are ready, run another script to load balance the videos across the clusters
Wait until all submissions are accepted
Now they wait for 10–24 hours until all the processing is complete
Since there were no notifications about the status of the clusters, the oncall would run another script to check if the work was completed
Once completed, the oncall would run another script to test that the new data was ingested correctly
And, if we’re lucky, the oncall would remember to run one last script to terminate the clusters so that we don’t pay for Amazon Elastic Compute Cloud 2 (EC2) instances that aren’t being used
As tedious as this looks, note that this flow is actually the happy case. At each step, there is a possibility of failure which can lead into another workflow for how to recover from that failure. Further, this flow was complicated enough that only one other person on the team besides the oncall knew how to execute it.
Next, given that the oncall ran this flow twice a week, we almost never hit the 18 hour SLA that our customers wanted.
You might have also noticed that there are a few different scripts that need to be executed in a particular order in order for this flow to work. Transferring state between these scripts, and coordinating their execution, is a labor-intensive, error-prone and painfully boring task that we no longer wanted humans to do. This leads us to Step Functions (SFN)!
Integration with SFN
We had prior experience working with SFN, so we knew that it was a good fit for this problem. Our first step was to identify all the different pieces that needed to fit together in order to use SFN to automate them. We broke down our workflow into four main steps -
Check if there’s new work to do
Do the work
Validate results
Inform users of results
1. Check if there’s new work to do
This part was the easiest. We have another set of scripts that are responsible for uploading new videos when they become available. All we had to do was invoke SFN with the paths of all the videos that were just uploaded. This removes the need to run SFN on a fixed schedule, and also guarantees that new videos will be processed as soon as they are uploaded.
2. Do the work
This part was relatively easy too. Since we already had scripts for creating our EMR clusters with a particular configuration, and since SFN supports configuring EMR via its native Domain Specific Language (DSL), we could just copy/paste that configuration into SFN directly.
3. Validate results
Once the previous step completes, results are written to predefined S3 buckets. We already had Glue jobs that scan these S3 buckets to ensure that all newly written data is valid and discoverable by our users. Since SFN DSL has native support for talking to AWS Glue, all we had to do was include a few lines in our state machine to synchronously invoke the glue jobs and wait until they complete.
4. Inform users of results
We created a Lambda that publishes information to a list of SNS topics, which in turn publish to various Slack channels. By invoking this Lambda at the very end of the state machine, we were able to easily notify our stakeholders about how much new data was uploaded, and how to find it.
Results
We were able to lower our average time for data delivery from ~100 hours to as fast as ~4 hours in some cases. Our p100 (worst-case scenario) dropped to 23 hours and the average hovers around ~9 hours.
We have completely eliminated the need for our oncall to engage with this pipeline. We now have strict guarantees that new data will automatically be processed as soon as it is available, along with monitors/alarms at every step of the pipeline to alert us if the automation fails.
Failure recovery has become simpler since we can model failure states and recovery actions directly into the State Machine. For example, failures to create a cluster will automatically retry. Failures to submit jobs to a cluster will automatically retry. Jobs that run longer than we are comfortable with will automatically notify our oncall. In the case of job failures, logs are automatically extracted from S3 and made available to our oncall for debugging. Additionally, since SFN UI shows you the input/output at every step, understanding where an error might have started or what the problematic input is, is also simpler than it was before.
The code also serves as documentation, thereby removing the need for us to remember to keep documentation updated as the code evolves. Ramping up new hires is much simpler since they can just view the State Machine visualization in AWS to understand how data flows from one step to the next.
Below is a diagram of what our State Machine looks like.
Lessons Learned
Monitoring Long Running Jobs
The metrics published by SFN on your behalf make it easy to measure the duration of a particular step after it has finished. However, if you want to be notified if a certain step is taking longer than X hours, you need to write that code yourself.
Concurrent Job Submissions
Submitting a large number of jobs to EMR in parallel leads to us hitting EMR’s Application Programming Interface (API) limit. Adding backoff-retry logic can help mitigate the chances of throttling when processing large amounts of data. This risk also applies to concurrently terminating a large number of clusters.
Do Not Automate Without Monitoring
As tempting as it can be to deprecate a manual workflow in favor of automation, we must remember to add appropriate monitoring as well. If you automate a manual workflow that has zero monitoring, you expose yourself to the risk of not knowing when the automation fails. We suggest adding high-level alarms such as health checks to your automation when you start, and incrementally adding more granular alarms as you proceed in order to avoid introducing blind spots.
Join us! 👋
We are actively hiring exceptional talent for the Arene Software Platform. If you are interested in working with us in our brand-new office in Nihonbashi, Tokyo, please have a look at our current open positions!