AI Quality & Evaluation Engineer, AI product
Woven City Social Infrastructures & Platforms
Tokyo
hybrid
TEAM
Toyota is redefining what it means to move. We're challenging the current state of mobility by enhancing the movement of people, goods, information and energy. Centered around three core concepts - A Living Laboratory™, Human-Centered, and Ever Evolving City™ - Woven City serves as a test course for mobility to fulfill our purpose of well-being for all.
We do this by bringing together a diverse community of people with a shared passion for the future of mobility to co-create, develop and refine innovative products and services. This cross-section of social infrastructure, mobility, and people provides a unique opportunity for inventors, residents and visitors to interact seamlessly with new technologies throughout daily life in an environment that emulates a real city.
For more information about Woven City, please visit: https://www.woven-city.global/
WHO ARE WE LOOKING FOR?
We are looking for the first dedicated QA Engineer for our AI products—someone who understands and embraces the uncertainty and non-deterministic nature of LLM-based systems, and who is interested not only in testing quality, but in making quality work as a sustainable system.
In this role, you will go beyond executing individual test cases. A key mission is to standardize and automate evaluation workflows so that user feedback and evaluation results continuously feed into product quality improvements. You will work closely not only with product development teams, but also with MLOps and data engineers, bringing a QA perspective to redesign, implement, and operate the entire feedback loop.
In particular, we expect you to take ownership of building the first practical quality feedback loop led by QA, even in situations where feedback from real users is limited or not yet sufficiently established.
This position reports to the function leader and offers a hybrid work arrangement. Additionally, there are business trips to Woven-City several times a month.
RESPONSIBILITIES
- Define quality dimensions for AI products, including accuracy, consistency, safety, fairness, and UX
- Conduct scenario-based testing, exploratory testing, and red teaming
- Analyze LLM outputs to identify behavioral trends and failure patterns
- Design quality evaluation processes using user feedback such as logs, ratings, and inquiries
- Standardize evaluation processes and structure them in a reusable, scalable manner
- Automate quality evaluation workflows
- Collect and aggregate evaluation data
- Establish mechanisms for continuous and recurring quality checks
- Build and operate quality feedback loops in collaboration with development, MLOps, and data engineering teams
- Document quality issues and risks, and communicate them clearly to relevant teams
MINIMUM QUALIFICATIONS
- 3+ years experience in QA or testing with products that use LLMs
- Experience in test design, including defining test perspectives and creating test cases
- Ability to evaluate systems with ambiguous specifications or non-unique “correct” answers by establishing clear judgment criteria
- Experience automating test or evaluation processes
- Experience collaborating with development teams to drive quality improvements
- Business-level proficiency in both English and Japanese
NICE TO HAVES
- Experience in scenario testing and red teaming
- Experience evaluating and analyzing user feedback and logs
- Proactive mindset with the ability to identify problems independently
- Basic understanding of natural language processing, particularly in the context of the Japanese language