How It's Made
7 steps to get started with large-scale labeling
How Instacart built a crowdsourced data labeling process (and how you can too!)
Organizations that develop technologies rooted in information retrieval, machine learning, recommender systems, and natural language processing depend on labels for modeling and experimentation. Humans provide these labels in the context of a specific task, and the data collected is used to construct training sets and evaluate the performance of different algorithms.
How do we collect human labels? Crowdsourcing has emerged as one of the possible ways to collect labels at scale. Popular services like Amazon Mechanical Turk or FigureEight are examples of platforms where one can create tasks, upload data sets, and pay for work. However, homework needs to be done before a data set is ready to be labeled. This is even more important for new domains where there are no existing training sets or other benchmarks… domains like grocery!
At Instacart, we are revolutionizing how people search, discover, and purchase groceries at scale. Every day, our users conduct millions of searches on our platform, and we return hundreds of millions of products for them to choose from. In such a unique domain, collecting human labels at scale has allowed us to augment Instacart search and generate best practices that we hope to share.
Introducing our “Pre-flight Checklist” of tasks for implementing large-scale crowdsourcing tasks. This list is independent of a specific crowdsourcing platform and can be adapted to any domain.
- Assess the lay of the land
- Identify your use cases
- Understand your product’s data
- Design your Human Intelligent Task (HIT)
- Determine your guidelines
- Communicate your task
- Maintain high quality
Before we jump in, a note on terminology: we use the terms rater, evaluator, or worker interchangeably to mention a human who is processing a task. In a task, humans are asked to provide answers to one or more questions. This process is usually called labeling, evaluation, or annotation, depending on the domain.
1. Assess the lay of the land
The first step to approaching human evaluation is to understand what your organization has already done. Make sure to ask the following questions:
- Have we done any similar human evaluation tasks before?
- Do we have any human-labeled data?
If your organization has already collected human evaluated data, make sure to understand existing processes. Do you have vendors with whom you already work? Is there an established way to store human-labeled data? Existing approaches can influence how you design your crowdsourcing task, so it’s important to take stock. Understand what went well in previous projects and what lessons were learned.
If you’re starting from scratch, focus on an area that the organization would like to know more about. For example, you may not know how good your top-k organic results are and want to quantify that metric.
At Instacart, we had previously completed a few ad-hoc projects, but now that we are beginning to run large-scale projects, we are revising the methodology.
2. Identify your use cases
Creating human evaluated data is often a costly and time-consuming process. Make sure to ask yourself:
- What do we want the human evaluated data to accomplish? Is there a metric in mind?
- Why is human evaluated data necessary here? Is this a critical project or nice-to-have?
- Is this a one-off attempt or part of a larger continuous project?
Your data could be used as general training and evaluation data, as a way to quality test the output of your model, or as a reference collection to benchmark current and future models. Each of these use cases may require different approaches, which you should keep in mind.
Moreover, make sure that your use cases will genuinely benefit from human labeling. Crowdsourced tasks require proper setup and a budget and should only be reserved for tasks requiring human input.
At Instacart, we wanted to measure the relevance of our search results. Labeled data helps us understand how relevant the products we show to users are when they enter a query into their search bar. This data can be used for training and evaluating models and measuring the quality of our search results.
You may also like...
How It's Made
Designing Digital Experiences That Augment the Analog World
Shoppers are the backbone of Instacart’s business. Every day, we’re energized by being able to help them serve customers more effectively and support their learning and development to build a long-term relationship. In their offline…...Mar 16, 2021
How It's Made
Nailing the Handoff
Exploring Certified Delivery’s checkout and delivery flows While Instacart’s bread and butter has always been and will continue to be grocery, many grocers and specialty retailers have a wide variety of items in their catalogs…...Nov 17, 2020
How It's Made
Announcing Coil 1.0
I’m very excited to announce the release of Coil 1.0. Coil is a Kotlin-first image loading library for Android built on top of Kotlin Coroutines. It simplifies loading images from the Internet (or any other…...Oct 22, 2020