How It's Made
Instacart’s Item Availability Architecture: Solving for scale and consistency
By Michael Prescott
This blog post is the last segment in a three-part series that outlines our innovative approach to overcoming inventory challenges, delving into the application of product design, machine learning, and engineering technology. Here are Part-1 and Part-2 of the series if you want to catch up.
In the first part of this series, we explored the dramatic shift in Instacart’s item availability, especially during the pandemic, and how we harnessed real-time availability predictions to keep pace with the rapidly fluctuating in-store inventory. In our second part, we delved into the machine-learning artistry behind real-time predictions for hundreds of millions of items. In this final segment, we share the engineering infrastructure we built to scale real-time predictions and foster faster experimentation.
Building an infrastructure to serve Real-time Availability (RTA) predictions shared in Part-2 required us to address several challenges.
- Support fetching of scores via Remote Procedure Call (RPC): For using real-time predictions, we needed to support fetching of scores through a new RPC created by the Machine Learning (ML) team.
- Support low latency use cases: Because the scores are used in filtering at the retrieval stage, we needed to support fast and bulk fetching of scores at the retrieval stage. An RPC-based approach would be too slow for these needs, so we needed an alternate approach on top of the Real-Time scoring API.
- Support high-consistency use cases: Consistency of score and what is available across all surfaces was vital, especially when fetching scores for UI changes. Informing customers of an item’s unavailability while showing it available on another surface would erode customer trust.
To solve these challenges, we implemented two methods for ingesting scores generated by the ML models into our DB storage for fast and bulk retrieval. For both methods, we chose to store the scores in a DB and update them to address the need for low-latency use cases. This approach puts the scores closer to retrieval and hence can be fetched via SQL joins.
- Full Sync: The ML Availability Scoring Service updates a table multiple times a day in Snowflake with the refreshed availability score of items. The DB ingestion workers read the snowflake table periodically and upsert the availability score for refreshed items to ensure no scores are stale.
- Lazy Score Refresh: Scores are updated on demand based on an item appearing in the search results when it exceeds the allowable age. To reduce the effort on the online systems, the refresh activity happens in background jobs and used kinesis to aggregate the updates to be ingested into the DB
Lazy score refresh and full sync refresh architecture
With hundreds of millions of items, performing a full sync frequently is neither scalable nor efficient as the scores don’t change in a manner that is significant to change the customer experience.
We observed that items that are more frequently searched or purchased have higher signals from shoppers and thus have frequent changes to the score.
Based on that insight, we used search results as a trigger for lazy refresh. This allowed us to update scores more frequently while reducing the ingestion load by 2/3rds. Because there is a temporal component to the scores, we also needed the full sync multiple times a day to ensure both tail and out-of-stock items that were originally filtered had an opportunity to get updated.
Multi-Model Experimentation Framework
To accelerate our ML model testing using the new RTA infrastructure, we developed a new experimentation framework that allows us to evaluate multiple models in a scalable way.
Framework for syncing scores and testing multiple real-time availability ML models
The framework is engineered to be a configuration-based design to reduce the need for intensive engineering support, offering the ML team a faster way to conduct experiments. The scores generated by distinct models are synchronized with the database concurrently, ensuring that all models are assessed with no differences in the frequency of updates.
The framework has three key components:
- DB column per Model: In this approach, each model’s score is synchronized to a dedicated column within the database table throughout the experimentation phase. This eliminates the need for engineering to do SQL modifications to join new tables.
- Model-Column Mapping: A service-level configuration system maps the model version to its corresponding unique column. This config is utilized by both the full-sync and lazy score refresh systems to fetch data for all the versions in the configuration.
- Experiment-Column Mapping: A/B experiments with scores linked to specific columns/ML models are now easily conducted with a unique feature flag associated with each column. Additionally, the framework supports defining a designated column for default/control experience.
This framework has greatly reduced the engineering work for running new ML model experiments in parallel. The service-level config is updated to add the mapping between the new model version and the DB column to launch a new ML model. The full and lazy sync processes now utilize this information to synchronize the scores for the new version during subsequent cycles. Once the full sync has run at least once, a feature flag associated with the new model is used to run an experiment and determine the optimal operating point. When calling an experiment, the mapping is updated to assign the winning ML model column to the default/control experience. Consequently, the previous column becomes available for syncing and evaluating another ML model.
While ML models consider hundreds of features, there are specific segments where, due to business necessity, a separate operating point needs to be chosen on the selection-found rate curve (covered in part-1). This necessitates having different thresholds (operating points) for different segments when filtering items in the SQL layer. This complexity grew exponentially as we experimented with new ML models and wanted to optimize different segments further. For example, when experimenting with three ML models and optimizing for three specific product categories, four retailers, two regions, and eight user segments can lead to 576 combinations and a threshold value for each.
total_combinations (576) =
num_models (3) * num_products (3) * num_retailers(4)*
num_inventory_areas(2) * num_user_buckets(8)
To solve the combinatorial problem originating from having a threshold value for each combination, we introduced the “Threshold Resolver” and “Delta Framework.”
The Threshold Resolver determines the base thresholds for various ML models under experimentation. These thresholds are optimal points on the Selection-Found Rate curve specific for each ML model and often found through experimentation. In addition, these thresholds are also regularly adjusted when marketplace conditions change.
The Deltas Framework applies fixed delta(s), both positive and negative, to the base thresholds. The final thresholds are computed at run time by applying all the applicable deltas on top of the base thresholds. When multiple segments are eligible, the deltas are stacked and applied together.
Illustration of how deltas are applied to result in different thresholds for segments
This approach offers modularity, allowing for independent adjustment of the thresholds or deltas without impacting other components. It also allows for easy experimentation of ML models without the need for finding optimal points for all segments but rather just the base thresholds. Backed by DB values, the Deltas framework allows for quick and precise action on specific segments with bite-sized complexity. For example, during a shortage of formula baby powder, when the found rate had dropped significantly for that product category, our team adjusted the deltas for this specific product category, facilitating a shift in the Selection-Found Rate curve’s equilibrium without making the corresponding changes to other segments or thresholds.
While the Deltas provided the flexibility to optimize individual segments along the Selection-Found rate, it became apparent that as the number of segments increased, these independent adjustments did not always lead to the global optimal point along the Selection-Found rate curve. Moreover, as the marketplace environment evolved, there was a need to optimize a greater number of values to ensure the attainment of the optimal operating point. In simple terms, our system transitioned from being like a basic bicycle with a handful of levers to resembling a racecarformula one car with numerous levers. Although we gained speed, it became essential to implement automation to manage these levers effectively.
Multi-Segment Optimization through feedback loop
To automate the management of these levers, we developed Multi-Segment optimization. This approach continuously monitors the selection displayed to customers, the found rate of the selection, user behavior (such as order rate and retention), and recommends fine adjustments (deltas) for each segment. By analyzing customer-level metrics, the final outcome of the thresholds and deltas, the system can adjust deltas independently for each segment, ultimately reaching the optimal operating point on the global Selection-Found rate curve.
The optimization function for each segment can vary, ranging from bias adjustment (average difference between found rate and prediction score) to more sophisticated techniques aimed at improving long-term customer retention. Fine-tuning adjustments are made offline, enabling the observation of patterns over an extended period and the implementation of complex approaches like Contextual Bandits.
Predicting item availability is a unique and complex challenge faced by companies like Instacart, involving determining millions of items’ physical availability across 80,000+ retail locations. Each location operates with its unique inventory management systems, making it a challenge to ensure that every item in a customer’s order can be fulfilled. Despite the difficulties, addressing this challenge and achieving a high “good found rate” is crucial for customer retention.
Matching Pre-COVID quality
COVID was a pivotal point in Instacart’s journey where the customer demand surged, tipping the balance of the selection-found rate operating point. In the last couple of years, powered by the new ML techniques and the engineering systems powering them outlined earlier, Instacart brought the quality of orders to pre-covid levels at a new scale.
% of Orders with all items found (scaled to illustrate the trends)
Setting Customer Expectations
Setting customer expectations through the journey, elaborated in the first part, has helped customers make informed decisions when creating their baskets. Because of these efforts, there has been a steady increase in customer orders having more high-in-stock items leading to higher found rates.
% of ordered items that are high in stock (scaled to illustrate the trends)
The Thresholds & Delta framework allowed for multiple experiments to be conducted in parallel on new ML models and optimizing the Selection-Found rate operating point for different segments. We saw a 6X increase in experiments run using the new framework.
Although we have made significant progress, work is still to be done. Robert Frost once said, “miles to go before I sleep.” At Instacart, we recognize the road ahead in solving the availability problem and exceeding customer expectations. The advancements we have achieved in product experience, ML capabilities, and engineering infrastructure over the past year have accelerated the discovery of new solutions and unlocked the potential for breakthrough innovations. We remain committed to leveraging technology and driving progress in solving this complex challenge.
We would like to thank the following key members that were instrumental in building the engineering system.
- ML: Allan Stewart, Jack He, Shishir Kumar, Yiming Lu
- ML Infra: Guanghua Shu
- Eng: Damon Ding, Michael Prescott, Yanhua Liu, Thomas Cheng, Frank Chung, Jason Shao, James Liew
- Eng Infra: Marco Montagna, Ankit Mittal
Most Recent in How It's Made
How It's Made
Scaling Productivity with Ava — Instacart’s Internal AI Assistant
By Zain Adil, Kevin Lei and Ada Cohen Over the past few months, we’ve been building an internal AI assistant powered by OpenAI’s GPT-4 and GPT-3.5 models called Ava. Ava has seen accelerated adoption at…...Sep 7, 2023
How It's Made
Adopting dbt as the Data Transformation Tool at Instacart
By James Zheng At Instacart, datasets are transformed for use in analytics, experimentation, machine learning, and retailer reporting. We use these datasets to make key business decisions that provide better experiences for our customers and…...Aug 17, 2023
How It's Made
The Next Era of Data at Instacart
By Nate Kupp Data is an integral part of how we do business at Instacart: for informing decisions, providing insights into how our users interact with the product, supporting ML/AI use, and much more. Over…...Aug 16, 2023