How It's Made
Supercharging ML/AI Foundations at Instacart
By Haixun Wang and Jagannath Putrevu
Machine Learning (ML) is one of the most impactful technology levers we have at Instacart to make our services better and help our customers, shoppers, retailers, and brand partners. ML investments at Instacart started as early as 2014, when we deployed our first set of models within Logistics to do demand forecasting, picking and driving time predictions, and order batching. The use cases continued to grow through 2014–2020, with more applications in Search, Recommendations, Personalization, Ads, Fraud, and Growth to help all four sides of our marketplace.
Most of these ML models were trained either on laptops or custom infrastructure developed within each team with no common patterns, and sometimes it took more than a month to put a model into production. In early 2021, we started building an in-house ML platform to enable our teams to construct, deploy, serve, and manage ML models and features at scale, ensuring their efficacy, dependability, and security throughout the ML lifecycle. Soon after, we saw a significant increase in applications developed on our ML infrastructure, which greatly improved the velocity and sophistication of the models we developed and helped create a greater business impact.
However, we still had a lot of room to improve the robustness and maturity of our infrastructure to enable widespread adoption within the company. We also wanted to make it easier for ML models to be personalized at a user level to achieve our vision of building a personalized grocery storefront for every Instacart customer. Recognizing these challenges, we started a new initiative in H1 2023 focused on a number of essential projects, including improving our Feature Store and our ML productivity tool Griffin to strengthen our ML infrastructure foundations. We built a new adaptive experimentation platform called Axon to run bandit experiments at scale and developed bespoke signals through User Modeling to help with personalization.
In this post, we discuss how these investments have further strengthened the ML foundations at Instacart, personalized our customer experience, and prepared us to embrace the age of LLMs and Generative AI.
ML Development Cycle
The typical Machine Learning (ML) development life cycle involves several stages, and ML infrastructure plays a crucial role in each one.
- Data is crucial for developing effective ML models. Through User Modeling, we generate important user-level signals to enable personalized e-commerce.
- Through a Feature Store, we manage and share features for training ML models. In addition, real-time features from an online feature store support models deployed in production.
- Using the ML productivity tool (Griffin), we train, tune, store, and deploy ML models.
- Using the adaptive experimental platform (Axon), we increase experimental velocity and facilitate deep personalization via mechanisms such as multi-arm and contextual bandits.
Finally, ML models in production engage with customers to generate new data, which is used as signals to update or train new models, completing the loop. The following is a high-level illustration that emphasizes the most important aspects of our current ML infrastructure stack.
Today, we have hundreds of ML services developed using Griffin, and served in production by the feature store. On the other hand, new challenges abound. From serving LLMs and generative AI applications to adapting to new application requirements that demand ever-more-powerful models to consolidating emerging but isolated ML initiatives, the future holds a great deal of work, from vision to innovation to adaptation.
Unlocking new capabilities
Over the course of H1 2023, we made a tremendous impact on several key fronts, including training and serving ML models, feature management, user modeling, and the adaptive experimental framework. We now have the following new abilities, thanks to these investments:
- Improved Stability and Efficiency — The newly developed version of Feature Store has a redesigned architecture based on the lessons learned from managing our first version. The architecture leverages a flexible protobuf structure for feature storage that enables strong typing while permitting long term evolution at its lowest layer. At the next layer is a keying structure that allows us to customize the data layout across storage nodes and storage clusters. At the top layer, we have a metadata-aware proxy layer that can optimize queries based on the tooling of the lower layers. Additionally, by checkpointing data sets just prior to loading them for serving, we gain operational control and visibility that allow for easy and impactful infrastructure automation, such as cluster replacement and data rollbacks.
- Onboard new features in hours instead of days — We built a feature creation and discovery tool called Feature Marketplace that enables MLEs to define features in minutes, eliminating the bottlenecks of feature staging and indexing. The entire process can now be completed in less than an hour (vs. a few days previously), significantly improving the feature engineering velocity at Instacart.
- Fetch a large number of features simultaneously — Cross-table fetches can now easily be batched into a single request without client-side concurrency management. This allows our query planning system to optimize data retrieval on behalf of MLEs, resulting in up to 85% faster queries.
- Better compliance with integrated privacy controls — We implemented privacy annotations as a first-class feature, which enabled MLEs to avoid dealing with concerns around how their data is stored and instead focus on their core product work. This also enabled the Feature Store to cut its ingestion costs in half and reduce its database footprint by 50% by leveraging a proxy-protected, single-environment strategy that meets compliance and security standards.
Model Exploration & Training
- Spin up Notebook GPU Instances for model exploration in only a few minutes — For the past two years, Instacart developers have been able to quickly provision on-demand personal remote instances for development. Previously, this was limited to software engineers, and the only way for MLEs to access high-performance hardware for exploration was through SageMaker, which required multiple rounds of terraform PRs for first-time provisioning (a process that can take days or weeks even for those MLEs patient enough to get through the process). This solution was not only difficult to set up, but also difficult to maintain, particularly when it came to dependency management. In H1, we extended our remote dev environments to support MLEs, allowing them to get up and running with a notebook in minutes, never having to worry about terraform, drivers, shared libraries, Jupyter kernels, or docker images.
- Train models with 10x more data — With the increased training capability, especially through distributed training via Ray, we are now able to train our machine learning models with 10x or even 100x more data. We are already seeing a significant business impact from this effort by boosting the performance of key ML models across Ads, Personalization, and Fulfillment.
Model Serving & Inference
- Reduce TensorFlow Inference Service latency by 85% — Previously, we served our TensorFlow models from Python-based RPC services. By moving these models to our new Go-based RPC system leveraging a compiled TensorFlow Serving appliance, we reduced P99 latency by 85%.
- Create end-to-end ML apps with few clicks — Thanks to renovations on both the Model Training and Serving platforms, MLEs can now create an ML Application (Training, Evaluation, Inference) with our Griffin UI. This accelerates MLE dev productivity when it comes to configuring ML applications, which was previously difficult and often took weeks or even months to complete.
- Have easy access to an expanding set of user modeling signals. Our customers interact with Instacart across several touchpoints — from onboarding, cart building, search, checkout, pre-delivery and post-delivery — and they exhibit heterogeneity in their preferences across these touchpoints. User modeling signals enable data scientists and MLEs across the company to characterize our customers along a variety of dimensions, including how sensitive they are to fees or category-specific prices, how open they are to product discovery via ads, how they value different aspects of fulfillment quality, and so on. Teams across the company have integrated these signals into their applications and models, enabling a more personalized Instacart experience.
- Make experiments adaptive and ML-driven — The new adaptive experimentation platform Axon supports multi-armed (MAB) and contextual bandits (CB), which improve experimental velocity and enable a wide range of personalization-based applications. Our platform enables implementations of CB/MAB experiments in as little as a few hours, compared to ad hoc processes that can take weeks or months.
In future posts, we will delve deeper into some of the technical work that went into unlocking these new capabilities.
The path ahead
Broad and effective adoption of ML infrastructure is essential for realizing the full potential of ML at Instacart. Successful adoption is contingent on two factors: a) the ease-of-use and robustness of the systems and tools, b) effective communication, coordination, and prioritization, as well as the culture for sharing, as they will play a crucial role in driving company-wide adoption. We are actively working on increasing the adoption of our new tools and onboarding new applications across all our ML teams.
With our current ML infrastructure, vital tools and systems are in place to meet our immediate needs. On the other hand, the landscape of ML foundations is constantly evolving, driven by a pull from application needs and a push from underlying technologies. We plan to continue investing in our foundations to drive more productivity and impact. Examples of future investments include:
a) Expanding the current ML infrastructure — our current ML training and serving platform only supports TensorFlow and LightGBM frameworks. We are planning to expand support for other popular frameworks too, like Pytorch and CasualML.
b) Developing vital infrastructure to meet the requirements of new applications — for example, embeddings are an efficient data representation that is widely used in many applications, but training and updating these embeddings is expensive. We are building a new embedding platform that will reduce infrastructure costs and boost productivity.
c) Harvesting and managing the unique data we have — part of the foundation of our business is our knowledge of food, groceries, and user preferences. We have spent much effort acquiring such knowledge. For example, the models we’ve created to generate complementary products, related queries, user personas, and more. The advent of generative AI also affords us the chance to distill knowledge from LLMs. We are exploring how we can share our unique knowledge in the most efficient manner for wide-spread adoption to improve customer experience and also help our shoppers, retailers and brand partners.
Embracing the Age of LLM and Generative AI
The advent of LLMs and Generative AI presents our business with opportunities never seen before. It is almost universally acknowledged that while it is easy to create a demo, it is exceedingly difficult to develop robust products using this technology. Simultaneously, there is a myth that generative AI is nothing more than prompt engineering. In actuality, the reliance on prompt engineering is the greatest barrier to unlocking the potential of LLMs and Generative AI. Therefore, the mission of our LLM foundation pod is to develop techniques that enable the creation of robust applications on top of LLMs.
Prompt engineering is a crucial element in leveraging LLMs and Generative AI. As a result, addressing its inherent stochastic nature is of the utmost importance for building robust systems based on sound software engineering principles. We envision two foundational strategies to help harness uncertainty. First, we may leverage LLMs to produce data that helps train predictive ML models whose decision-making process can be managed by well-established quality control mechanisms; Second, we may develop systems to learn prompts so that we support automatic prompt tuning, reducing the need for manual intervention and increasing reliability and efficiency.
ML foundations serve as the critical infrastructure that supports all ML initiatives in a company. Our investments have substantially fortified the core elements of the ML foundations, accelerating our transformation into an AI-first company.
As we set our sights on a future shaped by rapid advances in ML and AI, it is clear that the development of ML foundations is not just about understanding current capabilities or perfecting existing systems. It demands strategic foresight and the capability to anticipate potential shifts. A thorough understanding of ML, including its operations, principles, and functionality, is crucial in this context.
We are excited about the prospects of becoming an AI-first company, thriving on the combination of in-depth customer insights, comprehensive knowledge of grocery, food, and health, and the never-ending breakthroughs in ML/AI. We fully plan to continue investing in this area and sharing some of our learnings with our partners and broader ML/AI community.
Building these foundations was a joint effort across our teams, from ML infrastructure, core infrastructure, ML application engineers, and a mix of backend engineers from across the company. In particular, significant technical and leadership contributions have been made to these efforts by Ada Cohen, Adway Dhillon, Angadh Singh, Brian Lin, Cameron Taylor, Changyao Chen, David Vengerov, Jacob Jensen, James Matthews, Guanghua Shu, Han Li, Keith Lazuka, Kevin Lei, Lan Yu, Li Tan, Matt Crane, Pradeep Mani, Rajesh Mosur, Rajpal Paryani, Richard Gong, Sahil Khanna, Sharath Rao, Tilman Drerup, Tristan Fletcher, Vaibhav Agarwal, Walter Tuholski, Wideet Shende, Zain Adil, and Zihan Li.
Most Recent in How It's Made
How It's Made
Unveiling the Core of Instacart’s Griffin 2.0: A Deep Dive into the Machine Learning Training Platform
Authors: Han Li, Sahil Khanna, Jocelyn De La Rosa, Moping Dou, Sharad Gupta, Chenyang Yu and Rajpal Paryani Background About a year ago, we introduced the first version of Griffin, Instacart’s first ML Platform, detailing its development and support for end-to-end ML in…...Nov 22, 2023
How It's Made
Introducing Griffin 2.0: Instacart’s Next-Gen ML Platform
Authors: Rajpal Paryani, Han Li, Sahil Khanna, Walter Tuholski Background Griffin is Instacart’s Machine Learning (ML) platform, designed to enhance and standardize the process of developing and deploying ML applications. It significantly accelerated ML adoption at Instacart by tripling…...Nov 22, 2023
How It's Made
The Economics Team at Instacart
Tilman Drerup, Levi Boxell, and Robert Fletcher Tech firms are increasingly choosing to hire graduates from PhD programs in economics [1, 2, 3]. In this blog post, we present the economics team at Instacart and our take…...Nov 22, 2023