0% found this document useful (0 votes)
30 views9 pages

Starfish PRFAQ

Amazon has launched Starfish, an AI-powered product catalog that utilizes generative AI to enhance the shopping experience by creating a comprehensive and accurate representation of products. This new technology consolidates multiple machine learning models into a single solution to improve catalog quality and streamline operations. Starfish aims to achieve high accuracy in product data, benefiting both customers and sellers by providing complete and consistent ASIN attributes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views9 pages

Starfish PRFAQ

Amazon has launched Starfish, an AI-powered product catalog that utilizes generative AI to enhance the shopping experience by creating a comprehensive and accurate representation of products. This new technology consolidates multiple machine learning models into a single solution to improve catalog quality and streamline operations. Starfish aims to achieve high accuracy in product data, benefiting both customers and sellers by providing complete and consistent ASIN attributes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Starfish PRFAQ

1 Starfish: Regenerating perfect ASINs to power best-in-class customer experience


2 Amazon dramatically improves shopping experience for its customers by leveraging generative AI to create the world's largest perfect
3 product catalog
4 SEATTLE, March 30, 2025 (REUTERS) – Today, Amazon announced the launch of a new AI-powered product catalog that delights
5 customers in each facet of their shopping experience, from product discovery all the way to fulfillment. Behind the scenes, this is made
6 possible by a cutting-edge generative AI technology (Starfish) which builds on the latest scientific advances in Large Language Models
7 (LLMs) to intelligently synthesize information across Amazon products, manufacturer websites, customer reviews, and experts to
8 create a perfect catalog with the most comprehensive, consistent, and correct representation of all products.
9 "I almost never used Amazon to research or evaluate products", said Sean Reynolds, an Amazon Prime customer from Boston,
10 Massachusetts. "They don't have full product information, and often the information is confusing and conflicting with each other. Like
11 when I was searching for my Keurig machine - Amazon didn’t tell me whether it was dishwasher safe, what kinds of brew it can make
12 and showed different dimensions across the page so I couldn’t make out whether it would fit on my narrow counter. I really like how
13 Bed Bath Beyond presented the information on its page in a clean format, and I was able to understand all the features required. Being
14 a price conscious shopper, I decided to do a final price check on Amazon before buying, and to my surprise they had overnight changed
15 their entire Detail Page. It was not only as clean as Bed Bath and Beyond but also have more information organized cleanly. They even
16 inserted customer feedback for each feature, which made me realize that this model, even though dishwasher safe, wasn’t easy to
17 de-scale. Thus, ended up buying the newer model instead. With this new experience, next time maybe I can start with Amazon”
18 Amazon used to build and operate a multitude of ML models specific to each catalog problem space, such as schema mapping and
19 evolution, enriching product data with external or internal content, and ensuring consistency of values across various attributes. These
20 models were science, engineering, and operational resource-intensive, taking a long time to deliver. The independent execution of
21 these models often created incoherent and/or inconsistent product details. It also resulted in a heterogeneous quality bar across
22 different ASINs as various models would enrich them at different cadence. Inspired by recent developments in AI demonstrating LLMs’
23 capability to learn and adapt across a wide range of tasks, Amazon adopted a radically-different and holistic view to solve the above
24 catalog problems through a new custom generative LLM-based technology called Starfish: an ASIN regeneration machine. Starfish
25 takes as input the whole original ASIN and associated external content and holistically operates on it to produce the most complete,
26 consistent, correct and normalized version possible, all in one single application of the model.
27 “We engineered the ultimate ASIN repair-and-enhance machine”, says Robert Tekiela, Vice President in the Amazon Worldwide Stores
28 division behind the program. “Under Starfish, we unified multiple programs that operated on different aspects of the problem,
29 including completeness, correctness, consistency, normalization, attribute relevance, and schema mapping among others.” Robert
30 adds: “In the last 5 years, we have achieved significant improvements in Catalog quality by steadily consolidating ML models that
31 target closely related problems into a single more powerful solution. We have expected this convergence trend to continue, leading
32 to even larger improvements in Catalog quality. With Starfish, we took our vision one big step further and built an AI-based solution
33 that can perform all aspects of quality improvement as regeneration of a whole ASIN in one single pass and with unprecedented
34 accuracy.”
35 In our evaluations, Starfish exceeds the accuracy level of human decisions. The improved accuracy of the entire Amazon Catalog leads
36 to improvements in all dependent downstream features important to customers ranging from comparison shopping, sharper pricing,
37 and efficient fulfillment operations. It also simplifies solutions to identify relationships between ASINs accurately and group them
38 appropriately.
39 The impact of Starfish on variation families was noted by Katie, an Amazon Customer. “I always had a hard time buying shoes on
40 Amazon as different sizes were split across different results. I had to click across multiple ASINs to find my size. What was worse is
41 that some sizes were under US size system while others were under EU. I could never understand the logic behind it! Whatever it was
42 I’m so glad that Amazon fixed it and now I can find all the sizes in a consistent format on a single page”.
43 Amara Johnson, Senior Manager of Program Management in Amazon Fulfillment Technologies noted- “Since the launch of Starfish,
44 we have observed a sharp increase in correctness of GTIN-keyed product information. This has significantly reduced the number of
45 cases where a scan of a product would resolve to multiple ASINs causing us to do manual verification. We have also seen a reduction
46 in customer returns that were resulting from inaccurate mappings between GTINs and product details.”
47 Amazon started rolling out Starfish across all of its online stores worldwide. Customers worldwide can already enjoy the new perfect
48 catalog experience by visiting Amazon websites in their countries.
49

Amazon confidential 1
Starfish PRFAQ

50 Customer FAQs
51 1. What is the proposed solution?
52 We propose to define a wide spectrum of Catalog problems holistically within an extreme-regeneration formulation: Starfish will take
53 a whole ASIN as input, retrieve relevant context from internal and external data sources, regenerate the whole ASIN end-to-end as
54 output, and publish this output to the Catalog. Starfish will figure out intelligently how to get the ASIN to perfect state (content,
55 schema, structure), and will perform implicitly all the necessary tasks in a single generative pass. Starfish will operate on multiple
56 classes of input, including the original ASIN itself, all associated SKU contributions on an ASIN, a crawled listing in External Product
57 Catalog (EPC), or any combination of these objects. We envision Starfish to also discover the best-in-class schema for the input ASIN
58 and generate the output accordingly. Rather than pretraining a new base Large Language Model (LLM), we will build on existing base
59 LLMs evaluated and approved for internal use (e.g., FLAN T5 and UL2), and assess our choice continually to rebase our model as more
60 alternatives, such as AWS Bedrock or ShopGPT, become available. We will develop methods to perform adaptation, retrieval
61 augmentation, and fine-tuning of any base LLM to make it applicable and to strictly conform to our desired task.
62 2. What is in scope and out of scope for Starfish?
63 All Catalog tasks required to provide best-in-class customer experience for an ASIN, including completeness, correctness, and
64 consistency of structured attributes; generation of engaging, informative, and concise titles; ASIN reconciliation; schema mapping and
65 schema evolution to build comprehensive product knowledge are in scope of Starfish. Improvements to ASIN identity and relationship
66 inference (e.g., variations) between products through product data quality improvements are in scope for Starfish as well. Improving
67 discovery experience outside of ASIN-level product knowledge and building new CX are out of scope for Starfish.
68 3. How will you measure success?
69 The primary success metric will be Detail Page Data Quality (DPDQ) which measures overall product data quality of Detail Page (DP)
70 as perceived by customers, from Grade A to D, with Grade A being Amazon’s DP at par or better than Best-in-Class websites. With
71 Starfish, we will aim to achieve and maintain DPDQ Grade A or B for >90% GV-weighted ASINs. The secondary measures will be the
72 reduction in manual labelling required for models developed and time to market.
73 4. What is the primary benefit for Shoppers?
74 Presenting shoppers with complete, correct, consistent, and engaging ASIN attributes, accurate grouping of offers onto ASINs, and
75 useful groupings of ASINs onto families (such as variations and title sets) will improve their shopping, discovery, and buying
76 experiences.
77 5. What is the primary benefit for our SPs?
78 Today, SPs are the primary source (>70%) of acquiring structured data. As we improve our customer experiences by showing more
79 structured data, we end up asking more information from our SPs, which leads to friction. Starfish will generate ASIN attribute values
80 from external resources (when available) and rich media (e.g., images and videos), thus reducing SPs’ effort in listing creation.
81 6. What is the primary benefit for Amazon?
82 We believe that Amazon will benefit from Starfish in three major ways: 1) Performing schema and data quality improvements
83 simultaneously within a holistic LLM framework will lead to better-than-human accuracy and to automation of several tasks that are
84 currently done manually; 2) Starfish will be able to generalize to other Catalog tasks beyond ASIN regeneration through in-context
85 learning and prompt-tuning, significantly reducing labeling costs; and 3) Consolidation of different data quality programs and models
86 will lead to simplified architecture and reduction in maintenance and IMR costs in the long-term. We expect ASCS Product Data quality
87 enrichers coexisting with Starfish in short/medium-term, until the point that they can be entirely (or partially) consolidated into a LLM
88 with higher performance and efficiency; we envision reutilizing enrichment metrics to help validate Starfish.
89 7. How will you mitigate impact of model errors on customers and SPs?
90 We will have a three-pronged strategy. For High GV ASINs (GV Band-A which will be ~20%GV WW and <5M ASINs), we will review the
91 model output for sensitive attributes with our operations team before updating it to the catalog. For Top Brands (~10K Brands, 50M
92 ASINs), we will work with Selling Partner Services (SPS) team to notify SPs of the changes at ASIN level in MYP (Manage Your Product)
93 and provide mechanism to appeal before updating the Catalog. For tail ASINs, we will apply changes to the catalog and use available
94 mechanisms for SPs to appeal.
95 8. Can we not use currently available LLMs out of the box for ASIN regeneration?
96 LLMs demonstrate that advanced transformer architectures that contain billions of parameters can transfer knowledge across
97 domains and solve many problems simultaneously when trained on massive amounts of textual and visual data from diverse resources.
98 For example, state-of-the-art LLMs, such as ChatGPT and Claude, can generate impressive results in zero and few shot settings for

Amazon confidential 2
Starfish PRFAQ

99 tasks from Amazon Catalog. Unfortunately, the best-performing LLM models are not open source, and using proprietary LLMs behind
100 service endpoints can pose significant risks to Amazon. Even if these models were available to us, our preliminary evaluations showed
101 that they can fall short of desired performance in certain tasks and generate answers that are plausible-sounding but factually
102 incorrect, misleading, or not supported by the input context.
103
104 Internal FAQs
105 1. What is the difference between Starfish and other LLM initiatives in Amazon?
106 Given the fast-evolving landscape and potential disruption brought forth by the popularization of ChatGPT, many parallel initiatives
107 related to the development, application, or integration of LLMs have been launched across Amazon. We identified three such parallel
108 initiatives in Amazon Stores that focus on product facts and product data: Listing LLM/Quicklist (SPS), Perfect Detail Page (Shopping),
109 and ShopGPT/Nile (Search). Each of these initiatives has its specificities and targets a different engineering product, CX, and
110 deployment use cases. While all of these initiatives can rely on a common foundational product-data-aware core LLM (core science
111 artifact), the specific engineering product, CX requirements and deployment scenarios will lead to different fine-
112 tuning/adaptation/alignment/precision evaluation objectives of the base LLM.
113 2. What is the difference between Perfect Detail Page LLM and Starfish?
114 Perfect Detail Page (DP) LLM aims to improve catalog data by prioritizing attributes visible on detail page and therefore does not
115 attempt to perform schema discovery to identify new attributes and backfill values for them. Starfish on the other hand tries to
116 improve all attributes for ASINs including discovering new attributes using external sources. Starfish will use a Retrieval Model to
117 search for high-quality products (both using the products in Amazon catalog or external catalogs) that are relevant to the context of a
118 particular ASIN. It will use the Retrieval Model’s results to holistically regenerate the ASIN including attribute comparisons with schema
119 discovery feature, style normalization within context, and ASIN groupings.
120 Perfect DP LLM efforts aim to generate customer insights by summarizing reviews and QA which is not in scope for Starfish.
121 3. What is the difference between Listing LLM powering QuickList and Starfish?
122 With QuickList, SPS aims to create a simplified listing experience where sellers can provide inputs in any format they choose and Listing
123 LLM automatically generates high-quality structured product attributes and descriptive text. The following are the major differences
124 between QuickList and Starfish. 1) QuickList operates within the scope of a single contribution whereas Starfish can retrieve all relevant
125 information such as other contributions on the same ASIN, sellers scores, and corresponding EPC records for a given input ASIN for its
126 decisions. 2) QuickList operates based on a fixed schema while we envision to equip Starfish with schema discovery capabilities. 3)
127 QuickList does not update Catalog directly, and its suggestions can be rejected by Sellers whereas Starfish publishes its output to
128 Catalog and is subject to much higher precision requirements (95%-99% depending on the use case). 4) To provide an interactive listing
129 experience, QuickList must operate under stricter latency constraints while Starfish has more time budget for processing.
130 We expect QuickList to improve the quality of incoming contributions, which would in turn help Starfish to attain its ambitious
131 performance target.
132 4. How does Starfish relate to Nile/ShopGPT?
133 ShopGPT aims to power a new CX and shopping experience in the Amazon shopping website through a conversational chatbot. Starfish
134 is not intended to be a chatbot that can answer open-ended questions and does not require human conversation intent alignment. To
135 power the Nile experience, Search is developing ShopGPT, a LLM for conversational shopping ("What is the largest TV?", "Does this
136 product contain flour?") with 11B parameters in Q2 and 20B by Q4, trained on catalog and (potentially) external data. Nile’s current
137 focus is to deliver the model in the CX experience for customers (shoppers) and not for internal teams’ uses. However, longer term,
138 ShopGPT is envisioned to be offered for tuning purposes on designated tasks. Given its factoid pattern of questions, we envision
139 forming a collaboration, benchmarking an early Beta for attribute extraction tasks (completeness/correctness) and attribute
140 validation. ShopGPT can potentially be used as foundational base LLM for Starfish, and we will reevaluate and rebase our development
141 (fine-tuning, adaptation) on the ShopGPT model as needed whenever it is ready and shared.
142 5. How does Starfish relate to AWS Bedrock?
143 AWS Bedrock does not target Shopping/Stores domain. It is a general-purpose LLM and chatbot that will be offered to customers
144 worldwide for various use-cases. Starfish will rely on a foundational LLM to bootstrap its development and to transfer its natural
145 language skills, general intelligence, and reasoning capabilities to Catalog tasks. We will initially use an open-source option as a
146 foundational LLM but we expect to replace it with the new AWS Bedrock models after their availability. We are onboarded with AWS
147 Bedrock since Q4 2022 and have experimented with their older 20B parameter model sitting currently behind the test API. This model

Amazon confidential 3
Starfish PRFAQ

148 will be replaced by the new 26B and 52B parameters models (May) and the new 200B parameters model (September). We are actively
149 participating in the product requirement gathering with the AWS Bedrock Product Management team.
150 6. What resources do you plan to invest in 2023?
151 To move fast, we will create a virtual team consisting of 10 AS under a single-threaded ASM. This team will be supported by 2 L7 AS,
152 1 scholar, 2 TPM, 4 SDE, and 12 GCO auditors. We will fund most of the required HC between director-level organizations within ASCS
153 by reprioritizing our 2023 commitments. We will continually make adjustments to the number and composition of FTEs who
154 participate part-time or full-time in Starfish based on our progress. We are asking for an additional budget of $500K in IMR for GPU
155 resources and $100K for leveraging external LLM services provided that they are approved by Amazon. The Starfish team will work on
156 the tentative list of tracks below.
157 1. Data tooling: We will define data formats, serialization, and payloads for the model input and output (full ION, simplified
158 JSON, ad-hoc natural language), develop strategies to optimally inject ASINs payloads in LLM prompts under prompt budget
159 constraints (full ASIN, top-K relevant attributes, and task-specific attributes only), and build debugging and visualization tools.
160 2. Data evaluation benchmark: We will curate a reference test benchmark-suite across a range of catalog tasks from historical
161 labels (attribute prediction, attribute correctness, attribute validity, attribute normalization, attribute relevance and
162 applicability, title quality, PDP quality, policy related classification tasks, ontology classification tasks, duplicates, variations,
163 etc.). This benchmark will contain full ASIN snapshots, task descriptions, the golden labels, and ML models baselines and will
164 be used in the continuous benchmarking of the state-of-the-art LLMs and in model development.
165 3. SOTA LLM active survey, continuous benchmarking, and prompt engineering: We will explore, survey, and continuously
166 benchmark against the latest state-of-the-art (SOTA) base LLMs that will become available as our program progresses. These
167 models include: 1) fully open-source Apache 2.0 licensed models that we can host locally and over which we have full
168 customization and training control (FLAN T5, UL2), 2) proprietary models onboarded to AWS SageMaker JumpStart
169 Foundation Models (Cohere Command, AI21 Jurassic), 3) proprietary models hosted and served directly from model
170 provider’s API pending legal approval (Anthropic Claude, Open AI GPT-4), and 4) internal models as they are released
171 (ShopGPT, AWS Bedrock). We will actively monitor the releases of these models and quickly iterate on them with our suite
172 of general catalog tasks benchmark.
173 4. Self-supervision training dataset curation: We will curate an ASIN regeneration self-supervision dataset, including self-
174 supervision from Amazon Catalog, Contributions Store, EPC, UMP (attribute metadata), historical EPC-to-ASIN mappings.
175 5. Core modeling and self-supervision training: We will train Starfish LLM by fine-tuning the base LLM model via multiple self-
176 supervision paradigms including mixture-of-denoisers and extreme-denoising (e.g., UL2), in-filling (e.g., InCoder), regular
177 denoising, and causal language modeling. We will explore multiple fine-tuning strategies including parameter-efficient fine-
178 tuning (PEFT) and regular fine-tuning methods.
179 6. Retrieval Augmented Generation (RAG): This track involves development of retrieval augmentation capabilities to enhance
180 Starfish. We will research the best embeddings to use, build the Approximate Nearest Neighbors (ANN) index on the retrieval
181 sources (EPC, SM crawls), augment the Self Supervision with dual inputs and additional context (ASIN + ASIN, ASIN + EPC,
182 ASIC + SKU) and leverage the EPC to ASIN mappings from the data track.
183 7. Reinforcement Learning from Human Feedback (RLHF): We will obtain high-quality labeled data from human auditors, set
184 up labeling tooling and training framework for RLHF, develop the Reward Model, and perform policy optimization. We expect
185 this track to be crucial to enable schema discovery as expert humans will indicate to the model what a best-in-class listing for
186 the given ASIN should look like. We will explore RLHF as used in ChatGPT as well as Reinforcement Learning from AI Feedback
187 (RLAIF) as used in Anthropic’s Constitutional AI methodology.
188 8. Model distillation: We will develop a lightweight model (Starfish Student) that can achieve the same performance as the
189 main Starfish model (Starfish Teacher) to enable large-scale deployment under budget constraints.
190 9. Multi-modality enhancement: We will adapt text-only LLMs to multi-modality by adding adaptation weights without full pre-
191 training. We will also actively survey multimodal models releases (open-source, AWS JumpStart FM, AWS Bedrock, and
192 internal) and compare our model against them.
193 10. Instruction-tuning: This track involves building a general-purpose Catalog LLM, which can not only serve as a base LLM for
194 Starfish but can also perform other Catalog tasks that do not fall under the ASIN regeneration formulation, such as policy
195 classification tasks. We will follow the Fine-tuned Language Net (FLAN) approach; we will define a suite of comprehensive
196 Catalog tasks, create a super-dataset that consists of multiple curated natural language instruction templates per task, and
197 fine-tune the base LLM on it.

Amazon confidential 4
Starfish PRFAQ

198 11. Catalog LLM research: This track focuses on other LLM research initiatives on general catalog tasks, including fully automating
199 the prompt engineering directly from task description, class rationale, or SOP, including multimodality.
200 7. What are your timelines, deliverables, and milestones?
201 In 2023, we plan to deliver Starfish for TK (~5-10) PTs to allow for quick iteration and learning before we scale. We will incrementally
202 deliver Starfish in three main phases described below: -
203 Phase1: Regenerate ASIN from the current ASIN contents (Q2 2023): In this phase our primary objective would be to evaluate and
204 benchmark SOTA LLMs for our use cases including proprietary models behind external APIs, proprietary models available through AWS
205 SageMaker JumpStart, and open-source models. As we are exploring proprietary models, we will limit the scope to publicly available
206 data (i.e., existing data on detail page) and align our experiments with the POCs conducted so far which includes title generation,
207 identifying incorrectness/inconsistency across existing attributes and identifying schema defects e.g., irrelevant/inapplicable
208 attributes.
209 By the end of this phase, we will: 1) finalize data tooling, downstream evaluation benchmark data, and self-supervision data, 2)
210 complete first round of SOTA LLM benchmarking, 3) complete Starfish core modeling and first round of Starfish self-supervised fine-
211 tuning, 4) kick-off the RLHF tooling and labeling, and 5) kick-off the instruction-tuning dataset preparation.
212 Success Metrics: We measure success through the progress made on individual components of the DPDQ metric for the TK PTs: (1)
213 Increase in % of Grade A titles; (2) Reduction in customer perceived data inconsistency (CPDI); (3) Reduction in schema defects
214 Phase2: Regenerate ASIN using the current ASIN contents and external content from EPC (Q3 2023): In this phase we will: (1) expand
215 and augment Starfish with RAG by building the Retrieval Model and ANN index. We will utilize RAG to enrich ASIN with data from
216 external sources obtained via External Product Catalog (EPC); and (2) initiate expansion of regeneration capabilities of Starfish in Phase
217 1 from TK PTs to TK PTs to understand the scaling challenges. We will also start comparison of Starfish with the current custom models
218 for each catalog enrichment task.
219 We will re-evaluate the base model SOTA LLM benchmarking with the expected release of AWS Bedrock 26B and 52B, and the possible
220 access to trained weights of ShopGPT model checkpoint (we will load the weight on our FLAN T5 model). We will rebase the Starfish
221 model on these two models as needed. We will continue Starfish RLHF/RLAIF core modeling and labeling and start training the Reward
222 Model. We will kick-off the instruction-tuning of the FLAN T5 model on Catalog instructions super-dataset
223 Success Metrics: For the TK PTs chosen in Phase 1 we expect to improve on the following additional dimensions: (1) Improvement in
224 completeness rate; (2) Improvement in normalization rates; (3) additional improvements in CPDI metric and Grade A titles due to
225 additional content from external sources.
226 Phase3: Regenerate ASIN through additional schema discovery (Q4 2023): In this phase we will experiment with the SOTA LLMs
227 models’ capabilities to enrich schema through prompting using data in EPC. We will utilize data in EPC to backfill values for the newly
228 discovered attributes. We will align existing PTs in Voyager scope to support Otologist review and configuration of the newly
229 discovered attributes.
230 We will re-evaluate the base model SOTA LLM benchmarking with expected release of AWS Bedrock 200B, and with the completed
231 catalog instruction-tuned model. We will rebase the Starfish model on these two models as needed. We will also complete Starfish
232 RLHF training.
233 Success Metrics: TK% of ASINs achieving DPDQ Grade A for the TK PTs. Improvement in schema completeness metrics for TK PTs.
234 Phase4: 2024 and beyond: In 2024, we tentatively plan to focus on multimodal model development and scaling Starfish to Amazon
235 Catalog. We will finalize our plans, including additional h/c and budget requests for next year, in Q4 2023 based on our progress and
236 learnings and incorporate them into our OP2.
237 Deliverable: Scale Starfish to TK%GV ASINs a at or above as measured by DPDQ Grade A or B.
238 Through all these phases, we will use the Catalog Experimentation Platform (CEP) to measure business impact of catalog improvements
239 made by Starfish.
240 8. What experiments have you done so far that would give us confidence in Starfish?
241 In the last two years, ASCS teams designed and trained generative Language Models (0.2B ~ 0.8B parameters), which include GPT-2,
242 BART (SAGE), and T5, and deployed them in several of our production systems including attribute completeness and correctness,
243 processing hundreds of millions of transactions per day. Although these models were much smaller in scale than ChatGPT (~175B),
244 they provided us with our first insights into the capabilities of generative language models and how to adapt them to Catalog tasks.
245 Since the release of ChatGPT, we have experimented with open-source instruction-tuned LLMs in zero-shot setting (i.e., without
246 training), including FLAN-T5 XXL (11B parameters), mT0 (13B), FLAN UL2 (20B). We explored applying state-of-the-art prompt
Amazon confidential 5
Starfish PRFAQ

247 engineering techniques (In-Context Learning, Chain-of-Thought, Self-Consistency, etc.) on select Catalog tasks, such as predicting the
248 unit of measure type and the price-per-unit, inferring variation relationship, predicting attribute values and attribute correctness, and
249 automatically synthesizing keyword rules for CPP models. While the initial results are not yet at the level of current production models
250 trained on millions of training labels, they are promising considering that they required no labeling at all. For example, for the price-
251 per-unit use-case, zero-shot prompt-engineering of FLAN-T5 XXL (11B) led to 91% precision/recall compared to 95% precision/recall
252 with our current M5-based production system fine-tuned on 2M labels. We obtained evidence that the larger the LLM that we use,
253 the better its performance. We experimented with Retrieval-Augmented Generation (RAG) libraries (e.g., LangChain) to retrieve
254 relevant information in our SOPs and include it in the prompt context before making a prediction. We also started using RLHF to
255 generate parent titles for variation families, and training a Large Multimodal Model (LMM) of 3.5B parameters. Through these
256 experiments, we have gained familiarity with various scientific techniques and components that we will use in Project Starfish.
257 We also ran experiments on select use-cases approved by Legal by using ChatGPT. These experiments include title enhancement,
258 product classification, ontology definition, and some attribute prediction use-cases, where the ChatGPT model obtained better results
259 than humans. To perform additional benchmarking activities, we onboarded with the Amazon API key administrator for the Anthropic's
260 Claude v1.2 model, as part of the Anthropic-AWS evaluation agreement. We started the onboarding process with the gated preview
261 of AWS SageMaker JumpStart Foundation Models, which host ChatGPT-grade models from partner providers like Cohere Command
262 and AI21 Jurassic 2 Jumbo in AWS escrow accounts protecting data privacy. All our anecdotes with ChatGPT and other large proprietary
263 models hinted at strong product knowledge and expert-level e-commerce domain knowledge in general, outperforming human judges
264 on anecdote tests.
265 9. What are the risks for this proposal?
266 The proposed delivery schedule is aggressive anticipating exponential improvements in base LLMs, such as timely delivery of AWS
267 Bedrock, quick approval to onboard it, permission to use external LLMs, support for resource and IMR reallocation from existing OP2
268 plans, and additional investments as needed towards Starfish. To make quick progress towards our deliverables, we will choose an
269 open-source LLM as our foundational model and continually compare it against other open-source and internal alternatives. While
270 this approach increases our delivery speed, repeated baselining studies will result in extra h/c and IMR costs. Furthermore, we may
271 have to throw away some of the training work we have invested in open-source LLMs if we later identify a more powerful alternative
272 such as AWS Bedrock.
273 Our early exploration of ChatGPT and other recent generative model releases have demonstrated potential, but we cannot yet tell if
274 an LLM-based solution can scale to the entire catalog under latency and throughput constraints and can be cost effective in the next
275 18 months. We may have to rely on current production models entirely or operate them side by side with the LLM-based solutions
276 until cost efficiency is achieved. Furthermore, the h/c budget we have for model distillation in this proposal towards cost efficiency
277 may be an underestimate, and we may have to allocate more time and h/c resources.
278 We also face uncertainties about attaining human-level precision in Catalog tasks. While we expect Starfish to beat the production ML
279 models based on our preliminary investigations with LLMs, this may cost more than what we outline in this proposal in terms of time,
280 h/c, and annotation budget. We are aware that product “facts” generated by Starfish may be incorrect and applying them to catalog
281 will require guardrails, governance, and human appeal mechanisms at scale. Although we have decades of experience in making
282 automated updates to Catalog, the new holistic framework where a single model simultaneously performs multiple tasks introduces
283 additional challenges. For example, if we detect that Starfish precision drops below the target level for one of the tasks, we do not yet
284 know if and how we should take a mitigation action against other tasks. We are in the initial stages of Starfish; we will partner with
285 other teams, learn from their and our own experiences, provide visibility through rhythm of business meetings, and refine our plans
286 as we make progress.

Amazon confidential 6
Starfish PRFAQ

287 Appendix 1: Glossary


288 Large Language Model (LLM) is a type of artificial intelligence model that can understand and generate human-like text. These models
289 are trained on vast amounts of data from diverse resources. LLMs can perform a wide range of applications such as
290 question/answering, text summarization, text generation, and sentiment analysis. Examples of LLMs include OpenAI’s GPT series,
291 Google’s T5, DeepMind’s Sparrow, and Anthropic’s Claude.
292 ChatGPT is a specific variation and application of GPT-3.5 and GPT-4 for conversational AI tasks. Although ChatGPT has roughly the
293 same size as the state-of-the-art LLMs, researchers have enhanced ChatGPT with unique abilities by fine-tuning the original GPT-3
294 model first in a supervised fashion and then within an RLHF framework.
295 Reinforcement Learning with Human Feedback (RLHF) is an approach to train AI models particularly for natural language
296 understanding tasks. In RLHF, 1) humans provide language models with prompts and rank the generated text outputs, 2) a Reward
297 Model is created using human feedback to guide the AI model’s improvement, and 3) the AI model is fine-tuned using Reinforcement
298 Learning techniques.
299 Instruction tuning refers to the process of fine-tuning an AI model based on a curated dataset containing input-output pairs related
300 to the target tasks, framed and templated in natural language instructions. Instruction tuning improves the performance of language
301 models and makes them more easily prompted. Fine-tuning adjusts the model’s weights and parameters, to better align the
302 representations with the task at hand, it can be computationally expensive and time-consuming.
303 Prompt engineering is the process of refining an input prompt to elicit more useful and accurate answers from a language model.
304 Unlike instruction tuning, prompt tuning does not update model parameters but the improvements in model performance may be
305 limited to specific prompts.
306 Retrieval augmented generation (RAG) refers to enhancement of an LLM model with capabilities to query external data sources to
307 retrieve information that can be used to improve model performance. The primary advantage of retrieval augmentation is that it
308 enables LLMs to leverage knowledge that was not present in its training stage. (E.g., ChatGPT plug-ins)
309
310 Appendix 2: Starfish LLM Development

311
Amazon confidential 7
Starfish PRFAQ

312
313 Appendix 3: How Starfish operates
314

315
316
317

Amazon confidential 8
Starfish PRFAQ

318 Appendix 4: How Starfish processes ASINs holistically versus traditional separate treatments
319
320 Initial state: Left original ASIN, right new ASIN after multiple tasks have been applied to it, creating ill-posed problem and dependency
321 conflicts (e.g., the two tasks “Title generation from attributes” and “Attributes generation from titles” are conflicting with each other)

322
323
324 Target state: Left original ASIN, right new ASIN after single regeneration tasks, resolving all dependency conflicts

325
326

Amazon confidential 9

You might also like