Starfish PRFAQ
Starfish PRFAQ
Amazon confidential 1
Starfish PRFAQ
50 Customer FAQs
51 1. What is the proposed solution?
52 We propose to define a wide spectrum of Catalog problems holistically within an extreme-regeneration formulation: Starfish will take
53 a whole ASIN as input, retrieve relevant context from internal and external data sources, regenerate the whole ASIN end-to-end as
54 output, and publish this output to the Catalog. Starfish will figure out intelligently how to get the ASIN to perfect state (content,
55 schema, structure), and will perform implicitly all the necessary tasks in a single generative pass. Starfish will operate on multiple
56 classes of input, including the original ASIN itself, all associated SKU contributions on an ASIN, a crawled listing in External Product
57 Catalog (EPC), or any combination of these objects. We envision Starfish to also discover the best-in-class schema for the input ASIN
58 and generate the output accordingly. Rather than pretraining a new base Large Language Model (LLM), we will build on existing base
59 LLMs evaluated and approved for internal use (e.g., FLAN T5 and UL2), and assess our choice continually to rebase our model as more
60 alternatives, such as AWS Bedrock or ShopGPT, become available. We will develop methods to perform adaptation, retrieval
61 augmentation, and fine-tuning of any base LLM to make it applicable and to strictly conform to our desired task.
62 2. What is in scope and out of scope for Starfish?
63 All Catalog tasks required to provide best-in-class customer experience for an ASIN, including completeness, correctness, and
64 consistency of structured attributes; generation of engaging, informative, and concise titles; ASIN reconciliation; schema mapping and
65 schema evolution to build comprehensive product knowledge are in scope of Starfish. Improvements to ASIN identity and relationship
66 inference (e.g., variations) between products through product data quality improvements are in scope for Starfish as well. Improving
67 discovery experience outside of ASIN-level product knowledge and building new CX are out of scope for Starfish.
68 3. How will you measure success?
69 The primary success metric will be Detail Page Data Quality (DPDQ) which measures overall product data quality of Detail Page (DP)
70 as perceived by customers, from Grade A to D, with Grade A being Amazon’s DP at par or better than Best-in-Class websites. With
71 Starfish, we will aim to achieve and maintain DPDQ Grade A or B for >90% GV-weighted ASINs. The secondary measures will be the
72 reduction in manual labelling required for models developed and time to market.
73 4. What is the primary benefit for Shoppers?
74 Presenting shoppers with complete, correct, consistent, and engaging ASIN attributes, accurate grouping of offers onto ASINs, and
75 useful groupings of ASINs onto families (such as variations and title sets) will improve their shopping, discovery, and buying
76 experiences.
77 5. What is the primary benefit for our SPs?
78 Today, SPs are the primary source (>70%) of acquiring structured data. As we improve our customer experiences by showing more
79 structured data, we end up asking more information from our SPs, which leads to friction. Starfish will generate ASIN attribute values
80 from external resources (when available) and rich media (e.g., images and videos), thus reducing SPs’ effort in listing creation.
81 6. What is the primary benefit for Amazon?
82 We believe that Amazon will benefit from Starfish in three major ways: 1) Performing schema and data quality improvements
83 simultaneously within a holistic LLM framework will lead to better-than-human accuracy and to automation of several tasks that are
84 currently done manually; 2) Starfish will be able to generalize to other Catalog tasks beyond ASIN regeneration through in-context
85 learning and prompt-tuning, significantly reducing labeling costs; and 3) Consolidation of different data quality programs and models
86 will lead to simplified architecture and reduction in maintenance and IMR costs in the long-term. We expect ASCS Product Data quality
87 enrichers coexisting with Starfish in short/medium-term, until the point that they can be entirely (or partially) consolidated into a LLM
88 with higher performance and efficiency; we envision reutilizing enrichment metrics to help validate Starfish.
89 7. How will you mitigate impact of model errors on customers and SPs?
90 We will have a three-pronged strategy. For High GV ASINs (GV Band-A which will be ~20%GV WW and <5M ASINs), we will review the
91 model output for sensitive attributes with our operations team before updating it to the catalog. For Top Brands (~10K Brands, 50M
92 ASINs), we will work with Selling Partner Services (SPS) team to notify SPs of the changes at ASIN level in MYP (Manage Your Product)
93 and provide mechanism to appeal before updating the Catalog. For tail ASINs, we will apply changes to the catalog and use available
94 mechanisms for SPs to appeal.
95 8. Can we not use currently available LLMs out of the box for ASIN regeneration?
96 LLMs demonstrate that advanced transformer architectures that contain billions of parameters can transfer knowledge across
97 domains and solve many problems simultaneously when trained on massive amounts of textual and visual data from diverse resources.
98 For example, state-of-the-art LLMs, such as ChatGPT and Claude, can generate impressive results in zero and few shot settings for
Amazon confidential 2
Starfish PRFAQ
99 tasks from Amazon Catalog. Unfortunately, the best-performing LLM models are not open source, and using proprietary LLMs behind
100 service endpoints can pose significant risks to Amazon. Even if these models were available to us, our preliminary evaluations showed
101 that they can fall short of desired performance in certain tasks and generate answers that are plausible-sounding but factually
102 incorrect, misleading, or not supported by the input context.
103
104 Internal FAQs
105 1. What is the difference between Starfish and other LLM initiatives in Amazon?
106 Given the fast-evolving landscape and potential disruption brought forth by the popularization of ChatGPT, many parallel initiatives
107 related to the development, application, or integration of LLMs have been launched across Amazon. We identified three such parallel
108 initiatives in Amazon Stores that focus on product facts and product data: Listing LLM/Quicklist (SPS), Perfect Detail Page (Shopping),
109 and ShopGPT/Nile (Search). Each of these initiatives has its specificities and targets a different engineering product, CX, and
110 deployment use cases. While all of these initiatives can rely on a common foundational product-data-aware core LLM (core science
111 artifact), the specific engineering product, CX requirements and deployment scenarios will lead to different fine-
112 tuning/adaptation/alignment/precision evaluation objectives of the base LLM.
113 2. What is the difference between Perfect Detail Page LLM and Starfish?
114 Perfect Detail Page (DP) LLM aims to improve catalog data by prioritizing attributes visible on detail page and therefore does not
115 attempt to perform schema discovery to identify new attributes and backfill values for them. Starfish on the other hand tries to
116 improve all attributes for ASINs including discovering new attributes using external sources. Starfish will use a Retrieval Model to
117 search for high-quality products (both using the products in Amazon catalog or external catalogs) that are relevant to the context of a
118 particular ASIN. It will use the Retrieval Model’s results to holistically regenerate the ASIN including attribute comparisons with schema
119 discovery feature, style normalization within context, and ASIN groupings.
120 Perfect DP LLM efforts aim to generate customer insights by summarizing reviews and QA which is not in scope for Starfish.
121 3. What is the difference between Listing LLM powering QuickList and Starfish?
122 With QuickList, SPS aims to create a simplified listing experience where sellers can provide inputs in any format they choose and Listing
123 LLM automatically generates high-quality structured product attributes and descriptive text. The following are the major differences
124 between QuickList and Starfish. 1) QuickList operates within the scope of a single contribution whereas Starfish can retrieve all relevant
125 information such as other contributions on the same ASIN, sellers scores, and corresponding EPC records for a given input ASIN for its
126 decisions. 2) QuickList operates based on a fixed schema while we envision to equip Starfish with schema discovery capabilities. 3)
127 QuickList does not update Catalog directly, and its suggestions can be rejected by Sellers whereas Starfish publishes its output to
128 Catalog and is subject to much higher precision requirements (95%-99% depending on the use case). 4) To provide an interactive listing
129 experience, QuickList must operate under stricter latency constraints while Starfish has more time budget for processing.
130 We expect QuickList to improve the quality of incoming contributions, which would in turn help Starfish to attain its ambitious
131 performance target.
132 4. How does Starfish relate to Nile/ShopGPT?
133 ShopGPT aims to power a new CX and shopping experience in the Amazon shopping website through a conversational chatbot. Starfish
134 is not intended to be a chatbot that can answer open-ended questions and does not require human conversation intent alignment. To
135 power the Nile experience, Search is developing ShopGPT, a LLM for conversational shopping ("What is the largest TV?", "Does this
136 product contain flour?") with 11B parameters in Q2 and 20B by Q4, trained on catalog and (potentially) external data. Nile’s current
137 focus is to deliver the model in the CX experience for customers (shoppers) and not for internal teams’ uses. However, longer term,
138 ShopGPT is envisioned to be offered for tuning purposes on designated tasks. Given its factoid pattern of questions, we envision
139 forming a collaboration, benchmarking an early Beta for attribute extraction tasks (completeness/correctness) and attribute
140 validation. ShopGPT can potentially be used as foundational base LLM for Starfish, and we will reevaluate and rebase our development
141 (fine-tuning, adaptation) on the ShopGPT model as needed whenever it is ready and shared.
142 5. How does Starfish relate to AWS Bedrock?
143 AWS Bedrock does not target Shopping/Stores domain. It is a general-purpose LLM and chatbot that will be offered to customers
144 worldwide for various use-cases. Starfish will rely on a foundational LLM to bootstrap its development and to transfer its natural
145 language skills, general intelligence, and reasoning capabilities to Catalog tasks. We will initially use an open-source option as a
146 foundational LLM but we expect to replace it with the new AWS Bedrock models after their availability. We are onboarded with AWS
147 Bedrock since Q4 2022 and have experimented with their older 20B parameter model sitting currently behind the test API. This model
Amazon confidential 3
Starfish PRFAQ
148 will be replaced by the new 26B and 52B parameters models (May) and the new 200B parameters model (September). We are actively
149 participating in the product requirement gathering with the AWS Bedrock Product Management team.
150 6. What resources do you plan to invest in 2023?
151 To move fast, we will create a virtual team consisting of 10 AS under a single-threaded ASM. This team will be supported by 2 L7 AS,
152 1 scholar, 2 TPM, 4 SDE, and 12 GCO auditors. We will fund most of the required HC between director-level organizations within ASCS
153 by reprioritizing our 2023 commitments. We will continually make adjustments to the number and composition of FTEs who
154 participate part-time or full-time in Starfish based on our progress. We are asking for an additional budget of $500K in IMR for GPU
155 resources and $100K for leveraging external LLM services provided that they are approved by Amazon. The Starfish team will work on
156 the tentative list of tracks below.
157 1. Data tooling: We will define data formats, serialization, and payloads for the model input and output (full ION, simplified
158 JSON, ad-hoc natural language), develop strategies to optimally inject ASINs payloads in LLM prompts under prompt budget
159 constraints (full ASIN, top-K relevant attributes, and task-specific attributes only), and build debugging and visualization tools.
160 2. Data evaluation benchmark: We will curate a reference test benchmark-suite across a range of catalog tasks from historical
161 labels (attribute prediction, attribute correctness, attribute validity, attribute normalization, attribute relevance and
162 applicability, title quality, PDP quality, policy related classification tasks, ontology classification tasks, duplicates, variations,
163 etc.). This benchmark will contain full ASIN snapshots, task descriptions, the golden labels, and ML models baselines and will
164 be used in the continuous benchmarking of the state-of-the-art LLMs and in model development.
165 3. SOTA LLM active survey, continuous benchmarking, and prompt engineering: We will explore, survey, and continuously
166 benchmark against the latest state-of-the-art (SOTA) base LLMs that will become available as our program progresses. These
167 models include: 1) fully open-source Apache 2.0 licensed models that we can host locally and over which we have full
168 customization and training control (FLAN T5, UL2), 2) proprietary models onboarded to AWS SageMaker JumpStart
169 Foundation Models (Cohere Command, AI21 Jurassic), 3) proprietary models hosted and served directly from model
170 provider’s API pending legal approval (Anthropic Claude, Open AI GPT-4), and 4) internal models as they are released
171 (ShopGPT, AWS Bedrock). We will actively monitor the releases of these models and quickly iterate on them with our suite
172 of general catalog tasks benchmark.
173 4. Self-supervision training dataset curation: We will curate an ASIN regeneration self-supervision dataset, including self-
174 supervision from Amazon Catalog, Contributions Store, EPC, UMP (attribute metadata), historical EPC-to-ASIN mappings.
175 5. Core modeling and self-supervision training: We will train Starfish LLM by fine-tuning the base LLM model via multiple self-
176 supervision paradigms including mixture-of-denoisers and extreme-denoising (e.g., UL2), in-filling (e.g., InCoder), regular
177 denoising, and causal language modeling. We will explore multiple fine-tuning strategies including parameter-efficient fine-
178 tuning (PEFT) and regular fine-tuning methods.
179 6. Retrieval Augmented Generation (RAG): This track involves development of retrieval augmentation capabilities to enhance
180 Starfish. We will research the best embeddings to use, build the Approximate Nearest Neighbors (ANN) index on the retrieval
181 sources (EPC, SM crawls), augment the Self Supervision with dual inputs and additional context (ASIN + ASIN, ASIN + EPC,
182 ASIC + SKU) and leverage the EPC to ASIN mappings from the data track.
183 7. Reinforcement Learning from Human Feedback (RLHF): We will obtain high-quality labeled data from human auditors, set
184 up labeling tooling and training framework for RLHF, develop the Reward Model, and perform policy optimization. We expect
185 this track to be crucial to enable schema discovery as expert humans will indicate to the model what a best-in-class listing for
186 the given ASIN should look like. We will explore RLHF as used in ChatGPT as well as Reinforcement Learning from AI Feedback
187 (RLAIF) as used in Anthropic’s Constitutional AI methodology.
188 8. Model distillation: We will develop a lightweight model (Starfish Student) that can achieve the same performance as the
189 main Starfish model (Starfish Teacher) to enable large-scale deployment under budget constraints.
190 9. Multi-modality enhancement: We will adapt text-only LLMs to multi-modality by adding adaptation weights without full pre-
191 training. We will also actively survey multimodal models releases (open-source, AWS JumpStart FM, AWS Bedrock, and
192 internal) and compare our model against them.
193 10. Instruction-tuning: This track involves building a general-purpose Catalog LLM, which can not only serve as a base LLM for
194 Starfish but can also perform other Catalog tasks that do not fall under the ASIN regeneration formulation, such as policy
195 classification tasks. We will follow the Fine-tuned Language Net (FLAN) approach; we will define a suite of comprehensive
196 Catalog tasks, create a super-dataset that consists of multiple curated natural language instruction templates per task, and
197 fine-tune the base LLM on it.
Amazon confidential 4
Starfish PRFAQ
198 11. Catalog LLM research: This track focuses on other LLM research initiatives on general catalog tasks, including fully automating
199 the prompt engineering directly from task description, class rationale, or SOP, including multimodality.
200 7. What are your timelines, deliverables, and milestones?
201 In 2023, we plan to deliver Starfish for TK (~5-10) PTs to allow for quick iteration and learning before we scale. We will incrementally
202 deliver Starfish in three main phases described below: -
203 Phase1: Regenerate ASIN from the current ASIN contents (Q2 2023): In this phase our primary objective would be to evaluate and
204 benchmark SOTA LLMs for our use cases including proprietary models behind external APIs, proprietary models available through AWS
205 SageMaker JumpStart, and open-source models. As we are exploring proprietary models, we will limit the scope to publicly available
206 data (i.e., existing data on detail page) and align our experiments with the POCs conducted so far which includes title generation,
207 identifying incorrectness/inconsistency across existing attributes and identifying schema defects e.g., irrelevant/inapplicable
208 attributes.
209 By the end of this phase, we will: 1) finalize data tooling, downstream evaluation benchmark data, and self-supervision data, 2)
210 complete first round of SOTA LLM benchmarking, 3) complete Starfish core modeling and first round of Starfish self-supervised fine-
211 tuning, 4) kick-off the RLHF tooling and labeling, and 5) kick-off the instruction-tuning dataset preparation.
212 Success Metrics: We measure success through the progress made on individual components of the DPDQ metric for the TK PTs: (1)
213 Increase in % of Grade A titles; (2) Reduction in customer perceived data inconsistency (CPDI); (3) Reduction in schema defects
214 Phase2: Regenerate ASIN using the current ASIN contents and external content from EPC (Q3 2023): In this phase we will: (1) expand
215 and augment Starfish with RAG by building the Retrieval Model and ANN index. We will utilize RAG to enrich ASIN with data from
216 external sources obtained via External Product Catalog (EPC); and (2) initiate expansion of regeneration capabilities of Starfish in Phase
217 1 from TK PTs to TK PTs to understand the scaling challenges. We will also start comparison of Starfish with the current custom models
218 for each catalog enrichment task.
219 We will re-evaluate the base model SOTA LLM benchmarking with the expected release of AWS Bedrock 26B and 52B, and the possible
220 access to trained weights of ShopGPT model checkpoint (we will load the weight on our FLAN T5 model). We will rebase the Starfish
221 model on these two models as needed. We will continue Starfish RLHF/RLAIF core modeling and labeling and start training the Reward
222 Model. We will kick-off the instruction-tuning of the FLAN T5 model on Catalog instructions super-dataset
223 Success Metrics: For the TK PTs chosen in Phase 1 we expect to improve on the following additional dimensions: (1) Improvement in
224 completeness rate; (2) Improvement in normalization rates; (3) additional improvements in CPDI metric and Grade A titles due to
225 additional content from external sources.
226 Phase3: Regenerate ASIN through additional schema discovery (Q4 2023): In this phase we will experiment with the SOTA LLMs
227 models’ capabilities to enrich schema through prompting using data in EPC. We will utilize data in EPC to backfill values for the newly
228 discovered attributes. We will align existing PTs in Voyager scope to support Otologist review and configuration of the newly
229 discovered attributes.
230 We will re-evaluate the base model SOTA LLM benchmarking with expected release of AWS Bedrock 200B, and with the completed
231 catalog instruction-tuned model. We will rebase the Starfish model on these two models as needed. We will also complete Starfish
232 RLHF training.
233 Success Metrics: TK% of ASINs achieving DPDQ Grade A for the TK PTs. Improvement in schema completeness metrics for TK PTs.
234 Phase4: 2024 and beyond: In 2024, we tentatively plan to focus on multimodal model development and scaling Starfish to Amazon
235 Catalog. We will finalize our plans, including additional h/c and budget requests for next year, in Q4 2023 based on our progress and
236 learnings and incorporate them into our OP2.
237 Deliverable: Scale Starfish to TK%GV ASINs a at or above as measured by DPDQ Grade A or B.
238 Through all these phases, we will use the Catalog Experimentation Platform (CEP) to measure business impact of catalog improvements
239 made by Starfish.
240 8. What experiments have you done so far that would give us confidence in Starfish?
241 In the last two years, ASCS teams designed and trained generative Language Models (0.2B ~ 0.8B parameters), which include GPT-2,
242 BART (SAGE), and T5, and deployed them in several of our production systems including attribute completeness and correctness,
243 processing hundreds of millions of transactions per day. Although these models were much smaller in scale than ChatGPT (~175B),
244 they provided us with our first insights into the capabilities of generative language models and how to adapt them to Catalog tasks.
245 Since the release of ChatGPT, we have experimented with open-source instruction-tuned LLMs in zero-shot setting (i.e., without
246 training), including FLAN-T5 XXL (11B parameters), mT0 (13B), FLAN UL2 (20B). We explored applying state-of-the-art prompt
Amazon confidential 5
Starfish PRFAQ
247 engineering techniques (In-Context Learning, Chain-of-Thought, Self-Consistency, etc.) on select Catalog tasks, such as predicting the
248 unit of measure type and the price-per-unit, inferring variation relationship, predicting attribute values and attribute correctness, and
249 automatically synthesizing keyword rules for CPP models. While the initial results are not yet at the level of current production models
250 trained on millions of training labels, they are promising considering that they required no labeling at all. For example, for the price-
251 per-unit use-case, zero-shot prompt-engineering of FLAN-T5 XXL (11B) led to 91% precision/recall compared to 95% precision/recall
252 with our current M5-based production system fine-tuned on 2M labels. We obtained evidence that the larger the LLM that we use,
253 the better its performance. We experimented with Retrieval-Augmented Generation (RAG) libraries (e.g., LangChain) to retrieve
254 relevant information in our SOPs and include it in the prompt context before making a prediction. We also started using RLHF to
255 generate parent titles for variation families, and training a Large Multimodal Model (LMM) of 3.5B parameters. Through these
256 experiments, we have gained familiarity with various scientific techniques and components that we will use in Project Starfish.
257 We also ran experiments on select use-cases approved by Legal by using ChatGPT. These experiments include title enhancement,
258 product classification, ontology definition, and some attribute prediction use-cases, where the ChatGPT model obtained better results
259 than humans. To perform additional benchmarking activities, we onboarded with the Amazon API key administrator for the Anthropic's
260 Claude v1.2 model, as part of the Anthropic-AWS evaluation agreement. We started the onboarding process with the gated preview
261 of AWS SageMaker JumpStart Foundation Models, which host ChatGPT-grade models from partner providers like Cohere Command
262 and AI21 Jurassic 2 Jumbo in AWS escrow accounts protecting data privacy. All our anecdotes with ChatGPT and other large proprietary
263 models hinted at strong product knowledge and expert-level e-commerce domain knowledge in general, outperforming human judges
264 on anecdote tests.
265 9. What are the risks for this proposal?
266 The proposed delivery schedule is aggressive anticipating exponential improvements in base LLMs, such as timely delivery of AWS
267 Bedrock, quick approval to onboard it, permission to use external LLMs, support for resource and IMR reallocation from existing OP2
268 plans, and additional investments as needed towards Starfish. To make quick progress towards our deliverables, we will choose an
269 open-source LLM as our foundational model and continually compare it against other open-source and internal alternatives. While
270 this approach increases our delivery speed, repeated baselining studies will result in extra h/c and IMR costs. Furthermore, we may
271 have to throw away some of the training work we have invested in open-source LLMs if we later identify a more powerful alternative
272 such as AWS Bedrock.
273 Our early exploration of ChatGPT and other recent generative model releases have demonstrated potential, but we cannot yet tell if
274 an LLM-based solution can scale to the entire catalog under latency and throughput constraints and can be cost effective in the next
275 18 months. We may have to rely on current production models entirely or operate them side by side with the LLM-based solutions
276 until cost efficiency is achieved. Furthermore, the h/c budget we have for model distillation in this proposal towards cost efficiency
277 may be an underestimate, and we may have to allocate more time and h/c resources.
278 We also face uncertainties about attaining human-level precision in Catalog tasks. While we expect Starfish to beat the production ML
279 models based on our preliminary investigations with LLMs, this may cost more than what we outline in this proposal in terms of time,
280 h/c, and annotation budget. We are aware that product “facts” generated by Starfish may be incorrect and applying them to catalog
281 will require guardrails, governance, and human appeal mechanisms at scale. Although we have decades of experience in making
282 automated updates to Catalog, the new holistic framework where a single model simultaneously performs multiple tasks introduces
283 additional challenges. For example, if we detect that Starfish precision drops below the target level for one of the tasks, we do not yet
284 know if and how we should take a mitigation action against other tasks. We are in the initial stages of Starfish; we will partner with
285 other teams, learn from their and our own experiences, provide visibility through rhythm of business meetings, and refine our plans
286 as we make progress.
Amazon confidential 6
Starfish PRFAQ
311
Amazon confidential 7
Starfish PRFAQ
312
313 Appendix 3: How Starfish operates
314
315
316
317
Amazon confidential 8
Starfish PRFAQ
318 Appendix 4: How Starfish processes ASINs holistically versus traditional separate treatments
319
320 Initial state: Left original ASIN, right new ASIN after multiple tasks have been applied to it, creating ill-posed problem and dependency
321 conflicts (e.g., the two tasks “Title generation from attributes” and “Attributes generation from titles” are conflicting with each other)
322
323
324 Target state: Left original ASIN, right new ASIN after single regeneration tasks, resolving all dependency conflicts
325
326
Amazon confidential 9