GEO Generative Engine Optimization
GEO Generative Engine Optimization
Figure 1: Our proposed Generative Engine Optimization (GEO) method optimizes websites to boost their visibility in
Generative Engine responses. GEO’s black-box optimization framework then enables the website owner of the pizza website,
which lacked visibility originally, to optimize their website to increase visibility under Generative Engines. Further, GEO’s
general framework allows content creators to define and optimize their custom visibility metrics, giving them greater control
in this new emerging paradigm.
creators to control and understand how their content is ingested Through systematic evaluation, we demonstrate that our proposed
and portrayed. Generative Engine Optimization methods can boost visibility by
In this work, we propose the first general creator-centric frame- up to 40% on diverse queries, providing beneficial strategies for con-
work to optimize content for generative engines, which we dub tent creators. Among other things, we find that including citations,
Generative Engine Optimization (GEO), to empower content quotations from relevant sources, and statistics can significantly
creators to navigate this new search paradigm. GEO is a flexible boost source visibility, with an increase of over 40% across various
black-box optimization framework for optimizing web content vis- queries. We also demonstrate the efficacy of Generative Engine
ibility for proprietary and closed-source generative engines (Fig- Optimization on Perplexity.ai, a real-world generative engine and
ure 1). GEO ingests a source website and outputs an optimized demonstrate visibility improvements up to 37%.
version by tailoring and calibrating the presentation, text style, and In summary, our contributions are three-fold:
content to increase visibility in generative engines. (1) We propose Generative Engine Optimization, the first gen-
Further, GEO introduces a flexible framework for defining visi- eral optimization framework for website owners to optimize their
bility metrics tailor-made for generative engines as the notion of websites for generative engines. Generative Engine Optimiza-
visibility in generative engines is more nuanced and multi-faceted tion can improve the visibility of websites by up to 40% on a wide
than traditional search engines (Figure 3). While average ranking range of queries, domains, and real-world black-box generative
on the response page is a good measure of visibility in traditional engines.
search engines, which present a linear list of websites, this does (2) Our framework proposes a comprehensive set of visibility met-
not apply to generative engines. Generative Engines provide rich, rics specifically designed for generative engines and enables content
structured responses and embed websites as inline citations in the creators to flexibly optimize their content through customized visi-
response, often embedding them with different lengths, at varying bility metrics.
positions, and with diverse styles. This necessitates the need for vis- (3) To foster faithful evaluation of GEO methods in generative en-
ibility metrics tailor-made for generative engines, which measure gines, we propose the first large-scale benchmark consisting of
the visibility of attributed sources over multiple dimensions, such diverse search queries from wide-ranging domains and datasets
as relevance and influence of citation to query, measured through specially tailored for Generative Engines.
both an objective and a subjective lens.
To facilitate faithful and extensive evaluation of GEO methods,
we propose GEO-bench, a benchmark consisting of 10000 queries
from diverse domains and sources, adapted for generative engines.
GEO: Generative Engine Optimization KDD ’24, August 25–29, 2024, Barcelona, Spain
Generative Engines comprise two crucial components: a.) A set 2.2.1 Impressions for Generative Engines. In SEO, a website’s im-
of generative models 𝐺 = {𝐺 1, 𝐺 2 ...𝐺𝑛 }, each serving a specific pur- pression (or visibility) is determined by its average ranking over
pose like query reformulation or summarization, and b.) A search a range of queries. However, generative engines’ output nature
engine 𝑆𝐸 that returns a set of sources 𝑆 = {𝑠 1, 𝑠 2 ...𝑠𝑚 } given a necessitates different impression metrics. Unlike search engines,
query 𝑞. We present a representative workflow in Figure 2, which, Generative Engines combine information from multiple sources
at the time of writing, closely resembles the design of BingChat. This in a single response. Factors such as length, uniqueness, and pre-
workflow breaks down the input query into a set of simpler queries sentation of the cited website determine the true visibility of a
that are easier to consume for the search engine. Given a query, a citation. Thus, as illustrated in Figure 3, while a simple ranking
query re-formulating generative model, 𝐺 1 = 𝐺𝑞𝑟 , generates a set on the response page serves as an effective metric for impression
of queries 𝑄 1 = {𝑞 1, 𝑞 2 ...𝑞𝑛 }, which are then passed to the search and visibility in conventional search engines, such metrics are not
engine 𝑆𝐸 to retrieve a set of ranked sources 𝑆 = {𝑠 1, 𝑠 2, ..., 𝑠𝑚 }. The applicable to generative engine responses.
sets of sources 𝑆 are passed to a summarizing model 𝐺 2 = 𝐺𝑠𝑢𝑚 , In response to this challenge, we propose a suite of impression
which generates a summary 𝑆𝑢𝑚 𝑗 for each source in 𝑆, resulting in metrics designed with three key principles in mind: 1.) The metrics
the summary set (𝑆𝑢𝑚 = {𝑆𝑢𝑚 1, 𝑆𝑢𝑚 2, ..., 𝑆𝑢𝑚𝑚 }). The summary should hold relevance for creators, 2.) They should be explainable,
set is passed to a response-generating model 𝐺 3 = 𝐺𝑟𝑒𝑠𝑝 , which and 3.) They should be easily comprehensible by a broad spectrum
generates a cumulative response 𝑟 backed by sources 𝑆. In this work, of content creators. The first of these metrics, the “Word Count”
we focus on single-turn Generative Engines, but the formulation metric, is the normalized word count of sentences related to a
can be extended to multi-turn Conversational Generative Engines citation. Mathematically, this is defined as:
(Appendix A). Í
𝑠 ∈𝑆𝑐 |𝑠 |
The response 𝑟 is typically a structured text with embedded 𝐼𝑚𝑝 𝑤𝑐 (𝑐𝑖 , 𝑟 ) = Í 𝑖 (2)
citations. Citations are important given the tendency of LLMs to 𝑠 ∈𝑆𝑟 |𝑠 |
hallucinate information [10]. Specifically, consider a response 𝑟 Here 𝑆𝑐𝑖 is the set of sentences citing 𝑐𝑖 , 𝑆𝑟 is the set of sentences
composed of sentences {𝑙 1, 𝑙 2 ...𝑙𝑜 }. Each sentence may be backed in the response, and |𝑠 | is the number of words in sentence 𝑠. In
by a set of citations that are part of the retrieved set of documents cases where a sentence is cited by multiple sources, we share the
𝐶𝑖 ⊂ 𝑆. An ideal generative engine should ensure all statements word count equally with all the citations. Intuitively, a higher word
in the response are supported by relevant citations (high citation count correlates with the source playing a more important part in
recall), and all citations accurately support the statements they’re the answer, and thus, the user gets higher exposure to that source.
KDD ’24, August 25–29, 2024, Barcelona, Spain Pranjal Aggarwal et al.
Figure 3: Ranking and Visibility Metrics are straightforward in traditional search engines, which list website sources in ranked
order with verbatim content. However, Generative Engines generate rich, structured responses, often embedding citations
in a single block interleaved with each other. This makes ranking and visibility nuanced and multi-faceted. Further, unlike
search engines, where significant research has been conducted on improving visibility, optimizing visibility in generative
engine responses remains unclear. To address these challenges, our black-box optimization framework proposes a series of
well-designed impression metrics that creators can use to gauge and optimize their website’s performance and also allows the
creator to define their impression metrics.
However, since “Word Count” is not impacted by the ranking of the modified content after applying the GEO method. The modifi-
the citations (whether it appears first, for example), we propose a cations can range from simple stylistic alterations to incorporating
position-adjusted count that reduces the weight by an exponentially new content in a structured format. A well-designed GEO is equiv-
decaying function of the citation position: alent to a black-box optimization method that, without knowing
Í −
𝑝𝑜𝑠 (𝑠 ) the exact algorithmic design of generative engines, can increase
𝑠 ∈𝑆𝑐𝑖 |𝑠 | · 𝑒 |𝑆 |
the website’s visibility and implement textual modifications to 𝑊
𝐼𝑚𝑝𝑝𝑤𝑐 (𝑐𝑖 , 𝑟 ) = Í (3) independent of the exact queries.
𝑠 ∈𝑆𝑟 |𝑠 |
For our experiments, we apply Generative Engine Optimiza-
Intuitively, sentences that appear first in the response are more
tion methods on website content using a large language model,
likely to be read, and the exponent term in definition 𝐼𝑚𝑝𝑝𝑤𝑐 gives
prompted to perform specific stylistic and content changes to the
higher weightage to such citations. Thus, a website cited at the
website. In particular, based on the GEO method defining a spe-
top may have a higher impression despite having a lower word
cific set of desired characteristics, the source content is modified
count than a website cited in the middle or end of the response.
accordingly. We propose and evaluate several such methods:
Further, the choice of exponentially decaying function is motivated
1: Authoritative: Modifies text style of the source content to be
by several studies showing click-through rates follow a power-law
more persuasive and authoritative, 2. Statistics Addition: Modifies
as a function of ranking in search engines [7, 8]. While the above
content to include quantitative statistics instead of qualitative dis-
impression metrics are objective and well-grounded, they ignore
cussion, wherever possible, 3. Keyword Stuffing: Modifies content
the subjective aspects of the impact of citations on the user’s at-
to include more keywords from the query, as expected in classi-
tention. To address this, we propose the "Subjective Impression"
cal SEO optimization. 4. Cite Sources & 5. Quotation Addition:
metric, which incorporates facets such as the relevance of the cited
Adds relevant citations and quotations from credible sources re-
material to the user query, influence of the citation, uniqueness of
spectively, 6.) 6. Easy-to-Understand: Simplifies the language of
the material presented by a citation, subjective position, subjective
website, while 7. Fluency Optimization improves the fluency of
count, probability of clicking the citation, and diversity in the ma-
website text. 8. Unique Words & 9. Technical Terms: involves
terial presented. We use G-Eval [15], the current state-of-the-art
adding unique and technical terms respectively wherever possible,
for evaluation with LLMs, to measure each of these sub-metrics.
These methods cover diverse general strategies that website
2.2.2 Generative Engine Optimization methods for website. To owners can implement quickly and use regardless of the website
improve impression metrics, content creators must make changes content. Further, except for methods 3, 4, and 5, the remaining
to their website content. We present several generative engine- methods enhance the presentation of existing content to increase
agnostic strategies, referred to as Generative Engine Optimiza- its persuasiveness or appeal to the generative engine, without re-
tion methods (GEO). Mathematically, every GEO method is a func- quiring extra content. On the other hand, methods 3,4 and 5 may
tion 𝑓 : 𝑊 → 𝑊𝑖′ , where 𝑊 is the initial web content, and 𝑊 ′ is
GEO: Generative Engine Optimization KDD ’24, August 25–29, 2024, Barcelona, Spain
require some form of additional content. To analyze the perfor- an updated list of trending queries on the platform. 8. ELI-53 : This
mance gain of our methods, for each input user query, we randomly dataset contains questions from the ELI5 subreddit, where users ask
select one source website to be optimized and apply each of the complex questions and expect answers in simple, layman’s terms.
GEO methods separately on the same source. We refer readers to 9. GPT-4 Generated Queries: To supplement diversity in query
Appendix B.4 for more details on GEO methods. distribution, we prompt GPT-4 [21] to generate queries ranging
from various domains (e.g., science, history) and based on query
3 EXPERIMENTAL SETUP intent (e.g., navigational, transactional) and based on difficulty and
scope of generated response (e.g., open-ended, fact-based).
3.1 Evaluated Generative Engine
In accordance with previous works [14], we use a 2-step setup for . Our benchmark comprises 10K queries divided into 8K, 1K, and
Generative Engine design. The first step involves fetching relevant 1K for train, validation, and test splits, respectively. We preserve
sources for input query, followed by a second step where an LLM the real-world query distribution, with our benchmark containing
generates a response based on the fetched sources. Similar to pre- 80% informational queries and 10% each for transactional and navi-
vous works, we do not use summarization and provide the whole gational queries. Each query is augmented with the cleaned text
response for each source. Due to context length limitations and qua- content of the top 5 search results from the Google search engine.
dratic scaling cost based on the context size of transformer models, Tags. Optimizing website content often requires targeted changes
only the top 5 sources are fetched from the Google search engine based on the task’s domain. Additionally, a user of Generative
for every query. The setup closely mimics the workflow used in Engine Optimization may need to identify an appropriate method
previous works and the general design adopted by commercial GEs for only a subset of queries, considering multiple factors such as
such as you.com and perplexity.ai. The answer is then generated domain, user intent, and query nature. To facilitate this, we tag each
by the gpt3.5-turbo model [20] using the same prompt as prior query with one of seven different categories. For tagging, we em-
work [14]. We sample 5 different responses at temperature=0.7, to ploy the GPT-4 model and manually verify high recall and precision
reduce statistical deviations. on the test split.
Further in Section C.1, we evaluate the same Generative Engine Overall, GEO-bench consists of queries from 25 diverse domains
Optimization methods on Perplexity.ai, which is a commercially such as Arts, Health, and Games; it features a range of query diffi-
deployed generative engine, highlighting the generalizability of our culties from simple to multi-faceted; includes 9 different types of
proposed Generative Engine Optimization methods. queries such as informational and transactional; and encompasses
7 different categorizations. Owing to its specially designed high
3.2 Benchmark : GEO-bench diversity, the size of the benchmark, and its real-world nature, GEO-
Since there is currently no publicly available dataset containing bench is a comprehensive benchmark for evaluating Generative
Generative Engine related queries, we curate GEO-bench, a bench- Engines and serves as a standard testbed for assessing them for
mark consisting of 10K queries from multiple sources, repurposed various purposes in this and future works. We provide more details
for generative engines, along with synthetically generated queries. about GEO-bench in Appendix B.2.
The benchmark includes queries from nine different sources, each
further categorized based on their target domain, difficulty, query 3.3 GEO Methods
intent, and other dimensions. We evaluate 9 different proposed GEO methods as described in
Section 2.2.2. We compare them with a baseline, which measures
Datasets: 1. MS Macro, 2. ORCAS-1, and 3. Natural Ques-
the impression metric of unmodified website sources. We evaluate
tions: [1, 6, 13] These datasets contain real anonymized user queries
methods on the complete GEO-bench test split. Further, to reduce
from Bing and Google Search Engines. These three collectively
variance in results, we run our experiments on five different random
represent the common set of datasets that are used in search en-
seeds and report the average.
gine related research. However, Generative Engines will be posed
with far more difficult and specific queries with the intent of syn-
3.4 Evaluation Metrics
thesizing answers from multiple sources instead of searching for
them. To this end, we repurpose several other publicly available We utilize the impression metrics as defined in Section 2.2.1. Specif-
datasets: 4. AllSouls: This dataset contains essay questions from ically, we employ two impression metrics: 1. Position-Adjusted
"All Souls College, Oxford University." The queries in this dataset Word Count, which combines word count and position count.
require Generative Engines to perform appropriate reasoning to To analyze the effect of individual components, we also report
aggregate information from multiple sources. 5. LIMA: [25] con- scores on the two sub-metrics separately. 2. Subjective Impres-
tains challenging questions requiring Generative Engines to not sion, which is a subjective metric encompassing seven different
only aggregate information but also perform suitable reasoning aspects: 1) relevance of the cited sentence to the user query, 2) in-
to answer the question (e.g., writing a short poem, python code.). fluence of the citation, assessing the extent to which the generated
6. Davinci-Debtate [14] contains debate questions generated for response relies on the citation, 3) uniqueness of the material pre-
testing Generative Engines. 7. Perplexity.ai Discover2 : These sented by a citation, 4) subjective position, gauging the prominence
queries are sourced from Perplexity.ai’s Discover section, which is of the positioning of source from the user’s viewpoint, 5) subjec-
tive count, measuring the amount of content presented from the
2 https://www.perplexity.ai/discover 3 https://huggingface.co/datasets/eli5_category
KDD ’24, August 25–29, 2024, Barcelona, Spain Pranjal Aggarwal et al.
5 ANALYSIS
cy
s
tic
ote
o
en
ati
tis
qu
flu
cit
sta
Table 4: Representative examples of GEO methods optimizing source website. Additions are marked in green and Deletions in
red. Without adding any substantial new information, GEO methods significantly increase the visibility of the source content.
generative models conditioned on website content, factors such in conjunction with other methods (Average: 31.4%), despite it being
as backlink building should not disadvantage small creators. This relatively less effective when used alone (8% lower than Quotation
is evident from the relative improvements in visibility shown in Addition). The findings underscore the importance of studying GEO
Table 2. For example, the Cite Sources method led to a substantial methods in combination, as they are likely to be used by content
115.1% increase in visibility for websites ranked fifth in SERP, while creators in the real world.
on average, the visibility of the top-ranked website decreased by
30.3%. 5.4 Qualitative Analysis
This finding highlights GEO’s potential as a tool to democra- We present a qualitative analysis of GEO methods in Table 4, con-
tize the digital space. Many lower-ranked websites are created by taining representative examples where GEO methods boost source
small content creators or independent businesses, who traditionally visibility with minimal changes. Each method optimizes a source
struggle to compete with larger corporations in top search engine through suitable text additions and deletions. In the first example,
results. The advent of Generative Engines might initially seem dis- we see that simply adding the source of a statement can significantly
advantageous to these smaller entities. However, the application boost visibility in the final answer, requiring minimal effort from
of GEO methods presents an opportunity for these content cre- the content creator. The second example demonstrates that adding
ators to significantly improve their visibility in Generative Engine relevant statistics wherever possible ensures increased source vis-
responses. By enhancing their content with GEO, they can reach ibility in the final Generative Engine response. Finally, the third
a wider audience, leveling the playing field and allowing them to row suggests that merely emphasizing parts of the text and using a
compete more effectively with larger corporations. persuasive text style can also lead to improvements in visibility.
in Table 5. Similar to our generative engine, Quotation Addition to optimize their content under generative engines. We define im-
performs best in Position-Adjusted Word Count with a 22% im- pression metrics for generative engines and propose and release
provement over the baseline. Methods that performed well in our GEO-bench: a benchmark encompassing diverse user queries from
generative engine such as Cite Sources, Statistics Addition show multiple domains and settings, along with relevant sources needed
improvements of up to 9% and 37% on the two metrics. Our obser- to answer those queries. We propose several ways to optimize con-
vations, such as the ineffectiveness of traditional SEO methods like tent for generative engines and demonstrate that these methods can
Keyword Stuffing, are further highlighted, as it performs 10% worse boost source visibility by up to 40% in generative engine responses.
than the baseline. The results are significant for three reasons: 1) Among other findings, we show that including citations, quotations
they underscore the importance of developing different Genera- from relevant sources, and statistics can significantly boost source
tive Engine Optimization methods to benefit content creators, 2) visibility. Further, we discover a dependence of GEO methods’ ef-
they highlight the generalizability of our proposed GEO methods fectiveness on the query domain and the potential of combining
on different generative engines, 3) they demonstrate that content multiple GEO strategies in conjunction. We show promising results
creators can use our easy-to-implement proposed GEO methods on a commercially deployed generative engine with millions of
directly, thus having a high real-world impact. We refer readers to active users, showcasing the real-world impact of our work. In sum-
Appendix C.1 for more details. mary, our work is the first to formalize the important and timely
GEO paradigm, releasing algorithms and infrastructure (bench-
7 RELATED WORK marks, datasets, and metrics) to facilitate rapid progress in genera-
tive engines by the community. This serves as a first step towards
Evidence-based Answer Generation: Previous works have used understanding the impact of generative engines on the digital space
several techniques for answer generation backed by sources. Nakano and the role of GEO in this new paradigm of search engines.
et al. [19] trained GPT-3 to navigate web environments to generate
source-backed answers. Similarly, other methods [17, 23, 24] fetch
sources via search engines for answer generation. Our work unifies 9 LIMITATIONS
these approaches and provides a common benchmark for improving While we rigorously test our proposed methods on two generative
these systems in the future. In a recent working draft, Kumar and engines, including a publicly available one, methods may need to
Lakkaraju [11] showed that strategic text sequences can manipulate adapt over time as GEs evolve, mirroring the evolution of SEO.
LLM recommendations to enhance product visibility in generative Additionally, despite our efforts to ensure the queries in our GEO-
engines. While their approach focuses on increasing product visibil- bench closely resemble real-world queries, the nature of queries
ity through adversarial text, our method introduces non-adversarial can change over time, necessitating continuous updates. Further,
strategies to optimize any website content for improved visibility owing to the black-box nature of search engine algorithms, we
in generative engine search results. didn’t evaluate how GEO methods affect search rankings. However,
we note that changes made by GEO methods are targeted changes
Retrieval-Augmented Language Models: Several recent works in textual content, bearing some resemblance with SEO methods,
have tackled the issues of limited memory of language models while not affecting other metadata such as domain name, backlinks,
by fetching relevant sources from a knowledge base to complete a etc, and thus, they are less likely to affect search engine rankings.
task [3, 9, 18]. However, Generative Engine needs to generate an Further, as larger context lengths in language models become eco-
answer and provide attributions throughout the answer. Further, nomical, it is expected that future generative models will be able to
Generative Engine is not limited to a single text modality regarding ingest more sources, thus reducing the impact of search rankings.
both input and output. Additionally, the framework of Generative Lastly, while every query in our proposed GEO-bench is tagged and
Engine is not limited to fetching relevant sources but instead com- manually inspected, there may be discrepancies due to subjective
prises multiple tasks such as query reformulation, source selection, interpretations or errors in labeling.
and making decisions on how and when to perform them.
10 ACKNOWLEDGEMENTS
Search Engine Optimization: In nearly the past 25 years, extensive
This material is based upon work supported by the National Science
research has optimized web content for search engines [2, 12, 22].
Foundation under Grant No. 2107048. Any opinions, findings, and
These methods fall into On-Page SEO, improving content and user
conclusions or recommendations expressed in this material are
experience, and Off-Page SEO, boosting website authority through
those of the author(s) and do not necessarily reflect the views of
link building. In contrast, GEO deals with a more complex envi-
the National Science Foundation.
ronment involving multi-modality, conversational settings. Since
GEO is optimized against a generative model not limited to simple
keyword matching, traditional SEO strategies will not apply to REFERENCES
Generative Engine settings, highlighting the need for GEO. [1] Daria Alexander, Wojciech Kusa, and Arjen P. de Vries. 2022. ORCAS-I: Queries
Annotated with Intent using Weak Supervision. Proceedings of the 45th Inter-
national ACM SIGIR Conference on Research and Development in Information
Retrieval (2022). https://api.semanticscholar.org/CorpusID:248495926
8 CONCLUSION [2] Prashant Ankalkoti. 2017. Survey on Search Engine Optimization Tools &
In this work, we formulate search engines augmented with genera- Techniques. Imperial journal of interdisciplinary research 3 (2017). https:
//api.semanticscholar.org/CorpusID:116487363
tive models that we dub generative engines. We propose Genera- [3] Akari Asai, Xinyan Velocity Yu, Jungo Kasai, and Hannaneh Hajishirzi. 2021.
tive Engine Optimization (GEO) to empower content creators One Question Answering Model for Many Languages with Cross-lingual
KDD ’24, August 25–29, 2024, Barcelona, Spain Pranjal Aggarwal et al.
Dense Passage Retrieval. In Neural Information Processing Systems. https: Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brit-
//api.semanticscholar.org/CorpusID:236428949 tany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis
[4] Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-Scale Hy- Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben
pertextual Web Search Engine. Comput. Networks 30 (1998), 107–117. https: Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah
//api.semanticscholar.org/CorpusID:7587743 Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson,
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gor-
Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin don, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton,
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele,
In Advances in Neural Information Processing Systems, H. Larochelle, M. Ran- Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu
zato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny
Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/ Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kil-
[6] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and patrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner,
Jimmy J. Lin. 2021. MS MARCO: Benchmarking Ranking Models in the Large-Data Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kon-
Regime. Proceedings of the 44th International ACM SIGIR Conference on Research drich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael
and Development in Information Retrieval (2021). https://api.semanticscholar.org/ Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming
CorpusID:234336491 Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan
[7] Brian Dean. 2023. We Analyzed 4 Million Google Search Results. Here’s What Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov,
We Learned About Organic Click Through Rate. https://backlinko.com/google- Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew,
ctr-stats Accessed: 2024-06-08. Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David
[8] Danny Goodwin. 2011. Top Google Result Gets 36.4% of Clicks Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela
[Study]. https://www.searchenginewatch.com/2011/04/21/top-google-result- Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati,
gets-36-4-of-clicks-study/ Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind
[9] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub
2020. REALM: Retrieval-Augmented Language Model Pre-Training. ArXiv Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascan-
abs/2002.08909 (2020). https://api.semanticscholar.org/CorpusID:211204736 dolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng,
[10] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde
Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly
natural language generation. Comput. Surveys 55, 12 (2023), 1–38. Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford,
[11] Aounon Kumar and Himabindu Lakkaraju. 2024. Manipulating Large Language Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach,
Models to Increase Product Visibility. arXiv:2404.07981 [cs.IR] Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders,
[12] R.Anil Kumar, Zaiduddin Shaik, and Mohammed Furqan. 2019. A Survey on Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schul-
Search Engine Optimization Techniques. International Journal of P2P Network man, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker,
Trends and Technology (2019). https://doi.org/10.14445/22492615/IJPTT-V9I1P402 Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina
[13] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Pet-
Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob De- roski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B.
vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle,
Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc V. Le, and Slav Petrov. 2019. Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vi-
Natural Questions: A Benchmark for Question Answering Research. Transac- jayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang,
tions of the Association for Computational Linguistics 7 (2019), 453–466. https: Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter
//api.semanticscholar.org/CorpusID:86611921 Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter,
[14] Nelson F. Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating Verifiabil- Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael
ity in Generative Search Engines. ArXiv abs/2304.09848 (2023). https://api. Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba,
semanticscholar.org/CorpusID:258212854 Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng,
[15] Yang Liu, Dan Iter, Yichong Xu, Shuo Wang, Ruochen Xu, and Chenguang Zhu. Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. GPT-4 Technical Report.
2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. ArXiv arXiv:2303.08774 [cs.CL]
abs/2303.16634 (2023). https://api.semanticscholar.org/CorpusID:257804696 [22] A. Shahzad, Deden Witarsyah Jacob, Nazri M. Nawi, Hairulnizam Bin Mahdin,
[16] G. D. Maayan. 2023. How Google SGE will impact your traffic and Marheni Eka Saputri. 2020. The new trend for search engine optimization,
– and 3 SGE recovery case studies. Search Engine Land (5 Sep tools and techniques. Indonesian Journal of Electrical Engineering and Computer
2023). https://searchengineland.com/how-google-sge-will-impact-your-traffic- Science 18 (2020), 1568. https://api.semanticscholar.org/CorpusID:213123106
and-3-sge-recovery-case-studies-431430 [23] Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen
[17] Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz,
Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, W.K.F. Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie
Geoffrey Irving, and Nathan McAleese. 2022. Teaching language models to Kambadur, and Jason Weston. 2022. BlenderBot 3: a deployed conversational
support answers with verified quotes. ArXiv abs/2203.11147 (2022). https: agent that continually learns to responsibly engage. ArXiv abs/2208.03188 (2022).
//api.semanticscholar.org/CorpusID:247594830 https://api.semanticscholar.org/CorpusID:251371589
[18] Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ra- [24] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kul-
makanth Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane shreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang
Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali,
2023. Augmented Language Models: a Survey. ArXiv abs/2302.07842 (2023). Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen,
https://api.semanticscholar.org/CorpusID:256868474 Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao,
[19] Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu, Ouyang Long, Christina Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett,
Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel
Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben
Knight, Benjamin Chess, and John Schulman. 2021. WebGPT: Browser-assisted Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Ol-
question-answering with human feedback. ArXiv abs/2112.09332 (2021). https: son, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar,
//api.semanticscholar.org/CorpusID:245329531 Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen,
[20] OpenAI. 2022. Introducing ChatGPT. https://openai.com/index/chatgpt/ Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak,
[21] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Ed Chi, and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications.
Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam arXiv:2201.08239 [cs.CL]
Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Bal- [25] Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe
com, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Ma, Avia Efrat, Ping Yu, L. Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke
Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Zettlemoyer, and Omer Levy. 2023. LIMA: Less Is More for Alignment. ArXiv
Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, abs/2305.11206 (2023). https://api.semanticscholar.org/CorpusID:258822910
GEO: Generative Engine Optimization KDD ’24, August 25–29, 2024, Barcelona, Spain
Listing 1: Prompt used for Generative Engine. The GE takes Listing 2: Representative Queries from each of the 9 datasets
the query and 5 sources as input and outputs the response to in GEO-bench
query with response grounded in the sources. 1 ### ORCAS
1 Write an accurate and concise answer for the given user question, 2 - what does globalization mean
using _only_ the provided summarized web search results. 3 - wine pairing list
The answer should be correct, high-quality, and written by 4
an expert using an unbiased and journalistic tone. The user 5 ### AllSouls
's language of choice such as English, Francais, Espamol, 6 - Are open-access journals the future of academic publishing?
Deutsch, or should be used. The answer should be 7 - Should the study of non-Western philosophy be a requirement
informative, interesting, and engaging. The answer's logic for a philosophy degree in the UK?
and reasoning should be rigorous and defensible. Every 8
index]. When citing several search results, use [1][2][3] 13 ### ELI5
format rather than [1, 2, 3]. You can use multiple search 14 - Why does my cat kick its toys when playing with them?
results to respond comprehensively while avoiding 15 - what does caffeine actually do your muscles, especially
irrelevant search results. regarding exercising?
16
2
3 Question: {query} 17 ### GPT-4
4
18 - What are the benefits of a keto diet?
5 Search Results: 19 - What are the most profound impacts of the Renaissance period
6 {source_text} on modern society?
20
21 ### LIMA
22 - What are the primary factors that influence consumer behavior?
23 - What would be a great twist for a murder mystery? I'm looking
A CONVERSATIONAL GENERATIVE ENGINE for something creative, not to rehash old tropes.
24
In Section 2.1, we discussed a single-turn Generative Enginethat 25 ### MS-Macro
outputs a single response given the user query. However, one of the 26 - what does monogamous
27 - what is the normal fbs range for children
strengths of upcoming Generative Engines will be their ability to 28
engage in an active back-and-forth conversation with the user. The 29 ### Natural Questions
conversation allows users to provide clarifications to their queries 30 - where does the phrase bee line come from
31 - what is the prince of persia in the bible
or Generative Engine response and ask follow-ups. Specifically, 32
in equation 1, instead of the input being a single query 𝑞𝑢 , it is 33 ### Perplexity.ai
34 - how to gain more followers on LinkedIn
modeled as a conversation history 𝐻 = (𝑞𝑢𝑡 , 𝑟 𝑡 ) pairs. The response 35 - why is blood sugar higher after a meal
𝑟 𝑡 +1 is then defined as:
𝐺𝐸 := 𝑓𝐿𝐸 (𝐻, 𝑃𝑈 ) → 𝑟 𝑡 +1 (5)
where 𝑡 is the turn number. • Difficulty Level: The complexity of the query, ranging from
Further, to engage the user in a conversation, a separate LLM, simple to complex.
𝐿 𝑓 𝑜𝑙𝑙𝑜𝑤 or 𝐿𝑟𝑒𝑠𝑝 , may generate suggested follow-up queries based • Nature of Query: The type of information sought by the query,
such as factual, opinion, or comparison.
on 𝐻 , 𝑃𝑈 , and 𝑟 𝑡 +1 . The suggested follow-up queries are typically
• Genre: The category or domain of the query, such as arts and
designed to maximize the likelihood of user engagement. This
entertainment, finance, or science.
not only benefits Generative Engine providers by increasing user
• Specific Topics: The specific subject matter of the query, such
interaction but also benefits website owners by enhancing their
as physics, economics, or computer science.
visibility. Furthermore, these follow-up queries can help users by
• Sensitivity: Whether the query involves sensitive topics or not.
getting more detailed information.
• User Intent: The purpose behind the user’s query, such as re-
search, purchase, or entertainment.
B EXPERIMENTAL SETUP • Answer Type: The format of the answer that the query is seek-
B.1 Evaluated Generative Engine ing, such as fact, opinion, or list.
The exact prompt used is shown in Listing 1.
B.3 Evaluation Metrics
B.2 Benchmark We use 7 different subjective impression metrics, whose prompts
GEO-bench contains queries from nine datasets. Representative are presented in our our public repository: https://github.com/GEO-
queries from each of the datasets are shown in Figure 2. Further, we optim/GEO.
tag each of the queries based on a pool of 7 different categories. For
tagging, we use the GPT-4 model and manually confirm high recall B.4 GEO Methods
and precision in tagging. However, owing to such an automated We propose 9 different Generative Engine Optimization methods
system, the tags can be noisy and should not be considered carefully. to optimize website content for generative engines. We evaluate
Details about each of these queries are presented here: these methods on the complete GEO-bench test split. Further, to
KDD ’24, August 25–29, 2024, Barcelona, Spain Pranjal Aggarwal et al.
reduce variance in results, we run our experiments on five different C.1 GEO in the Wild : Experiments with
random seeds and report the average. Deployed Generative Engine
We also evaluate our proposed Generative Engine Optimization
B.5 Prompts for GEO methods methods on real-world deployed Generative Engine: Perplexity.ai.
We present all prompts in our our public repository: https://github. Since perplexity.ai does not allow the user to specify source URLs,
com/GEO-optim/GEO. GPT-3.5 turbo was used for all experiments. we instead provide source text as file uploads to perplexity.ai while
ensuring all answers are generated only using the file sources pro-
C RESULTS vided. We evaluate all our methods on a subset of 200 samples of
our test set. Results using Perplexity.ai are shown in Table 7.
We perform experiments on 5 random seeds and present results
with statistical deviations in Table 6