0% found this document useful (0 votes)
65 views89 pages

Full Text 01

This thesis investigates the enhancement of SEO classification models through advanced machine learning techniques, specifically XGBoost and CatBoost, utilizing large-scale relational data to improve search engine ranking predictions. It highlights the importance of feature engineering and the ethical considerations of web scraping while providing insights into search engine mechanisms. The research contributes to both academic literature and practical digital marketing applications, suggesting future exploration of off-page SEO factors and natural language processing integration.

Uploaded by

karghuma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views89 pages

Full Text 01

This thesis investigates the enhancement of SEO classification models through advanced machine learning techniques, specifically XGBoost and CatBoost, utilizing large-scale relational data to improve search engine ranking predictions. It highlights the importance of feature engineering and the ethical considerations of web scraping while providing insights into search engine mechanisms. The research contributes to both academic literature and practical digital marketing applications, suggesting future exploration of off-page SEO factors and natural language processing integration.

Uploaded by

karghuma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Pushing the Boundaries of

Digital Marketing with


SEO-Modeling
A Practical Application of Classification

Anir Darrazi

Department of Computer
and Systems Sciences
Degree project 30 credits
Computer and Systems Sciences
Degree project at the master level (300 credits)
Spring term 2024
Supervisor: Mateus de Oliveira Oliveira
Swedish title: Pusha gränserna inom digital
marknadsföring med SEO-modellering
DVK Uppsats Program

This thesis was written within the Spring 2024 edition of the DVK Uppsats
Program.

Coordination

Mateus de Oliveira Oliveira

Organization

Henrik Bergström
Peter Idestam-Almquist
Mateus de Oliveira Oliveira
Beatrice Åkerblom
Abstract

This thesis explores the refinement of SEO classification models by integrat-


ing advanced machine learning technologies, notably XGBoost and CatBoost,
and leveraging large-scale, relational data to enhance the predictive accuracy of
search engine rankings. The research examines how feature engineering and re-
lational data comparisons between web pages influence the effectiveness of SEO
strategies, highlighting a clear improvement in model accuracy when contextual
variables are considered.

In addition to technical advancements, the study also explores the ethical im-
plications of web scraping and the transparency required in manipulating SEO
algorithms. By systematically analyzing the performance of enhanced models
on varied dataset, this work reveals critical insights into the underlying mech-
anisms of search engines and the factors influencing web page visibility. The
thesis argues that a deeper understanding of these factors, supported by robust
empirical data, can drive more targeted and effective SEO practices.

Overall, the research contributes to both academic literature and practical ap-
plications in digital marketing, offering a framework for developing more so-
phisticated SEO tools that can adapt to the ever changing digital landscape.
It opens up new avenues for future research, particularly in the exploration of
off-page SEO factors and the integration of natural language processing to au-
tomate and optimize content creation for better search engine performance.

Keywords: SEO-modeling, Classification, SEO, XGBoost, Machine Learning,


Search Engines, Digital Marketing
Synopsis

Background
In the digital age, where search engines are vital for navigating online informa-
tion, this thesis research the nuances of SEO, examining on-page elements and
their accessibility for optimization, alongside exploring machine learning tech-
niques like XGBoost and CatBoost for understanding SEO rankings, thereby
unraveling the complex interplay between search engine algorithms, SEO strate-
gies, and advanced computational methods in the competitive digital landscape.

Problem
While SEO-modeling has progressed with classification techniques achieving up
to 70% accuracy, studies like Matošević et al. (2021) face scalability and robust-
ness challenges due to limited dataset sizes, necessitating a broader approach
that enhances model accuracy and explores the wider implications for business
and academia in the ever-evolving digital landscape.

Research Question
How can the accuracy of SEO classification models be improved through the
use of large-scale, relative feature engineered datasets?

Method
This thesis adopts an empirical, quantitative strategy, using data science meth-
ods experiment on enhancing SEO classification models through feature en-
gineering and large dataset analysis, integrating machine learning to identify
impactful features.

Result
The result compare different classification models and datasets. The findings
suggest improvements in model accuracy by using a pre-processed relational
dataset, contributing to both academic knowledge and practical SEO strategies.
Discussion
The novel approach of integrating relational data, which demonstrated a modest
but an important improvement in model accuracy, underscoring the potential for
nuanced strategies that consider the competitive dynamics of digital marketing.

v
Acknowledgement

I want to express my deepest gratitude to Mateus de Oliveira Oliveira for his


exceptional guidance on this project. His expertise and dedication have been
instrumental in its success. I would also like to extend my sincere thanks to
Afva Group AB for the invaluable support. The resources provided have been
crucial in facilitating the work.
Contents

List of Listings x

List of Tables xi

List of Abbreviations xii

1 Introduction 1
1.1 SEO-modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Results and Considerations . . . . . . . . . . . . . . . . . . . . . 5

2 Extended Background: Components of SEO Classification 7


2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Search Engine Optimization . . . . . . . . . . . . . . . . . . . . . 8
2.4 On-page Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Off-page Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 CatBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.9 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Methodology 16
3.1 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Research Framework Overview . . . . . . . . . . . . . . . . . . . 17
3.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4 SEO Classification . . . . . . . . . . . . . . . . . . . . . . 17
3.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Software Use and Implementation . . . . . . . . . . . . . . . . . . 18
3.3.1 ChatGPT 3.5 API for Query Generation . . . . . . . . . . 18

vii
3.3.2 DataForSEO Google SERP API . . . . . . . . . . . . . . 18
3.3.3 Webshare Rotating Proxy . . . . . . . . . . . . . . . . . . 18
3.3.4 Python for Machine Learning . . . . . . . . . . . . . . . . 19
3.3.5 Multiprocessing for Parallel Processing . . . . . . . . . . . 19
3.4 Dataset Compilation . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Sourcing Primary Data . . . . . . . . . . . . . . . . . . . 20
3.4.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Analysis of HTML Structure Data . . . . . . . . . . . . . 22
3.5.2 Using SERP Data as a Classifier . . . . . . . . . . . . . . 23
3.5.3 Relational Expansions . . . . . . . . . . . . . . . . . . . . 23
3.5.4 Frequency Analysis of Query-Associated Terms . . . . . . 24
3.5.5 Feature Engineering Phase . . . . . . . . . . . . . . . . . 24
3.5.6 Training the XGBoost & CatBoost Model . . . . . . . . . 24
3.5.7 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . 25
3.5.8 Addressing Fitting Issues . . . . . . . . . . . . . . . . . . 25
3.5.9 Addressing Biases . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Alternative Research Strategies . . . . . . . . . . . . . . . . . . . 26
3.7 Validity and Reliability . . . . . . . . . . . . . . . . . . . . . . . 27
3.8 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Main Results 29
4.1 Technical Development . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Model Performance Accuracy . . . . . . . . . . . . . . . . . . . . 37
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Conclusion 40
5.1 Comparison with Existing Work . . . . . . . . . . . . . . . . . . 41
5.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Ethical Implications . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 The use of AI Tools . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Considerations for Future Work . . . . . . . . . . . . . . . . . . . 44

Bibliography 49

Appendices 50
A Full Feature Set 50

B Query Engineering 53

C The use of XAI for SEO Classification 65

viii
D Code Repository 75

ix
Listings

4.1 Creating the queries . . . . . . . . . . . . . . . . . . . . . . . . . 29


4.2 Gathering the targeted links and classifiers . . . . . . . . . . . . 30
4.3 HTTP-request through rotating proxy . . . . . . . . . . . . . . . 31
4.4 Web scraper function collecting on-page feature values . . . . . . 32
4.5 Parallel processing applying multiprocessing to scraper worker() 33
4.6 Calculating the relational expansion and creating a new dataset . 34
4.7 Cross calculating the correlation between features . . . . . . . . . 35
4.8 Calculating the information gain for each feature . . . . . . . . . 35
4.9 Reducing the amount of classifiers to 6 groups . . . . . . . . . . . 35
4.10 Training the CatBoost model . . . . . . . . . . . . . . . . . . . . 36
4.11 Training the XGBoost model . . . . . . . . . . . . . . . . . . . . 36
4.12 Calculating the accuracy of the respective models . . . . . . . . . 37

x
List of Tables

3.1 This table shows the URL, keyword, and geography, which to-
gether work like the unique identifier for each record. The rank
group is the classifier ranked within 1 - 100 in its respective query
group. See Appendix A for the complete feature set. . . . . . . . 22

4.1 The table shows the population percentile of each classifier from
the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 The table shows final accuracy of the trained models based on
which dataset that was used. . . . . . . . . . . . . . . . . . . . . 38

A.1 This table shows the URL, keyword, and geography, which to-
gether work like the unique identifier for each record. The rank
group is the classifier ranked within 1 - 100 in its respective query
group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.2 These features are based on the previous on-page values but rel-
ative to the best-ranking page in its respective query group. . . . 51
A.3 These features are based on the previous on-page values but rel-
ative to the best-ranking page in its respective query group. . . . 52

B.1 The table shows the city, country and language name for a specific
query. This information is a component of the query engineering. 53
B.2 This table shows the respective business terms that was used in
the query engineering in English. The terms was translated based
on the language related to the city it was combined with. . . . . 60

xi
List of Abbreviations

SEO - Search Engine Optimization


ML - Machine Learning
SERP - Search Engine Results Page
PR - PageRank
DA - Domain Authority
HTML - HyperText Markup Language
CTR - Click Through Rate
URL - Uniform Resource Locator
AI - Artificial Intelligence
GBM - Gradient Boosting Machine
ANN - Artificial Neural Network
NLP - Natural Language Processing
LLM - Large Language Model
SVM - Support Vector Machine
KNN - K-Nearest Neighbor
ROC - Receiver Operating Characteristic

xii
Chapter 1

Introduction

In the ever evolving digital landscape, organic digital marketing has become
important for businesses seeking to establish and maintain a robust online pres-
ence. The pivotal role of Search Engine Optimization (SEO) is a pillar of this
kind of digital marketing. This critical tool not only navigates the complexi-
ties of Search Engine algorithms but also enhances online visibility (Bhandari
& Bansal 2018). This thesis explores the mechanisms of digital marketing, fo-
cusing on SEO, highlighting its transformation into an indispensable element of
modern marketing strategies.

The digital marketing ecosystem is characterized by its dynamic nature, where


continuous technological advancements and consumer behavior changes create
both challenges and opportunities for marketers. In this dynamic landscape,
Search Engines stand as the gatekeepers of information, employing intricate fil-
tering systems to sift through the vast expanse of the internet (Ravi et al. 2012).
These systems, sophisticated in their design, are not merely tools for informa-
tion retrieval but pivotal platforms that shape online traffic flow and influence
consumer discovery.

Understanding the complex inner workings of Search Engine algorithms is cru-


cial for SEO digital marketers, as these algorithms determine how content is
ranked and displayed in response to user queries (Bar-Ilan 2007). In this context,
SEO emerges as a strategic tool that aligns web content with these algorithms
to increase the visibility and accessibility of websites. SEO involve optimizing
various elements of web pages, from content and keywords to site architecture
and user experience. Thereby enhancing their appeal to Search Engines (Ravi
et al. 2012).

This thesis makes an attempt to better understand the complex interplay be-
tween digital marketing and SEO, underscoring the inherently technical and
pattern based nature of SEO. It is this technical foundation that lends itself to
the application of machine learning, particularly through the use of classifica-

1
tion models. By conducting a thorough analysis of classification models for SEO
and their accuracy on web page ranking, this study provides valuable insights
into how businesses can harness the power of machine learning to optimize their
SEO strategies. The focus is on understanding SEO not merely as a tactic for
enhancing search rankings, but as a field ready for the application of machine
learning techniques, which can significantly refine and automate the process of
optimizing web pages.

1.1 SEO-modeling
The term SEO-modeling is introduced in this thesis based on the concept of
simulating Search Engine ranking and the application of classification models
in predicting page rankings. The term is heavily ispired by Matošević et al.
(2021) and their work with SEO classification. At the core of the term, we
have SEO which involves optimizing web pages to rank as high as possible in
the Search Engine results pages (SERPs), a crucial factor in driving web traffic.
The evolution of SEO has seen it transform from simple keyword stuffing to so-
phisticated practices that align with complex Search Engine algorithms. These
algorithms, employed by Search Engines like Google and Bing, are designed to
rank pages based on relevance and authority, among other factors. The precise
workings of these algorithms are closely guarded secrets (Bhandari & Bansal
2018). However, SEO researchers have developed methods to approximate how
these algorithms operate with the help numerous different models, thus the rel-
evance of this term.

SEO-modeling stands at the intersection of data science and digital market-


ing. It encompasses statistical models and machine learning techniques to un-
derstand and predict the factors influencing Search Engine rankings. SEO-
modeling aims to uncover the complex interactions between various ranking
factors by simulating Search Engine algorithms. These factors range from on
page elements like content quality and keyword relevance to off page elements
like backlinks and social signals.

This thesis mainly focuses on the role of classification models in SEO. Classifi-
cation models are a supervised machine learning algorithm that can categorize
data into different classes. In the context of SEO, these models are trained on
datasets containing website features and their corresponding rankings to predict
the ranking of new or altered web pages. Features considered in these models
can include technical aspects of a website (like page loading speed and mobile
friendliness), content related attributes (such as keyword density and topical
relevance), and external factors (like backlink profiles) (Rogers 2002).

By applying these models, businesses and SEO professionals can gain power-
ful insights into how web page design and content changes might affect their
Search Engine rankings. This predictive capability is a game changer in for-

2
mulating effective SEO strategies. It enables a data driven approach to SEO,
where decisions are based on empirical evidence rather than expert intuition.

1.2 Problem
While SEO-modeling, primarily through classification models, has made signif-
icant strides in recent years, it is essential to acknowledge its limitations. The
existing literature showcases the potential of classification techniques to mimic
Search Engine algorithms, with predictive accuracy up to 70% (Matošević et al.
2021). However, these models still need help in scalability and robustness,
mainly due to the constraints imposed by the size of the datasets used in these
studies.

Previous studies in this area like Matošević et al. (2021) used 600 records,
and Banaei & Honarvar (2017) 7 400 records, which showcase that they re-
lied on relatively small datasets, which may not fully capture the complexity
and variability of factors that influence Search Engine rankings in the real world
scenario. Search Engines like Google considers at least hundreds of factors in
their algorithms, and the interactions between these factors can be intricate and
dynamic (Balabantaray 2017). SEO-modeling datasets may need more diversity
and depth to model these interactions accurately. This leads to a potential mis-
alignment between predicted and actual rankings, or even worse dataset bias.
Mainly noticed when these models are applied to a broader range of websites
and market sectors which it was not trained on. Furthermore, it is crucial to
recognize the untapped societal and academic benefits that could arise from
enhancing SEO-modeling techniques. The potential for improved accuracy and
reliability in SEO classification models is significant. For businesses and web
content creators, these advancements could translate into more strategic web
optimization, enhancing visibility and competitiveness in the digital market-
place.

As we address this knowledge gap, it is essential to recognize that it is about


more than just refining the existing SEO classification models for greater accu-
racy and applying them to more extensive and diverse datasets. It is also about
exploring the broader implications of these advancements. This requires a multi
faceted approach that considers the technical challenges of SEO-modeling, the
economic and societal benefits of improved Search Engine rankings, and the
academic value of applying and testing machine learning theories in a dynamic
and complex real world environment.

Problem Formulation: Addressing the limitations of existing SEO classifi-


cation models, particularly in scalability and robustness due to constraints posed
by dataset size.

3
1.3 Research Question
Given the identified gap in the literature regarding the scalability and robust-
ness of SEO classification models, the primary research question this thesis seeks
to address is:

Research Question: How can the accuracy of SEO classification models be


improved through the use of large scale, relative feature engineered datasets?

The research question arises from the need to enhance the current understand-
ing and effectiveness of SEO-modeling in predicting web page rankings. The
goal is to overcome the limitations of small datasets and to examine how more
extensive and varied data can refine the predictive capabilities of these models.

To address this research question, the study will employ a methodology rooted
in data science. The core of this approach involves the collection of a large scale
dataset, comprised of 1 653 946 records. This dataset will include web page
features extracted through web scraping HTML and SERP (Search Engine Re-
sults Page) rank data, which will serve as the classifier.

The dataset’s size and diversity are critical in providing a understanding of


the various factors that Search Engines consider when ranking pages. The large
volume of data is expected to encompass a wide range of industries, website
types, and content strategies, thereby ensuring that the findings are not limited
to a specific niche or market sector. In the analysis phase, machine learning
techniques, particularly classification algorithms, will be employed to develop
models that can predict page rankings based on the collected features. The
performance of these models will be evaluated based on their accuracy and the
insights they provide into Search Engine ranking factors.

Delimitations: This study will have specific delimitations to maintain focus


and feasibility. Firstly, the scope of web scraping will be limited to certain types
of websites and content, primarily dictated by accessibility and relevance to com-
mon search queries. Secondly, while the dataset is large, it will not encompass
all possible types of web content or the full spectrum of SEO factors, especially
those that are difficult to scrape, such as user engagement metrics. Thirdly, the
study will focus on the algorithms of specific major Search Engines, primarily
Google, due to its dominant market share and influence in web search trends.
Lastly, ethical considerations and compliance with web scraping and data us-
age regulations will be strictly adhered to, which may limit access to certain
types of data and websites. This research aims to contribute a more nuanced
understanding of SEO-modeling, providing valuable insights for businesses and
content creators while advancing the academic discourse in this field.

4
1.4 Related Work
A foundational aspect of our work is the application of machine learning algo-
rithms to SEO challenges. Studies such as Matošević et al. (2021) and Banaei
& Honarvar (2017) have similarly employed machine learning techniques like
decision trees and neural networks to predict website rankings. Our research
builds upon this foundation, utilizing more advanced algorithms like XGBoost
and CatBoost, which are particularly adept at handling large, complex datasets.

Our emphasis on large scale, diverse datasets reflects an emerging trend in the
field, as demonstrated by the work of researchers like Witten et al. (2016). They
highlight the importance of data mining and large datasets in uncovering hidden
patterns and insights. We extend this concept by incorporating relational data,
which allows for a comparative analysis of web pages in relation to top perform-
ing sites. The use of relational values is a distinct aspect of our approach. This
methodology aligns with the scientific principles of contextual analysis, where
data is understood not just in isolation but in relation to a broader set of fac-
tors. While less explored in existing SEO studies, this approach is grounded in
scientific practices that consider data within its wider context and environment.

Our study is rooted in established scientific methodologies while also pushing


the boundaries of conventional SEO research. By leveraging advanced machine
learning techniques and adopting a novel approach to data analysis, our research
contributes to a deeper understanding of SEO in the digital age and paves the
way for future scientific explorations in this field.

1.5 Results and Considerations


Our research has yielded significant insights into the field of SEO-modeling.
The key finding is the consistent enhancement of accuracy of SEO classifica-
tion models when incorporating relational values. The XGBoost and CatBoost
models, when trained on datasets with relational values, showed an improve-
ment 1.3% in accuracy, resulting in XGBoost achieving 67.5% and CatBoost
68.0%, compared to their performance with original values. This suggests that
understanding web pages in relation to top performers within the same search
context offers a more refined approach to predicting Search Engine rankings.
Moreover, the study underscores the importance of large scale, diverse datasets
in capturing the multifaceted nature of SEO.

Looking ahead, the results from this study open exciting avenues for future
research and practical applications. The improved accuracy of SEO models
sets the stage for more advanced digital marketing strategies, enabling busi-
nesses and content creators to enhance their online visibility more effectively.
The integration of relational values in SEO-modeling presents an opportunity
to develop more sophisticated tools that can better navigate the complexities of

5
Search Engine algorithms.

6
Chapter 2

Extended Background:
Components of SEO
Classification

2.1 Overview
The digital age has transformed how information is accessed and consumed,
making search engines crucial tools in navigating the vast array of online data.
These engines perform common essential functions for it to work, such as crawl-
ing, indexing, and ranking, using various algorithms to ensure users receive
relevant results for their queries (Rogers 2002). SEO emerges as a pivotal strat-
egy in this context, aiming to increase a website’s visibility and organic traffic.
This thesis explores the multifaceted aspects of SEO, including on-page SEO,
which focuses on content quality and HTML optimization, and off-page SEO,
which deals with backlinks and external factors. Technical SEO also plays a
role in improving the backend aspects of a website to enhance its search engine
ranking. In particular, the thesis examines the influence of on-page factors like
high-quality content, efficient HTML tags, and user experience in determining
a website’s search engine ranking due to its accessability for webmasters. The
thesis also ventures further into machine learning, particularly the classification
tasks within SEO, employing algorithms such as XGBoost and CatBoost. XG-
Boost, known for its efficiency and scalability in handling large datasets, has
been instrumental in predicting the ranking potential of web pages in the re-
search project of (Matošević et al. 2021). Similarly, CatBoost’s unique approach
to handling categorical data has made it a valuable tool in diverse handling of
features (Hancock & Khoshgoftaar 2020). This comprehensive study aims to
shed light on the dynamic interplay between search engine algorithms, SEO
strategies, and the latest advancements in machine learning, offering insights
into how these elements collectively influence the understanding of SEO rank-

7
ing and visibility of web pages in an increasingly competitive digital landscape.

2.2 Search Engines


Search engines are sophisticated online tools that enable users to find informa-
tion online by typing in keywords or phrases as queries. They are fundamental
to how we interact with vast online data. At their core, search engines perform
three primary functions. Such as crawling, creating indexes, and providing
search users with ranked lists of the websites they have determined are the
most relevant (Rogers 2002).

Crawling is the process by which search engines use bots (spiders or crawlers)
to systematically browse the web and collect information from every accessible
webpage. The data collected from these pages is then indexed, stored, and orga-
nized to make it easily retrievable. When the user inputs a query into a search
engine, the search engine processes it by comparing the search terms against
the pages indexed in its database. It looks for relevant matches, considering
keyword density, site speed, links, etc (Hendry & Efthimiadis 2008). Search en-
gines employ intricate algorithms to determine the ranking of the indexed pages
in order of relevance. These algorithms consider hundreds of ranking factors
from the index and other factors, such as the quality of content, user engage-
ment, page speed, backlinks, and many others Ravi et al. (2012), which will be
discussed in more detail later. The specific factors and the weight each carries
vary between search engines and evolve as they are updated. The final output
is a list of web pages, typically displayed in order of relevance. This is what the
user sees after submitting a search query. The results are intended to be the
most relevant and valuable about the user’s search terms.

2.3 Search Engine Optimization


Search Engine Optimization (SEO) represents a crucial tactic in digital market-
ing, pivotal in enhancing the prominence and positioning of websites on search
engines such as Google, Bing, or Yahoo. The core objective of SEO is to aug-
ment organic (non-paid) traffic to a website, making it more discoverable and
appealing to these search engines.

At the heart of SEO lies identifying and utilizing the right keywords. These
specific words and expressions are potential visitors who enter into search en-
gines as queries or similar. Incorporating relevant keywords strategically in
website content, titles, and meta tags is fundamental for improving a page’s
visibility in search results. However, SEO is not just about keyword stuffing;
the caliber of the content is just as vital. Search engines prioritize informative,
relevant, and engaging content for the user, encompassing various forms of me-
dia such as text, images, videos, and other interactive functionalities (Sharma

8
et al. 2019).

On-page SEO is a critical component involving enhancing individual web pages


to improve their search engine ranking and attract more relevant traffic. This
includes the content and keyword aspects and the optimization of HTML tags
(like title, meta, and header tags) and images. On the other side, we have off-
page SEO, which refers to actions taken outside the website. This primarily
involves building backlinks, meaning links from other reputable sites. These
backlinks are akin to votes of confidence from one site to another based on the
central search engine algorithms used, such as Google’s PageRank, Domain Au-
thority, and other link-based graph networks, which are crucial for building a
site’s overall reputation and authority (Balabantaray 2017).

Technical SEO focuses on the backend aspects of a website, aiming to improve


technical elements to boost a page’s ranking in search engines. This encompasses
optimizing site speed, ensuring mobile friendliness, facilitating easy indexing
and crawlability, improving site architecture, integrating structured data, and
enhancing website security. These technical optimizations ensure that a website
is user-friendly and search-engine-friendly (Balabantaray 2017). User experience
(UX) has also grown essential for SEO. Search engines are increasingly factor-
ing in the overall user experience a website offers in their ranking algorithms.
This includes how easily users can navigate the site, the interface design, and
the visitor’s general experience while interacting with the site (Ravi et al. 2012).

Grasping how search engine algorithms work is essential for effective SEO. These
algorithms are continually updated. Thus, staying up to date with these changes
is vital for adapting and refining SEO strategies. Moreover, using analytics and
reporting tools, like Google Analytics, plays a pivotal role. These tools helps
experts to manually monitor the performance of SEO strategies, providing in-
sights into traffic, user engagement, and conversion metrics, making it possible
to adapt to ever-changing search engines (Google 2024).

2.4 On-page Factors


On-page SEO factors are integral elements of search engine optimization and a
big part of search engine index factors and, thus, algorithm features, focusing
on optimizing various parts of a website that affect its search engine rankings.
Unlike so called off-page factors, which mostly involves external features like
backlinks, on-page SEO encompasses the aspects of a website that the devel-
oper can directly control. Making on-page factors the most accessible method
of rank manipulation (Sharma et al. 2019).

The cornerstone of on-page SEO is content. High-quality, relevant, and reg-


ularly updated content is fundamental in attracting and retaining website visi-
tors. Search engines endeavor to present users with the most pertinent results

9
for their queries, and thus, they favor websites that offer valuable and informa-
tive content. This content must be crafted not only to engage readers but also
to incorporate strategic keyword usage. The keywords are terms and phrases
that potential visitors use in search queries. A website can significantly improve
its search engine visibility by thoughtfully using these keywords in its content
(Sharma et al. 2019).

A big part of on-page factors is about the HTML structure. Title tags are
an example of such an on-page element. These HTML elements define the titles
of web pages and are displayed on SERPs as clickable headlines. Title tags
should be succinct and descriptive and include relevant keywords to enhance
their effectiveness in search rankings and user click-through rates. Though not
directly impacting search rankings, the Meta descriptions are vital in on-page
SEO. These HTML attributes provide a summary of a webpage’s content. A
well crafted meta description can entice users to click on a search result, thus
improving the click-through rate (CTR) and benefiting the site’s SEO perfor-
mance. The headings and subheadings are also essential for on-page SEO. They
make the content more readable and enjoyable for users while aiding search
engines in comprehending the structure and critical points of a page’s content.
Using relevant keywords in these headings can further boost SEO. The URL
structure is also an essential aspect of on-page SEO. Search engines favor URLs
that are easy to read and include keywords relevant to the page’s content Luh
et al. (2016).

There are many factors to consider when working with on-page SEO. It is a
multifaceted aspect of website optimization that requires attention to elements
like content, title tags, meta descriptions, headings, URL structure, image op-
timization, internal linking, page speed, and much more (Luh et al. 2016). By
effectively managing these on-page factors, website owners can notably enhance
their ranking on search engines and better understand search engines inner
workings (Sharma et al. 2019).

2.5 Off-page Factors


Off-page SEO encompasses the tactics and measures implemented beyond the
limitations of the websites on-page factors to impact the rankings within search
engine. It focuses on such as boosting the site’s reputation and authority by
building relationships and creating content on other platforms and other factors
crucial for improving a website’s visibility and ranking in search engines.

The most significant off-page factor is backlinks, sometimes called inbound or


external links. Backlinks are links from other websites that point to the spe-
cific site. They are among the top factors that search engines use to determine
a website’s credibility and authority (Hochstotter & Koch 2009). The logic is
straightforward and relatively easy: if multiple reputable sites link to a page

10
(or a complete website, depending on which algorithm is used), search engines
indicate that the page is reputable and offers valuable content. However, not all
backlinks are valued equally. Links from high-authority, relevant websites have
a much more substantial impact than those from low-quality, irrelevant sites.
Moreover, the way these links are acquired also matters. Organic or naturally
earned links are usually far more valuable than those obtained through manip-
ulative tactics (Rogers 2002).

There are several ways of calculating the relevance and trust of a page or website.
One such algorithm is PageRank (PR), a system created by Google’s founders,
Larry Page and Sergey Brin, which was one of the first algorithms utilized by
Google for ranking web pages in search result listings. It is based on academic
citations’ logic and treats links as votes. In this system, a link from one website
to another acts as an endorsement and a sign of trust and quality. However,
not all votes are equal. Links from reputable and authoritative sites carry more
weight. This system revolutionized how search results were ranked, moving
away from keyword-centric methods to a system based on the interconnected-
ness and authority of web pages (Page et al. 1998). The formula is PageRank
as described by Rogers (2002) where PR(A) is the PR of page A, PR(Tn) is
the PageRank of a page Tn linking to A, C(Tn) is the number of outbound
links on page Tn, and d is the damping factor, typically set to 0.85, giving each
link a base value. This formula calculates the PageRank of a page based on the
PageRank of pages linking to it, adjusted by their number of outbound links
and the damping factor.

P R(A) = (1 − d) + d(P R(T 1/C(T 1) + ... + P R(T n)/C(T n))

Domain Authority (DA) is another important metric used by SEO professionals


created by Moz that forecasts a website’s potential ranking on search engines.
This calculation is based on assessing several aspects, such as the total quantity
of links and the domains from which these links originate. Domain Authority is
expressed on a scale from 1 to 100, with higher scores showing higher potential
to rank. It is important to note that Domain Authority is not a metric used
by Google in its ranking algorithms but is a valuable tool for SEO professionals
to sense a website’s potential search engine performance (Reyes-Lillo et al. 2023).

Link-related factors extend beyond just the number of links. The relevance
of linking sites, the quality of the link source, the anchor text used in the link,
and the recency of links all factor into how search engines assess these links. For
instance, a link from a site closely related to your topic is more valuable than
a random link from an unrelated site. Similarly, links from new sources can be
more valuable than repeated links from the same source, as they indicate the
site’s growing popularity and relevance (Rogers 2002).

11
2.6 Classification
Machine learning, a vital branch of artificial intelligence (AI), has revolution-
ized how data is analyzed and interpreted. At the heart of machine learning
lies the concept of enabling systems to learn from data without being explicitly
programmed. Classification is a key function in machine learning, which has
wide-ranging applications, including Search Engine Optimization (SEO).

The essence of machine learning is to develop algorithms capable of process-


ing and learning from data, subsequently making decisions or predictions based
on this accumulated knowledge. The process begins with training these algo-
rithms on a dataset. This training data comprises examples of inputs, each
paired with the correct output or label. For instance, in a machine learning
model designed for email spam detection, the training data would include nu-
merous email examples labeled ’spam’ or ’not spam’ (King et al. 1995).

A crucial aspect of machine learning is the identification and utilization of fea-


tures. Features are individual measurable properties or characteristics of the
phenomena under observation (Brown & Mues 2012). In the context of SEO,
these features encompass a range of elements, such as the number of backlinks
a website has, the page loading speed, the density and distribution of keywords
in the content, and even the structure of URLs. Once the machine learning
model is trained on these features, it learns to identify patterns and relationships
within the data. This model, essentially an algorithmic representation, defines
the relationship between the given features and their corresponding class labels
(Thakur et al. 2011). The effectiveness of a machine learning model largely de-
pends on the quality and comprehensiveness of the training data, as well as the
sophistication of the algorithm. Various algorithms are used for classification in
machine learning, each with strengths and applications. Standard algorithms
include decision trees, support vector machines, naive Bayes classifiers, and neu-
ral networks. These algorithms differ in their approach to finding patterns in
the data and in their complexity. The choice of algorithm can significantly in-
fluence the performance and accuracy of the machine learning model (Lim et al.
2000).

In SEO modeling, the classification method is rather established in previous


research and can be used in performance predictions. The classification involves
categorizing various elements of a website or its content into different features
that can impact its search engine ranking (Banaei & Honarvar 2017). For ex-
ample, machine learning models can classify web pages based on their likelihood
to rank well for specific search queries, which will be looked at in this thesis.

12
2.7 XGBoost
XGBoost stands for eXtreme Gradient Boosting and implements gradient boost-
ing machines (GBM), a type of machine learning algorithm that falls under the
umbrella of ensemble learning. Ensemble learning combines multiple models to
improve predictions’ overall performance, robustness, and accuracy. XGBoost
is an enhancement of gradient boosting, a method that builds models sequen-
tially, with each new model attempting to correct the previous errors. One of
the critical features of XGBoost is its use of a gradient-boosting framework.
This framework operates by constructing a series of decision trees sequentially,
where each subsequent tree is built to minimize the errors or residuals of the
previous trees. In gradient boosting, the loss function, a measure of how well the
model performs, is minimized using the gradient descent algorithm. XGBoost
enhances this process by introducing a more structured model formulation to
mitigate overfitting, which makes it more robust and efficient than standard
GBM (Chen & Guestrin 2016).

XGBoost has been designed to be highly scalable and efficient. It utilizes ad-
vanced principles such as parallel and distributed computing, making it excep-
tionally fast and capable of handling large datasets (Chen & Guestrin 2016).
This scalability is a critical factor that differentiates XGBoost from traditional
gradient-boosting methods. The algorithm can efficiently run on high-performance
computing environments, making it a practical choice for big data applications.
Another significant advantage of XGBoost is its flexibility. The algorithm can be
used for regression (predicting continuous values) and classification (predicting
categorical values) tasks. It supports various loss functions and customization
options, making it adaptable to various problems (Al Daoud 2019).

Additionally, XGBoost provides features for handling missing values and sup-
ports various data formats, enhancing its usability in real-world scenarios. XG-
Boost also incorporates several mechanisms to avoid overfitting. Besides the
regularized model framework, it offers features like subsampling of the data and
column sampling, further enhancing the robustness of the model. These fea-
tures, combined with the capacity to fine-tune many hyperparameters, allow
practitioners to build highly optimized models tailored to their specific data
and requirements (Zhang et al. 2022).

In practical applications, XGBoost has shown outstanding performance in var-


ious domains, including but not limited to finance, healthcare, marketing, and
natural language processing. Its ability to handle large-scale data and deliver
high-performance models has made it a preferred algorithm in numerous data
science competitions and real-world applications. It has also shown to be par-
ticularly effective in previous attempts to classify SEO (Matošević et al. 2021).

XGBoost has been employed to classify web pages based on their likelihood
to rank well for specific keywords or queries. By training the algorithm on

13
features such as keyword density, page structure, internal and external links,
and even user behavior data (like click-through rates and bounce rates), the
XGBoost was taught to predict the ranking potential of pages with a as high
accuracy as 70% (Matošević et al. 2021). This predictive capability allows SEO
professionals to prioritize and optimize content more effectively.

2.8 CatBoost
CatBoost, an acronym for ”Categorical Boosting”, is a state-of-the-art machine
learning algorithm that has garnered attention in academic and applied data
science fields for its exceptional performance, particularly in dealing with cate-
gorical data. Developed by researchers at Yandex, CatBoost is an open-source
gradient boosting library, a part of the ensemble learning family of algorithms
as XGBoost (Prokhorenkova et al. 2018). CatBoost distinguishes itself within
this family through its innovative approach to processing categorical data and
handling various complexities inherent in datasets variety which can be lever-
aged within SEO-modeling.

A primary challenge in machine learning, especially in gradient boosting, is


effectively handling categorical data, a common data type that includes non-
numeric categories or labels. Traditional machine learning models require pre-
processing of this data, typically through techniques like one-hot encoding,
which can be computationally expensive and inefficient, especially with high
cardinality categories (categories with many unique values). CatBoost, however,
integrates a novel algorithm for processing categorical data that eliminates ex-
tensive pre-processing, reducing the time and computational resources required
(Prokhorenkova et al. 2018).

CatBoost’s approach to categorical data involves transforming categories into


numerical values using a combination of statistics from the dataset in a way
that is resistant to overfitting. This process is known as ”ordered boosting,” a
permutation-driven approach that reduces the likelihood of overfitting, a com-
mon issue in standard gradient boosting methods. Additionally, CatBoost uti-
lizes a unique technique known as ”target statistics,” where categorical features
are converted into numerical values based on the target variable, further en-
hancing the model’s accuracy and efficiency (Hancock & Khoshgoftaar 2020).
CatBoost excels in its training speed and scalability. It is designed for high-
performance computing environments and can efficiently handle large datasets.
This scalability is achieved through various optimization techniques, such as
symmetric trees and oblivious trees, which ensure consistent and fast model
training, even with large and complex datasets (Prokhorenkova et al. 2018).

CatBoost has shown outstanding performance in practical applications, from


standard regression and classification problems to more complex tasks like rank-
ing and recommendation systems. It has not been used in SEO modeling but

14
similar contexts. Its ability to handle large-scale data and deliver high-accuracy
models has made it popular in industries such as finance (Al Daoud 2019). And
makes it very good model for comparing the result with XGBoost when using
a large scale SEO data.

2.9 Feature Engineering


Feature engineering is essential in machine learning. This process involves se-
lecting, modifying, and creating features from raw data to enhance the efficacy
of machine learning models (Turner et al. 1999). In SEO classification, feature
engineering enables the extraction of meaningful insights from web page data,
thereby improving the accuracy of predictions regarding page ranking. The
essence of feature engineering in SEO is to transform raw web data into infor-
mative, machine-readable inputs. For instance, aspects like keyword density, the
structure of URLs, page loading speed, and backlink profiles are converted into
quantitative features. This transformation is pivotal because machine learning
models require numerical inputs to learn and make predictions (Zheng & Casari
2018). Therefore, the process of feature engineering directly impacts the per-
formance of models like XGBoost or CatBoost in classifying web pages for SEO
purposes.

Feature engineering in SEO is multifaceted. One aspect is the extraction of tex-


tual features from web page content. This involves analyzing the frequency and
distribution of keywords, assessing content relevance and quality, and even pars-
ing HTML structures to extract information from titles and meta-tags. Another
crucial aspect is link analysis. Here, the focus is on quantifying the number and
quality of backlinks, a vital indicator of a webpage’s authority and relevance.
Sophisticated algorithms can delve into the anchor text of these links, providing
deeper insights into how external sources perceive the page’s content. Beyond
these, user engagement metrics like click-through rates (CTR) and bounce rates
are also significant features. These metrics offer insights into how real users in-
teract with the website, reflecting its usability and relevance to the audience.
Additionally, technical SEO features, including page load speed and mobile-
friendliness, are incorporated to gauge the overall user experience (Portier et al.
2020).

Practical feature engineering is a collaborative effort, requiring a deep under-


standing of both the domain of SEO and the intricacies of machine learning. It
is a balancing act of including relevant features that contribute to model per-
formance while avoiding overfitting, where a model is too closely tailored to the
training data and performs poorly on new data. This approach ensures the best
possible outcomes for SEO classification tasks (Zheng & Casari 2018). In ap-
plying machine learning to SEO, good feature engineering has shown promising
results. By accurately classifying web pages based on their potential ranking in
search engine results, SEO practitioners can better optimize their websites.

15
Chapter 3

Methodology

3.1 Research Strategy


To answer our research question, we have chosen an quantitative empirical re-
search strategy. More specifically, experiment employing data science method-
ologies to delve into the research question about the accuracy and applicability
of SEO classification models and how they can be improved.

Our empirical research strategy is fundamentally quantitative in nature, uti-


lizing large datasets to construct and rigorously test SEO classification models.
The approach is an experiment, where it’s not merely an observation of pri-
mary data but an active testing of hypotheses regarding the improvement of
SEO models through feature engineering and data scale-up. This methodol-
ogy aligns closely with Creswell & Creswell (2017) perspective on quantitative
methods, serving as an ideal framework to test the objective of enhancing SEO
classification models. Our research strategy, grounded in empirical experimen-
tation with SEO data, is the optimal approach to explore our research question
due to the inherent nature of search engines, which fundamentally operate on
vast and varied datasets. This method is especially well-suited for experiments
within SEO classification, a field where the data’s complexity and diversity are
reflective of the real-world dynamics that influence search engine rankings. By
engaging in empirical experimentation, we can uncover subtle patterns and cor-
relations that might otherwise remain hidden.

The established methodology of using machine learning for SEO classification


provides a robust framework for our study. This approach allows us to delve
deeper into the dataset, enhancing the precision and effectiveness of the classi-
fication models. Our research aims to extend the existing boundaries of SEO
knowledge, building on a well-founded experimental method to push the en-
velope further. By choosing this strategy, we ensure that our investigation is
not only aligned with the intrinsic characteristics of search engines but also

16
contributes significantly to advancing the field of SEO.

3.2 Research Framework Overview


The research framework of our study can be divided into several subsections,
each integral to the process of SEO classification using machine learning. The
framework follows a supervised machine learning framework tailored to our ap-
plication, involving data collection, preprocessing, feature extraction, classifica-
tion and evaluation as seen in (Alpaydin 2020).

3.2.1 Data Collection


The study will focus on collecting a diverse array of web pages across different
content types. The objective will be to compile a balanced dataset, essential for
precise SEO modeling. This phase will include detailed web scraping, adhering
to ethical standards and technical limitations.

3.2.2 Data Preprocessing


The collected data will be imported into Python for exploratory analysis and
will undergo a rigorous preprocessing phase. The goal will be to remove data
inconsistencies, such as duplicates, and to focus on SEO-relevant features like
HTML structure and keyword frequency. Additionally, a layer of relational
expansion will be added, effectively doubling the number of features.

3.2.3 Feature Extraction


Subsequent to preprocessing, the dataset will be divided into training and test-
ing sets. Key features influencing SEO rankings, such as keyword density and
HTML tag usage, will be identified using methods like correlation matrix and
Information Gain calculation. Additionally, the dataset will be split into two
subsets: one containing the raw data and the other featuring preprocessed re-
lational values.

3.2.4 SEO Classification


This phase will involve training advanced machine learning models such as XG-
Boost and CatBoost, utilizing Python libraries with the same names, on the
identified features from the respective datasets. Through cross-validation, these
models will be fine-tuned, and their accuracy in predicting SEO rankings will
be evaluated.

3.2.5 Evaluation
After training the XGBoost and CatBoost models, they will be rigorously eval-
uated to determine their effectiveness in accurately predicting SEO rankings.

17
This evaluation will involve measuring the models’ accuracy, which will serve as
a key performance indicator. Through this process, we will assess the practical-
ity of the models in real-world SEO applications and their ability to generalize
from the training data to unseen data. This critical stage will allow us to val-
idate our models’ predictions and ensure that our research contributions are
both reliable and applicable to the field of SEO.

3.3 Software Use and Implementation


For the successful execution of this thesis, specific software, AI tools and tech-
nologies have been carefully selected to align with the research objectives and
to ensure efficient data processing and analysis.

3.3.1 ChatGPT 3.5 API for Query Generation


For this research project, ChatGPT 3.5 API was instrumental in generating
a large number of search queries. Leveraging the capabilities of this language
model, we were able to create a diverse and extensive list of business search
terms and translate them to different languages. These queries were essential
for compiling a comprehensive dataset.

ChatGPT 3.5 API facilitated the creation of these queries by understanding the
context and nuances required for effective SEO analysis. It generated search
terms that were not only relevant to the study but also varied enough to cover
different business sectors and geographic locations.

3.3.2 DataForSEO Google SERP API


A pivotal tool in this research is the Google SERP data API provided by
DataForSEO. This API offers access to a rich source of SERP data, essential for
obtaining accurate search ranking information. As outlined in the DataForSEO
documentation DataForSEO (2023), this API provides diverse metrics includ-
ing organic search results, paid results, featured snippets, and local pack list-
ings. The capability to retrieve real-time data is particularly beneficial for this
project, as it ensures the collection of current and relevant data points for our
SEO models. The comprehensive nature of the DataForSEO API allows for a
detailed and nuanced understanding of search engine rankings, crucial for the
accuracy of the classification models developed in this research.

3.3.3 Webshare Rotating Proxy


A rotating proxy server provides you with a pool of IP addresses, automatically
assigning a new IP for each connection or after a fixed interval. This means that
each request you make goes through a different IP address, helping to mask your
actual IP and reducing the likelihood of being detected by web servers. Using

18
rotating proxies for HTTPS requests can significantly enhance the capabilities of
web scraping, data collection, or any application where maintaining anonymity
and avoiding IP blocks or rate limits is crucial. Webshare offers rotating proxies
to facilitate these needs effectively (Webshare 2024).

3.3.4 Python for Machine Learning


Python, renowned for its simplicity and powerful libraries, is the chosen pro-
gramming language for this project. Its extensive range of libraries like Pan-
das and NumPy is ideal for data manipulation tasks. As McKinney (2022)
notes, these libraries provide robust tools for handling, cleaning, and trans-
forming data, a critical process in preparing the dataset for analysis. More-
over, Python’s ecosystem includes advanced machine learning libraries such as
Scikit-learn, XGBoost, and CatBoost. Pedregosa et al. (2011) highlight Scikit-
learn’s comprehensive set of tools for machine learning and statistical modeling.
Such as classification, regression, clustering, and dimensionality reduction. XG-
Boost and CatBoost, as further elaborated by Chen & Guestrin (2016) and
Prokhorenkova et al. (2018), offer high-performance models that are especially
effective for large-scale datasets.

3.3.5 Multiprocessing for Parallel Processing


The multiprocessing capabilities in Python play a crucial role in enhancing the
computational efficiency of this project. The multiprocessing library in Python,
as described by Van Rossum & Drake (2011), enables effective parallel pro-
cessing, which is vital for managing the intensive computational demands of
handling large datasets. This library allows for the distribution of data process-
ing tasks across multiple processors, thereby reducing the time required for data
scraping, preprocessing, and model training. Such parallelization is essential in
a data-intensive project like this, where the volume of data could otherwise lead
to significant processing bottlenecks.

3.4 Dataset Compilation


In this critical section of our research, we will go further into the methodological
framework underlying our dataset compilation, which serves as the backbone of
our empirical inquiry into SEO optimization. Recognizing the paramount im-
portance of a robust and refined dataset, we employed a strategic approach to
gather, preprocess, and structure our data, ensuring that it not only supports
but enhances the predictive capabilities of our machine learning models. By
prioritizing meticulous feature selection and comprehensive data sourcing, we
aimed to craft a dataset that is both representative of the complex dynamics
of search engine algorithms and fine-tuned to address the nuances of SEO chal-
lenges. This section outlines the rigorous processes involved in the compilation

19
and preparation of our dataset, highlighting the methods employed to minimize
biases and maximize the relevance and accuracy of our modeling efforts.

3.4.1 Sourcing Primary Data


Primary data refers to information collected directly by the researcher for the
specific purpose of their study. This data type is obtained first-hand, tailored
to the research question, and not previously published or used in other studies.
Primary data is precious in fields where current and specific information is crit-
ical. Bell et al. (2022) explains this concept, emphasizing that primary data is
gathered directly from the source, thereby providing fresh, unprocessed insights
pertinent to the researcher’s specific query or hypothesis.

In this study, primary data is vital due to the rapid and continual changes in
search engine algorithms. Given the dynamic nature of the SEO landscape, even
a few months old data can become outdated. Primary data collection methods,
such as real-time web scraping and accessing up-to-date information via APIs
like Google SERP from DataForSEO, ensure that the dataset reflects the latest
trends and algorithmic updates in search engines. This approach guarantees
that the analysis and resultant SEO classification model are grounded in the
most current and relevant data, a critical factor for the validity and effective-
ness of SEO strategies. By using primary data, this research is not just relevant
to the field of SEO, it is aligned with its core needs. In SEO-related studies,
where staying on top of the latest algorithmic changes is a must for accurate
analysis and modeling, primary data is the key. It is the tool that allows for the
possibility to keep pace with the dynamic SEO landscape, ensuring that your
research is timely and specific to the current SEO environment.

3.4.2 Feature Selection


The dataset is compiled collection of web page tags and attributes, search engine
ranking data, and other relevant variables for keyword analysis. The data has
been sourced through the use of the Google SERP data API from DataForSEO
based on 46 000 random search queries containing both a company related key-
word and geographic term. These search queries was created through compiled
list of 307 city names and 150 business terms which all was combined with each
other resulting in (307 * 150 =) 46 000 queries. See Appendix B for the terms
used for the queries. Each search query generated 100 classified results through
Google SERP data API (with a total result of 4 600 000 potential records). This
was combined with comprehensive web scraping of the result expanding the fea-
ture set. The mentioned approach allows for the collection of a vast range of
features, which are essential for creating an accurate and robust SEO classifica-
tion model. The total number of used records resulted in 1 653 946 after ethical
considerations (read more about it in the chapter ”6.3 Ethical Considerations”)
and data cleaning those records that included lost or unreadable information.
Each feature in the dataset plays a specific role in modeling the SEO perfor-

20
mance of a web page and was inspired by the result of Matošević et al. (2021)
and the report Darrazi & Mak (2023) which can be found in Appendix C. During
the feature selection we collected as many features as we could based on common
HTML tags as found in (Jamsa et al. 2002). After a some further data cleaning
calculating multicollinearity and Information Gain among the selected features
to drop those which did not add value to the feature set. Leaving us with
features such as the number of H1 tags (h1 tags amount), the frequency of key-
words in title tags (title keyword freq), the average character count of H1 tags
(h1 avg char count) and 29 other features which all impacts the on-page SEO.
Similarly, geographical relevance of content (geography, h1 geography freq, etc.)
is included to understand the impact of location-based optimization on search
rankings (see the full feature set in Appendix A).

The dataset’s combination of SERP data (such as rank group) with web page
features offers a comprehensive view of the classification model, allowing it to
learn from a wide array of variables that influence a page’s ranking. This rich
set of features is pivotal in developing an SEO model that accurately reflects
the multifaceted nature of search engine ranking algorithms.

The features are focused on on-page factors rather than a combination of both
on-page and off-page. The choice is rooted in practical considerations regard-
ing the control and immediacy of SEO adjustments. On-page elements such as
content quality, keyword optimization, and HTML tags are directly accessible
and modifiable by web developers and SEO professionals. This direct acces-
sibility makes on-page features not only crucial but also the most actionable
aspect of SEO strategy. While off-page factors like backlinks and social sig-
nals are undeniably significant in search engine algorithms, they often require
long-term efforts and external collaborations, which can be more challenging
to influence directly and swiftly. On-page factors, in contrast, offer immediate
opportunities for optimization. Adjustments to elements like title tags, meta
descriptions, and content relevancy can be implemented and measured quickly,
providing SEO professionals with more agile and responsive tools to influence a
website’s search engine performance.

This emphasis on on-page features aligns with the objective of equipping SEO
professionals with actionable insights. By analyzing the impact of these directly
controllable elements, the developed SEO model can serve as a practical guide
for immediate website optimizations. These changes are not only more feasi-
ble to implement but also offer the potential for rapid improvements in SERP
rankings. Therefore, while the dataset may focus on a subset of the factors
influencing page rankings, it strategically targets those most relevant and ac-
cessible to SEO practitioners. This approach ensures that the model’s insights
are practical and actionable, enabling professionals to optimize web pages in a
dynamic search engine landscape.

The ’rel’ prefix in certain features stands for ’relational’, indicating the rela-

21
Dataset Features Explanation
url The web page URL
rank group Grouped ranking position in search engine results
keyword Term associated with the search query
geography Geographic location relevant to the web page content

Table 3.1: This table shows the URL, keyword, and geography, which together
work like the unique identifier for each record. The rank group is the classifier
ranked within 1 - 100 in its respective query group. See Appendix A for the
complete feature set.

tion of that feature to the corresponding feature of the best-ranking page for a
particular search query. For example, ’rel h1 tags amount’ compares the num-
ber of H1 tags on a page to the number of H1 tags on the best-ranking page
for the same search query. The ’freq’ suffix stands for ’frequency’, denoting the
occurrence frequency of certain elements (like keywords or geographic terms)
within specific tags or content areas of a web page.

These features collectively offer a rich dataset that provides a multi-dimensional


view of SEO factors. The dataset not only includes basic web page elements
like tags and keywords but also relational metrics that allow for a deeper anal-
ysis of how web pages stack up against top-performing pages in search engine
results. This comprehensive dataset is instrumental in developing a robust SEO
classification model capable of predicting search engine rankings with higher
accuracy.

3.5 Data analysis


This thesis outlines a comprehensive approach undertaken to dissect and un-
derstand the intricate on-page factors influencing search engine rankings. This
analysis is pivotal in feature engineering and model training, primarily focus-
ing on extracting HTML structure data, relational expansions, analyzing term
frequencies, and employing advanced machine learning techniques.

3.5.1 Analysis of HTML Structure Data


The foundation of our data analysis is extracting HTML structural elements
from a vast array of web pages. This process involves parsing HTML to gather
critical on-page elements like title tags, headers (H1, H2, H3, etc.), meta de-
scriptions, and image alt attributes. The objective is to capture the structural
nuances of each page, which are hypothesized to be influential in search engine
rankings.

22
3.5.2 Using SERP Data as a Classifier
Search Engine Results Page (SERP) data from Google will be used as a classifier
in the development of the classification models. This strategy is in line with the
works of Matošević et al. (2021), who discusses the role of web search data
in understanding and modeling web dynamics. The SERP data will provide
the necessary labels for the supervised learning models, categorizing web pages
based on their search engine rankings.

3.5.3 Relational Expansions


In our data analysis process, a key aspect involves expanding the dataset to
incorporate relational metrics. This entails a comparative analysis, where we
juxtapose the attributes of a web page against those of the top-performing pages
for the same search queries. The rationale behind this is not just to evaluate
web pages in isolation but to understand them in the context of their digital
ecosystem. By analyzing how each web page measures up against the highest-
ranking pages, we gain insights into what differentiates a top-ranking page from
lower-ranking ones within the same search context.

This approach is rooted in our hypothesis that the structure of a web page,
though significant, does not solely determine its rank in the context of a specific
search query. We propose that the competitive landscape, as how a page com-
pares to others in the same space, could have a role in determining its ranking.
This perspective challenges the traditional view of SEO, which often empha-
sizes the optimization of individual page attributes in isolation. Our hypothesis
further suggests that the highest-ranking page within a specific search context
could serve as a benchmark or, as said before, an anchor point. This anchor
provides a reference against which we can evaluate the performance of other
pages. The idea is that in the realm of search engine rankings, it’s not just
about how well-optimized a page is in absolute terms, but how it stacks up
against the current best performer in the same category or search context. In
essence, the relational expansion of data in our study is more than a mere com-
parison, it’s an attempt to understand the dynamics of relative performance in
the SEO landscape. By analyzing pages in relation to the best performers, we
aim to unravel the nuances of competitive SEO and provide a more comprehen-
sive understanding of what drives search engine rankings. This approach aligns
with our broader aim to develop a more contextually aware and competitively
sensitive SEO classification model.

The formula represents the relational data calculation used in our study for
SEO classification modeling. In this formula, R1 denotes the best-classified
record within a specific SERP (Search Engine Results Page) group, while Rn
represents another record within the same group. The division of Rn by R1
allows for a comparative analysis, indicating how a given web page’s features
and attributes perform relative to the top-ranked page in its category. This

23
relational approach aims to provide more contextual insights into search engine
rankings, enhancing the predictive power and accuracy of the SEO models. By
understanding and analyzing the performance of web pages in relation to the
best performers, we can derive more nuanced and effective strategies for SEO
optimization.
R1/Rn

3.5.4 Frequency Analysis of Query-Associated Terms


A significant portion of our analysis is dedicated to examining the frequency
of specific terms related to the search queries. This involves determining the
prevalence of targeted keywords within various HTML elements and correlating
these frequencies with the page’s ranking. This frequency analysis aims to
uncover patterns indicating effective keyword usage strategies for search engine
rankings.

3.5.5 Feature Engineering Phase


This phase is about refining and selecting the most relevant data features for
the machine learning model. It involves assessing the extracted, abstracted, and
frequency-based data to determine which features will likely provide the most
insightful inputs for the predictive model. The choices of relevant features was
gathered from previous research as Matošević et al. (2021) and Darrazi & Mak
(2023) found in Appendix C indicated by respective features information gain.

We also streamlined the complexity of SERP rank prediction by grouping the


absolute rankings into six broader categories, facilitating a more manageable
and practical approach to model training. This method involved classifying
web pages into clusters based on their search engine ranking positions, such as
the top 5, positions 6 to 10, 11 to 20, 21 to 40, 41 to 70, and 71 to 100. This
classification not only simplified the data analysis but also aligned closely with
practical SEO objectives. Grouping the rankings allows the models to focus
on discerning broader trends and patterns in website optimization and visibil-
ity, which is more relevant for real-world SEO strategies and interventions. By
reducing the granularity of the ranking data, the study aimed to enhance the
models’ predictive capabilities while maintaining their relevance and applicabil-
ity in practical SEO contexts.

3.5.6 Training the XGBoost & CatBoost Model


The culminating step in our data analysis is training the classification models.
XGBoost and CatBoost are the model libraries chosen for its efficiency and
effectiveness in handling large datasets and complex feature sets. This model
training evaluates which features, including the relational abstractions, have
the most significant impact on classification accuracy. Using XGBoost and Cat-
Boost allows for a robust analysis of the importance of features. It provides

24
insights into the effectiveness of relational abstraction in improving the predic-
tive accuracy of the SEO classification model.

The choice of using both models is because we wanted to ensure the relational
dataset result reliability. Assuring us that the result did not just depend on
choice of one specific model and overfitting.

3.5.7 Evaluation Metric


Accuracy is a fundamental metric used to evaluate the performance of classifi-
cation models. It is calculated by using the ratio of correct predictions to the
total instances in the dataset. In simple terms, accuracy measures how often
the model makes the correct prediction, whether it’s correctly identifying a true
condition (true positives) or correctly recognizing a false condition (true nega-
tives). In this formula, TN stands for True Negatives, TP for True Positives,
FP for False Positives, and FN for False Negatives (Alpaydin 2020).

Accuracy = T N + T P/T N + T P + F P + F N

There are other evaluation metrics more suited for inbalanced datasets like the
Receiver Operating Characteristic (ROC) curve Hoo et al. (2017). But SEO
data are perfectly balanced in theory. Giving each page a unique rank placing
per SERP. Thus, making accuracy both relevant and the preferred evaluation
metric by both Matošević et al. (2021) and Banaei & Honarvar (2017). That is
why we choose accuracy as the unit of evaluation metric. Also making it easier
to compare the results with previous research.

3.5.8 Addressing Fitting Issues


In addressing the challenges of potential overfitting or underfitting in our ma-
chine learning models, this study emphasized reducing the complexity of the
dataset through normalization and rigorous feature engineering. We employed
normalization techniques to standardize data scales, enhancing model training
efficiency and stability. Extensive feature engineering was conducted to identify
and construct the most relevant features, which helped streamline the model ar-
chitecture and improve interpretability. We also used cross-validation to ensure
robust testing across different data subsets and applied regularization techniques
to prevent the models from becoming overly complex. Hyperparameter tuning
was crucial for balancing the models complexity against their predictive accu-
racy. The parameters are found in Jupyter Nodebook ’classifier models.ipynb’
linked in Appendix D. Furthermore, by comparing different models (XGBoost
and CatBoost) and rerunning them seven times each, we gained deeper insights
into the consistency of model performance and the effectiveness of our feature
engineering strategies.

25
3.5.9 Addressing Biases
A comprehensive approach was adopted to address potential biases in data
collection, preprocessing, and modeling. This involved rigorous data validation
and cleaning to ensure data quality, regular monitoring of the web scraping and
updating the data collection to minimize errors, and the application of diverse
preprocessing techniques to address issues like missing values or outliers. Careful
selection of features and testing various modeling configurations were essential to
reduce errors and biases in the final model. The methodology was continuously
refined, considering new insights, to enhance the reliability and accuracy of the
study’s outcomes.

3.6 Alternative Research Strategies


In contrast to the current empirical approach utilized in this thesis, an alterna-
tive research strategy that could be explored is analytical research as described
in Kothari (2004). This approach would involve a detailed examination and
analysis of top-performing web pages to identify common features that con-
tribute to their high rankings in search engine results. The analytical research
strategy, focusing more on theoretical and conceptual analysis, offers a differ-
ent perspective compared to the data-driven empirical approach. One method is
in-depth analysis of top-performing pages. Instead of primarily relying on large-
scale data collection and statistical analysis, the analytical approach would focus
on a comprehensive examination of high-ranking web pages. This method would
involve scrutinizing various aspects of these pages, such as their content struc-
ture, design elements, keyword usage, and overall user experience. An objective
of such a strategy could be to discern patterns and commonalities among the top-
ranking pages. This involves a systematic breakdown of each element of these
pages to understand what makes them successful in SEO terms. The analysis
would be geared towards constructing a theoretical model that encapsulates the
key features and strategies that are most effective in search engine optimization.

This approach would also include a comparative analysis, where features of top-
performing pages are contrasted with those of lower-ranking pages to identify
what distinguishes them. This strategy allows for a more conceptual under-
standing of SEO success factors, moving beyond empirical data to delve into
the qualitative aspects that contribute to a page’s high ranking. Based on the
findings from the in-depth analysis, this approach would aim to formulate the-
ories or models that encapsulate the essence of effective SEO strategies. These
models would provide a conceptual framework that can guide SEO practices,
offering a more theoretical perspective compared to the data-centric models
derived from empirical research.

26
3.7 Validity and Reliability
In any research endeavor, the concepts of validity and reliability are fundamen-
tal to ensuring the accuracy and trustworthiness of the findings. Validity refers
to how a study accurately reflects on the specific concept it intends to measure.
At the same time, reliability pertains to the consistency of a measurement over
time. For validity, it is crucial to consider both internal and external aspects.
Internal validity relates to how the research findings accurately depict the re-
ality of the subjects being studied, devoid of external influences or biases. In
the context of this SEO research, it have been designed and executed totally
based on the empirical result of information gain from previous works. The ex-
ternal validity concerns how the findings can be generalized beyond the specific
settings of the study (Kothari 2004). In our case, the diversity and size of the
dataset play a critical role in enhancing external validity, as they allow for the
generalization of findings across different web domains and search contexts. Co-
hen et al. (2002) emphasize the importance of these aspects, outlining strategies
to enhance both internal and external validity in research studies.

Reliability in this research is addressed through the consistency and replica-


bility of the data collection and analysis methods. By employing systematic
procedures for web scraping and data analysis, and most importantly, using
established and proven tools like the Google SERP API from DataForSEO, the
study aims to ensure that the data collected is consistent and the analysis re-
producible. This approach, backed by the reliability of these tools, aligns with
the recommendations of Bell et al. (2022), who highlight the significance of
consistency in data collection techniques and analytical methods to ensure the
reliability of research findings. Moreover, the use of machine learning models
such as XGBoost further contributes to the reliability of the study. Due to their
proven effectiveness in handling large datasets, these robust models provide a
reliable means of analyzing the complex relationships between the various SEO
factors and web page rankings. The iterative nature of machine learning algo-
rithms, where models are continually refined through training and testing, also
enhances the reliability of the predictive insights generated by the study (Chen
& Guestrin 2016).

3.8 Ethical Considerations


In conducting research, particularly in fields involving data collection and anal-
ysis, ethical considerations are paramount to ensure the integrity and social
responsibility of the study. This thesis, involving web scraping and analyzing
search engine data, presents unique ethical challenges that require careful con-
sideration. This research project upholds high ethical standards by respecting
the autonomy and intentions of website owners, ensuring confidentiality and
anonymity in data handling, and being mindful of the potential implications
of the research findings. By adhering to these principles, the study aims to

27
contribute valuable knowledge to the field of SEO while maintaining a strong
ethical foundation.

While the concept of voluntary participation is more typically associated with


human subjects, it can extend to the use of data from websites. It’s essential
to respect the intentions of website owners, acknowledging that some may not
consent to their data being scraped. Therefore, a cautious approach is adopted
when web scraping, always reading the communicated robots.txt to respect their
policies, prioritizing public data, and respecting the wishes of website adminis-
trators. Informed consent is complex in the context of web scraping, as obtaining
direct permission from each website owner is impractical. However, adherence
to the terms of service of websites and APIs (like Google SERP) is crucial.
This compliance involves avoiding scraping websites that explicitly prohibit it
in their terms of service, thus aligning with the standard of informed consent by
proxy. Ensuring anonymity and confidentiality is critical, especially since web-
site data might indirectly reveal information about individuals or organizations.
The study addresses this by not publishing the collected data, ensuring that no
personal or sensitive information is available together with the published results.

Besides the alignment with ethical data scraping there is potential harm if the
research findings are used to manipulate or hazardously exploit search engine
algorithms, contradicting the ethical principle of beneficence. To mitigate this,
the communication of results is handled responsibly, focusing on academic and
practical insights rather than exploiting loopholes in search algorithms. The
study acknowledges that search engine algorithms, like Google’s, are propri-
etary and respect the intellectual property rights of these entities. Google’s
search algorithm is a trade secret and is not intended to be reverse-engineered
or understood in detail. The research respects this by not attempting to pre-
cisely replicate or decode the algorithm. Instead, the focus is on analyzing
patterns and correlations that are observable and ethical to study.

28
Chapter 4

Main Results

4.1 Technical Development


During our experiment, we began by creating a robust query list to simulate
real-world search scenarios, combining 150 business terms with 307 city names,
which resulted in 46,000 unique queries. We ensured the relevance of these
queries across different linguistic contexts by translating them using OpenAI’s
ChatGPT-3.5 Turbo Large Language Model (LLM), particularly for cities where
the predominant language was not English. The detailed breakdown of business
terms and city names used can be found in Appendix B.

For the actual implementation, we utilized Python to automate the genera-


tion and translation of queries. We first read our prepared lists of queries and
city names from their respective CSV files. As we processed each query and city
combination, we checked the city’s primary language. For cities where English
was the primary language, we directly replaced a placeholder in the query string
with the city name. This code is linked in Appendix D. Below you see a snipped
of how we created the query list first creating five list of different types of val-
ues. This process not only generated a new query but also provided us with a
simplified version of the query for later keyword analysis. This systematic ap-
proach was logged in the console, showing the progress of both the main query
generation, which helped in monitoring the workflow effectively. The resulting
lists, comprising modified queries, keywords, cities, languages, and countries,
were then compiled into one CSV. This methodical approach ensured that our
dataset was comprehensive and tailored to reflect varied search intents across
different geographical and linguistic landscapes.

# Loading the query components


queries = pd . read_csv ( " / queries . csv " )
cities = pd . read_csv ( " / cities . csv " )

# Looping through the queries

29
for queries_index , queries_row in queries . iterrows ():

# Printing progression
print ( " Progress : " + str ( int ((( queries_index + 1)/( len ( queries )
+ 1))*100)) + " % " )

# Looping through the cities


for cities_index , cities_row in cities . iterrows ():

# Indentify language
if cities_row [2]. strip () == " English " :

# Here is the definition of new_string and


# n e w _ s t r i n g _ f o r _ k e y w o r d which can be found
# in the complete code , linked in Appendix D .

# Add the values to lists


query_list . append ( new_string )
keyword_list . append ( n e w _ s t r i n g _ f o r _ k e y w o r d )
location_list . append ( cities_row [0])
language_list . append ( cities_row [2])
country_list . append ( cities_row [1])

# Creating a dataframe from the lists


df = pd . DataFrame ({
’ Queries ’: query_list ,
’ Keyword ’: keyword_list ,
’ Location ’: location_list ,
’ language ’: language_list ,
’ Country ’: country_list ,
})

# Creating a CSV out of the dataframe of values


df . to_csv ( " all_queries . csv " , index = True )
Listing 4.1: Creating the queries

After developing our extensive list of queries, we utilized these queries to extract
data from the DataForSEO SERP API. This API provided us with 4.6 million
SERP results, which included page links and their corresponding SERP ranks
based on each query. This rich dataset formed the foundation for our potential
analysis, containing a total of 4.6 million records.

In our Python script, we defined a function seodata serp to handle the API
requests. For each query in our list, we initialized the RestClient with our cre-
dentials. We then crafted a post data dictionary for each query, specifying the
language, country, and desired depth of search results (meaning the amount
results per request) being 100. We made a POST request to the DataForSEO
API endpoint to retrieve live, organic SERP data from Google. Successful re-
sponses were logged, detailing the progress and key parameters of each query,
and the responses were collected in a list. This systematic approach allowed us
to efficiently gather detailed SERP data.

30
# SERP API call function taking a list of queries as parameter
def seodata_serp ( queries_list ):

# Initialized empty list for the responses


response_list = []

# Looping through the query list


for queries_index , queries in
enumerate ( queries_list [ " Queries " ]):

# Adding the credentials to the API call


client = RestClient ( " ... " , " ... " )
post_data = dict ()

# Adding the parameters to the API call


post_data [ len ( post_data )] = dict (
language_name = queries_list [ " Language " ][ queries_index ] ,
location_name = queries_list [ " Country " ][ queries_index ] ,
keyword = queries ,
depth =100 ,
)

# Saving the response


response = client . post (
" / v3 / serp / google / organic / live / advanced " ,
post_data
)

# If the response is successful


if response [ " status_code " ] == 20000:

# Add the response to the list of responses


response_list . append ([
queries_list [ " Keyword " ][ queries_index ] ,
queries_list [ " Location " ][ queries_index ] ,
queries_list [ " Country " ][ queries_index ] ,
response ])
Listing 4.2: Gathering the targeted links and classifiers

To expand our dataset with on-page feature values, we set up a Proxmox cluster
with four workstation computers, collectively harnessing 24 cores. This power-
ful setup allowed us to scrape web pages from 4.6 million links efficiently. To
manage the high volume of requests from a single network IP without triggering
security protocols on the target websites, we utilized rotating proxy IPs provided
by Webshare.io. Each HTTP request was routed through these proxies, with
each proxy rotating with every request to mask our scraping activities. This
setup was crucial for completing the data collection without interruption, which
took about 34 days of continuous operation using multiprocessing techniques.
We divided the dataset among the 24 cores, with each core processing a subset
of the data, thereby maximizing efficiency and minimizing time.

# Saving the proxy http request


response = requests . get (

31
url ,
headers = headers ,

# Adding the credentials to the request


proxies ={
" http " : " http ://... . webshare . io :80/ " ,
" https " : " http ://... . webshare . io :80/ "
},
timeout =10
)
Listing 4.3: HTTP-request through rotating proxy

# Web scraper function taking a list of links and two values


# telling it where to continue from in the case of a break
def scraper_ worker ( urls_data_part , part , procentile ):

# Initialized empty list and dictionary


html _data_te mp = {}
f a i l e d _ u r l s _ t em p = []

# Printing progression
p r e s e t _ p r o c e n t i l e = int (( len ( urls_da ta_part ) / 100)
* procentile )
temp = -1
print ( " Process initiated for part : " + str ( part ))

# Looping through the urls


for i in range ( preset_procentile , len ( urls_d ata_par t )):
url = ur ls_data _part [ i ][0]

# Http request through proxy


html = fetch_html ( url )
if html != " Error " :
html _data_te mp [ url ] = html
else :
f a i l e d _ u r l s _ t e mp . append ( url )

# Calculating progression
c u r r e n t _ p e r c e n t a g e = int ((( i + 1) / ( len ( url s_data_p art )))
* 100)

if c u r r e n t _ p e r c e n t a g e > temp :
temp = c u r r e n t _ p e r c e n t a g e

# Printing progression
print ( f " Part : { part } , Errors : { len ( f a i l e d _ u r l s_ t e m p )} ,
Progress : { temp }% " )

if temp > 0:

# Saving the subset


s a v e _ h t m l _ t o _ j s o n ( html_data_temp ,
f " html_data_ { part } _ { temp }. json " )
Listing 4.4: Web scraper function collecting on-page feature values

32
# M ul ti p ro ce s si ng function taking set of links and starting point
def p a r a l l e l _ s c r a p e r ( urls_data_parts , s t a r t _ p e r c e n t i l e s ):

# Mapping the available cores for parallel processing


with mp . Pool () as pool :
pool . starmap (
# Running the scrap er_worke r () function
scraper_worker ,
[( u rl s_ d at a_ pa r ts [ part ] , part , s t a r t _ p e r c e n t i l e s [ part ])
for part in u r ls _d at a _p ar ts ])
Listing 4.5: Parallel processing applying multiprocessing to scraper worker()
The corresponding Python code illustrates how we implemented the scraping
process. Functions like fetch html managed the fetching of web pages using
dynamically rotating proxies to maintain anonymity and avoid blocking. The
scraper loop and scraper worker functions organized the multiprocessing tasks,
dividing the workload across multiple cores to streamline the scraping process.
These functions logged the progress and handled errors, ensuring any failed re-
quests were retried and data integrity was maintained. The final step involved
saving the scraped HTML data and any failed URLs for further analysis, en-
suring comprehensive documentation of the scraping process. This meticulous
approach not only optimized our data collection but also provided a robust
foundation for subsequent SEO analysis.

The web scraping process was geared towards collecting specific data points
listed in Appendix A, Table A.2 of our documentation. However, we encoun-
tered several challenges that reduced the volume of usable data. A significant
number of web pages blocked our scraping attempts due to our IP addresses
or because we accessed the sites via headless browsers, which some sites can
detect and restrict. Additionally, compliance with web standards meant re-
specting robots.txt files that explicitly forbade scraping, leading to further data
exclusion. This resulted in approximately 64% of our potential data being in-
accessible or unusable, leaving us with 3,251,724 records. Further data cleaning
was necessary to remove records with unreadable symbols or missing values,
which culminated in a refined dataset of 1,653,946 records ready for use in our
analysis. This stage was critical to ensure the quality and reliability of our data
for subsequent machine learning processes.

We further developed the dataset comprising 1,653,946 records by applying a


relational expansion technique. This method allowed us to enhance the raw data
by establishing a relative scale within each predefined group of data, creating a
more refined dataset ready for subsequent machine learning training. The pro-
cess began by identifying the best-performing records within each group based
on a specified classifier value. This was achieved using a segment of code that
grouped the dataset by ’group id’ and then pinpointed the index of the record
with the lowest classifier value within each group. Once identified, we expanded
on each group by adjusting the data values relative to these best-performing
records. Specifically, we took certain features (columns 5 to 35) and normalized

33
their values by dividing each by the corresponding value of the best-performing
record in the group, ensuring to avoid any division by zero. This normalization
was critical for aligning the data across different groups, making it consistent
and comparable, thus preparing it for effective model training. This approach
not only improved the accuracy of our models by standardizing data points rel-
ative to peak performance benchmarks within their groups but also facilitated a
deeper analysis of the underlying patterns across varied segments of the dataset.

# Finding the row indices with the lowest classifier value for each
# group
best_classified_indices =
rel_dataset . groupby ( ’ group_id ’ )[ rel_dataset . columns [1]]. idxmin ()

# For each group , divide values in columns 5 -35 by the


# best classified row values
for group_id , best_idx in b e s t _ c l a s s i f i e d _ i n d i c e s . iteritems ():
# Select columns 5 -35 for the current group
group_rows = rel_dataset [ rel_dataset [ ’ group_id ’] ==
group_id ][ rel_dataset . columns [4:35]]

# Geting the best classified row values , replace 0 with a small


# number to avoid division by zero
best_values = rel_dataset . loc [ best_idx ,
rel_dataset . columns [4:35]]. replace (0 , 1e -10)

# Dividing each row by the best classified row values and


# update the DataFrame
rel_dataset . loc [ rel_dataset [ ’ group_id ’] == group_id ,
rel_dataset . columns [4:35]] = group_rows . div ( best_values )
Listing 4.6: Calculating the relational expansion and creating a new dataset

We then conducted an in-depth analysis of the datasets to optimize the features


for our machine learning models. We started by calculating correlation matrices
for both the relational and original datasets to identify highly correlated fea-
tures. We then utilized the Python library Scikit-learn to compute information
gain, helping us discern the predictive power of each feature. Features that
showed more than 70% correlation with others and had lower information gains
compared to their correlated counterparts were removed. This step helped re-
duce redundancy. We continued by eliminating the least informative features
until only the top 15 most impactful features remained for both datasets. We
did it by computing the correlation matrices for the relational (‘rel correlation
matrix‘) and original (‘org correlation matrix‘) datasets to identify relation-
ships between features. Then, we created a Decision Tree Classifier, using en-
tropy as a criterion for measuring the quality of splits, to determine the impor-
tance of each feature. The classifier was trained, and the importance of features
was extracted (‘feature importances‘). These importances were then placed
into a DataFrame for better readability and sorted by their importance. This
allowed us to visually confirm and analyze the most significant features, helping
in decision-making for feature selection. The correlation matrices and tables of
the information gain for the respective datasets can be found in Appendix D

34
(repository file ‘classifier models.ipynb‘).
# Creating two correlation matrixes
r e l _ c o r r e l a t i o n _ m a t r i x = rel_dataset . corr ()
o r g _ c o r r e l a t i o n _ m a t r i x = org_dataset . corr ()
Listing 4.7: Cross calculating the correlation between features

# Creating a decision tree classifier with entropy criterion


clf = D e c i s i o n T r e e C l a s s i f i e r ( criterion = ’ entropy ’)
clf . fit (X , y )

# Getting feature importances


f e a t u r e _ i m p o r t a n c e s = clf . f e a t u r e _ i m p o r t a n c e s _

# Creating a DataFrame for easier inte rpretat ion


f e a t u r e _ i m p o r t a n c e _ d f = pd . DataFrame ({ ’ Feature ’: X . columns ,
’ Importance ’: f e a t u r e _ i mp o r t a n c e s })

# Sorting by importance
f e a t u r e _ i m p o r t a n c e _ d f . sort_values ( by = ’ Importance ’ ,
ascending = False , inplace = True )
Listing 4.8: Calculating the information gain for each feature

Once we had two preprocessed datasets ready, we recognized the need to sim-
plify the classification system to streamline the training process. Originally
dealing with 100 classifiers based on rank positions, we condensed this to just
six classifiers to reduce complexity and improve model manageability. We de-
fined a function, map to smaller range, to categorize the SERP positions into
six distinct groups based on their rank. This function assigns a numeric label
from 1 to 6, grouping ranks into clusters like Top 5, 6-10, and so on, up to
71-100. We then applied this mapping function to our ’classifiers’ column using
the .apply() method, effectively transforming the original detailed classifications
into broader categories, making the dataset less cumbersome for our machine
learning algorithms to process. This reduction not only simplified the training
but also aligned the classifiers with more strategic groupings relevant to SEO
analysis.
# Defining a function to map values to the specified ranges
def m a p _ t o _ s m a l l e r _ r a n g e ( value ):
if value <= 5:
return 1 # Top 5
elif value <= 10:
return 2 # 6 -10
elif value <= 20:
return 3 # 11 -20
elif value <= 40:
return 4 # 21 -40
elif value <= 70:
return 5 # 51 -70
else :
return 6 # 71 -100

# Applying the function to the classifiers

35
y = y . apply ( m a p _ t o _ s m a l l e r _ r a n g e )
Listing 4.9: Reducing the amount of classifiers to 6 groups

Once our datasets were prepped and optimized, we moved on to the model
training phase. We developed two Python scripts utilizing the XGBoost and
CatBoost libraries, because of their efficiency and performance in classification
tasks. To ensure the reliability of our results, each model was trained seven
times on both datasets. We split our dataset into training and testing sets
using Scikit-learn’s train test split function, setting aside 20% of the data for
testing to evaluate the model’s performance. We then instantiated a CatBoost
classifier, training it with both datasets separately. The model was trained on
the training set, while the testing set was used as the evaluation set to monitor
performance and avoid overfitting.
# Spliting the dataset into training and testing sets
X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X , y ,
test_size =0.2 , random_state =42)

# Creating a CatBoost classifier


model = cb . C a t B o o s t C l a s s i f i e r ()

# Training the model


model . fit ( X_train , y_train , eval_set =( X_test , y_test ) ,
verbose = False )
Listing 4.10: Training the CatBoost model

Similarly, the XGBoost model was then trained using the training dataset, al-
lowing us to measure its effectiveness on its performance on the test set.
# Spliting the data into training and testing sets
X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X , y ,
test_size =0.2 , random_state =42)

# Defining the XGBoost classifier


xgb_clf = xgb . XGBClassifier ()

# Training the model


xgb_clf . fit ( X_train , y_train )
Listing 4.11: Training the XGBoost model

In the final stage of our experiment, we calculated the accuracy for each model
iteration to gauge their performance. As previously mentioned we used an 80%
training data and 20% testing data split for this purpose. The models demon-
strated consistency, with a standard deviation of only 0.3% in their accuracy
scores, indicating reliable performance across seven training cycles per model
and dataset used.

After predictions were made using the models, we used the ‘accuracy score‘
function from the ‘sklearn.metrics‘ package to determine how well the models
performed against the actual outcomes. For each model, such as CatBoost and
XGBoost, we printed out the accuracy in a formatted output to easily compare

36
their effectiveness. This method provided a clear, quantitative measure of how
well each model could predict SEO rankings based on the input data processed
through the training and testing phases.
# Calculating the accuracy
accuracy = accurac y_score ( y_test , y_pred )
Listing 4.12: Calculating the accuracy of the respective models

4.2 Model Performance Accuracy


A key aspect of our study’s results involves establishing a baseline accuracy
due to the dataset not being perfectly balanced in practice because of the clas-
sification groupings and the data loss. Baseline accuracy makes it easier to
understand performance and is crucial for the comparison between the machine
learning models used. Baseline accuracy serves as a critical reference point,
providing context to the performance enhancements achieved by the models.

In our research, the baseline was determined based on the classification of web
pages into six distinct ranking groups, top 5, 6-10, 11-20, 21-40, 41-70, and
71-100. This categorization reduced the complexity inherent in predicting the
exact SERP rank. The baseline accuracy was calculated from the most pop-
ulous category within our dataset, which reflects a naı̈ve model’s performance
that would always predict the web page to be in this most common category.

Classifier Dataset Portion


6 34.970012%
5 32.008301%
3 11.905854%
4 9.630541%
1 5.932906%
2 5.562386%

Table 4.1: The table shows the population percentile of each classifier from the
dataset.

Our data indicated that the largest group was class 6 which encompassed ap-
proximately 34.97% of the dataset. Thus, the baseline accuracy for our model,
under the assumption of always predicting this most frequent group, was set
at 34.97%. This baseline provides a fundamental benchmark for evaluating the
effectiveness of our machine learning models. In contrast, the XGBoost and
CatBoost models, trained on both original and relational values, demonstrated
a significant improvement over this baseline. Especially the CatBoost which
showed an accuracy of as high as 68.0% with relational values. The compari-
son of these results against the baseline accuracy underscores the value of using

37
sophisticated machine learning models and relational data in enhancing the
prediction accuracy of SEO rankings. The improvements are significant in the
context of SEO, where such increases in accuracy can have meaningful implica-
tions for web page visibility and traffic.

The results present the findings from the empirical analysis conducted using
XGBoost and CatBoost models on two distinct datasets, one consisting of orig-
inal web-scraped values and the other of relational values derived from com-
paring web page attributes to top-ranking pages for a specific group of pages
based on a search query (which is called SERP). In ensuring the validity of the
results, both the XGBoost and CatBoost models were trained seven times on
each dataset type. This rigorous approach aimed to confirm the consistency
of the findings. The datasets used comprised original web-scraped values and
relational values. The results revealed a noticeable accuracy improvement when
using relational values, averaging at approximately 1.3% across models and the
the test cases. The XGBoost model results on the dataset with relational values
was 67.5%. When trained on the dataset with original web-scraped values, the
model demonstrated a slightly lower accuracy of 66.3%. The CatBoost model
performed marginally better on the relational dataset, attaining an accuracy of
68.0%. In comparison, the model trained with the original dataset showed an
accuracy of 66.8%.

Model Dataset Accuracy


XGBoost Relational values 67.5%
XGBoost Original values 66.3%
CatBoost Relational values 68.0%
CatBoost Original values 66.6%

Table 4.2: The table shows final accuracy of the trained models based on which
dataset that was used.

These results indicate that incorporating relational values into the dataset of-
fers a modest but notable improvement in the accuracy of SEO classification
models. The increase in accuracy, while seemingly incremental, is significant
in the context of SEO where even small edges can translate into substantial
improvements in web page rankings. The improved performance of models us-
ing relational values underscores the importance of contextual and comparative
SEO factors. It suggests that understanding a web page’s attributes in relation
to top performers within the same search context provides a more nuanced and
effective approach to predicting search engine rankings.

The results validate the hypothesis that a broader, more contextually aware
dataset can enhance the accuracy of SEO classification models. The findings
from both XGBoost and CatBoost models reinforce the value of relational data

38
in understanding and predicting the complexities of search engine rankings,
offering meaningful insights for advancing SEO strategies in the competitive
digital marketing landscape.

4.3 Discussion
In evaluating the results of the study, the accuracy measurements from the ma-
chine learning models stand out as a significant achievement, highlighting the
efficacy of the advanced classification techniques used. The choice of XGBoost
and CatBoost was pivotal due to their capability to handle large datasets and
complex feature interactions efficiently, which was essential given the scale and
scope of the data involved. These models are particularly adept at ranking and
classification tasks, which made them suitable for the predictive tasks of SEO
classification. The repeated training cycles further reinforced the reliability of
the results, showing a consistent improvement in accuracy, which confirms the
models’ robustness.

The use of established Python libraries like Scikit-learn, XGBoost, and Cat-
Boost instead of developing models from scratch was a strategic choice that
allowed for more focus on data manipulation and model tuning rather than on
the foundational aspects of algorithm development. These libraries offer well
optimized, tested algorithms that are both scalable and efficient, facilitating
a more streamlined research process and reducing the potential for error that
comes with custom coded solutions.

However, the study’s delimitations also play a critical role in shaping the inter-
pretation of its outcomes. The focus on on-page SEO factors, while significant,
does not encapsulate the full spectrum of variables that search engines con-
sider for ranking web pages. Off-page factors like backlinks and social signals,
which were not included in this study, are also known to significantly impact
SEO performance. Additionally, the reliance on web scraping for data collection
introduces potential biases, as not all relevant data may be accessible or accu-
rately representable through this method, and ethical considerations regarding
data privacy and compliance with web usage policies must be addressed.

In summary, the results of this thesis demonstrate a clear advancement in the


application of machine learning to SEO, supported by a rigorous methodologi-
cal approach and the strategic use of technological tools. Yet, the delimitations
identified suggest areas for further exploration, particularly in incorporating a
broader range of SEO factors and refining data collection techniques to enhance
the comprehensiveness and ethical standing of the research.

39
Chapter 5

Conclusion

In this work, we studied the intricacies and potential improvements in SEO clas-
sification models. More specifically, we considered the research question: ”How
can the accuracy of SEO classification models be improved through the use of
large scale, relative feature engineered datasets?”

To address this research question, we implemented advanced machine learning


techniques, specifically the XGBoost and CatBoost algorithms. We collected
and synthesized a large dataset, integrating both original web-scraped values
and relational values to provide a comprehensive view of web page features and
their impact on search engine rankings. The analysis was conducted using these
sophisticated models to discern patterns and correlations.

Our results were encouraging, providing evidence that incorporating relational


values into the datasets yielded a slight but significant improvement in model
accuracy. Specifically, we observed that models trained on relational values con-
sistently outperformed those using only original values, with an average accuracy
improvement of approximately 1.3%. Which is the most remarkable conclusion
from our study, that the use of relational values, which contextualize a web-
page’s features against top performers in similar search contexts, enhanced the
predictive capability of SEO classification models, even in the diverse and com-
plex environment of web page ranking. This finding underscores the importance
of a more nuanced, context-aware approach in the realm of SEO optimization.

In conclusion, this study addressed the acknowledged limitations in the field


of SEO modeling, particularly in the use of classification models. While previ-
ous literature, including works by Matošević et al. (2021), Luh et al. (2016), and
Banaei & Honarvar (2017), demonstrated the potential of these techniques, our
work furthered this exploration by tackling the issues of scalability and robust-
ness often constrained by the size of datasets in these studies. Acknowledging
that search engines consider a plethora of factors in their algorithms, the intri-
cate and dynamic nature of these factors.

40
Through our empirical research, we provided evidence that incorporating re-
lational values into SEO classification models consistently enhances their pre-
dictive accuracy. This approach, which contextualizes a webpage’s features
against those of the top performers in similar search contexts. This improve-
ment, though modest, is substantial in the domain of SEO, where even small
increases in accuracy can yield considerable benefits in web page visibility and
traffic.

The findings of this study go beyond just refining existing models, they open
avenues for deeper exploration into the societal and academic implications of
enhanced SEO modeling techniques. For businesses and web content creators,
the advancements in model accuracy and reliability translate into more effective
and strategic web optimization, bolstering visibility and competitiveness in the
digital marketplace. Furthermore, the academic value of this study lies in its
application and testing of machine learning theories within the intricate real-
world environment of SEO, contributing valuable insights to the field. Thus, this
study not only bridges the knowledge gap identified in earlier research but also
lays the groundwork for future investigations that consider the multifaceted ap-
proach necessary for addressing the technical challenges, economic benefits, and
academic exploration in the evolving landscape of SEO and digital marketing.

5.1 Comparison with Existing Work


In comparing the results of this thesis with existing work, we find significant
parallels and advancements in our approach to SEO classification modeling, par-
ticularly in our empirical use of large-scale, feature-engineered datasets. The
research by Banaei & Honarvar (2017), as outlined in the IJCSNS International
Journal of Computer Science and Network Security, concentrated on using ma-
chine learning techniques, specifically ANN, for web page rank estimation based
on SEO parameters. Our work aligns with their objectives but differs in method-
ology, opting for more efficient models and much larger dataset and a novelty
in data pre-processing. Our results show a similar ambition to improve the pre-
dictability of page rankings but does not manage to overcome the previously
showcased performance (Matošević et al. 2021). But our research also addresses
the limitations noted in previous studies. For instance, Banaei & Honarvar
(2017) work and others cited in their literature review, such as the study by
Vazirgiannis et al. (2008), faced constraints due to smaller dataset sizes and
limited scope of SEO factors. Our study expands on this by not only using
more extensive datasets increasing the robustness of the result but also incorpo-
rating relational values, a novel approach in this field. The average performance
increase of 1.3% in our models when using relational data demonstrates a tan-
gible improvement over these earlier methods.

The research presented in ”Using Machine Learning for Web Page Classifica-

41
tion in Search Engine Optimization” by Matošević et al. (2021) offers valuable
insights for comparing with our study. Their research employs machine learn-
ing techniques for classifying web pages into predefined categories based on the
extent of content adjustment to SEO guidelines, using data labeled by domain
experts. While Matošević et al. (2021) explored a range of machine learning clas-
sifiers including decision trees, SVM, Naı̈ve Bayes, KNN, and logistic regression,
our study concentrated on the performance of XGBoost and CatBoost models.
These models preformed most accurate. They reported classifier accuracies
ranging from 54.59% to 69.67%, exceeding their baseline classification accuracy
of 48.83%. In our research, we also utilized these advanced machine learning
techniques, specifically focusing on the top performers, XGBoost and CatBoost
models. We compared the performance of these models using datasets with
original web-scraped values and relational values. Our findings demonstrated
that models using relational values achieved higher accuracies, with XGBoost
achieving 67.5% and CatBoost 68.0% accuracy, and an increase in accuracy by
as much as 33.03% compared to our baseline accuracy of 34.97%.

SEO modeling has seen various approaches to understanding and predicting


web page rankings. From using neural networks and decision trees to exploring
new machine-learning techniques, these studies have contributed to the collec-
tive understanding of SEO dynamics. However, they all, including this report,
need help with dataset comprehensiveness limitations and the ability to gener-
alize findings across different market sectors and website types such as dynamic
content (Ke et al. 2006). Because all the studies are based on a fraction sam-
ple of all the online ranking factors and pages. The internet contains billions
of websites, and indexing such a large pool of pages requires resources on an
industrial level (Akamine et al. 2009). Thus making it hard for researchers to
grasp the complete scope of the Search Engine dynamics.

5.2 Delimitations
One of the significant limitations was the available computing power and time.
Advanced machine learning models and web scraping, particularly when deal-
ing with large datasets, require substantial computational resources and time.
Limited compute power constrained the amount of data that could be collected,
the complexity of the models we could run and the depth of analysis we could
perform within a reasonable time frame. This factor limited the refinement of
the models and the granularity of the results.

A notable delimitation was the unavailability of certain features, especially off-


page SEO factors like backlinks amount and quality, social signals, page speed
and page spam metrics. These factors play a crucial role in SEO but were
beyond the scope of this study due to the difficulty and in accurately and con-
sistently scraping such data. This limitation means that the study’s conclusions
are only representative of on-page SEO factors, leaving the impact of off-page

42
factors practically unexplored.

The process of web scraping faced several challenges. Many websites deploy
technical measures to prevent scraping, such as IP blocking, CAPTCHAs, and
dynamic content rendering, making data collection more difficult. Additionally,
the legal and ethical considerations, particularly adherence to websites’ policies
against scraping, further constrained the dataset’s breadth. These limitations
impacted the diversity and representativeness of the dataset, which, in turn,
may affect the generalizability of the study’s findings.

The research design, while comprehensive, inherently possesses certain limi-


tations. The decision to categorize SERP rankings into broader groups, for
instance, was necessary to reduce complexity but potentially oversimplified the
nuanced nature of search engine rankings. Reducing the models applicability in
certain use cases.

Reliance on web-scraped data also raises concerns about data reliability and
potential biases. Web content is dynamic and constantly changing, and there’s
a risk that the scraped data may not accurately reflect the current state of web
pages or the internet at large.

The findings are most applicable to the specific context and dataset studied.
Given the rapid evolution of search engine algorithms and the internet land-
scape, the applicability of the findings may be limited over time or in different
contexts.

While the research provides valuable insights into the application of machine
learning in SEO, these delimitations highlight the need for caution in general-
izing the results. They underscore the importance of considering the study’s
specific context, the constraints under which it was conducted, and the rapidly
changing nature of the digital landscape when interpreting the findings.

5.3 Ethical Implications


As with any advancement in data science and machine learning, there is a po-
tential risk of misuse. The improved models could be used unethically to game
search engine algorithms, potentially leading to unfair competitive advantages
or the proliferation of low-quality, spammy content. It is imperative that the
use of these models is guided by ethical standards that prioritize the integrity
of information and fairness in digital marketing practices.

The enhanced accuracy of SEO classification models has the potential to in-
fluence how websites are optimized for search engines. While this can lead to
improved visibility for businesses and content creators, there’s also a respon-
sibility to ensure these practices do not devolve into manipulative techniques

43
that degrade the quality of search results as said before (low-quality content)
or violate search engine guidelines. Ethical SEO should focus on improving
user experience and providing valuable content, rather than solely on exploiting
algorithmic vulnerabilities. The increased sophistication of SEO tools impacts
how information is accessed and consumed by society. There’s a risk that the
dominant narrative or content could be shaped by those with the most advanced
SEO tactics, potentially leading to a homogenization of content or marginaliza-
tion of less optimized but valuable information sources.

Finally, there is an academic responsibility to ensure that the development and


application of advanced machine learning models in SEO are conducted with
transparency and a commitment to advancing knowledge ethically. This involves
continuous reflection on the implications of our work and fostering a research
culture that values ethical considerations as much as technical advancements.

5.4 The use of AI Tools


For this research project, we employed several AI tools to enhance the quality
and efficiency of our work. Grammarly was used extensively for language cor-
rection, ensuring that our written content was clear, grammatically accurate,
and well-structured. This tool helped maintain a high standard of written com-
munication throughout the thesis.

We utilized ChatGPT 3.5 to assist in generating search queries. This AI-


powered language model made it possible to create a large number of unique
and relevant search terms, which were essential for compiling a comprehensive
and representative dataset for SEO analysis.

Additionally, GitHub Copilot was employed for code auto-correction. It signif-


icantly improved our coding workflow by providing intelligent code suggestions
and corrections in real-time, helping to reduce errors and enhance the overall
efficiency of our programming tasks.

Together, these AI tools played a crucial role in refining our research methodol-
ogy, improving the accuracy and clarity of our documentation, and streamlining
the technical aspects of our project.

5.5 Considerations for Future Work


Future research should continue to explore ways to refine the accuracy of on-
page SEO prediction models. This could involve more detailed analysis of con-
tent structure, user interaction metrics, or advanced HTML feature engineering.
Incorporating emerging web technologies and evolving SEO practices into the
models will be vital in keeping the research relevant and applicable to current

44
digital marketing trends. Integrating off-page factors such as backlinks, social
media presence, and domain authority could significantly enhance the perfor-
mance of SEO classification models. These factors play a crucial role in how
search engines rank pages but present challenges in data collection and interpre-
tation. Future studies could focus on innovative methods to accurately collect
and integrate off-page data, balancing the complexity of these factors with the
practicalities of model implementation.

One of the most intriguing prospects is combining sophisticated SEO modeling


with the advancements in natural language processing and generation models
such as large language models. The integration of these fields could revolution-
ize content creation, making it possible to automatically generate high-quality,
SEO-optimized web content at scale. Research in this area could explore how
LLMs can be tailored to adhere not only to SEO guidelines but also to maintain
the originality, readability, and user engagement of the content. The conver-
gence of advanced SEO techniques and LLMs opens up vast opportunities in
the digital marketing space. There is a significant market gap for tools that can
effectively balance search engine visibility with content quality. Future work
should investigate how these technologies can be leveraged to create more equi-
table and efficient digital visibility.

In conclusion, the future of SEO research is poised at an exciting juncture, with


opportunities to push the boundaries of already high performing SEO models,
incorporating off-page factors more comprehensively, and especially synergize
with cutting-edge LLM technologies. This progression holds the promise of
transforming the digital marketing landscape, offering new tools for visibility
and engagement in the ever-evolving world of online content. However, this
journey must be navigated with a keen sense of responsibility, ensuring that
advancements benefit a wide range of stakeholders while upholding ethical stan-
dards.

45
Bibliography

Akamine, S., Kato, Y., Kawahara, D., Shinzato, K., Inui, K., Kurohashi, S.
& Kidawara, Y. (2009), Development of a large-scale web crawler and search
engine infrastructure, in ‘Proceedings of the 3rd International Universal Com-
munication Symposium’, pp. 126–131.
Al Daoud, E. (2019), ‘Comparison between xgboost, lightgbm and catboost
using a home credit dataset’, International Journal of Computer and Infor-
mation Engineering 13(1), 6–10.
Alpaydin, E. (2020), Introduction to machine learning, MIT press.
Balabantaray, R. C. (2017), ‘Evaluation of web search engine based on ranking
of results and its features’, International Journal of Information and Com-
munication Technology 10(4), 392–405.
Banaei, H. & Honarvar, A. R. (2017), ‘Web page rank estimation in search
engine based on seo parameters using machine learning techniques’, Int J
Comput Sci Netw Sec 17, 95–100.
Bar-Ilan, J. (2007), ‘Manipulating search engine algorithms: the case of google’,
Journal of Information, Communication and Ethics in Society 5(2/3), 155–
166.
Bell, E., Bryman, A. & Harley, B. (2022), Business research methods, Oxford
university press.
Bhandari, R. S. & Bansal, A. (2018), ‘Impact of search engine optimization as
a marketing tool’, Jindal Journal of Business Research 7(1), 23–36.
Brown, I. & Mues, C. (2012), ‘An experimental comparison of classification
algorithms for imbalanced credit scoring data sets’, Expert systems with ap-
plications 39(3), 3446–3453.
Chen, T. & Guestrin, C. (2016), Xgboost: A scalable tree boosting system, in
‘Proceedings of the 22nd acm sigkdd international conference on knowledge
discovery and data mining’, pp. 785–794.
Cohen, L., Manion, L. & Morrison, K. (2002), Research methods in education,
routledge.

46
Creswell, J. W. & Creswell, J. D. (2017), Research design: Qualitative, quanti-
tative, and mixed methods approaches, Sage publications.
Darrazi, A. & Mak, L. K. (2023), ‘The use of xai for seo classification’, Research
Topics in Data Science . Research topics in Data Science, HT2023, Stockholm
University.

DataForSEO (2023), ‘Serp api documentation’. Accessed: 2024-03-11.


URL: https://docs.dataforseo.com/v3/serp/google/overview/?bash
Google (2024), ‘How search works’, https://www.google.com/search/
howsearchworks/. Accessed: 2024-03-19.

Hancock, J. & Khoshgoftaar, T. (2020), ‘Catboost for big data: an interdisci-


plinary review. j big data 7 (1): 94’.
Hendry, D. G. & Efthimiadis, E. N. (2008), Conceptual models for search en-
gines, in ‘Web search: Multidisciplinary perspectives’, Springer, pp. 277–307.

Hochstotter, N. & Koch, M. (2009), ‘Standard parameters for searching be-


haviour in search engines and their empirical evaluation’, Journal of informa-
tion science 35(1), 45–65.
Hoo, Z. H., Candlish, J. & Teare, D. (2017), ‘What is an roc curve?’.
Jamsa, K., King, K. & Anderson, A. (2002), HTML & Web Design, McGraw-
Hill.
Ke, Y., Deng, L., Ng, W. & Lee, D.-L. (2006), ‘Web dynamics and their ram-
ifications for the development of web search engines’, Computer Networks
50(10), 1430–1447.

King, R. D., Feng, C. & Sutherland, A. (1995), ‘Statlog: comparison of classifi-


cation algorithms on large real-world problems’, Applied Artificial Intelligence
an International Journal 9(3), 289–333.
Kothari, C. R. (2004), Research methodology: Methods and techniques, New Age
International.

Lim, T.-S., Loh, W.-Y. & Shih, Y.-S. (2000), ‘A comparison of prediction accu-
racy, complexity, and training time of thirty-three old and new classification
algorithms’, Machine learning 40, 203–228.
Luh, C.-J., Yang, S.-A. & Huang, T.-L. D. (2016), ‘Estimating google’s search
engine ranking function from a search engine optimization perspective’, On-
line Information Review 40(2), 239–255.
Matošević, G., Dobša, J. & Mladenić, D. (2021), ‘Using machine learning for web
page classification in search engine optimization’, Future Internet 13(1), 9.
McKinney, W. (2022), Python for data analysis, ” O’Reilly Media, Inc.”.

47
Page, L., Brin, S., Motwani, R. & Winograd, T. (1998), The pagerank citation
ranking: Bring order to the web, in ‘Proc. of the 7th International World
Wide Web Conf’.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. et al. (2011), ‘Scikit-
learn: Machine learning in python’, the Journal of machine Learning research
12, 2825–2830.
Portier, W. K., Li, Y. & Kouassi, B. A. (2020), Improving search engine ranking
prediction based on a new feature engineering tool, in ‘Proceedings of the
2020 4th International Conference on Vision, Image and Signal Processing’,
pp. 1–6.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. (2018),
‘Catboost: unbiased boosting with categorical features’, Advances in neural
information processing systems 31.
Ravi, S., Ganesan, N. & Raju, V. (2012), ‘Search engines using evolution-
ary algorithms’, International Journal of Communication Network Security
1(4), 39–44.
Reyes-Lillo, D., Morales-Vargas, A. & Rovira, C. (2023), ‘Reliability of domain
authority scores calculated by moz, semrush, and ahrefs’, Profesional de la
información/Information Professional 32(4).

Rogers, I. (2002), ‘The google pagerank algorithm and how it works’.


Sharma, D., Shukla, R., Giri, A. K. & Kumar, S. (2019), A brief review on
search engine optimization, in ‘2019 9th international conference on cloud
computing, data science & engineering (confluence)’, IEEE, pp. 687–692.

Thakur, A., Sangal, A. & Bindra, H. (2011), ‘Quantitative measurement and


comparison of effects of various search engine optimization parameters on
alexa traffic rank’, International Journal of Computer Applications 26(5), 15–
23.
Turner, C. R., Fuggetta, A., Lavazza, L. & Wolf, A. L. (1999), ‘A conceptual
basis for feature engineering’, Journal of Systems and Software 49(1), 3–15.
Van Rossum, G. & Drake, F. L. (2011), An introduction to Python, Network
Theory Ltd.
Vazirgiannis, M., Drosos, D., Senellart, P. & Vlachou, A. (2008), Web page
rank prediction with markov models, in ‘Proceedings of the 17th international
conference on World Wide Web’, pp. 1075–1076.
Webshare (2024), ‘Proxy server - webshare’, https://www.webshare.io/
proxy-server. Accessed: 2023-04-19.

48
Witten, I., Frank, E., Hall, M. & Pal, C. (2016), Data Mining: Practical Ma-
chine Learning Tools and Techniques.
Zhang, P., Jia, Y. & Shang, Y. (2022), ‘Research and application of xgboost
in imbalanced data’, International Journal of Distributed Sensor Networks
18(6), 15501329221106935.

Zheng, A. & Casari, A. (2018), Feature engineering for machine learning: prin-
ciples and techniques for data scientists, ” O’Reilly Media, Inc.”.

49
Appendix A

Full Feature Set

Dataset Features Explanation


url The web page URL
rank group Grouped ranking position in search engine results
keyword Term associated with the search query
geography Geographic location relevant to the web page content

Table A.1: This table shows the URL, keyword, and geography, which together
work like the unique identifier for each record. The rank group is the classifier
ranked within 1 - 100 in its respective query group.

50
Dataset Features Explanation
h1 tags amount Number of H1 on the web page
h1 avg char count Average character count of H1
h1 keyword freq Frequency of keywords in H1
h1 geography freq Frequency of location terms in H1
h2 tags amount Number of H2 on the web page
h2 keyword freq Frequency of keywords in H2
h2 geography freq Frequency of location terms in H2
h3 tags amount Number of H3 on the web page
h3 keyword freq Frequency of keywords in H3
h3 geography freq Frequency of location terms in H3
h4 tags amount Number of H4 on the web page
p tags amount Number of P on the web page
a internal tags Number of internal links (anchor) on the web page
a external tags Number of external links (anchor) on the web page
img tags amount Number of img on the web page
meta tags amount Number of meta on the web page
title word count Word count of the web page title
description word count Word count of the web page description
title char count Character count of the web page title
title keyword freq Frequency of keywords in the web page title
title geography freq Frequency of location terms in the web page title
description char count Character count of the web page description
alt keyword freq Frequency of keywords in img alt attributes
alt geography freq Frequency of location terms in img alt attributes
a keyword freq Frequency of keywords in anchor
a geography freq Frequency of location terms in anchor
url keyword freq Frequency of keywords in the URL
url geography freq Frequency of location terms in the URL
p total char count Total character count in P
p total keyword freq Frequency of keywords in P
p total geography freq Frequency of location terms in P

Table A.2: These features are based on the previous on-page values but relative
to the best-ranking page in its respective query group.

51
Dataset Features Explanation
rel h1 tags amount Relative H1 amount
rel h1 avg char count Relative average character count of H1
rel h1 keyword freq Relative frequency of keywords in H1
rel h1 geography freq Relative frequency of location terms in H1
rel h2 tags amount Relative H2 amount
rel h2 keyword freq Relative frequency of keywords in H2
rel h2 geography freq Relative frequency of location terms in H2
rel h3 tags amount Relative H3 amount
rel h3 keyword freq Relative frequency of keywords in H3
rel h3 geography freq Relative frequency of location terms in H3
rel h4 tags amount Relative H4 amount
rel p tags amount Relative P amount
rel a internal tags Relative internal links amount
rel a external tags Relative external links amount
rel img tags amount Relative img amount
rel meta tags amount Relative meta amount
rel title word count Relative title word count
rel description word count Relative description word count
rel title char count Relative title character count
rel title keyword freq Relative frequency of keywords in title
rel title geography freq Relative frequency of location terms in title
rel description char count Relative description character count
rel alt keyword freq Relative frequency of keywords in img alt
rel alt geography freq Relative frequency of location terms in img alt
rel alt geography freq Relative frequency of keywords in anchor
rel alt geography freq Relative frequency of location terms in anchor
rel alt geography freq Relative frequency of keywords in the URL
rel url geography freq Relative frequency of location terms in the URL
rel p total char count Relative total character count in P
rel p total keyword freq Relative frequency of keywords in P
rel p total geography freq Relative frequency of location terms in P

Table A.3: These features are based on the previous on-page values but relative
to the best-ranking page in its respective query group.

52
Appendix B

Query Engineering

Table B.1: The table shows the city, country and language name for a specific
query. This information is a component of the query engineering.

City Country Language


Stockholm Sweden Swedish
Helsinki Finland Finnish
Copenhagen Denmark Danish
Reykjavik Iceland Icelandic
Tallinn Estonia Estonian
Riga Latvia Latvian
Vilnius Lithuania Lithuanian
London United Kingdom English
Paris France French
Brussels Belgium French
Amsterdam Netherlands Dutch
Dublin Ireland English
Berlin Germany German
Frankfurt Germany German
Rome Italy Italian
Milan Italy Italian
Madrid Spain Spanish
Barcelona Spain Spanish
Lisbon Portugal Portuguese
Athens Greece Greek
Istanbul Turkey Turkish
Valencia Spain Spanish
Moscow Russia Russian
Saint Petersburg Russia Russian
Kiev Ukraine Ukrainian
Continued on next page

53
Table B.1 – continued from previous page
City Country Language
Bucharest Romania Romanian
Budapest Hungary Hungarian
Prague Czech Republic Czech
Warsaw Poland Polish
Sofia Bulgaria Bulgarian
Vienna Austria German
Zurich Switzerland German
Geneva Switzerland German
Munich Germany German
Hamburg Germany German
Krakow Poland Polish
Bratislava Slovakia Slovak
Ljubljana Slovenia Slovenian
Bergen Norway Norwegian
Gothenburg Sweden Swedish
Turku Finland Finnish
Aarhus Denmark Danish
Akureyri Iceland Icelandic
Tromsø Norway Norwegian
Uppsala Sweden Swedish
Oulu Finland Finnish
Birmingham United Kingdom English
Marseille France French
Antwerp Belgium French
Rotterdam Netherlands Dutch
Strasbourg France French
Edinburgh United Kingdom English
Düsseldorf Germany German
Cologne Germany German
Naples Italy Italian
Seville Spain Spanish
Porto Portugal Portuguese
Thessaloniki Greece Greek
Palermo Italy Italian
Valletta Malta Maltese
Zagreb Croatia Croatian
Ljubljana Slovenia Slovenian
Minsk Belarus Russian
Belgrade Serbia Serbian
Sarajevo Bosnia and Herzegovina Bosnia
Salzburg Austria German
Basel Switzerland German
Stuttgart Germany German
Continued on next page

54
Table B.1 – continued from previous page
City Country Language
Nuremberg Germany German
Graz Austria German
Innsbruck Austria German
Poznan Poland Polish
Brno Czech Republic Czech
Trondheim Norway Norwegian
Malmö Sweden Swedish
Espoo Finland Finnish
Odense Denmark Danish
Keflavik Iceland Icelandic
Umeå Sweden Swedish
Linköping Sweden Swedish
Västerås Sweden Swedish
Glasgow United Kingdom English
Lyon France French
Ghent Belgium French
Utrecht Netherlands Dutch
Bordeaux France French
Leeds United Kingdom English
Hannover Germany German
Leipzig Germany German
Genoa Italy Italian
Bilbao Spain Spanish
Coimbra Portugal Portuguese
Patras Greece Greek
Bologna Italy Italian
Split Croatia Croatian
Maribor Slovenia Slovenian
Varna Bulgaria Bulgarian
Lviv Ukraine Ukrainian
Cluj-Napoca Romania Romanian
Novi Sad Serbia Serbian
Tirana Albania Albanian
Vilnius Lithuania Lithuanian
Rostov-on-Don Russia Russian
Odessa Ukraine Ukrainian
Linz Austria German
Lausanne Switzerland German
Bonn Germany German
Dresden Germany German
Szczecin Poland Polish
Ostrava Czech Republic Czech
Debrecen Hungary Hungarian
Continued on next page

55
Table B.1 – continued from previous page
City Country Language
Timis, oara Romania Romanian
Karlstad Sweden Swedish
Jönköping Sweden Swedish
Kuopio Finland Finnish
Aalborg Denmark Danish
Helsingborg Sweden Swedish
Bodo Norway Norwegian
Lahti Finland Finnish
Stavanger Norway Norwegian
Nottingham United Kingdom English
Toulouse France French
Liege Belgium French
The Hague Netherlands Dutch
Nice France French
Cardiff United Kingdom English
Bremen Germany German
Essen Germany German
Turin Italy Italian
Zaragoza Spain Spanish
Faro Portugal Portuguese
Heraklion Greece Greek
Venice Italy Italian
Rijeka Croatia Croatian
Piraeus Greece Greek
Constant, a Romania Romanian
Kaunas Lithuania Lithuanian
Brno Czech Republic Czech
Plovdiv Bulgaria Bulgarian
Yekaterinburg Russia Russian
Kharkiv Ukraine Ukrainian
Nizhny Novgorod Russia Russian
Salzburg Austria German
Bern Switzerland German
Dortmund Germany German
Mainz Germany German
Lublin Poland Polish
Graz Austria German
Klagenfurt Austria German
Sibiu Romania Romanian
New York City New York English
Boston Massachusetts English
Philadelphia Pennsylvania English
Pittsburgh Pennsylvania English
Continued on next page

56
Table B.1 – continued from previous page
City Country Language
Baltimore Maryland English
Washington D.C. District of Columbia
Providence Rhode Island English
Buffalo New York English
Chicago Illinois English
Detroit Michigan English
Minneapolis Minnesota English
St. Louis Missouri English
Cleveland Ohio English
Indianapolis Indiana English
Milwaukee Wisconsin English
Columbus Ohio English
Atlanta Georgia English
Miami Florida English
New Orleans Louisiana English
Nashville Tennessee English
Charlotte North Carolina English
Austin Texas English
Houston Texas English
Dallas Texas English
Los Angeles California English
San Francisco California English
Seattle Washington English
Denver Colorado English
Las Vegas Nevada English
Phoenix Arizona English
Portland Oregon English
San Diego California English
Albuquerque New Mexico English
Tucson Arizona English
Oklahoma City Oklahoma English
El Paso Texas English
Santa Fe New Mexico English
Austin Texas English
San Antonio Texas English
Fort Worth Texas English
Albany New York English
Hartford Connecticut English
New Haven Connecticut English
Portland Maine English
Newark New Jersey English
Harrisburg Pennsylvania English
Syracuse New York English
Continued on next page

57
Table B.1 – continued from previous page
City Country Language
Rochester New York English
Kansas City Missouri English
Omaha Nebraska English
Cincinnati Ohio English
Toledo Ohio English
Madison Wisconsin English
Des Moines Iowa English
Fargo North Dakota English
Sioux Falls South Dakota English
Orlando Florida English
Tampa Florida English
Memphis Tennessee English
Raleigh North Carolina English
Louisville Kentucky English
Birmingham Alabama English
Charleston South Carolina English
Richmond Virginia English
Sacramento California English
Honolulu Hawaii English
Anchorage Alaska English
Boise Idaho English
Reno Nevada English
Spokane Washington English
Salt Lake City Utah English
Santa Fe New Mexico English
Tulsa Oklahoma English
Corpus Christi Texas English
Lubbock Texas English
Amarillo Texas English
Little Rock Arkansas English
Baton Rouge Louisiana English
Shreveport Louisiana English
Fayetteville Arkansas English
Bridgeport Connecticut English
Worcester Massachusetts English
Cambridge Massachusetts English
Trenton New Jersey English
Scranton Pennsylvania English
Bangor Maine English
Burlington Vermont English
Stamford Connecticut English
Grand Rapids Michigan English
Springfield Illinois English
Continued on next page

58
Table B.1 – continued from previous page
City Country Language
Wichita Kansas English
Lincoln Nebraska English
Fort Wayne Indiana English
Akron Ohio English
Duluth Minnesota English
Davenport Iowa English
Virginia Beach Virginia English
Savannah Georgia English
Lexington Kentucky English
Asheville North Carolina English
Jacksonville Florida English
Knoxville Tennessee English
Mobile Alabama English
Columbia South Carolina English
Fresno California English
Boulder Colorado English
Tucson Arizona English
Albuquerque New Mexico English
Bozeman Montana English
Cheyenne Wyoming English
Missoula Montana English
Eugene Oregon English
Midland Texas English
Galveston Texas English
Norman Oklahoma English
Wichita Falls Texas English
Flagstaff Arizona English
Roswell New Mexico English
Fort Smith Arkansas English
Lafayette Louisiana English
Allentown Pennsylvania English
Providence Rhode Island English
Manchester New Hampshire English
Portland Maine English
Newport Rhode Island English
Saratoga Springs New York English
Atlantic City New Jersey English
Erie Pennsylvania English
Rockford Illinois English
Cedar Rapids Iowa English
Springfield Missouri English
Rapid City South Dakota English
Green Bay Wisconsin English
Continued on next page

59
Table B.1 – continued from previous page
City Country Language
Gary Indiana English
Lansing Michigan English
Bismarck North Dakota English
Jackson Mississippi English
Chattanooga Tennessee English
Wilmington North Carolina English
Greensboro North Carolina English
Tallahassee Florida English
Montgomery Alabama English
Greenville South Carolina English
Roanoke Virginia English
Tacoma Washington English
Bakersfield California English
Santa Cruz California English
Olympia Washington English
Santa Barbara California English
Medford Oregon English
Flagstaff Arizona English
Billings Montana English
Santa Rosa New Mexico English
El Reno Oklahoma English
McAllen Texas English
Odessa Texas English
Lubbock Texas English
Broken Arrow Oklahoma English
Lake Charles Louisiana English
Beaumont Texas English

Table B.2: This table shows the respective business terms that was used in the
query engineering in English. The terms was translated based on the language
related to the city it was combined with.

Business Terms
Hotel
Dentist
Car dealership
Fitness center
Fashion boutique
Coffee shop
Law firm
Museum
Construction company
Continued on next page

60
Table B.2 – continued from previous page
Business Terms
Tech startup
Luxury spa
Bookstore
Seafood restaurant
University
Art gallery
Bakery
Real estate agency
Veterinary clinic
Organic grocery store
Advertising agency
Yoga studio
Brewery
Music school
Pet grooming service
Antique shop
Bicycle shop
Nightclub
Graphic design studio
Plant nursery
Children’s clothing store
Independent film theater
Specialty cheese shop
Golf course
Photography studio
Language learning center
Sushi bar
Architectural firm
Organic farm
Interior design studio
Vegan restaurant
Computer repair shop
Jazz club
Toy store
Kayak rental service
Custom tailoring service
Rock climbing gym
Used bookstore
Microbrewery
Wedding planning service
Escape room
Aerial yoga studio
Historic hotel
Continued on next page

61
Table B.2 – continued from previous page
Business Terms
Gourmet chocolate shop
Independent record store
Scuba diving center
Botanical garden
Organic coffee roaster
Adventure travel agency
Renewable energy company
Handcrafted furniture store
Vegan bakery
Independent cinema
Craft pottery studio
Ethical fashion boutique
Specialty tea shop
Outdoor gear retailer
Local farmers market
Digital marketing agency
Contemporary art museum
Jazz bar
Board game cafe
Organic butcher
Co-working space
Specialty fish market
Boutique wine shop
Historic theatre
Luxury bed and breakfast
Custom jewelry designer
Independent video game store
Experimental dining restaurant
Vintage record shop
Specialty cocktail bar
Artisan bread bakery
Independent publishing house
Eco-friendly home goods store
Urban rooftop garden
Custom suit tailor
Rare book dealer
High-end audio equipment store
Theme cafe
Vintage clothing store
Independent animation studio
Exotic pet store
Specialty spice shop
Art restoration service
Continued on next page

62
Table B.2 – continued from previous page
Business Terms
Sustainable fashion store
Craft beer pub
Herbal apothecary
Professional photography service
Ice cream parlor
Indie video game developer
Luxury yacht charter
Handmade ceramics studio
Specialty pizza restaurant
Independent film production company
Organic beauty salon
Artisan cheese maker
Vintage toy store
Boutique travel agency
Craft cocktail lounge
Independent comic book store
Local micro-distillery
Sustainable architecture firm
Artisanal chocolate maker
Boutique fitness studio
Specialty vegan grocer
Custom motorcycle workshop
Ethical clothing boutique
Independent music label
Gourmet burger joint
Independent perfumery
Local craft fair
Eco-friendly cleaning service
Artisanal pastry shop
Vintage furniture gallery
Specialty seafood market
High-end barber shop
Contemporary dance studio
Custom bicycle builder
Organic winery
Virtual reality arcade
Handmade soap shop
Boutique pet store
Independent animation film theater
Custom leather goods store
Artisan coffee roaster
Luxury bed and breakfast establishment
Ethnic cuisine cooking class
Continued on next page

63
Table B.2 – continued from previous page
Business Terms
Boutique stationery store
Specialty sushi restaurant
Independent book publisher
Artisanal ice cream maker
Vintage camera shop
Handcrafted pottery class
Boutique bridal shop
Gourmet deli
Custom surfboard shaper
Organic farm-to-table restaurant
High-end audiovisual equipment store

64
Appendix C

The use of XAI for SEO


Classification

65
Research topics in Data Science — HT2023
Project Group 18 - The use of XAI for SEO classification

Name:
Anir Darrazi
Lai Ki Mak
Table of Contents
Abstract...................................................................................................................................................3
Introduction............................................................................................................................................ 3
Purpose....................................................................................................................................................3
Method.................................................................................................................................................... 4
A. Data...............................................................................................................................................4
B. Pre-Processing...............................................................................................................................4
C. Learning Algorithms & Evaluation Measures.............................................................................. 5
Results..................................................................................................................................................... 7
Conclusions............................................................................................................................................. 8
References............................................................................................................................................... 9
Attachments............................................................................................................................................ 9
Abstract
Explainable AI (XAI) can be used to study the search engine optimization (SEO) by
investigating features of a web page that can affect the absolute rank of the Google search
result. 24 features were studied with the Classification and Regression Tree algorithm
supported by Python Scikit learn library. The top five features (number of paragraphs,
number of external tags, word count of the title, word count of the description and page rank
score) with the highest Information Gain were found to be the most influential features.
Ranges of Information Gain were also studied with a brute force algorithm to manually
assign the decision tree for the features. The result was found with 23% of accuracy. It is
suggested that features of the web page structure and network can play a role in SEO results
while there can be other factors too.

Introduction
Explainable AI (XAI) can allow human techniques are used to find the correlation
beings to comprehend how machine between the different conditions (such as
learning algorithms create the results number of tags, word counts) and the page
(TURRI V., 2023). This can be used to rank result.
explain the algorithms and reasons behind
the result computed after algorithms. On According to statista.com, Google was the
the other hand, search engine optimization most popular search engine in Sweden in
(SEO) is the process or method to increase April 2023 with market share of 93.94
the chance for a web page to be prioritized percent (Bianchi T., 2023). In order to
and appear on the first page of search achieve a better representation of the
results (Sharma D. et. al, 2019). In order to results in Sweden, the SEO algorithm
achieve favorable marketing results, it is behind Google is investigated in this
considered that the web page should be project. In other words, content of web
shown on the first page of the search pages from Google search results are
results as much as possible. Search engines analysed and investigated to understand
can retrieve thousands of related results, the SEO algorithm behind Google search
page rank is required to rank the results engine.
and display them in a ranked order
(Bar-Ilan J., 2007). It is beneficial for
users or marketing companies to have Purpose
control over SEO (Sharma D. et. al, 2019). The aim of this project was to study the
The more control of SEO, the higher features of web pages that can affect the
chance that the web page can have a search result of the Google search engine.
higher ranking. However, conditions and Comparison of the different features of
the algorithms behind SEO are under the web pages were conducted to understand
direct control of developers of the SEO the correlation of the features and the
companies (Sharma D. et. al, 2019). With ranking of web pages in the search result.
multiple content in a web page, such as Values of all essential features were
page title, text, images, links, there is investigated with machine learning
always a challenge to understand the methods to find the top five features which
criteria and conditions for SEO. The aim can affect the search result the most.
of this project is to find out the conditions Information Gain (IG) was calculated and
and the part(s) of content on a web page a decision tree was built to understand
which can affect the SEO result with XAI. which feature(s) can affect the search
Decision trees and Information Gain (IG) results.
Method B. Pre-Processing

Firstly, cleaning the value types involves


A. Data
transforming data from various formats
(like strings and booleans) into a uniform
The data was collected specifically for this numeric format. This step is crucial
project. Web document data was collected because numeric data is more suitable for
using a web scraper script and an API algorithms in machine learning models.
from DataForSEO, merging data from the The process involved converting
two data sources into one dataset. Initially, categorical strings to numerical codes and
the specific data required was identified mapping boolean values to binary
focusing on the web document structure, numbers.
such as HTML structures but also
networking data such as IP addresses, Then the data was standardized meaning it
spam score and backlinks etc. See table 1 being scaled and centered around the mean
for all the features used in this project. with a unit standard deviation. The mean
of the attribute becomes zero, and the
The script was made in the programming resultant distribution has a unit standard
language Python, integrating the web deviation. This technique is vital in
scraping library BeautifulSoup and the scenarios where the data attributes have
API from DataForSEO into the same different units or vary in scale, as it brings
creating the dataset used from this project. all variables to a similar scale, thus
This script was designed to send requests avoiding any bias during the modeling
to target websites in the top 45 search process.
results at Google for different search
queries. The data collected used each web Later on the finding and dropping of
document (or so called web page) as a correlating columns was done to remove
record. With pages about a random topic unnecessary redundancy and reduce the
which in this case was construction dimensions of the dataset. In the datasets,
/renovation in Sweden. The choice of some columns were highly correlated with
subject for the queries is a delimitation for over 80%. See the included Model
the research because of time and resource Training Jupyter Notebook for which exact
constraint. The query subject was not columns were dropped. These correlations
considered important enough for the can lead to multicollinearity, which might
research project to have more diverse skew the results of certain models
subjects and queries. (Analytics Vidhya, 2023). The typical
approach is to identify these correlating
The data which was collected was around
columns and drop the redundant ones,
4500 records due to funding limitations.
keeping only one representative column.
The project could not use more than 100
This is reducing the complexity of the
dollars worth credits at DataForSEO API.
model and improving its interpretability
Which stunted the final dataset size. Also
which is important for a XAI approach (R.
due to missing values because of the web
Confalonieri et al, 2021).
scraper the absolute dataset ended up
being 2246 records of web documents.
Lastly, the classifier was binned into
different groupings. This involves
Finally, the dataset was formatted and
categorizing the outputs of a classifier
stored in a comma separation structure
“Absolute Rank” into distinct buckets or
(CSV file) to later be used in the
bins. In your case, it would be dividing the
pre-processing.
classifier results into six numbered
categories (1-6). Category 1 being the top Variable Information Gain
5 ranked web documents and category 6
being the top 41 - 45. This technique was h1_tags_amount 0.011904063073398193
deployed to simplify the results, making h2_tags_amount 0.05705711647293863
them easier to interpret and analyze (this is
also part of the XAI approach compressing h3_tags_amount 0.04369466952812787
the information making it more easily h4_tags_amount 0.02685934717888355
understood).
h5_tags_amount 0.011445050797226264
These steps are integral to ensuring that h6_tags_amount 0.005819072737295278
the data fed into the machine learning
models is clean, standardized, and devoid p_tags_amount 0.07221529805998796

of redundant information, which in turn a_internal_tags 0.07103395270632457


can significantly improve the performance
and interpretability of the models. a_external_tags 0.09305458981770696

img_tags_amount 0.05933279026172075

C. Learning Algorithms & video_tags_amount 0.0


Evaluation Measures
meta_tags_amount 0.042998843439199984

The Python package Scikit-learn was used meta_charset 0.003498375601807558


for the sake of a good comparison and meta_description 0.0
easy access to the features Information
Gain (see table 1 for the Information title_word_count 0.07164254119122739
Gain). More specifically the decision tree is_image 0.00919502513852775
model was used and trained. See the
included Model Training Jupyter website_name_word_count 0.013022322140216135
Notebook for how it was done in detail. description_word_count 0.1396427263025681
Scikit-learn uses the Classification and
Regression Trees or in short CART pagerank 0.07925636942651178

algorithm for its decision tree models first_seen 0.022453049478635422


(Scikit-learn developers, 2024). The
algorithm is greedy in nature, meaning it crawled_pages 0.05403913635065908

searches for the best split at the current referring_domains 0.03807518506378987


node without considering the future impact
of this decision. This can lead to locally backlinks_spam_score 0.04751872578871184

optimal but globally suboptimal trees. Due target_spam_score 0.026241749444535004


to it being greedy, the computing process
is efficient enough to handle all the 24
Table 1: Showing the features used, the
features in sufficient time.
respective features Information Gain and
the five most important features marked in
Lastly, the accuracy was calculated for the
green.
particular trained model using the function
accuracy_score from sklearn.metrics
Regarding the manually designed decision
module.
tree with a XAI approach a completely
different method was used. The model was
designed with a if-else logic making it
easy to hardcode a reasoning output for
each splitting node. Due to it being a XAI
approach puts a criteria on the model
design being human understandable. Thus, Brute force training involves exhaustively
the features used are reduced to the top searching through a problem space to find
five most important features based on the a solution. Because brute force algorithms
Information Gain. See table 1 rows marked are so computationally expensive there
in green for the used features. The was not enough time within the scope of
reduction of features makes the decision this project to further develop on the
tree much more readable and manually designed decision tree with this
understandable but risks oversimplifying training approach. The computation time is
and underfitting. exponentially developing per added
threshold variable tested with a mean
The training variables were a threshold amount of test iterations. See table 4 for
between the min-max value for the the variable / iterations (time) development
respective feature used. See table 3 for the for 5 - 7 variables.
min-max values for the top 5 features. The
threshold was trained using a brute force MV = I
algorithm due compensation for the
reduced amount of features used in the M: Variable iteration mean value, V:
manually designed decision tree model. Amount of variables, I: Total amount of
The training used the dataset to find the iterations
absolute most optimal threshold variables
amongst those tested. See the code-snippet Similar to the scikit-learn model, the
from the Model Training Jupyter manually designed decision tree's
Notebook below and table 2 for the exact performance is evaluated using accuracy
iterations per threshold variable used: as a metric.

“threshold1_range = np.arange(-0.6, 20, 0.5) Variable Min-value Max-value


threshold2_range = np.arange(-4.2, 2.8, 0.4)
threshold3_range = np.arange(-2, 2.8, 0.4) p_tags_amount -0.758726 12.772177
threshold4_range = np.arange(-0.7, 12, 0.3)
threshold5_range = np.arange(-1.2, 3.4, 0.2) a_external_tags -0.636928 20.480658

title_word_count -2.125608 2.927189


Threshold1 41.2
description_word_count -4.328574 2.929587
Threshold2 17.5
pagerank -1.246474 3.599432

Threshold3 12.0
Table 3: Showing the top 5 most important
Threshold4 42.3
features min-max values.
Threshold5 23.0

Table 2: Showing the respective variable


iterations.
Table 4: Showing the exponential increase in computation
iterations for the brute force training for 5 - 7 variables.

Results
The study's evaluation of the two decision lower accuracy compared to the
tree models, the Scikit-learn decision tree, scikit-learn model, it offers greater
and a manually created XAI decision tree, explainability, especially in how it
yielded noteworthy findings. The delineates each decision at the decision
Scikit-learn model, employing a robust tree nodes. This feature of the XAI model
24-feature set, attained an accuracy of 29% is crucial, particularly in contexts where
(0.2851851851851852). This contrasts understanding the model's
with the manually developed XAI model, decision-making process is more important
which, despite being trained on a limited than the accuracy of its predictions.
5-feature dataset, achieved a close
accuracy of 23% (0.2319679430097952). Moreover, it's important to highlight that
This surprising similarity in performance, both models perform above a random
despite the significant difference in the chance level. This is evidenced by the fact
number of features analyzed, underscores that the largest classification groups in the
a critical insight: the selection and dataset (comprising 22.2% each) are 3, 4,
optimization of features play a crucial role 5. If one were to predict the same class
in model efficacy. The results suggest both (out of those three) each time, the accuracy
models have considerable potential for would be 22.2% and by random prediction
improvement. Enhancing their ability to would likely be lower since the
analyze web page features effectively for classification groups are not evenly sized.
SEO optimization remains a key area for However, both models surpass this
future development. threshold, indicating their effectiveness
beyond mere chance.
It would also be beneficial to expand upon
the performance differences between the Finally, it's worth noting that all features
XAI model and the scikit-learn model. demonstrated low Information Gain. This
Emphasize how, despite the XAI model's suggests that while the collected data was
relevant, no single feature was
significantly impactful in enhancing the of the features and “Absolute Rank” is not
classifier's accuracy beyond what might be always directly correlated. In addition,
expected by chance and some luck. This with the delimition set in the project, the
finding should be explored further to scope of data studied can affect the result
understand how feature engineering and of the machine learning algorithm. In
the general quality of the dataset (both future study, more data and investigation
feature relevance and dataset size) aligns besides Information Gain can be studied to
with the overall performance and further analyse the factors and the
characteristics of the models. algorithm affecting the SEO.

Conclusions
XAI was used to investigate how features
of a web page can affect SEO. In total, 24
features of a web page were studied and
their respective Information Gain values
were calculated using the machine learning
algorithms as specified in the above
method section. Assuming that “Absolute
Rank”is the SEO result which specifies the
sequence of the web page appearing in the

Google search result, the top five features


of the web pages that can affect the SEO
result were “p_tags_amount” (the amount
of tags in the web page), “a_external_tags”
(the amount of external tags),
“title_word_count” (the number of word
count in the title),
“description_word_count” (the number of
word count in the description), “pagerank”
(the score calculated by Google about
based on the quality and quantity of
external links to the web page). This
suggested that these were the most
important five features to be optimized for
getting the higher “Absolute Rank” of
SEO. Among the top five features,
“description_word_count” is the most
important based on the Classifications and
Regression Trees model. Brute force
algorithm was also used to manually
classify the features based on Information
Gain. The ranking between these five
features can be varied based on the initial
random seeds and the accuracy is around
25% - 30%. With all extracted features
being studied in this project, this result
suggested that there can be other factors
outside of these features or the correlation
References
Turri, V. (2022, January 17). What is Explainable AI?. Retrieved December 19, 2023, from
https://insights.sei.cmu.edu/blog/what-is-explainable-ai/.

Bianchi, T. (2023, June 28). Market share of leading search engines in Sweden in April 2023.
Retrieved December 19, 2023 from
https://www.statista.com/statistics/621418/most-popular-search-engines-in-sweden/

Sharma, D., Shukla, R., Giri, K. A., Kumar, S. (2019). A Brief Review on Search Engine
Optimization. 9th International Conference on Cloud Computing, Data Science &
Engineering (Confluence)

Scikit-learn developers. (2024). Decision Trees. Retrieved from


https://scikit-learn.org/stable/modules/tree.html

Analytics Vidhya. (2023). What is Multicollinearity? Retrieved from


https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/

R. Confalonieri et al., A historical perspective of explainable artificial intelligence, 2021,


WIREs Data Min. Knowl. Discov. 11, e1391.

Attachments
Model Training Jupyter Notebook. Find it at the following link:
https://drive.google.com/file/d/1AAX2yOJwnBsZadxllGZxWGPZlGAsfWoa/view?usp=shari
ng

Web Scraper & DataFromSEO API Jupyter Notebook. Find it at the following link:
https://drive.google.com/file/d/1we6qotBn3TW5x8glFWVHjgcA7haUx_3r/view?usp=sharin
g
Appendix D

Code Repository

The GitHub repository ”Pushing-the-Boundaries-of-Digital-Marketing-with-SEO-


Modeling” hosts a collection of scripts and Jupyter notebooks. This repository
is crucial for researchers and practitioners looking to reproduce or build upon
the original research project. Key components include ‘classifier models.ipynb‘,
which contains the machine learning models for classifying data; ‘query genera-
tor.ipynb‘ for creating search queries; ‘serp api.ipynb‘ that interface with search
engine result pages through DataForSEP SERP API; ‘url mining.ipynb‘ for ex-
tracting URLs and classifiers from SERP results; and ‘web scraper.py‘, a Python
script for scraping on-page data from the websites. Sensitive information and
private credentials have been removed to protect privacy while still making the
research reproducible. This approach supports open science by allowing others
to verify results and contribute to the field.

Access the repository: https://github.com/anirdarrazi/Pushing-the-Boundaries-


of-Digital-Marketing-with-SEO-Modeling

Software Heritage Archive repository backup: https://archive.softwareheritage.o


rg/browse/origin/directory/?originu rl = https : //github.com/anirdarrazi/P u
shing−the−Boundaries−of −Digital−M arketing−with−SEO−M odeling

75

You might also like