Full Text 01
Full Text 01
Anir Darrazi
Department of Computer
and Systems Sciences
Degree project 30 credits
Computer and Systems Sciences
Degree project at the master level (300 credits)
Spring term 2024
Supervisor: Mateus de Oliveira Oliveira
Swedish title: Pusha gränserna inom digital
marknadsföring med SEO-modellering
DVK Uppsats Program
This thesis was written within the Spring 2024 edition of the DVK Uppsats
Program.
Coordination
Organization
Henrik Bergström
Peter Idestam-Almquist
Mateus de Oliveira Oliveira
Beatrice Åkerblom
Abstract
In addition to technical advancements, the study also explores the ethical im-
plications of web scraping and the transparency required in manipulating SEO
algorithms. By systematically analyzing the performance of enhanced models
on varied dataset, this work reveals critical insights into the underlying mech-
anisms of search engines and the factors influencing web page visibility. The
thesis argues that a deeper understanding of these factors, supported by robust
empirical data, can drive more targeted and effective SEO practices.
Overall, the research contributes to both academic literature and practical ap-
plications in digital marketing, offering a framework for developing more so-
phisticated SEO tools that can adapt to the ever changing digital landscape.
It opens up new avenues for future research, particularly in the exploration of
off-page SEO factors and the integration of natural language processing to au-
tomate and optimize content creation for better search engine performance.
Background
In the digital age, where search engines are vital for navigating online informa-
tion, this thesis research the nuances of SEO, examining on-page elements and
their accessibility for optimization, alongside exploring machine learning tech-
niques like XGBoost and CatBoost for understanding SEO rankings, thereby
unraveling the complex interplay between search engine algorithms, SEO strate-
gies, and advanced computational methods in the competitive digital landscape.
Problem
While SEO-modeling has progressed with classification techniques achieving up
to 70% accuracy, studies like Matošević et al. (2021) face scalability and robust-
ness challenges due to limited dataset sizes, necessitating a broader approach
that enhances model accuracy and explores the wider implications for business
and academia in the ever-evolving digital landscape.
Research Question
How can the accuracy of SEO classification models be improved through the
use of large-scale, relative feature engineered datasets?
Method
This thesis adopts an empirical, quantitative strategy, using data science meth-
ods experiment on enhancing SEO classification models through feature en-
gineering and large dataset analysis, integrating machine learning to identify
impactful features.
Result
The result compare different classification models and datasets. The findings
suggest improvements in model accuracy by using a pre-processed relational
dataset, contributing to both academic knowledge and practical SEO strategies.
Discussion
The novel approach of integrating relational data, which demonstrated a modest
but an important improvement in model accuracy, underscoring the potential for
nuanced strategies that consider the competitive dynamics of digital marketing.
v
Acknowledgement
List of Listings x
List of Tables xi
1 Introduction 1
1.1 SEO-modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Results and Considerations . . . . . . . . . . . . . . . . . . . . . 5
3 Methodology 16
3.1 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Research Framework Overview . . . . . . . . . . . . . . . . . . . 17
3.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4 SEO Classification . . . . . . . . . . . . . . . . . . . . . . 17
3.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Software Use and Implementation . . . . . . . . . . . . . . . . . . 18
3.3.1 ChatGPT 3.5 API for Query Generation . . . . . . . . . . 18
vii
3.3.2 DataForSEO Google SERP API . . . . . . . . . . . . . . 18
3.3.3 Webshare Rotating Proxy . . . . . . . . . . . . . . . . . . 18
3.3.4 Python for Machine Learning . . . . . . . . . . . . . . . . 19
3.3.5 Multiprocessing for Parallel Processing . . . . . . . . . . . 19
3.4 Dataset Compilation . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Sourcing Primary Data . . . . . . . . . . . . . . . . . . . 20
3.4.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Analysis of HTML Structure Data . . . . . . . . . . . . . 22
3.5.2 Using SERP Data as a Classifier . . . . . . . . . . . . . . 23
3.5.3 Relational Expansions . . . . . . . . . . . . . . . . . . . . 23
3.5.4 Frequency Analysis of Query-Associated Terms . . . . . . 24
3.5.5 Feature Engineering Phase . . . . . . . . . . . . . . . . . 24
3.5.6 Training the XGBoost & CatBoost Model . . . . . . . . . 24
3.5.7 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . 25
3.5.8 Addressing Fitting Issues . . . . . . . . . . . . . . . . . . 25
3.5.9 Addressing Biases . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Alternative Research Strategies . . . . . . . . . . . . . . . . . . . 26
3.7 Validity and Reliability . . . . . . . . . . . . . . . . . . . . . . . 27
3.8 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Main Results 29
4.1 Technical Development . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Model Performance Accuracy . . . . . . . . . . . . . . . . . . . . 37
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Conclusion 40
5.1 Comparison with Existing Work . . . . . . . . . . . . . . . . . . 41
5.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Ethical Implications . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 The use of AI Tools . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Considerations for Future Work . . . . . . . . . . . . . . . . . . . 44
Bibliography 49
Appendices 50
A Full Feature Set 50
B Query Engineering 53
viii
D Code Repository 75
ix
Listings
x
List of Tables
3.1 This table shows the URL, keyword, and geography, which to-
gether work like the unique identifier for each record. The rank
group is the classifier ranked within 1 - 100 in its respective query
group. See Appendix A for the complete feature set. . . . . . . . 22
4.1 The table shows the population percentile of each classifier from
the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 The table shows final accuracy of the trained models based on
which dataset that was used. . . . . . . . . . . . . . . . . . . . . 38
A.1 This table shows the URL, keyword, and geography, which to-
gether work like the unique identifier for each record. The rank
group is the classifier ranked within 1 - 100 in its respective query
group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.2 These features are based on the previous on-page values but rel-
ative to the best-ranking page in its respective query group. . . . 51
A.3 These features are based on the previous on-page values but rel-
ative to the best-ranking page in its respective query group. . . . 52
B.1 The table shows the city, country and language name for a specific
query. This information is a component of the query engineering. 53
B.2 This table shows the respective business terms that was used in
the query engineering in English. The terms was translated based
on the language related to the city it was combined with. . . . . 60
xi
List of Abbreviations
xii
Chapter 1
Introduction
In the ever evolving digital landscape, organic digital marketing has become
important for businesses seeking to establish and maintain a robust online pres-
ence. The pivotal role of Search Engine Optimization (SEO) is a pillar of this
kind of digital marketing. This critical tool not only navigates the complexi-
ties of Search Engine algorithms but also enhances online visibility (Bhandari
& Bansal 2018). This thesis explores the mechanisms of digital marketing, fo-
cusing on SEO, highlighting its transformation into an indispensable element of
modern marketing strategies.
This thesis makes an attempt to better understand the complex interplay be-
tween digital marketing and SEO, underscoring the inherently technical and
pattern based nature of SEO. It is this technical foundation that lends itself to
the application of machine learning, particularly through the use of classifica-
1
tion models. By conducting a thorough analysis of classification models for SEO
and their accuracy on web page ranking, this study provides valuable insights
into how businesses can harness the power of machine learning to optimize their
SEO strategies. The focus is on understanding SEO not merely as a tactic for
enhancing search rankings, but as a field ready for the application of machine
learning techniques, which can significantly refine and automate the process of
optimizing web pages.
1.1 SEO-modeling
The term SEO-modeling is introduced in this thesis based on the concept of
simulating Search Engine ranking and the application of classification models
in predicting page rankings. The term is heavily ispired by Matošević et al.
(2021) and their work with SEO classification. At the core of the term, we
have SEO which involves optimizing web pages to rank as high as possible in
the Search Engine results pages (SERPs), a crucial factor in driving web traffic.
The evolution of SEO has seen it transform from simple keyword stuffing to so-
phisticated practices that align with complex Search Engine algorithms. These
algorithms, employed by Search Engines like Google and Bing, are designed to
rank pages based on relevance and authority, among other factors. The precise
workings of these algorithms are closely guarded secrets (Bhandari & Bansal
2018). However, SEO researchers have developed methods to approximate how
these algorithms operate with the help numerous different models, thus the rel-
evance of this term.
This thesis mainly focuses on the role of classification models in SEO. Classifi-
cation models are a supervised machine learning algorithm that can categorize
data into different classes. In the context of SEO, these models are trained on
datasets containing website features and their corresponding rankings to predict
the ranking of new or altered web pages. Features considered in these models
can include technical aspects of a website (like page loading speed and mobile
friendliness), content related attributes (such as keyword density and topical
relevance), and external factors (like backlink profiles) (Rogers 2002).
By applying these models, businesses and SEO professionals can gain power-
ful insights into how web page design and content changes might affect their
Search Engine rankings. This predictive capability is a game changer in for-
2
mulating effective SEO strategies. It enables a data driven approach to SEO,
where decisions are based on empirical evidence rather than expert intuition.
1.2 Problem
While SEO-modeling, primarily through classification models, has made signif-
icant strides in recent years, it is essential to acknowledge its limitations. The
existing literature showcases the potential of classification techniques to mimic
Search Engine algorithms, with predictive accuracy up to 70% (Matošević et al.
2021). However, these models still need help in scalability and robustness,
mainly due to the constraints imposed by the size of the datasets used in these
studies.
Previous studies in this area like Matošević et al. (2021) used 600 records,
and Banaei & Honarvar (2017) 7 400 records, which showcase that they re-
lied on relatively small datasets, which may not fully capture the complexity
and variability of factors that influence Search Engine rankings in the real world
scenario. Search Engines like Google considers at least hundreds of factors in
their algorithms, and the interactions between these factors can be intricate and
dynamic (Balabantaray 2017). SEO-modeling datasets may need more diversity
and depth to model these interactions accurately. This leads to a potential mis-
alignment between predicted and actual rankings, or even worse dataset bias.
Mainly noticed when these models are applied to a broader range of websites
and market sectors which it was not trained on. Furthermore, it is crucial to
recognize the untapped societal and academic benefits that could arise from
enhancing SEO-modeling techniques. The potential for improved accuracy and
reliability in SEO classification models is significant. For businesses and web
content creators, these advancements could translate into more strategic web
optimization, enhancing visibility and competitiveness in the digital market-
place.
3
1.3 Research Question
Given the identified gap in the literature regarding the scalability and robust-
ness of SEO classification models, the primary research question this thesis seeks
to address is:
The research question arises from the need to enhance the current understand-
ing and effectiveness of SEO-modeling in predicting web page rankings. The
goal is to overcome the limitations of small datasets and to examine how more
extensive and varied data can refine the predictive capabilities of these models.
To address this research question, the study will employ a methodology rooted
in data science. The core of this approach involves the collection of a large scale
dataset, comprised of 1 653 946 records. This dataset will include web page
features extracted through web scraping HTML and SERP (Search Engine Re-
sults Page) rank data, which will serve as the classifier.
4
1.4 Related Work
A foundational aspect of our work is the application of machine learning algo-
rithms to SEO challenges. Studies such as Matošević et al. (2021) and Banaei
& Honarvar (2017) have similarly employed machine learning techniques like
decision trees and neural networks to predict website rankings. Our research
builds upon this foundation, utilizing more advanced algorithms like XGBoost
and CatBoost, which are particularly adept at handling large, complex datasets.
Our emphasis on large scale, diverse datasets reflects an emerging trend in the
field, as demonstrated by the work of researchers like Witten et al. (2016). They
highlight the importance of data mining and large datasets in uncovering hidden
patterns and insights. We extend this concept by incorporating relational data,
which allows for a comparative analysis of web pages in relation to top perform-
ing sites. The use of relational values is a distinct aspect of our approach. This
methodology aligns with the scientific principles of contextual analysis, where
data is understood not just in isolation but in relation to a broader set of fac-
tors. While less explored in existing SEO studies, this approach is grounded in
scientific practices that consider data within its wider context and environment.
Looking ahead, the results from this study open exciting avenues for future
research and practical applications. The improved accuracy of SEO models
sets the stage for more advanced digital marketing strategies, enabling busi-
nesses and content creators to enhance their online visibility more effectively.
The integration of relational values in SEO-modeling presents an opportunity
to develop more sophisticated tools that can better navigate the complexities of
5
Search Engine algorithms.
6
Chapter 2
Extended Background:
Components of SEO
Classification
2.1 Overview
The digital age has transformed how information is accessed and consumed,
making search engines crucial tools in navigating the vast array of online data.
These engines perform common essential functions for it to work, such as crawl-
ing, indexing, and ranking, using various algorithms to ensure users receive
relevant results for their queries (Rogers 2002). SEO emerges as a pivotal strat-
egy in this context, aiming to increase a website’s visibility and organic traffic.
This thesis explores the multifaceted aspects of SEO, including on-page SEO,
which focuses on content quality and HTML optimization, and off-page SEO,
which deals with backlinks and external factors. Technical SEO also plays a
role in improving the backend aspects of a website to enhance its search engine
ranking. In particular, the thesis examines the influence of on-page factors like
high-quality content, efficient HTML tags, and user experience in determining
a website’s search engine ranking due to its accessability for webmasters. The
thesis also ventures further into machine learning, particularly the classification
tasks within SEO, employing algorithms such as XGBoost and CatBoost. XG-
Boost, known for its efficiency and scalability in handling large datasets, has
been instrumental in predicting the ranking potential of web pages in the re-
search project of (Matošević et al. 2021). Similarly, CatBoost’s unique approach
to handling categorical data has made it a valuable tool in diverse handling of
features (Hancock & Khoshgoftaar 2020). This comprehensive study aims to
shed light on the dynamic interplay between search engine algorithms, SEO
strategies, and the latest advancements in machine learning, offering insights
into how these elements collectively influence the understanding of SEO rank-
7
ing and visibility of web pages in an increasingly competitive digital landscape.
Crawling is the process by which search engines use bots (spiders or crawlers)
to systematically browse the web and collect information from every accessible
webpage. The data collected from these pages is then indexed, stored, and orga-
nized to make it easily retrievable. When the user inputs a query into a search
engine, the search engine processes it by comparing the search terms against
the pages indexed in its database. It looks for relevant matches, considering
keyword density, site speed, links, etc (Hendry & Efthimiadis 2008). Search en-
gines employ intricate algorithms to determine the ranking of the indexed pages
in order of relevance. These algorithms consider hundreds of ranking factors
from the index and other factors, such as the quality of content, user engage-
ment, page speed, backlinks, and many others Ravi et al. (2012), which will be
discussed in more detail later. The specific factors and the weight each carries
vary between search engines and evolve as they are updated. The final output
is a list of web pages, typically displayed in order of relevance. This is what the
user sees after submitting a search query. The results are intended to be the
most relevant and valuable about the user’s search terms.
At the heart of SEO lies identifying and utilizing the right keywords. These
specific words and expressions are potential visitors who enter into search en-
gines as queries or similar. Incorporating relevant keywords strategically in
website content, titles, and meta tags is fundamental for improving a page’s
visibility in search results. However, SEO is not just about keyword stuffing;
the caliber of the content is just as vital. Search engines prioritize informative,
relevant, and engaging content for the user, encompassing various forms of me-
dia such as text, images, videos, and other interactive functionalities (Sharma
8
et al. 2019).
Grasping how search engine algorithms work is essential for effective SEO. These
algorithms are continually updated. Thus, staying up to date with these changes
is vital for adapting and refining SEO strategies. Moreover, using analytics and
reporting tools, like Google Analytics, plays a pivotal role. These tools helps
experts to manually monitor the performance of SEO strategies, providing in-
sights into traffic, user engagement, and conversion metrics, making it possible
to adapt to ever-changing search engines (Google 2024).
9
for their queries, and thus, they favor websites that offer valuable and informa-
tive content. This content must be crafted not only to engage readers but also
to incorporate strategic keyword usage. The keywords are terms and phrases
that potential visitors use in search queries. A website can significantly improve
its search engine visibility by thoughtfully using these keywords in its content
(Sharma et al. 2019).
A big part of on-page factors is about the HTML structure. Title tags are
an example of such an on-page element. These HTML elements define the titles
of web pages and are displayed on SERPs as clickable headlines. Title tags
should be succinct and descriptive and include relevant keywords to enhance
their effectiveness in search rankings and user click-through rates. Though not
directly impacting search rankings, the Meta descriptions are vital in on-page
SEO. These HTML attributes provide a summary of a webpage’s content. A
well crafted meta description can entice users to click on a search result, thus
improving the click-through rate (CTR) and benefiting the site’s SEO perfor-
mance. The headings and subheadings are also essential for on-page SEO. They
make the content more readable and enjoyable for users while aiding search
engines in comprehending the structure and critical points of a page’s content.
Using relevant keywords in these headings can further boost SEO. The URL
structure is also an essential aspect of on-page SEO. Search engines favor URLs
that are easy to read and include keywords relevant to the page’s content Luh
et al. (2016).
There are many factors to consider when working with on-page SEO. It is a
multifaceted aspect of website optimization that requires attention to elements
like content, title tags, meta descriptions, headings, URL structure, image op-
timization, internal linking, page speed, and much more (Luh et al. 2016). By
effectively managing these on-page factors, website owners can notably enhance
their ranking on search engines and better understand search engines inner
workings (Sharma et al. 2019).
10
(or a complete website, depending on which algorithm is used), search engines
indicate that the page is reputable and offers valuable content. However, not all
backlinks are valued equally. Links from high-authority, relevant websites have
a much more substantial impact than those from low-quality, irrelevant sites.
Moreover, the way these links are acquired also matters. Organic or naturally
earned links are usually far more valuable than those obtained through manip-
ulative tactics (Rogers 2002).
There are several ways of calculating the relevance and trust of a page or website.
One such algorithm is PageRank (PR), a system created by Google’s founders,
Larry Page and Sergey Brin, which was one of the first algorithms utilized by
Google for ranking web pages in search result listings. It is based on academic
citations’ logic and treats links as votes. In this system, a link from one website
to another acts as an endorsement and a sign of trust and quality. However,
not all votes are equal. Links from reputable and authoritative sites carry more
weight. This system revolutionized how search results were ranked, moving
away from keyword-centric methods to a system based on the interconnected-
ness and authority of web pages (Page et al. 1998). The formula is PageRank
as described by Rogers (2002) where PR(A) is the PR of page A, PR(Tn) is
the PageRank of a page Tn linking to A, C(Tn) is the number of outbound
links on page Tn, and d is the damping factor, typically set to 0.85, giving each
link a base value. This formula calculates the PageRank of a page based on the
PageRank of pages linking to it, adjusted by their number of outbound links
and the damping factor.
Link-related factors extend beyond just the number of links. The relevance
of linking sites, the quality of the link source, the anchor text used in the link,
and the recency of links all factor into how search engines assess these links. For
instance, a link from a site closely related to your topic is more valuable than
a random link from an unrelated site. Similarly, links from new sources can be
more valuable than repeated links from the same source, as they indicate the
site’s growing popularity and relevance (Rogers 2002).
11
2.6 Classification
Machine learning, a vital branch of artificial intelligence (AI), has revolution-
ized how data is analyzed and interpreted. At the heart of machine learning
lies the concept of enabling systems to learn from data without being explicitly
programmed. Classification is a key function in machine learning, which has
wide-ranging applications, including Search Engine Optimization (SEO).
12
2.7 XGBoost
XGBoost stands for eXtreme Gradient Boosting and implements gradient boost-
ing machines (GBM), a type of machine learning algorithm that falls under the
umbrella of ensemble learning. Ensemble learning combines multiple models to
improve predictions’ overall performance, robustness, and accuracy. XGBoost
is an enhancement of gradient boosting, a method that builds models sequen-
tially, with each new model attempting to correct the previous errors. One of
the critical features of XGBoost is its use of a gradient-boosting framework.
This framework operates by constructing a series of decision trees sequentially,
where each subsequent tree is built to minimize the errors or residuals of the
previous trees. In gradient boosting, the loss function, a measure of how well the
model performs, is minimized using the gradient descent algorithm. XGBoost
enhances this process by introducing a more structured model formulation to
mitigate overfitting, which makes it more robust and efficient than standard
GBM (Chen & Guestrin 2016).
XGBoost has been designed to be highly scalable and efficient. It utilizes ad-
vanced principles such as parallel and distributed computing, making it excep-
tionally fast and capable of handling large datasets (Chen & Guestrin 2016).
This scalability is a critical factor that differentiates XGBoost from traditional
gradient-boosting methods. The algorithm can efficiently run on high-performance
computing environments, making it a practical choice for big data applications.
Another significant advantage of XGBoost is its flexibility. The algorithm can be
used for regression (predicting continuous values) and classification (predicting
categorical values) tasks. It supports various loss functions and customization
options, making it adaptable to various problems (Al Daoud 2019).
Additionally, XGBoost provides features for handling missing values and sup-
ports various data formats, enhancing its usability in real-world scenarios. XG-
Boost also incorporates several mechanisms to avoid overfitting. Besides the
regularized model framework, it offers features like subsampling of the data and
column sampling, further enhancing the robustness of the model. These fea-
tures, combined with the capacity to fine-tune many hyperparameters, allow
practitioners to build highly optimized models tailored to their specific data
and requirements (Zhang et al. 2022).
XGBoost has been employed to classify web pages based on their likelihood
to rank well for specific keywords or queries. By training the algorithm on
13
features such as keyword density, page structure, internal and external links,
and even user behavior data (like click-through rates and bounce rates), the
XGBoost was taught to predict the ranking potential of pages with a as high
accuracy as 70% (Matošević et al. 2021). This predictive capability allows SEO
professionals to prioritize and optimize content more effectively.
2.8 CatBoost
CatBoost, an acronym for ”Categorical Boosting”, is a state-of-the-art machine
learning algorithm that has garnered attention in academic and applied data
science fields for its exceptional performance, particularly in dealing with cate-
gorical data. Developed by researchers at Yandex, CatBoost is an open-source
gradient boosting library, a part of the ensemble learning family of algorithms
as XGBoost (Prokhorenkova et al. 2018). CatBoost distinguishes itself within
this family through its innovative approach to processing categorical data and
handling various complexities inherent in datasets variety which can be lever-
aged within SEO-modeling.
14
similar contexts. Its ability to handle large-scale data and deliver high-accuracy
models has made it popular in industries such as finance (Al Daoud 2019). And
makes it very good model for comparing the result with XGBoost when using
a large scale SEO data.
15
Chapter 3
Methodology
16
contributes significantly to advancing the field of SEO.
3.2.5 Evaluation
After training the XGBoost and CatBoost models, they will be rigorously eval-
uated to determine their effectiveness in accurately predicting SEO rankings.
17
This evaluation will involve measuring the models’ accuracy, which will serve as
a key performance indicator. Through this process, we will assess the practical-
ity of the models in real-world SEO applications and their ability to generalize
from the training data to unseen data. This critical stage will allow us to val-
idate our models’ predictions and ensure that our research contributions are
both reliable and applicable to the field of SEO.
ChatGPT 3.5 API facilitated the creation of these queries by understanding the
context and nuances required for effective SEO analysis. It generated search
terms that were not only relevant to the study but also varied enough to cover
different business sectors and geographic locations.
18
rotating proxies for HTTPS requests can significantly enhance the capabilities of
web scraping, data collection, or any application where maintaining anonymity
and avoiding IP blocks or rate limits is crucial. Webshare offers rotating proxies
to facilitate these needs effectively (Webshare 2024).
19
and preparation of our dataset, highlighting the methods employed to minimize
biases and maximize the relevance and accuracy of our modeling efforts.
In this study, primary data is vital due to the rapid and continual changes in
search engine algorithms. Given the dynamic nature of the SEO landscape, even
a few months old data can become outdated. Primary data collection methods,
such as real-time web scraping and accessing up-to-date information via APIs
like Google SERP from DataForSEO, ensure that the dataset reflects the latest
trends and algorithmic updates in search engines. This approach guarantees
that the analysis and resultant SEO classification model are grounded in the
most current and relevant data, a critical factor for the validity and effective-
ness of SEO strategies. By using primary data, this research is not just relevant
to the field of SEO, it is aligned with its core needs. In SEO-related studies,
where staying on top of the latest algorithmic changes is a must for accurate
analysis and modeling, primary data is the key. It is the tool that allows for the
possibility to keep pace with the dynamic SEO landscape, ensuring that your
research is timely and specific to the current SEO environment.
20
mance of a web page and was inspired by the result of Matošević et al. (2021)
and the report Darrazi & Mak (2023) which can be found in Appendix C. During
the feature selection we collected as many features as we could based on common
HTML tags as found in (Jamsa et al. 2002). After a some further data cleaning
calculating multicollinearity and Information Gain among the selected features
to drop those which did not add value to the feature set. Leaving us with
features such as the number of H1 tags (h1 tags amount), the frequency of key-
words in title tags (title keyword freq), the average character count of H1 tags
(h1 avg char count) and 29 other features which all impacts the on-page SEO.
Similarly, geographical relevance of content (geography, h1 geography freq, etc.)
is included to understand the impact of location-based optimization on search
rankings (see the full feature set in Appendix A).
The dataset’s combination of SERP data (such as rank group) with web page
features offers a comprehensive view of the classification model, allowing it to
learn from a wide array of variables that influence a page’s ranking. This rich
set of features is pivotal in developing an SEO model that accurately reflects
the multifaceted nature of search engine ranking algorithms.
The features are focused on on-page factors rather than a combination of both
on-page and off-page. The choice is rooted in practical considerations regard-
ing the control and immediacy of SEO adjustments. On-page elements such as
content quality, keyword optimization, and HTML tags are directly accessible
and modifiable by web developers and SEO professionals. This direct acces-
sibility makes on-page features not only crucial but also the most actionable
aspect of SEO strategy. While off-page factors like backlinks and social sig-
nals are undeniably significant in search engine algorithms, they often require
long-term efforts and external collaborations, which can be more challenging
to influence directly and swiftly. On-page factors, in contrast, offer immediate
opportunities for optimization. Adjustments to elements like title tags, meta
descriptions, and content relevancy can be implemented and measured quickly,
providing SEO professionals with more agile and responsive tools to influence a
website’s search engine performance.
This emphasis on on-page features aligns with the objective of equipping SEO
professionals with actionable insights. By analyzing the impact of these directly
controllable elements, the developed SEO model can serve as a practical guide
for immediate website optimizations. These changes are not only more feasi-
ble to implement but also offer the potential for rapid improvements in SERP
rankings. Therefore, while the dataset may focus on a subset of the factors
influencing page rankings, it strategically targets those most relevant and ac-
cessible to SEO practitioners. This approach ensures that the model’s insights
are practical and actionable, enabling professionals to optimize web pages in a
dynamic search engine landscape.
The ’rel’ prefix in certain features stands for ’relational’, indicating the rela-
21
Dataset Features Explanation
url The web page URL
rank group Grouped ranking position in search engine results
keyword Term associated with the search query
geography Geographic location relevant to the web page content
Table 3.1: This table shows the URL, keyword, and geography, which together
work like the unique identifier for each record. The rank group is the classifier
ranked within 1 - 100 in its respective query group. See Appendix A for the
complete feature set.
tion of that feature to the corresponding feature of the best-ranking page for a
particular search query. For example, ’rel h1 tags amount’ compares the num-
ber of H1 tags on a page to the number of H1 tags on the best-ranking page
for the same search query. The ’freq’ suffix stands for ’frequency’, denoting the
occurrence frequency of certain elements (like keywords or geographic terms)
within specific tags or content areas of a web page.
22
3.5.2 Using SERP Data as a Classifier
Search Engine Results Page (SERP) data from Google will be used as a classifier
in the development of the classification models. This strategy is in line with the
works of Matošević et al. (2021), who discusses the role of web search data
in understanding and modeling web dynamics. The SERP data will provide
the necessary labels for the supervised learning models, categorizing web pages
based on their search engine rankings.
This approach is rooted in our hypothesis that the structure of a web page,
though significant, does not solely determine its rank in the context of a specific
search query. We propose that the competitive landscape, as how a page com-
pares to others in the same space, could have a role in determining its ranking.
This perspective challenges the traditional view of SEO, which often empha-
sizes the optimization of individual page attributes in isolation. Our hypothesis
further suggests that the highest-ranking page within a specific search context
could serve as a benchmark or, as said before, an anchor point. This anchor
provides a reference against which we can evaluate the performance of other
pages. The idea is that in the realm of search engine rankings, it’s not just
about how well-optimized a page is in absolute terms, but how it stacks up
against the current best performer in the same category or search context. In
essence, the relational expansion of data in our study is more than a mere com-
parison, it’s an attempt to understand the dynamics of relative performance in
the SEO landscape. By analyzing pages in relation to the best performers, we
aim to unravel the nuances of competitive SEO and provide a more comprehen-
sive understanding of what drives search engine rankings. This approach aligns
with our broader aim to develop a more contextually aware and competitively
sensitive SEO classification model.
The formula represents the relational data calculation used in our study for
SEO classification modeling. In this formula, R1 denotes the best-classified
record within a specific SERP (Search Engine Results Page) group, while Rn
represents another record within the same group. The division of Rn by R1
allows for a comparative analysis, indicating how a given web page’s features
and attributes perform relative to the top-ranked page in its category. This
23
relational approach aims to provide more contextual insights into search engine
rankings, enhancing the predictive power and accuracy of the SEO models. By
understanding and analyzing the performance of web pages in relation to the
best performers, we can derive more nuanced and effective strategies for SEO
optimization.
R1/Rn
24
insights into the effectiveness of relational abstraction in improving the predic-
tive accuracy of the SEO classification model.
The choice of using both models is because we wanted to ensure the relational
dataset result reliability. Assuring us that the result did not just depend on
choice of one specific model and overfitting.
Accuracy = T N + T P/T N + T P + F P + F N
There are other evaluation metrics more suited for inbalanced datasets like the
Receiver Operating Characteristic (ROC) curve Hoo et al. (2017). But SEO
data are perfectly balanced in theory. Giving each page a unique rank placing
per SERP. Thus, making accuracy both relevant and the preferred evaluation
metric by both Matošević et al. (2021) and Banaei & Honarvar (2017). That is
why we choose accuracy as the unit of evaluation metric. Also making it easier
to compare the results with previous research.
25
3.5.9 Addressing Biases
A comprehensive approach was adopted to address potential biases in data
collection, preprocessing, and modeling. This involved rigorous data validation
and cleaning to ensure data quality, regular monitoring of the web scraping and
updating the data collection to minimize errors, and the application of diverse
preprocessing techniques to address issues like missing values or outliers. Careful
selection of features and testing various modeling configurations were essential to
reduce errors and biases in the final model. The methodology was continuously
refined, considering new insights, to enhance the reliability and accuracy of the
study’s outcomes.
This approach would also include a comparative analysis, where features of top-
performing pages are contrasted with those of lower-ranking pages to identify
what distinguishes them. This strategy allows for a more conceptual under-
standing of SEO success factors, moving beyond empirical data to delve into
the qualitative aspects that contribute to a page’s high ranking. Based on the
findings from the in-depth analysis, this approach would aim to formulate the-
ories or models that encapsulate the essence of effective SEO strategies. These
models would provide a conceptual framework that can guide SEO practices,
offering a more theoretical perspective compared to the data-centric models
derived from empirical research.
26
3.7 Validity and Reliability
In any research endeavor, the concepts of validity and reliability are fundamen-
tal to ensuring the accuracy and trustworthiness of the findings. Validity refers
to how a study accurately reflects on the specific concept it intends to measure.
At the same time, reliability pertains to the consistency of a measurement over
time. For validity, it is crucial to consider both internal and external aspects.
Internal validity relates to how the research findings accurately depict the re-
ality of the subjects being studied, devoid of external influences or biases. In
the context of this SEO research, it have been designed and executed totally
based on the empirical result of information gain from previous works. The ex-
ternal validity concerns how the findings can be generalized beyond the specific
settings of the study (Kothari 2004). In our case, the diversity and size of the
dataset play a critical role in enhancing external validity, as they allow for the
generalization of findings across different web domains and search contexts. Co-
hen et al. (2002) emphasize the importance of these aspects, outlining strategies
to enhance both internal and external validity in research studies.
27
contribute valuable knowledge to the field of SEO while maintaining a strong
ethical foundation.
Besides the alignment with ethical data scraping there is potential harm if the
research findings are used to manipulate or hazardously exploit search engine
algorithms, contradicting the ethical principle of beneficence. To mitigate this,
the communication of results is handled responsibly, focusing on academic and
practical insights rather than exploiting loopholes in search algorithms. The
study acknowledges that search engine algorithms, like Google’s, are propri-
etary and respect the intellectual property rights of these entities. Google’s
search algorithm is a trade secret and is not intended to be reverse-engineered
or understood in detail. The research respects this by not attempting to pre-
cisely replicate or decode the algorithm. Instead, the focus is on analyzing
patterns and correlations that are observable and ethical to study.
28
Chapter 4
Main Results
29
for queries_index , queries_row in queries . iterrows ():
# Printing progression
print ( " Progress : " + str ( int ((( queries_index + 1)/( len ( queries )
+ 1))*100)) + " % " )
# Indentify language
if cities_row [2]. strip () == " English " :
After developing our extensive list of queries, we utilized these queries to extract
data from the DataForSEO SERP API. This API provided us with 4.6 million
SERP results, which included page links and their corresponding SERP ranks
based on each query. This rich dataset formed the foundation for our potential
analysis, containing a total of 4.6 million records.
In our Python script, we defined a function seodata serp to handle the API
requests. For each query in our list, we initialized the RestClient with our cre-
dentials. We then crafted a post data dictionary for each query, specifying the
language, country, and desired depth of search results (meaning the amount
results per request) being 100. We made a POST request to the DataForSEO
API endpoint to retrieve live, organic SERP data from Google. Successful re-
sponses were logged, detailing the progress and key parameters of each query,
and the responses were collected in a list. This systematic approach allowed us
to efficiently gather detailed SERP data.
30
# SERP API call function taking a list of queries as parameter
def seodata_serp ( queries_list ):
To expand our dataset with on-page feature values, we set up a Proxmox cluster
with four workstation computers, collectively harnessing 24 cores. This power-
ful setup allowed us to scrape web pages from 4.6 million links efficiently. To
manage the high volume of requests from a single network IP without triggering
security protocols on the target websites, we utilized rotating proxy IPs provided
by Webshare.io. Each HTTP request was routed through these proxies, with
each proxy rotating with every request to mask our scraping activities. This
setup was crucial for completing the data collection without interruption, which
took about 34 days of continuous operation using multiprocessing techniques.
We divided the dataset among the 24 cores, with each core processing a subset
of the data, thereby maximizing efficiency and minimizing time.
31
url ,
headers = headers ,
# Printing progression
p r e s e t _ p r o c e n t i l e = int (( len ( urls_da ta_part ) / 100)
* procentile )
temp = -1
print ( " Process initiated for part : " + str ( part ))
# Calculating progression
c u r r e n t _ p e r c e n t a g e = int ((( i + 1) / ( len ( url s_data_p art )))
* 100)
if c u r r e n t _ p e r c e n t a g e > temp :
temp = c u r r e n t _ p e r c e n t a g e
# Printing progression
print ( f " Part : { part } , Errors : { len ( f a i l e d _ u r l s_ t e m p )} ,
Progress : { temp }% " )
if temp > 0:
32
# M ul ti p ro ce s si ng function taking set of links and starting point
def p a r a l l e l _ s c r a p e r ( urls_data_parts , s t a r t _ p e r c e n t i l e s ):
The web scraping process was geared towards collecting specific data points
listed in Appendix A, Table A.2 of our documentation. However, we encoun-
tered several challenges that reduced the volume of usable data. A significant
number of web pages blocked our scraping attempts due to our IP addresses
or because we accessed the sites via headless browsers, which some sites can
detect and restrict. Additionally, compliance with web standards meant re-
specting robots.txt files that explicitly forbade scraping, leading to further data
exclusion. This resulted in approximately 64% of our potential data being in-
accessible or unusable, leaving us with 3,251,724 records. Further data cleaning
was necessary to remove records with unreadable symbols or missing values,
which culminated in a refined dataset of 1,653,946 records ready for use in our
analysis. This stage was critical to ensure the quality and reliability of our data
for subsequent machine learning processes.
33
their values by dividing each by the corresponding value of the best-performing
record in the group, ensuring to avoid any division by zero. This normalization
was critical for aligning the data across different groups, making it consistent
and comparable, thus preparing it for effective model training. This approach
not only improved the accuracy of our models by standardizing data points rel-
ative to peak performance benchmarks within their groups but also facilitated a
deeper analysis of the underlying patterns across varied segments of the dataset.
# Finding the row indices with the lowest classifier value for each
# group
best_classified_indices =
rel_dataset . groupby ( ’ group_id ’ )[ rel_dataset . columns [1]]. idxmin ()
34
(repository file ‘classifier models.ipynb‘).
# Creating two correlation matrixes
r e l _ c o r r e l a t i o n _ m a t r i x = rel_dataset . corr ()
o r g _ c o r r e l a t i o n _ m a t r i x = org_dataset . corr ()
Listing 4.7: Cross calculating the correlation between features
# Sorting by importance
f e a t u r e _ i m p o r t a n c e _ d f . sort_values ( by = ’ Importance ’ ,
ascending = False , inplace = True )
Listing 4.8: Calculating the information gain for each feature
Once we had two preprocessed datasets ready, we recognized the need to sim-
plify the classification system to streamline the training process. Originally
dealing with 100 classifiers based on rank positions, we condensed this to just
six classifiers to reduce complexity and improve model manageability. We de-
fined a function, map to smaller range, to categorize the SERP positions into
six distinct groups based on their rank. This function assigns a numeric label
from 1 to 6, grouping ranks into clusters like Top 5, 6-10, and so on, up to
71-100. We then applied this mapping function to our ’classifiers’ column using
the .apply() method, effectively transforming the original detailed classifications
into broader categories, making the dataset less cumbersome for our machine
learning algorithms to process. This reduction not only simplified the training
but also aligned the classifiers with more strategic groupings relevant to SEO
analysis.
# Defining a function to map values to the specified ranges
def m a p _ t o _ s m a l l e r _ r a n g e ( value ):
if value <= 5:
return 1 # Top 5
elif value <= 10:
return 2 # 6 -10
elif value <= 20:
return 3 # 11 -20
elif value <= 40:
return 4 # 21 -40
elif value <= 70:
return 5 # 51 -70
else :
return 6 # 71 -100
35
y = y . apply ( m a p _ t o _ s m a l l e r _ r a n g e )
Listing 4.9: Reducing the amount of classifiers to 6 groups
Once our datasets were prepped and optimized, we moved on to the model
training phase. We developed two Python scripts utilizing the XGBoost and
CatBoost libraries, because of their efficiency and performance in classification
tasks. To ensure the reliability of our results, each model was trained seven
times on both datasets. We split our dataset into training and testing sets
using Scikit-learn’s train test split function, setting aside 20% of the data for
testing to evaluate the model’s performance. We then instantiated a CatBoost
classifier, training it with both datasets separately. The model was trained on
the training set, while the testing set was used as the evaluation set to monitor
performance and avoid overfitting.
# Spliting the dataset into training and testing sets
X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X , y ,
test_size =0.2 , random_state =42)
Similarly, the XGBoost model was then trained using the training dataset, al-
lowing us to measure its effectiveness on its performance on the test set.
# Spliting the data into training and testing sets
X_train , X_test , y_train , y_test = t r a i n _ t e s t _ s p l i t (X , y ,
test_size =0.2 , random_state =42)
In the final stage of our experiment, we calculated the accuracy for each model
iteration to gauge their performance. As previously mentioned we used an 80%
training data and 20% testing data split for this purpose. The models demon-
strated consistency, with a standard deviation of only 0.3% in their accuracy
scores, indicating reliable performance across seven training cycles per model
and dataset used.
After predictions were made using the models, we used the ‘accuracy score‘
function from the ‘sklearn.metrics‘ package to determine how well the models
performed against the actual outcomes. For each model, such as CatBoost and
XGBoost, we printed out the accuracy in a formatted output to easily compare
36
their effectiveness. This method provided a clear, quantitative measure of how
well each model could predict SEO rankings based on the input data processed
through the training and testing phases.
# Calculating the accuracy
accuracy = accurac y_score ( y_test , y_pred )
Listing 4.12: Calculating the accuracy of the respective models
In our research, the baseline was determined based on the classification of web
pages into six distinct ranking groups, top 5, 6-10, 11-20, 21-40, 41-70, and
71-100. This categorization reduced the complexity inherent in predicting the
exact SERP rank. The baseline accuracy was calculated from the most pop-
ulous category within our dataset, which reflects a naı̈ve model’s performance
that would always predict the web page to be in this most common category.
Table 4.1: The table shows the population percentile of each classifier from the
dataset.
Our data indicated that the largest group was class 6 which encompassed ap-
proximately 34.97% of the dataset. Thus, the baseline accuracy for our model,
under the assumption of always predicting this most frequent group, was set
at 34.97%. This baseline provides a fundamental benchmark for evaluating the
effectiveness of our machine learning models. In contrast, the XGBoost and
CatBoost models, trained on both original and relational values, demonstrated
a significant improvement over this baseline. Especially the CatBoost which
showed an accuracy of as high as 68.0% with relational values. The compari-
son of these results against the baseline accuracy underscores the value of using
37
sophisticated machine learning models and relational data in enhancing the
prediction accuracy of SEO rankings. The improvements are significant in the
context of SEO, where such increases in accuracy can have meaningful implica-
tions for web page visibility and traffic.
The results present the findings from the empirical analysis conducted using
XGBoost and CatBoost models on two distinct datasets, one consisting of orig-
inal web-scraped values and the other of relational values derived from com-
paring web page attributes to top-ranking pages for a specific group of pages
based on a search query (which is called SERP). In ensuring the validity of the
results, both the XGBoost and CatBoost models were trained seven times on
each dataset type. This rigorous approach aimed to confirm the consistency
of the findings. The datasets used comprised original web-scraped values and
relational values. The results revealed a noticeable accuracy improvement when
using relational values, averaging at approximately 1.3% across models and the
the test cases. The XGBoost model results on the dataset with relational values
was 67.5%. When trained on the dataset with original web-scraped values, the
model demonstrated a slightly lower accuracy of 66.3%. The CatBoost model
performed marginally better on the relational dataset, attaining an accuracy of
68.0%. In comparison, the model trained with the original dataset showed an
accuracy of 66.8%.
Table 4.2: The table shows final accuracy of the trained models based on which
dataset that was used.
These results indicate that incorporating relational values into the dataset of-
fers a modest but notable improvement in the accuracy of SEO classification
models. The increase in accuracy, while seemingly incremental, is significant
in the context of SEO where even small edges can translate into substantial
improvements in web page rankings. The improved performance of models us-
ing relational values underscores the importance of contextual and comparative
SEO factors. It suggests that understanding a web page’s attributes in relation
to top performers within the same search context provides a more nuanced and
effective approach to predicting search engine rankings.
The results validate the hypothesis that a broader, more contextually aware
dataset can enhance the accuracy of SEO classification models. The findings
from both XGBoost and CatBoost models reinforce the value of relational data
38
in understanding and predicting the complexities of search engine rankings,
offering meaningful insights for advancing SEO strategies in the competitive
digital marketing landscape.
4.3 Discussion
In evaluating the results of the study, the accuracy measurements from the ma-
chine learning models stand out as a significant achievement, highlighting the
efficacy of the advanced classification techniques used. The choice of XGBoost
and CatBoost was pivotal due to their capability to handle large datasets and
complex feature interactions efficiently, which was essential given the scale and
scope of the data involved. These models are particularly adept at ranking and
classification tasks, which made them suitable for the predictive tasks of SEO
classification. The repeated training cycles further reinforced the reliability of
the results, showing a consistent improvement in accuracy, which confirms the
models’ robustness.
The use of established Python libraries like Scikit-learn, XGBoost, and Cat-
Boost instead of developing models from scratch was a strategic choice that
allowed for more focus on data manipulation and model tuning rather than on
the foundational aspects of algorithm development. These libraries offer well
optimized, tested algorithms that are both scalable and efficient, facilitating
a more streamlined research process and reducing the potential for error that
comes with custom coded solutions.
However, the study’s delimitations also play a critical role in shaping the inter-
pretation of its outcomes. The focus on on-page SEO factors, while significant,
does not encapsulate the full spectrum of variables that search engines con-
sider for ranking web pages. Off-page factors like backlinks and social signals,
which were not included in this study, are also known to significantly impact
SEO performance. Additionally, the reliance on web scraping for data collection
introduces potential biases, as not all relevant data may be accessible or accu-
rately representable through this method, and ethical considerations regarding
data privacy and compliance with web usage policies must be addressed.
39
Chapter 5
Conclusion
In this work, we studied the intricacies and potential improvements in SEO clas-
sification models. More specifically, we considered the research question: ”How
can the accuracy of SEO classification models be improved through the use of
large scale, relative feature engineered datasets?”
40
Through our empirical research, we provided evidence that incorporating re-
lational values into SEO classification models consistently enhances their pre-
dictive accuracy. This approach, which contextualizes a webpage’s features
against those of the top performers in similar search contexts. This improve-
ment, though modest, is substantial in the domain of SEO, where even small
increases in accuracy can yield considerable benefits in web page visibility and
traffic.
The findings of this study go beyond just refining existing models, they open
avenues for deeper exploration into the societal and academic implications of
enhanced SEO modeling techniques. For businesses and web content creators,
the advancements in model accuracy and reliability translate into more effective
and strategic web optimization, bolstering visibility and competitiveness in the
digital marketplace. Furthermore, the academic value of this study lies in its
application and testing of machine learning theories within the intricate real-
world environment of SEO, contributing valuable insights to the field. Thus, this
study not only bridges the knowledge gap identified in earlier research but also
lays the groundwork for future investigations that consider the multifaceted ap-
proach necessary for addressing the technical challenges, economic benefits, and
academic exploration in the evolving landscape of SEO and digital marketing.
The research presented in ”Using Machine Learning for Web Page Classifica-
41
tion in Search Engine Optimization” by Matošević et al. (2021) offers valuable
insights for comparing with our study. Their research employs machine learn-
ing techniques for classifying web pages into predefined categories based on the
extent of content adjustment to SEO guidelines, using data labeled by domain
experts. While Matošević et al. (2021) explored a range of machine learning clas-
sifiers including decision trees, SVM, Naı̈ve Bayes, KNN, and logistic regression,
our study concentrated on the performance of XGBoost and CatBoost models.
These models preformed most accurate. They reported classifier accuracies
ranging from 54.59% to 69.67%, exceeding their baseline classification accuracy
of 48.83%. In our research, we also utilized these advanced machine learning
techniques, specifically focusing on the top performers, XGBoost and CatBoost
models. We compared the performance of these models using datasets with
original web-scraped values and relational values. Our findings demonstrated
that models using relational values achieved higher accuracies, with XGBoost
achieving 67.5% and CatBoost 68.0% accuracy, and an increase in accuracy by
as much as 33.03% compared to our baseline accuracy of 34.97%.
5.2 Delimitations
One of the significant limitations was the available computing power and time.
Advanced machine learning models and web scraping, particularly when deal-
ing with large datasets, require substantial computational resources and time.
Limited compute power constrained the amount of data that could be collected,
the complexity of the models we could run and the depth of analysis we could
perform within a reasonable time frame. This factor limited the refinement of
the models and the granularity of the results.
42
factors practically unexplored.
The process of web scraping faced several challenges. Many websites deploy
technical measures to prevent scraping, such as IP blocking, CAPTCHAs, and
dynamic content rendering, making data collection more difficult. Additionally,
the legal and ethical considerations, particularly adherence to websites’ policies
against scraping, further constrained the dataset’s breadth. These limitations
impacted the diversity and representativeness of the dataset, which, in turn,
may affect the generalizability of the study’s findings.
Reliance on web-scraped data also raises concerns about data reliability and
potential biases. Web content is dynamic and constantly changing, and there’s
a risk that the scraped data may not accurately reflect the current state of web
pages or the internet at large.
The findings are most applicable to the specific context and dataset studied.
Given the rapid evolution of search engine algorithms and the internet land-
scape, the applicability of the findings may be limited over time or in different
contexts.
While the research provides valuable insights into the application of machine
learning in SEO, these delimitations highlight the need for caution in general-
izing the results. They underscore the importance of considering the study’s
specific context, the constraints under which it was conducted, and the rapidly
changing nature of the digital landscape when interpreting the findings.
The enhanced accuracy of SEO classification models has the potential to in-
fluence how websites are optimized for search engines. While this can lead to
improved visibility for businesses and content creators, there’s also a respon-
sibility to ensure these practices do not devolve into manipulative techniques
43
that degrade the quality of search results as said before (low-quality content)
or violate search engine guidelines. Ethical SEO should focus on improving
user experience and providing valuable content, rather than solely on exploiting
algorithmic vulnerabilities. The increased sophistication of SEO tools impacts
how information is accessed and consumed by society. There’s a risk that the
dominant narrative or content could be shaped by those with the most advanced
SEO tactics, potentially leading to a homogenization of content or marginaliza-
tion of less optimized but valuable information sources.
Together, these AI tools played a crucial role in refining our research methodol-
ogy, improving the accuracy and clarity of our documentation, and streamlining
the technical aspects of our project.
44
digital marketing trends. Integrating off-page factors such as backlinks, social
media presence, and domain authority could significantly enhance the perfor-
mance of SEO classification models. These factors play a crucial role in how
search engines rank pages but present challenges in data collection and interpre-
tation. Future studies could focus on innovative methods to accurately collect
and integrate off-page data, balancing the complexity of these factors with the
practicalities of model implementation.
45
Bibliography
Akamine, S., Kato, Y., Kawahara, D., Shinzato, K., Inui, K., Kurohashi, S.
& Kidawara, Y. (2009), Development of a large-scale web crawler and search
engine infrastructure, in ‘Proceedings of the 3rd International Universal Com-
munication Symposium’, pp. 126–131.
Al Daoud, E. (2019), ‘Comparison between xgboost, lightgbm and catboost
using a home credit dataset’, International Journal of Computer and Infor-
mation Engineering 13(1), 6–10.
Alpaydin, E. (2020), Introduction to machine learning, MIT press.
Balabantaray, R. C. (2017), ‘Evaluation of web search engine based on ranking
of results and its features’, International Journal of Information and Com-
munication Technology 10(4), 392–405.
Banaei, H. & Honarvar, A. R. (2017), ‘Web page rank estimation in search
engine based on seo parameters using machine learning techniques’, Int J
Comput Sci Netw Sec 17, 95–100.
Bar-Ilan, J. (2007), ‘Manipulating search engine algorithms: the case of google’,
Journal of Information, Communication and Ethics in Society 5(2/3), 155–
166.
Bell, E., Bryman, A. & Harley, B. (2022), Business research methods, Oxford
university press.
Bhandari, R. S. & Bansal, A. (2018), ‘Impact of search engine optimization as
a marketing tool’, Jindal Journal of Business Research 7(1), 23–36.
Brown, I. & Mues, C. (2012), ‘An experimental comparison of classification
algorithms for imbalanced credit scoring data sets’, Expert systems with ap-
plications 39(3), 3446–3453.
Chen, T. & Guestrin, C. (2016), Xgboost: A scalable tree boosting system, in
‘Proceedings of the 22nd acm sigkdd international conference on knowledge
discovery and data mining’, pp. 785–794.
Cohen, L., Manion, L. & Morrison, K. (2002), Research methods in education,
routledge.
46
Creswell, J. W. & Creswell, J. D. (2017), Research design: Qualitative, quanti-
tative, and mixed methods approaches, Sage publications.
Darrazi, A. & Mak, L. K. (2023), ‘The use of xai for seo classification’, Research
Topics in Data Science . Research topics in Data Science, HT2023, Stockholm
University.
Lim, T.-S., Loh, W.-Y. & Shih, Y.-S. (2000), ‘A comparison of prediction accu-
racy, complexity, and training time of thirty-three old and new classification
algorithms’, Machine learning 40, 203–228.
Luh, C.-J., Yang, S.-A. & Huang, T.-L. D. (2016), ‘Estimating google’s search
engine ranking function from a search engine optimization perspective’, On-
line Information Review 40(2), 239–255.
Matošević, G., Dobša, J. & Mladenić, D. (2021), ‘Using machine learning for web
page classification in search engine optimization’, Future Internet 13(1), 9.
McKinney, W. (2022), Python for data analysis, ” O’Reilly Media, Inc.”.
47
Page, L., Brin, S., Motwani, R. & Winograd, T. (1998), The pagerank citation
ranking: Bring order to the web, in ‘Proc. of the 7th International World
Wide Web Conf’.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. et al. (2011), ‘Scikit-
learn: Machine learning in python’, the Journal of machine Learning research
12, 2825–2830.
Portier, W. K., Li, Y. & Kouassi, B. A. (2020), Improving search engine ranking
prediction based on a new feature engineering tool, in ‘Proceedings of the
2020 4th International Conference on Vision, Image and Signal Processing’,
pp. 1–6.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. (2018),
‘Catboost: unbiased boosting with categorical features’, Advances in neural
information processing systems 31.
Ravi, S., Ganesan, N. & Raju, V. (2012), ‘Search engines using evolution-
ary algorithms’, International Journal of Communication Network Security
1(4), 39–44.
Reyes-Lillo, D., Morales-Vargas, A. & Rovira, C. (2023), ‘Reliability of domain
authority scores calculated by moz, semrush, and ahrefs’, Profesional de la
información/Information Professional 32(4).
48
Witten, I., Frank, E., Hall, M. & Pal, C. (2016), Data Mining: Practical Ma-
chine Learning Tools and Techniques.
Zhang, P., Jia, Y. & Shang, Y. (2022), ‘Research and application of xgboost
in imbalanced data’, International Journal of Distributed Sensor Networks
18(6), 15501329221106935.
Zheng, A. & Casari, A. (2018), Feature engineering for machine learning: prin-
ciples and techniques for data scientists, ” O’Reilly Media, Inc.”.
49
Appendix A
Table A.1: This table shows the URL, keyword, and geography, which together
work like the unique identifier for each record. The rank group is the classifier
ranked within 1 - 100 in its respective query group.
50
Dataset Features Explanation
h1 tags amount Number of H1 on the web page
h1 avg char count Average character count of H1
h1 keyword freq Frequency of keywords in H1
h1 geography freq Frequency of location terms in H1
h2 tags amount Number of H2 on the web page
h2 keyword freq Frequency of keywords in H2
h2 geography freq Frequency of location terms in H2
h3 tags amount Number of H3 on the web page
h3 keyword freq Frequency of keywords in H3
h3 geography freq Frequency of location terms in H3
h4 tags amount Number of H4 on the web page
p tags amount Number of P on the web page
a internal tags Number of internal links (anchor) on the web page
a external tags Number of external links (anchor) on the web page
img tags amount Number of img on the web page
meta tags amount Number of meta on the web page
title word count Word count of the web page title
description word count Word count of the web page description
title char count Character count of the web page title
title keyword freq Frequency of keywords in the web page title
title geography freq Frequency of location terms in the web page title
description char count Character count of the web page description
alt keyword freq Frequency of keywords in img alt attributes
alt geography freq Frequency of location terms in img alt attributes
a keyword freq Frequency of keywords in anchor
a geography freq Frequency of location terms in anchor
url keyword freq Frequency of keywords in the URL
url geography freq Frequency of location terms in the URL
p total char count Total character count in P
p total keyword freq Frequency of keywords in P
p total geography freq Frequency of location terms in P
Table A.2: These features are based on the previous on-page values but relative
to the best-ranking page in its respective query group.
51
Dataset Features Explanation
rel h1 tags amount Relative H1 amount
rel h1 avg char count Relative average character count of H1
rel h1 keyword freq Relative frequency of keywords in H1
rel h1 geography freq Relative frequency of location terms in H1
rel h2 tags amount Relative H2 amount
rel h2 keyword freq Relative frequency of keywords in H2
rel h2 geography freq Relative frequency of location terms in H2
rel h3 tags amount Relative H3 amount
rel h3 keyword freq Relative frequency of keywords in H3
rel h3 geography freq Relative frequency of location terms in H3
rel h4 tags amount Relative H4 amount
rel p tags amount Relative P amount
rel a internal tags Relative internal links amount
rel a external tags Relative external links amount
rel img tags amount Relative img amount
rel meta tags amount Relative meta amount
rel title word count Relative title word count
rel description word count Relative description word count
rel title char count Relative title character count
rel title keyword freq Relative frequency of keywords in title
rel title geography freq Relative frequency of location terms in title
rel description char count Relative description character count
rel alt keyword freq Relative frequency of keywords in img alt
rel alt geography freq Relative frequency of location terms in img alt
rel alt geography freq Relative frequency of keywords in anchor
rel alt geography freq Relative frequency of location terms in anchor
rel alt geography freq Relative frequency of keywords in the URL
rel url geography freq Relative frequency of location terms in the URL
rel p total char count Relative total character count in P
rel p total keyword freq Relative frequency of keywords in P
rel p total geography freq Relative frequency of location terms in P
Table A.3: These features are based on the previous on-page values but relative
to the best-ranking page in its respective query group.
52
Appendix B
Query Engineering
Table B.1: The table shows the city, country and language name for a specific
query. This information is a component of the query engineering.
53
Table B.1 – continued from previous page
City Country Language
Bucharest Romania Romanian
Budapest Hungary Hungarian
Prague Czech Republic Czech
Warsaw Poland Polish
Sofia Bulgaria Bulgarian
Vienna Austria German
Zurich Switzerland German
Geneva Switzerland German
Munich Germany German
Hamburg Germany German
Krakow Poland Polish
Bratislava Slovakia Slovak
Ljubljana Slovenia Slovenian
Bergen Norway Norwegian
Gothenburg Sweden Swedish
Turku Finland Finnish
Aarhus Denmark Danish
Akureyri Iceland Icelandic
Tromsø Norway Norwegian
Uppsala Sweden Swedish
Oulu Finland Finnish
Birmingham United Kingdom English
Marseille France French
Antwerp Belgium French
Rotterdam Netherlands Dutch
Strasbourg France French
Edinburgh United Kingdom English
Düsseldorf Germany German
Cologne Germany German
Naples Italy Italian
Seville Spain Spanish
Porto Portugal Portuguese
Thessaloniki Greece Greek
Palermo Italy Italian
Valletta Malta Maltese
Zagreb Croatia Croatian
Ljubljana Slovenia Slovenian
Minsk Belarus Russian
Belgrade Serbia Serbian
Sarajevo Bosnia and Herzegovina Bosnia
Salzburg Austria German
Basel Switzerland German
Stuttgart Germany German
Continued on next page
54
Table B.1 – continued from previous page
City Country Language
Nuremberg Germany German
Graz Austria German
Innsbruck Austria German
Poznan Poland Polish
Brno Czech Republic Czech
Trondheim Norway Norwegian
Malmö Sweden Swedish
Espoo Finland Finnish
Odense Denmark Danish
Keflavik Iceland Icelandic
Umeå Sweden Swedish
Linköping Sweden Swedish
Västerås Sweden Swedish
Glasgow United Kingdom English
Lyon France French
Ghent Belgium French
Utrecht Netherlands Dutch
Bordeaux France French
Leeds United Kingdom English
Hannover Germany German
Leipzig Germany German
Genoa Italy Italian
Bilbao Spain Spanish
Coimbra Portugal Portuguese
Patras Greece Greek
Bologna Italy Italian
Split Croatia Croatian
Maribor Slovenia Slovenian
Varna Bulgaria Bulgarian
Lviv Ukraine Ukrainian
Cluj-Napoca Romania Romanian
Novi Sad Serbia Serbian
Tirana Albania Albanian
Vilnius Lithuania Lithuanian
Rostov-on-Don Russia Russian
Odessa Ukraine Ukrainian
Linz Austria German
Lausanne Switzerland German
Bonn Germany German
Dresden Germany German
Szczecin Poland Polish
Ostrava Czech Republic Czech
Debrecen Hungary Hungarian
Continued on next page
55
Table B.1 – continued from previous page
City Country Language
Timis, oara Romania Romanian
Karlstad Sweden Swedish
Jönköping Sweden Swedish
Kuopio Finland Finnish
Aalborg Denmark Danish
Helsingborg Sweden Swedish
Bodo Norway Norwegian
Lahti Finland Finnish
Stavanger Norway Norwegian
Nottingham United Kingdom English
Toulouse France French
Liege Belgium French
The Hague Netherlands Dutch
Nice France French
Cardiff United Kingdom English
Bremen Germany German
Essen Germany German
Turin Italy Italian
Zaragoza Spain Spanish
Faro Portugal Portuguese
Heraklion Greece Greek
Venice Italy Italian
Rijeka Croatia Croatian
Piraeus Greece Greek
Constant, a Romania Romanian
Kaunas Lithuania Lithuanian
Brno Czech Republic Czech
Plovdiv Bulgaria Bulgarian
Yekaterinburg Russia Russian
Kharkiv Ukraine Ukrainian
Nizhny Novgorod Russia Russian
Salzburg Austria German
Bern Switzerland German
Dortmund Germany German
Mainz Germany German
Lublin Poland Polish
Graz Austria German
Klagenfurt Austria German
Sibiu Romania Romanian
New York City New York English
Boston Massachusetts English
Philadelphia Pennsylvania English
Pittsburgh Pennsylvania English
Continued on next page
56
Table B.1 – continued from previous page
City Country Language
Baltimore Maryland English
Washington D.C. District of Columbia
Providence Rhode Island English
Buffalo New York English
Chicago Illinois English
Detroit Michigan English
Minneapolis Minnesota English
St. Louis Missouri English
Cleveland Ohio English
Indianapolis Indiana English
Milwaukee Wisconsin English
Columbus Ohio English
Atlanta Georgia English
Miami Florida English
New Orleans Louisiana English
Nashville Tennessee English
Charlotte North Carolina English
Austin Texas English
Houston Texas English
Dallas Texas English
Los Angeles California English
San Francisco California English
Seattle Washington English
Denver Colorado English
Las Vegas Nevada English
Phoenix Arizona English
Portland Oregon English
San Diego California English
Albuquerque New Mexico English
Tucson Arizona English
Oklahoma City Oklahoma English
El Paso Texas English
Santa Fe New Mexico English
Austin Texas English
San Antonio Texas English
Fort Worth Texas English
Albany New York English
Hartford Connecticut English
New Haven Connecticut English
Portland Maine English
Newark New Jersey English
Harrisburg Pennsylvania English
Syracuse New York English
Continued on next page
57
Table B.1 – continued from previous page
City Country Language
Rochester New York English
Kansas City Missouri English
Omaha Nebraska English
Cincinnati Ohio English
Toledo Ohio English
Madison Wisconsin English
Des Moines Iowa English
Fargo North Dakota English
Sioux Falls South Dakota English
Orlando Florida English
Tampa Florida English
Memphis Tennessee English
Raleigh North Carolina English
Louisville Kentucky English
Birmingham Alabama English
Charleston South Carolina English
Richmond Virginia English
Sacramento California English
Honolulu Hawaii English
Anchorage Alaska English
Boise Idaho English
Reno Nevada English
Spokane Washington English
Salt Lake City Utah English
Santa Fe New Mexico English
Tulsa Oklahoma English
Corpus Christi Texas English
Lubbock Texas English
Amarillo Texas English
Little Rock Arkansas English
Baton Rouge Louisiana English
Shreveport Louisiana English
Fayetteville Arkansas English
Bridgeport Connecticut English
Worcester Massachusetts English
Cambridge Massachusetts English
Trenton New Jersey English
Scranton Pennsylvania English
Bangor Maine English
Burlington Vermont English
Stamford Connecticut English
Grand Rapids Michigan English
Springfield Illinois English
Continued on next page
58
Table B.1 – continued from previous page
City Country Language
Wichita Kansas English
Lincoln Nebraska English
Fort Wayne Indiana English
Akron Ohio English
Duluth Minnesota English
Davenport Iowa English
Virginia Beach Virginia English
Savannah Georgia English
Lexington Kentucky English
Asheville North Carolina English
Jacksonville Florida English
Knoxville Tennessee English
Mobile Alabama English
Columbia South Carolina English
Fresno California English
Boulder Colorado English
Tucson Arizona English
Albuquerque New Mexico English
Bozeman Montana English
Cheyenne Wyoming English
Missoula Montana English
Eugene Oregon English
Midland Texas English
Galveston Texas English
Norman Oklahoma English
Wichita Falls Texas English
Flagstaff Arizona English
Roswell New Mexico English
Fort Smith Arkansas English
Lafayette Louisiana English
Allentown Pennsylvania English
Providence Rhode Island English
Manchester New Hampshire English
Portland Maine English
Newport Rhode Island English
Saratoga Springs New York English
Atlantic City New Jersey English
Erie Pennsylvania English
Rockford Illinois English
Cedar Rapids Iowa English
Springfield Missouri English
Rapid City South Dakota English
Green Bay Wisconsin English
Continued on next page
59
Table B.1 – continued from previous page
City Country Language
Gary Indiana English
Lansing Michigan English
Bismarck North Dakota English
Jackson Mississippi English
Chattanooga Tennessee English
Wilmington North Carolina English
Greensboro North Carolina English
Tallahassee Florida English
Montgomery Alabama English
Greenville South Carolina English
Roanoke Virginia English
Tacoma Washington English
Bakersfield California English
Santa Cruz California English
Olympia Washington English
Santa Barbara California English
Medford Oregon English
Flagstaff Arizona English
Billings Montana English
Santa Rosa New Mexico English
El Reno Oklahoma English
McAllen Texas English
Odessa Texas English
Lubbock Texas English
Broken Arrow Oklahoma English
Lake Charles Louisiana English
Beaumont Texas English
Table B.2: This table shows the respective business terms that was used in the
query engineering in English. The terms was translated based on the language
related to the city it was combined with.
Business Terms
Hotel
Dentist
Car dealership
Fitness center
Fashion boutique
Coffee shop
Law firm
Museum
Construction company
Continued on next page
60
Table B.2 – continued from previous page
Business Terms
Tech startup
Luxury spa
Bookstore
Seafood restaurant
University
Art gallery
Bakery
Real estate agency
Veterinary clinic
Organic grocery store
Advertising agency
Yoga studio
Brewery
Music school
Pet grooming service
Antique shop
Bicycle shop
Nightclub
Graphic design studio
Plant nursery
Children’s clothing store
Independent film theater
Specialty cheese shop
Golf course
Photography studio
Language learning center
Sushi bar
Architectural firm
Organic farm
Interior design studio
Vegan restaurant
Computer repair shop
Jazz club
Toy store
Kayak rental service
Custom tailoring service
Rock climbing gym
Used bookstore
Microbrewery
Wedding planning service
Escape room
Aerial yoga studio
Historic hotel
Continued on next page
61
Table B.2 – continued from previous page
Business Terms
Gourmet chocolate shop
Independent record store
Scuba diving center
Botanical garden
Organic coffee roaster
Adventure travel agency
Renewable energy company
Handcrafted furniture store
Vegan bakery
Independent cinema
Craft pottery studio
Ethical fashion boutique
Specialty tea shop
Outdoor gear retailer
Local farmers market
Digital marketing agency
Contemporary art museum
Jazz bar
Board game cafe
Organic butcher
Co-working space
Specialty fish market
Boutique wine shop
Historic theatre
Luxury bed and breakfast
Custom jewelry designer
Independent video game store
Experimental dining restaurant
Vintage record shop
Specialty cocktail bar
Artisan bread bakery
Independent publishing house
Eco-friendly home goods store
Urban rooftop garden
Custom suit tailor
Rare book dealer
High-end audio equipment store
Theme cafe
Vintage clothing store
Independent animation studio
Exotic pet store
Specialty spice shop
Art restoration service
Continued on next page
62
Table B.2 – continued from previous page
Business Terms
Sustainable fashion store
Craft beer pub
Herbal apothecary
Professional photography service
Ice cream parlor
Indie video game developer
Luxury yacht charter
Handmade ceramics studio
Specialty pizza restaurant
Independent film production company
Organic beauty salon
Artisan cheese maker
Vintage toy store
Boutique travel agency
Craft cocktail lounge
Independent comic book store
Local micro-distillery
Sustainable architecture firm
Artisanal chocolate maker
Boutique fitness studio
Specialty vegan grocer
Custom motorcycle workshop
Ethical clothing boutique
Independent music label
Gourmet burger joint
Independent perfumery
Local craft fair
Eco-friendly cleaning service
Artisanal pastry shop
Vintage furniture gallery
Specialty seafood market
High-end barber shop
Contemporary dance studio
Custom bicycle builder
Organic winery
Virtual reality arcade
Handmade soap shop
Boutique pet store
Independent animation film theater
Custom leather goods store
Artisan coffee roaster
Luxury bed and breakfast establishment
Ethnic cuisine cooking class
Continued on next page
63
Table B.2 – continued from previous page
Business Terms
Boutique stationery store
Specialty sushi restaurant
Independent book publisher
Artisanal ice cream maker
Vintage camera shop
Handcrafted pottery class
Boutique bridal shop
Gourmet deli
Custom surfboard shaper
Organic farm-to-table restaurant
High-end audiovisual equipment store
64
Appendix C
65
Research topics in Data Science — HT2023
Project Group 18 - The use of XAI for SEO classification
Name:
Anir Darrazi
Lai Ki Mak
Table of Contents
Abstract...................................................................................................................................................3
Introduction............................................................................................................................................ 3
Purpose....................................................................................................................................................3
Method.................................................................................................................................................... 4
A. Data...............................................................................................................................................4
B. Pre-Processing...............................................................................................................................4
C. Learning Algorithms & Evaluation Measures.............................................................................. 5
Results..................................................................................................................................................... 7
Conclusions............................................................................................................................................. 8
References............................................................................................................................................... 9
Attachments............................................................................................................................................ 9
Abstract
Explainable AI (XAI) can be used to study the search engine optimization (SEO) by
investigating features of a web page that can affect the absolute rank of the Google search
result. 24 features were studied with the Classification and Regression Tree algorithm
supported by Python Scikit learn library. The top five features (number of paragraphs,
number of external tags, word count of the title, word count of the description and page rank
score) with the highest Information Gain were found to be the most influential features.
Ranges of Information Gain were also studied with a brute force algorithm to manually
assign the decision tree for the features. The result was found with 23% of accuracy. It is
suggested that features of the web page structure and network can play a role in SEO results
while there can be other factors too.
Introduction
Explainable AI (XAI) can allow human techniques are used to find the correlation
beings to comprehend how machine between the different conditions (such as
learning algorithms create the results number of tags, word counts) and the page
(TURRI V., 2023). This can be used to rank result.
explain the algorithms and reasons behind
the result computed after algorithms. On According to statista.com, Google was the
the other hand, search engine optimization most popular search engine in Sweden in
(SEO) is the process or method to increase April 2023 with market share of 93.94
the chance for a web page to be prioritized percent (Bianchi T., 2023). In order to
and appear on the first page of search achieve a better representation of the
results (Sharma D. et. al, 2019). In order to results in Sweden, the SEO algorithm
achieve favorable marketing results, it is behind Google is investigated in this
considered that the web page should be project. In other words, content of web
shown on the first page of the search pages from Google search results are
results as much as possible. Search engines analysed and investigated to understand
can retrieve thousands of related results, the SEO algorithm behind Google search
page rank is required to rank the results engine.
and display them in a ranked order
(Bar-Ilan J., 2007). It is beneficial for
users or marketing companies to have Purpose
control over SEO (Sharma D. et. al, 2019). The aim of this project was to study the
The more control of SEO, the higher features of web pages that can affect the
chance that the web page can have a search result of the Google search engine.
higher ranking. However, conditions and Comparison of the different features of
the algorithms behind SEO are under the web pages were conducted to understand
direct control of developers of the SEO the correlation of the features and the
companies (Sharma D. et. al, 2019). With ranking of web pages in the search result.
multiple content in a web page, such as Values of all essential features were
page title, text, images, links, there is investigated with machine learning
always a challenge to understand the methods to find the top five features which
criteria and conditions for SEO. The aim can affect the search result the most.
of this project is to find out the conditions Information Gain (IG) was calculated and
and the part(s) of content on a web page a decision tree was built to understand
which can affect the SEO result with XAI. which feature(s) can affect the search
Decision trees and Information Gain (IG) results.
Method B. Pre-Processing
img_tags_amount 0.05933279026172075
Threshold3 12.0
Table 3: Showing the top 5 most important
Threshold4 42.3
features min-max values.
Threshold5 23.0
Results
The study's evaluation of the two decision lower accuracy compared to the
tree models, the Scikit-learn decision tree, scikit-learn model, it offers greater
and a manually created XAI decision tree, explainability, especially in how it
yielded noteworthy findings. The delineates each decision at the decision
Scikit-learn model, employing a robust tree nodes. This feature of the XAI model
24-feature set, attained an accuracy of 29% is crucial, particularly in contexts where
(0.2851851851851852). This contrasts understanding the model's
with the manually developed XAI model, decision-making process is more important
which, despite being trained on a limited than the accuracy of its predictions.
5-feature dataset, achieved a close
accuracy of 23% (0.2319679430097952). Moreover, it's important to highlight that
This surprising similarity in performance, both models perform above a random
despite the significant difference in the chance level. This is evidenced by the fact
number of features analyzed, underscores that the largest classification groups in the
a critical insight: the selection and dataset (comprising 22.2% each) are 3, 4,
optimization of features play a crucial role 5. If one were to predict the same class
in model efficacy. The results suggest both (out of those three) each time, the accuracy
models have considerable potential for would be 22.2% and by random prediction
improvement. Enhancing their ability to would likely be lower since the
analyze web page features effectively for classification groups are not evenly sized.
SEO optimization remains a key area for However, both models surpass this
future development. threshold, indicating their effectiveness
beyond mere chance.
It would also be beneficial to expand upon
the performance differences between the Finally, it's worth noting that all features
XAI model and the scikit-learn model. demonstrated low Information Gain. This
Emphasize how, despite the XAI model's suggests that while the collected data was
relevant, no single feature was
significantly impactful in enhancing the of the features and “Absolute Rank” is not
classifier's accuracy beyond what might be always directly correlated. In addition,
expected by chance and some luck. This with the delimition set in the project, the
finding should be explored further to scope of data studied can affect the result
understand how feature engineering and of the machine learning algorithm. In
the general quality of the dataset (both future study, more data and investigation
feature relevance and dataset size) aligns besides Information Gain can be studied to
with the overall performance and further analyse the factors and the
characteristics of the models. algorithm affecting the SEO.
Conclusions
XAI was used to investigate how features
of a web page can affect SEO. In total, 24
features of a web page were studied and
their respective Information Gain values
were calculated using the machine learning
algorithms as specified in the above
method section. Assuming that “Absolute
Rank”is the SEO result which specifies the
sequence of the web page appearing in the
Bianchi, T. (2023, June 28). Market share of leading search engines in Sweden in April 2023.
Retrieved December 19, 2023 from
https://www.statista.com/statistics/621418/most-popular-search-engines-in-sweden/
Sharma, D., Shukla, R., Giri, K. A., Kumar, S. (2019). A Brief Review on Search Engine
Optimization. 9th International Conference on Cloud Computing, Data Science &
Engineering (Confluence)
Attachments
Model Training Jupyter Notebook. Find it at the following link:
https://drive.google.com/file/d/1AAX2yOJwnBsZadxllGZxWGPZlGAsfWoa/view?usp=shari
ng
Web Scraper & DataFromSEO API Jupyter Notebook. Find it at the following link:
https://drive.google.com/file/d/1we6qotBn3TW5x8glFWVHjgcA7haUx_3r/view?usp=sharin
g
Appendix D
Code Repository
75