{"id":85180,"date":"2022-01-05T12:00:15","date_gmt":"2022-01-05T17:00:15","guid":{"rendered":"https:\/\/blog.logrocket.com\/?p=85180"},"modified":"2024-06-04T17:13:01","modified_gmt":"2024-06-04T21:13:01","slug":"build-python-web-crawler","status":"publish","type":"post","link":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/","title":{"rendered":"Build a Python web crawler from scratch"},"content":{"rendered":"<!DOCTYPE html>\n<html><p>Why would anyone want to collect more data when there is so much already? Even though the magnitude of information is alarmingly large, you often find yourself looking for data that is unique to your needs.<\/p><img loading=\"lazy\" decoding=\"async\" width=\"730\" height=\"487\" src=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png\" class=\"attachment-full size-full wp-post-image\" alt=\"Python Logo Over a Blue Background\" srcset=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png 730w, https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch-300x200.png 300w\" sizes=\"auto, (max-width: 730px) 100vw, 730px\">\n<p>For example, what would you do if you wanted to collect info on the history of your favorite basketball team or your favorite ice cream flavor?<\/p>\n<p>Enterprise data collection is essential in the day-to-day life of a data scientist because the ability to collect actionable data on trends of the modern-day means possible business opportunities.<\/p>\n<p>In this tutorial, you\u2019ll learn about web crawling via a simple online store.<\/p>\n<h2>HTML anatomy refresher<\/h2>\n<p>Let\u2019s review basic HTML anatomy. <a href=\"https:\/\/blog.logrocket.com\/how-browser-rendering-works-behind-scenes\/\" target=\"_blank\" rel=\"noopener\">Nearly all websites on the Internet<\/a> are built using the combination of HTML and CSS code (including JavaScript, but we won\u2019t talk about it here).<\/p>\n<p>Below is a sample HTML code with some critical parts annotated.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-85190\" src=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/sample-html-code.png\" alt=\"Sample HTML Code\" width=\"730\" height=\"443\" srcset=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/sample-html-code.png 730w, https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/sample-html-code-300x182.png 300w\" sizes=\"auto, (max-width: 730px) 100vw, 730px\" \/><\/p>\n<p>The HTML code on the web will be a bit more complicated than this, however. It will be nearly impossible to just look at the code and figure out what it\u2019s doing. For this reason, we will learn about more sophisticated tools to make sense of massive HTML pages, starting with XPath syntax.<\/p>\n<h2>XPath with lxml<\/h2>\n<p>The whole idea behind web scraping is to use automation to extract information from the massive sea of HTML tags and their attributes. One of the tools, among many, to use in this process is using XPath.<\/p>\n<p>XPath stands for XML path language. XPath syntax contains intuitive rules to locate HTML tags and extract information from their attributes and texts. For this section, we will practice using XPath on the HTML code you saw in the above picture:<\/p>\n<pre class=\"language-python hljs\">sample_html = \"\"\"\n&lt;bookstore id='main'&gt;\n\n   &lt;book&gt;\n       &lt;img src='https:\/\/books.toscrape.com\/index.html'&gt;\n       &lt;title lang=\"en\" class='name'&gt;Harry Potter&lt;\/title&gt;\n       &lt;price&gt;29.99&lt;\/price&gt;\n   &lt;\/book&gt;\n\n   &lt;book&gt;\n       &lt;a href='https:\/\/www.w3schools.com\/xml\/xpath_syntax.asp'&gt;\n           &lt;title lang=\"en\"&gt;Learning XML&lt;\/title&gt;\n       &lt;\/a&gt;\n       &lt;price&gt;39.95&lt;\/price&gt;\n   &lt;\/book&gt;\n\n&lt;\/bookstore&gt;\n\"\"\"<\/pre>\n<p>To start using XPath to query this HTML code, we will need a small library:<\/p>\n<pre class=\"language-python hljs\">pip install lxml<\/pre>\n<p>LXML allows you to read HTML code as a string and query it using XPath. First, we will convert the above string to an HTML element using the <code>fromstring <\/code>function:<\/p>\n<pre class=\"language-python hljs\">from lxml import html\n\nsource = html.fromstring(sample_html)\n\n&gt;&gt;&gt; source\n&lt;Element bookstore at 0x1e612a769a0&gt;\n&gt;&gt;&gt; type(source)\nlxml.html.HtmlElement<\/pre>\n<p>Now, let\u2019s write our first XPath code. We will select the bookstore tag first:<\/p>\n<pre class=\"language-python hljs\">&gt;&gt;&gt; source.xpath(\"\/\/bookstore\")\n[&lt;Element bookstore at 0x1e612a769a0&gt;]<\/pre>\n<p>Simple! Just write a double forward slash followed by a tag name to select the tag from anywhere of the HTML tree. We can do the same for the book tag:<\/p>\n<pre class=\"language-python hljs\">&gt;&gt;&gt; source.xpath(\"\/\/book\")\n[&lt;Element book at 0x1e612afcb80&gt;, &lt;Element book at 0x1e612afcbd0&gt;]<\/pre>\n<p>As you can see, we get a list of two book tags. Now, let\u2019s see how to choose an immediate child of a tag. For example, let\u2019s select the title tag that comes right inside the book tag:<\/p>\n<pre class=\"language-python hljs\">&gt;&gt;&gt; source.xpath(\"\/\/book\/title\")\n[&lt;Element title at 0x1e6129dfa90&gt;]<\/pre>\n<p>We only have a single element, which is the first title tag. We didn\u2019t choose the second tag because it is not an immediate child of the second book tag. But we can replace the single forward slash with a double one to choose both title tags:<\/p>\n<pre class=\"language-python hljs\">&gt;&gt;&gt; source.xpath(\"\/\/book\/\/title\")\n[&lt;Element title at 0x1e6129dfa90&gt;, &lt;Element title at 0x1e612b0edb0&gt;]<\/pre>\n<p>Now, let\u2019s see how to choose the text inside a tag:<\/p>\n<pre class=\"language-python hljs\">&gt;&gt;&gt; source.xpath(\"\/\/book\/title[1]\/text()\")\n['Harry Potter']<\/pre>\n<p>Here, we are selecting the text inside the first title tag. As you can see, we can also specify which of the title tags we want using brackets notation. To choose the text inside that tag, just follow it with a forward slash and a <code>text()<\/code> function.<\/p>\n<p>Finally, we look at how to locate tags based on their attributes like <code>id<\/code>, <code>class<\/code>, <code>href, <\/code>or any other attribute inside <code>&lt;&gt;<\/code>. Below, we will choose the title tag with the name class:<\/p>\n<pre class=\"language-python hljs\">&gt;&gt;&gt; source.xpath(\"\/\/title[@class='name']\")\n[&lt;Element title at 0x1e6129dfa90&gt;]<\/pre>\n<p>As expected, we get a single element. Here are a few examples of choosing other tags using attributes:<\/p>\n<pre class=\"language-python hljs\">&gt;&gt;&gt; source.xpath(\"\/\/*[@id='main']\")  # choose any element with id 'main'\n[&lt;Element bookstore at 0x1e612a769a0&gt;]\n&gt;&gt;&gt; source.xpath(\"\/\/title[@lang='en']\")  # choose a title tag with 'lang' attribute of 'en'.\n[&lt;Element title at 0x1e6129dfa90&gt;, &lt;Element title at 0x1e612b0edb0&gt;]<\/pre>\n<p>I suggest you look at <a href=\"https:\/\/www.w3schools.com\/xml\/xpath_syntax.asp\" target=\"_blank\" rel=\"noopener\">this page<\/a> to learn more about XPath.<\/p>\n<h2>Creating a class to store the data<\/h2>\n<p>For this tutorial, we will be scraping this <a href=\"https:\/\/slickdeals.net\/computer-deals\/?page=1\" target=\"_blank\" rel=\"noopener\">online store\u2019s computers section<\/a>:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-85193\" src=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/online-store-screenshot.png\" alt=\"Online Store Screenshot\" width=\"730\" height=\"368\" srcset=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/online-store-screenshot.png 730w, https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/online-store-screenshot-300x151.png 300w\" sizes=\"auto, (max-width: 730px) 100vw, 730px\" \/><\/p>\n<p>We will be extracting every item\u2019s name, manufacturer, and price. To make things easier, we will create a class with these attributes:<\/p>\n<pre class=\"language-python hljs\">class StoreItem:\n   \"\"\"\n   A general class to store item data concisely.\n   \"\"\"\n\n   def __init__(self, name, price, manufacturer):\n       self.name = name\n       self.price = price\n       self.manufacturer = manufacturer<\/pre>\n<p>Let\u2019s initialize the first item manually:<\/p>\n<pre class=\"language-python hljs\">item1 = StoreItem(\"Lenovo IdeaPad\", 749, \"Walmart\")<\/pre>\n<h2>Getting the page source<\/h2>\n<p>Now, let\u2019s get down to the serious business. To scrape the website, we will need its HTML source. Achieving this requires using another library:<\/p>\n<pre class=\"language-python hljs\">pip install requests<\/pre>\n<p>Requests allow you to send HTTPS requests to websites and, of course, get back the result with their HTML code. It is as easy as calling its get method and passing the webpage address:<\/p>\n<pre class=\"language-python hljs\">import requests\n\nHOME_PAGE = \"https:\/\/slickdeals.net\/computer-deals\/?page=1\"\n&gt;&gt;&gt; requests.get(HOME_PAGE)\n&lt;Response [200]&gt;<\/pre>\n<p>If the response comes with a <code>200<\/code> status code, the request was successful. To get the HTML code, we use the content attribute:<\/p>\n<pre class=\"language-python hljs\">r = requests.get(HOME_PAGE)\n\nsource = html.fromstring(r.content)\n\n&gt;&gt;&gt; source\n&lt;Element html at 0x1e612ba63b0&gt;<\/pre>\n<p>Above, we are converting the result to an LXML compatible object. As we probably repeat this process a few times, we will convert it into a function:<\/p>\n<pre class=\"language-python hljs\">def get_source(page_url):\n   \"\"\"\n   A function to download the page source of the given URL.\n   \"\"\"\n   r = requests.get(page_url)\n   source = html.fromstring(r.content)\n\n   return source\nsource = get_source(HOME_PAGE)\n\n&gt;&gt;&gt; source\n&lt;Element html at 0x1e612d11770&gt;<\/pre>\n<p>But, here is a problem \u200a\u2014 \u200aany website contains tens of thousands of HTML code, which makes visual exploration of the code impossible. For this reason, we will turn to our browser to figure out which tags and attributes contain the information we want.<\/p>\n<p>After loading the page, right-click anywhere on the page and choose <strong>Inspect<\/strong> to open developer tools:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-85196\" src=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/choosing-inspect.gif\" alt=\"Choosing Inspect\" width=\"730\" height=\"338\"><\/p>\n<p>Using the <strong>selector arrow,<\/strong> you can hover over and click on parts of the page to find out the element below the cursor and figure out their associated attributes and info. It will also change the bottom window to move to the location of the selected element.<\/p>\n<p>As we can see, all stored items are within <code>li<\/code> elements, with a class attribute containing the words <code>fpGridBox<\/code> grid. Let\u2019s choose them using XPath:<\/p>\n<pre class=\"language-python hljs\">source = get_source(HOME_PAGE)\n\nli_list = source.xpath(\"\/\/li[contains(@class, 'fpGridBox grid')]\")\n&gt;&gt;&gt; len(li_list)\n28<\/pre>\n<p>Because the class names are changing, we are using part of the class name that is common in all <code>li <\/code>elements. As a result, we have selected 28 <code>li<\/code> elements, which you can double-check by counting them on the web page itself.<\/p>\n<h2>Extracting the data<\/h2>\n<p>Now, let\u2019s start extracting the item details from the<code> li<\/code> elements. Let\u2019s first look at how to find the item\u2019s name using the selector arrow:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-85198\" src=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/find-item-name.gif\" alt=\"Find Item Name\" width=\"730\" height=\"338\"><\/p>\n<p>The item names are located inside tags with class names that contain the <code>itemTitle<\/code> keyword. Let\u2019s select them with XPath to make sure:<\/p>\n<pre class=\"language-python hljs\">item_names = [\n   li.xpath(\".\/\/a[@class='itemTitle bp-p-dealLink bp-c-link']\") for li in li_list\n]\n\n&gt;&gt;&gt; len(item_names)\n28<\/pre>\n<p>As expected, we got 28 item names. This time, we are using chained XPath on <code>li<\/code> elements, which requires starting the syntax with a dot. Below, I will write the XPath for other item details using the browser tools:<\/p>\n<pre class=\"language-python hljs\">li_xpath = \"\/\/li[contains(@class, 'fpGridBox grid')]\"  # Choose the `li` items\n\nnames_xpath = \".\/\/a[@class='itemTitle bp-p-dealLink bp-c-link']\/text()\"\nmanufacturer_xpath = \".\/\/*[contains(@class, 'itemStore bp-p-storeLink')]\/text()\"\nprice_xpath = \".\/\/*[contains(@class, 'itemPrice')]\/text()\"<\/pre>\n<p>We have everything we need to scrape all the items on the page. Let\u2019s do it in a loop:<\/p>\n<pre class=\"language-python hljs\">li_list = source.xpath(li_xpath)\n\nitems = list()\nfor li in li_list:\n   name = li.xpath(names_xpath)\n   manufacturer = li.xpath(manufacturer_xpath)\n   price = li.xpath(price_xpath)\n\n   # Store inside a class\n   item = StoreItem(name, price, manufacturer)\n   items.append(item)\n&gt;&gt;&gt; len(items)\n28<\/pre>\n<h2>Handling the pagination<\/h2>\n<p>We now have all items on this page. However, if you scroll down, you\u2019ll see the <strong>Next<\/strong> button, indicating that there are more items to scrape. We don\u2019t want to visit all pages manually one by one because there can be hundreds.<\/p>\n<p>But if you pay attention to the URL when we click on the <strong>Next<\/strong> button every time:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-85201\" src=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/next-button-url-1.gif\" alt=\"Next Button URL\" width=\"683\" height=\"51\"><\/p>\n<p>The page number changes at the end. Now, I\u2019ve checked that there are 22 pages of items on the website. So, we will create a simple loop to iterate through the pagination and repeat the scraping process:<\/p>\n<pre class=\"language-python hljs\">from tqdm.notebook import tqdm  # pip install tqdm\n\n# Create a list to store all\nitems = list()\nfor num in tqdm(range(1, 23)):\n   url = f\"https:\/\/slickdeals.net\/computer-deals\/?page={num}\"\n   source = get_source(url)  # Get HTML code\n\n   li_list = source.xpath(li_xpath)\n\n   for li in li_list:\n       name = clean_text(li.xpath(names_xpath))\n       manufacturer = clean_text(li.xpath(manufacturer_xpath))\n       price = clean_text(li.xpath(price_xpath))\n\n       # Store inside a class\n       item = StoreItem(name, price, manufacturer)\n       items.append(item)<\/pre>\n<p>I am also using the <a href=\"https:\/\/tqdm.github.io\/\" target=\"_blank\" rel=\"noopener\">tqdm<\/a> library, which displays a progress bar when wrapped around an iterable:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-85203\" src=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/progress-bar-loading.gif\" alt=\"Progress Bar Loading\" width=\"730\" height=\"372\"><\/p>\n<p>Let\u2019s check how many items we have:<\/p>\n<pre class=\"language-python hljs\">&gt;&gt;&gt; len(items)\n588<\/pre>\n<p>588 computers! Now, let\u2019s store the items we have into a CSV file.<\/p>\n<h3>Storing the data<\/h3>\n<p>To store the data, we will use the <a href=\"https:\/\/pandas.pydata.org\/docs\/\" target=\"_blank\" rel=\"noopener\">Pandas<\/a> library to create a <code>DataFrame<\/code> and save it to a CSV:<\/p>\n<pre class=\"language-python hljs\">import pandas as pd\n\ndf = pd.DataFrame(\n   {\n       \"name\": [item.name for item in items],\n       \"price\": [item.price for item in items],\n       \"manufacturer\": [item.manufacturer for item in items],\n   }\n)\n\ndf.head()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-85205\" src=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/name-price-manufacturer-table.png\" alt=\"Name, Price, and Manufacturer Table\" width=\"672\" height=\"173\" srcset=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/name-price-manufacturer-table.png 672w, https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/name-price-manufacturer-table-300x77.png 300w\" sizes=\"auto, (max-width: 672px) 100vw, 672px\" \/><\/p>\n<p>There you go! Let\u2019s finally save it to a file:<\/p>\n<pre class=\"language-python hljs\">df.to_csv(\"data\/scaped.csv\", index=False)<\/pre>\n<h2>Conclusion<\/h2>\n<p>This tutorial was a straightforward example of how to use a web crawler in Python. While mastering the tools you learned today will be more than enough for most of your scraping needs, you may need a few additional tools for particularly nasty websites.<\/p>\n<p>Specifically, I suggest you learn about <a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\" target=\"_blank\" rel=\"noopener\">BeautifulSoup<\/a> if you don\u2019t feel like learning XPath syntax, as <a href=\"https:\/\/blog.logrocket.com\/build-python-web-scraper-beautiful-soup\/\" target=\"_blank\" rel=\"noopener\">BeautifulSoup offers an OOP approach to querying the HTML code<\/a>.<\/p>\n<p>For websites that require logging in or changes dynamically using JavaScript, you should learn one of the best libraries in Python\u200a, <a href=\"https:\/\/blog.logrocket.com\/web-automation-selenium-python\/\" target=\"_blank\" rel=\"noopener\">\u200aSelenium<\/a>. Finally, for enterprise web scraping, there is <a href=\"https:\/\/docs.scrapy.org\/en\/latest\/\" target=\"_blank\" rel=\"noopener\">Scrapy<\/a>, which covers pretty much every aspect there is to web scraping. Thanks for reading!<\/p>\n<\/html>\n","protected":false},"excerpt":{"rendered":"<p>In this tutorial, you\u2019ll learn about web crawling via a simple online store and how to build a Python web crawler from scratch.<\/p>\n","protected":false},"author":156415844,"featured_media":85183,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2147999,1],"tags":[2109833],"class_list":["post-85180","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dev","category-uncategorized","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.1.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Build a Python web crawler from scratch - LogRocket Blog<\/title>\n<meta name=\"description\" content=\"In this tutorial, you\u2019ll learn about web crawling via a simple online store and how to build a Python web crawler from scratch.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Build a Python web crawler from scratch - LogRocket Blog\" \/>\n<meta property=\"og:description\" content=\"In this tutorial, you\u2019ll learn about web crawling via a simple online store and how to build a Python web crawler from scratch.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/\" \/>\n<meta property=\"og:site_name\" content=\"LogRocket Blog\" \/>\n<meta property=\"article:published_time\" content=\"2022-01-05T17:00:15+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-06-04T21:13:01+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png\" \/>\n\t<meta property=\"og:image:width\" content=\"730\" \/>\n\t<meta property=\"og:image:height\" content=\"487\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Bekhruz Tuychiev\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Bekhruz Tuychiev\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/\",\"url\":\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/\",\"name\":\"Build a Python web crawler from scratch - LogRocket Blog\",\"isPartOf\":{\"@id\":\"https:\/\/blog.logrocket.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png\",\"datePublished\":\"2022-01-05T17:00:15+00:00\",\"dateModified\":\"2024-06-04T21:13:01+00:00\",\"author\":{\"@id\":\"https:\/\/blog.logrocket.com\/#\/schema\/person\/d690f547ef3e754bf576f9bf55febd52\"},\"description\":\"In this tutorial, you\u2019ll learn about web crawling via a simple online store and how to build a Python web crawler from scratch.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#primaryimage\",\"url\":\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png\",\"contentUrl\":\"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png\",\"width\":730,\"height\":487,\"caption\":\"Python Logo Over a Blue Background\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.logrocket.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Build a Python web crawler from scratch\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.logrocket.com\/#website\",\"url\":\"https:\/\/blog.logrocket.com\/\",\"name\":\"LogRocket Blog\",\"description\":\"Resources to Help Product Teams Ship Amazing Digital Experiences\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.logrocket.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.logrocket.com\/#\/schema\/person\/d690f547ef3e754bf576f9bf55febd52\",\"name\":\"Bekhruz Tuychiev\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.logrocket.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/2dc2723df3106b7bf646c5498ecf9054bbdc079ce78837a86e6a5545fbca5253?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/2dc2723df3106b7bf646c5498ecf9054bbdc079ce78837a86e6a5545fbca5253?s=96&d=mm&r=g\",\"caption\":\"Bekhruz Tuychiev\"},\"description\":\"I am a data science content writer, spilling every bit of knowledge I have through a series of blog posts, articles, and tutorials. Trying to fulfill my never-satisfied desire of teaching AI and data science to as many people as possible.\",\"url\":\"https:\/\/blog.logrocket.com\/author\/bekhruztuychiev\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Build a Python web crawler from scratch - LogRocket Blog","description":"In this tutorial, you\u2019ll learn about web crawling via a simple online store and how to build a Python web crawler from scratch.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/","og_locale":"en_US","og_type":"article","og_title":"Build a Python web crawler from scratch - LogRocket Blog","og_description":"In this tutorial, you\u2019ll learn about web crawling via a simple online store and how to build a Python web crawler from scratch.","og_url":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/","og_site_name":"LogRocket Blog","article_published_time":"2022-01-05T17:00:15+00:00","article_modified_time":"2024-06-04T21:13:01+00:00","og_image":[{"width":730,"height":487,"url":"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png","type":"image\/png"}],"author":"Bekhruz Tuychiev","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Bekhruz Tuychiev","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/","url":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/","name":"Build a Python web crawler from scratch - LogRocket Blog","isPartOf":{"@id":"https:\/\/blog.logrocket.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#primaryimage"},"image":{"@id":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png","datePublished":"2022-01-05T17:00:15+00:00","dateModified":"2024-06-04T21:13:01+00:00","author":{"@id":"https:\/\/blog.logrocket.com\/#\/schema\/person\/d690f547ef3e754bf576f9bf55febd52"},"description":"In this tutorial, you\u2019ll learn about web crawling via a simple online store and how to build a Python web crawler from scratch.","breadcrumb":{"@id":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.logrocket.com\/build-python-web-crawler\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#primaryimage","url":"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png","contentUrl":"https:\/\/blog.logrocket.com\/wp-content\/uploads\/2022\/01\/build-python-web-crawler-scratch.png","width":730,"height":487,"caption":"Python Logo Over a Blue Background"},{"@type":"BreadcrumbList","@id":"https:\/\/blog.logrocket.com\/build-python-web-crawler\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.logrocket.com\/"},{"@type":"ListItem","position":2,"name":"Build a Python web crawler from scratch"}]},{"@type":"WebSite","@id":"https:\/\/blog.logrocket.com\/#website","url":"https:\/\/blog.logrocket.com\/","name":"LogRocket Blog","description":"Resources to Help Product Teams Ship Amazing Digital Experiences","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.logrocket.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.logrocket.com\/#\/schema\/person\/d690f547ef3e754bf576f9bf55febd52","name":"Bekhruz Tuychiev","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.logrocket.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/2dc2723df3106b7bf646c5498ecf9054bbdc079ce78837a86e6a5545fbca5253?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/2dc2723df3106b7bf646c5498ecf9054bbdc079ce78837a86e6a5545fbca5253?s=96&d=mm&r=g","caption":"Bekhruz Tuychiev"},"description":"I am a data science content writer, spilling every bit of knowledge I have through a series of blog posts, articles, and tutorials. Trying to fulfill my never-satisfied desire of teaching AI and data science to as many people as possible.","url":"https:\/\/blog.logrocket.com\/author\/bekhruztuychiev\/"}]}},"yoast_description":"In this tutorial, you\u2019ll learn about web crawling via a simple online store and how to build a Python web crawler from scratch.","_links":{"self":[{"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/posts\/85180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/users\/156415844"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/comments?post=85180"}],"version-history":[{"count":12,"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/posts\/85180\/revisions"}],"predecessor-version":[{"id":197835,"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/posts\/85180\/revisions\/197835"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/media\/85183"}],"wp:attachment":[{"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/media?parent=85180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/categories?post=85180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.logrocket.com\/wp-json\/wp\/v2\/tags?post=85180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}