Many web-based startups and businesses are adopting machine learning and natural language processing techniques to make their work partially or fully automated. One such NLP technique is web-scraping, this method allows anyone to scrape the data or information from the websites, the purpose could vary but the process remains the same.
To help implement this method, we have several APIs and tools, but the question is which tool is beginner-friendly, implements easily, and well structured?. To sort out this confusion here is a tutorial on scrapy framework which is a powerful framework and is loved by developers all around the world.
(Must read: How does Squarify Help in Building Treemaps Using Python?)
Scrapy is a python based web crawler, open-source, and free platform. Zyte is the services company that maintains the working of scrapy platforms. Out of many purposes, this framework is mainly used for data mining where we try to find the patterns between the huge dataset and for automating web testing.
Let us dive into its architecture which will help us to grasp its working.
Scrapy Engine Architecture , Source
This architecture involves 7 components which are explained below-:
Spiders: the spiders are the special classes used to parse the scraped response which means, this component is user-based, every response will be different depending upon the URL.
Scrapy Engine: the scrapy engine is used to maintain the flow of data across the system, which makes it an important component.
Scheduler: the scheduler accepts the request from the scrapy engine and gives it back to the scrapy engine whenever asked.
Downloader: this component fetches the web pages and delivers them to the scrapy engine.
Item pipeline: item pipeline comes to the role after the spiders parse the items, item pipeline processes these items and also stores the item in the databases.
Downloader middlewares: This component is responsible for accepting the results from the scrapy engine and processes them forward to the downloader.
Spider Middlewares: the spider middlewares is responsible for accepting the responses from the scrapy engine and passes them forward to the spiders.
(Also read: How Do We Implement Beautiful Soup For Web Scraping?)
Scrapy easily processes whatever type of data you feed no matter which format you are feeding, which makes it extremely versatile. Scrapy is also capable of handling patchy data and fixing it for better results.
Pip command to download scrapy-:
pip install scrapy
Before moving into the working example of scrapy, let us discuss a few important tools that will help us to understand the model.
‘Selectors’ is the scrapy library that selects the specific parts from the HTML document like ‘div’, ‘anchor tags’, paragraphs, and more. These selected parts from the HTML are specifically from ‘css’ or ‘XPath’.
XPath is an XML path language, which is used to select nodes inside the XML documents whereas CSS is used for the styling purpose in the HTML document.
Let’s see how these selectors work-:
Consider a small HTML document-:
<div class = ‘one’>
<a href=’www.aipoint.tech’> aipoint</a>
</div>
Now, if we want to select this div, anchor link, text, or more using selectors we need to follow the below instruction-:
//div/a -> <a href=’www.aipoint.tech’> aipoint</a>
As we can see the above command gives the whole anchor tag, to become more specific, or if we need the text inside the anchor tag, use the below command-:
//div/a/text() -> aipoint
The above command gives the text inside the anchor tag.
//div/a/@href -> www.aipoint.tech
Now consider you want to extract the whole paragraph from the paragraph tag, below is the way to extract out paragraphs by using the selectors-:
<p class=’one-class’>Analytics steps is one-stop destination for information services</p>
Above is the paragraph inside the paragraph tag and we want to extract it using selectors-:
//p[@class=’one-class’]/text()
The above selector command would give the text inside the paragraph tag.
(Suggested blog: Python Essentials for Voice Control: Part 1)
Let’s write the code and extract some of the information from zyte.com. Now let us see what we need to scrape from this website. After you have installed scrapy, go to the command prompt and start the scrapy project using the following command-:
scrapy startproject <projectname>
We are taking ‘qs’ as a project name so our command will be:
scrapy startproject qs
Now, if you will open this folder in the VS code editor, you are going to see a few of the files shown below:
Inside the spiders folder, create a new file and name it ‘quotes.py’.
Now, inspect the page whose content you want to scrape, use ctrl + shift + I to inspect, and keep searching the div that contains the quotes.
You would know the desired div as while hovering over the div, you get to see the boundary that div contains.
Below is the anchor class we found that contains the title.
And you will see the boundary of this div while hovering over it with the help of a cursor like below-:
Now that we have found the desired anchor tag, we are good to move forward and create spiders that would crawl the website to find the content inside this anchor tag.
Now, inside the quotes.py that we created in the scrapy project, add the below code-:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://www.zyte.com/blog/']
def parse(self, response):
for title in response.css('.oxy-post-title'):
yield {'title': title.css('::text').get()}
for next_page in response.css('a.next'):
yield response.follow(next_page, self.parse)
Above is the code or the way by which we implement the scrapy, we can see that a spider class is created where we have added the name of the spider and the URL of the page from which we need to extract the desired information.
(Suggested blog: DATA TYPES in Python)
Secondly, we have created a method for parsing, the parse method takes two arguments (self and response).
You must be wondering what ‘yield’ does, ‘yield’ is a python keyword by which we can pause and resume the execution of a generator function.
The anchor class is named ‘.oxy-post-title’, therefore we added this class name in response.css, and yield is used to get the text information inside the anchor tag.
response.follow is used to parse the information on the next page.
Now the most important thing is to run and save the information, run this code on VS code editor, and crawl this website using scrapy, we are going to write one line on VS code terminal which is written below-:
Scrapy crawl <spider name> -o data.json
Now, in our case, the spider name is ‘blog spider’. So we will write the following line on the editor’s terminal.
scrapy crawl blogspider -o data1.json
Now, you will see a data.json file
Now, if we will open this JSON file, we can see all the data that have been scraped perfectly-:
Scrapy is a powerful open-source tool and one of the best tools for web scraping, we learned about the different components of the scrapy tool with the example where we saved the data in the JSON format.
5 Factors Influencing Consumer Behavior
READ MOREElasticity of Demand and its Types
READ MOREAn Overview of Descriptive Analysis
READ MOREWhat is PESTLE Analysis? Everything you need to know about it
READ MOREWhat is Managerial Economics? Definition, Types, Nature, Principles, and Scope
READ MORE5 Factors Affecting the Price Elasticity of Demand (PED)
READ MORE6 Major Branches of Artificial Intelligence (AI)
READ MOREScope of Managerial Economics
READ MOREDijkstra’s Algorithm: The Shortest Path Algorithm
READ MOREDifferent Types of Research Methods
READ MORE
Latest Comments