Understand objected-oriented programming (OOP) by building a minimal Web Scraping framework
What you are going to learn
requests
is a very popular package in Python because it provides many convenient methods to handle requests, parsing and exception handling. One could also use the official urllib
package, however for the same tasks it is overall much easier to use requests
due to its code design. You can clearly see the philosophy of the creator through the website’s motto:
The objective of this tutorial is to introduce you to objected-oriented programming (OOP) by imitating a working example in requests
. After having a grasp of the very basic principles of OOP, we would move on to a more elaborate example to make you better seize the benefits of OOP.
At the end you will be able to build a minimal web scraping framework allowing to scrape articles of the famous French newspaper Le Monde and New York Times.
Before starting this tutorial, be sure to have installed requests
and beautiful soup
with:
pip install requests
pip install beautifulsoup4
A working example in requests
Here is a quick starting example using requests
, as you can see from the output.
- The
get
method returns a class object. - The
status code
tells us that the request is successful. - The
text
property returns the source code of Google’s homepage. - The
cookies
property returns your cookies.
import requestsr = requests.get('https://www.google.com/')print("The get method returns a class object.")
print("-------------")
print(type(r))print("\nThe status code tells us that the request is successful.")
print("-------------")
print(r.status_code)print("\nThe text property returns the source code of Google homepage.")
print("-------------")
print(r.text)print("\nThe cookies property returns your cookies.")
print("-------------")
print(r.cookies)--------------------------------------------------------------------The get method returns a class object
-------------
<class 'requests.models.Response'>The status code tells us that the request is successful.
-------------
200The text property returns the source code of Google homepage
-------------
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="fr"><head><meta content="text/htd{line-height:.8em}.gac_m td{line-height:17px}form{margin-&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}
l\x22:16,\x22sbpr\x22:16,\x22scd\x22:10,\x22stok\x22:\x22-kXP3xnMOMg3nj0CIEI2W6nZXg4\x22,\x22uhde\x22:false}}';google.pmc=JSON.parse(pmc);})();</script> <body> </body></html>The cookies property returns your cookies
-------------
<RequestsCookieJar[<Cookie CONSENT=PENDING+097 for .google.com/>]>
Class, constructor, method and property
When we talk about objected-oriented programming (OOP), no matter which language you are using (Some languages like Java and C++ are more natively class-based), the four most fundamental concepts are class, constructor, method and property
.
Let’s build our first class in Python. A class is a concept, a perception. Let’s say we want to define what is a human being (Person).
Typically you would start building your Person
(capital case) class by defining it's properties (name, nationality, job) using a constructor
(the __init__
here). You would also like to define the class’s method (what a Person can do).
One thing might seem odd for beginners is the self
parameter. Roughly speaking it is a placeholder which allows your methods to access the properties.
The code should be quite self-explanatory.
class Person:
def __init__(self, name, age, nationality, job):
self.name = name
self.age = age
self.nationality = nationality
self.job = job def greeting(self):
print(f"Hello my name is {self.name}.\nI'm {self.nationality}.\nI work in {self.job}.")--------------------------------------------------------------------# initiate a class
me = Person("Xiaoou","Chinese", "NLP")# using the greeting method
me.greeting()# access the name property
print(me.name)Hello my name is Xiaoou.
I'm Chinese.
I work in NLP.
Xiaoou
So far so good. We have built our first class. Now let’s imitate the schema used in the working example of requests
we saw at the beginning. As a reminder, the working code bloc is:
r = requests.get('https://www.google.com/')
print(type(r))
<class 'requests.models.Response'>
print(r.status_code)
print(r.text)
When you run requests.get("https://www.google.com/")
, you have an object of class <class 'requests.models.Response'>
. This object has two properties: status_code
and text
. Let's imitate this schema with an example of article scraping.
So basically we are trying to have an object returned after calling a function/method.
We first create the class Content
and then the function scrape_lemonde
which encapsulates the scraped url, title and the first 100 characters into the object created by the Content
class.
import requests
from bs4 import BeautifulSoupclass Content:
def __init__(self, url, title, first100):
self.url = url
self.title = title
self.first100 = first100def get_parsed_text(url):
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soupdef scrape_lemonde(url):
soup = get_parsed_text(url)
title = soup.find('h1').text
body_tags = soup.article.find_all(["p", "h2"], recursive=False)
body = ""
for tag in body_tags:
body += tag.get_text()
first100 = body[:100]
return Content(url, title, first100)url = 'https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html'content = scrape_lemonde(url)print("Url")
print("---------")
print(content.url)print("\nTitle\n----------")
print(content.title)print("\nFirst 100 characters\n-----------")
print(content.first100)
--------------------------------------------------------------------Url
---------
https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.htmlTitle
----------
Le mouvement Wiener Werkstätte, dans « Cartes postales magazine »First 100 characters
-----------
« Le 22 janvier 2012 décède Paul Armand, ce qui entraîne la mort de Cartes postales et collections (
What are the benefits of such an approach ?
Let’s imagine that the end user who uses your product only has the scrape_lemonde
function. He would have to know another 3 functions get_url
, get_content
and get_title
to be able to use your scraper.
It's quite complex and counter-intuitive.
However, an objected-oriented approach is much more comprehensible. Because a webpage naturally has a title
, a url
and text
. Your end user doesn’t need to know how you manage to get these informations, he just needs to know how to access them.
That’s the principle of encapsulation
.
Note that I’ve deliberately simplified some concepts relevant to software design
as I don't want to get too deep into it. However I hope that you see already how OOP works in vivo and the benefits of such approach.
Another benefit (besides encapsulation
and better code structure) is its underlying normalizing abilities. Let’s say you now want to scrape New York Times. It suffices to add a scrape_nytimes
while returning always the same kind of object (Content
with 3 properties).
In this way you normalize your framework or your product by defining a unified output regardless of the scraper/function that the user employs.
Let’s implement it.
# The new functiondef scrape_nytimes(url):
bs = get_parsed_text(url)
title = bs.find('h1').text
lines = bs.select('div.StoryBodyCompanionColumn div p')
body = '\n'.join([line.text for line in lines])
first100 = body[:100]
return Content(url, title, first100)# Exactly the same schemaurl = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'content = scrape_nytimes(url)print("Url")
print("---------")
print(content.url)print("\nTitle\n----------")
print(content.title)print("\nFirst 100 characters\n-----------")
print(content.first100)
--------------------------------------------------------------------Url
---------
https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.htmlTitle
----------
The Men Who Want to Live ForeverFirst 100 characters
-----------
Would you like to live forever? Some billionaires, already invincible in every other way, have decid
Isn’t that beautiful?
However the structure of our code (or product) is still one step from being perfectly intuitive. As a common mortal, I feel like a scraper
class would be a perfect place to start instead of having to know the two functions scrape_lemonde
and scrape_nytimes
.
Let’s do it! Note that we reuse the Content class created earlier. Then we integrate the two scraper functions in a new Crawler class.
The new Crawler class has journal
as parameter, which allows us to adjust the scraper’s behavior according to the newspaper (which uses different tags to surround text body).
It’s important to add self
as a parameter when you create functions in a class. Actually when “functions” are created inside a class they are called methods and the self
parameter suggests that these are methods acting on objects of the same class.
The last thing worth mentioning is the print
in the constructor which returns a message when the user wishes to crawl a newspaper not recognized by the Crawler.
class Content:
def __init__(self, url, title, first100):
self.url = url
self.title = title
self.first100 = first100class Crawler: def __init__(self, journal):
self.journal = journal # if newspaper not recognized if journal not in ["lemonde","nyt"]:
print("Our tech group doesn't scrape this website for the moment.\nPlease contact us to build a scraper for this journal :D") def get_parsed_text(self,url):
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soup def scrape_lemonde(self,url):
soup = self.get_parsed_text(url)
title = soup.find('h1').text
body_tags = soup.article.find_all(["p", "h2"], recursive=False)
body = ""
for tag in body_tags:
body += tag.get_text()
first100 = body[:100]
return Content(url, title, first100) def scrape_nytimes(self,url):
bs = self.get_parsed_text(url)
title = bs.find('h1').text
lines = bs.select('div.StoryBodyCompanionColumn div p')
body = '\n'.join([line.text for line in lines])
first100 = body[:100]
return Content(url, title, first100) def get(self,url):
if self.journal == "lemonde":
return self.scrape_lemonde(url)
elif self.journal == "nyt":
return self.scrape_nytimes(url)
Le monde scraper
Now let’s instantiate a scraper for articles of the newspaper Le Monde
.
# Create the lemonde crawler
lemonde_crawler = Crawler("lemonde")lemonde_content = lemonde_crawler.get("https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html")
print("Url")
print("---------")
print(lemonde_content.url)print("\nTitle\n----------")
print(lemonde_content.title)print("\nFirst 100 characters\n-----------")
print(lemonde_content.first100)--------------------------------------------------------------------Url
---------
https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.htmlTitle
----------
Le mouvement Wiener Werkstätte, dans « Cartes postales magazine »First 100 characters
-----------
« Le 22 janvier 2012 décède Paul Armand, ce qui entraîne la mort de Cartes postales et collections (
New York Times scraper
Now a scraper for articles of New York Times.
# Create the New York Times Crawler
nyt_crawler = Crawler("nyt")nyt_content = nyt_crawler.get("https://www.nytimes.com/2021/04/02/us/politics/capitol-attack.html")
print("Url")
print("---------")
print(nyt_content.url)
print("\nTitle\n----------")
print(nyt_content.title)
print("\nFirst 100 characters\n-----------")
print(nyt_content.first100)--------------------------------------------------------------------Url
---------
https://www.nytimes.com/2021/04/02/us/politics/capitol-attack.htmlTitle
----------
Driver Rams Into Officers at Capitol, Killing One and Injuring AnotherFirst 100 characters
-----------
WASHINGTON — The band of razor wire-topped fencing around the Capitol had recently come down. The he
Unknown newspaper
When the end user enters an unknown journal, it will spit out the pre-defined message.
bbc_crawler = Crawler("bbc")--------------------------------------------------------------------Our tech group doesn't scrape this website for the moment.
Please contact us to build a scraper for this journal :D
Wrap-up and further readings
Bravo for having read so far, hopefully you have:
- Learned how to manage a little web-scraping project using class in Python
- Understood the benefits of object oriented programming both at the level of code structuring and at the level of software design
- Known the basic underlying principles of frameworks like
requests
Now it’s time to build your own framework :D
If you want to know more about web scraping, read the wonderful book by Ryan Mitchell from which this tutorial is partly inspired.
Web Scraping with Python: Collecting More Data from the Modern Web
If you want to dive into Object-Oriented Programming, be sure to check the book of Mark Lutz:
Programming Python: Powerful Object-Oriented Programming