Understand objected-oriented programming (OOP) by building a minimal Web Scraping framework

Why you want to be a more organized programmer

Xiaoou&AI
7 min readApr 3, 2021
Photo on pixabay (free for commercial use)

What you are going to learn

requests is a very popular package in Python because it provides many convenient methods to handle requests, parsing and exception handling. One could also use the official urllib package, however for the same tasks it is overall much easier to use requests due to its code design. You can clearly see the philosophy of the creator through the website’s motto:

The objective of this tutorial is to introduce you to objected-oriented programming (OOP) by imitating a working example in requests . After having a grasp of the very basic principles of OOP, we would move on to a more elaborate example to make you better seize the benefits of OOP.

At the end you will be able to build a minimal web scraping framework allowing to scrape articles of the famous French newspaper Le Monde and New York Times.

Before starting this tutorial, be sure to have installed requests and beautiful soup with:

pip install requests
pip install beautifulsoup4

A working example in requests

Here is a quick starting example using requests, as you can see from the output.

  1. The get method returns a class object.
  2. The status code tells us that the request is successful.
  3. The text property returns the source code of Google’s homepage.
  4. The cookies property returns your cookies.
import requestsr = requests.get('https://www.google.com/')print("The get method returns a class object.")
print("-------------")
print(type(r))
print("\nThe status code tells us that the request is successful.")
print("-------------")
print(r.status_code)
print("\nThe text property returns the source code of Google homepage.")
print("-------------")
print(r.text)
print("\nThe cookies property returns your cookies.")
print("-------------")
print(r.cookies)
--------------------------------------------------------------------The get method returns a class object
-------------
<class 'requests.models.Response'>
The status code tells us that the request is successful.
-------------
200
The text property returns the source code of Google homepage
-------------
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="fr"><head><meta content="text/htd{line-height:.8em}.gac_m td{line-height:17px}form{margin-&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}
l\x22:16,\x22sbpr\x22:16,\x22scd\x22:10,\x22stok\x22:\x22-kXP3xnMOMg3nj0CIEI2W6nZXg4\x22,\x22uhde\x22:false}}';google.pmc=JSON.parse(pmc);})();</script> <body> </body></html>
The cookies property returns your cookies
-------------
<RequestsCookieJar[<Cookie CONSENT=PENDING+097 for .google.com/>]>

Class, constructor, method and property

When we talk about objected-oriented programming (OOP), no matter which language you are using (Some languages like Java and C++ are more natively class-based), the four most fundamental concepts are class, constructor, method and property.

Let’s build our first class in Python. A class is a concept, a perception. Let’s say we want to define what is a human being (Person).

Typically you would start building your Person (capital case) class by defining it's properties (name, nationality, job) using a constructor (the __init__ here). You would also like to define the class’s method (what a Person can do).

One thing might seem odd for beginners is the self parameter. Roughly speaking it is a placeholder which allows your methods to access the properties.

The code should be quite self-explanatory.

class Person:
def __init__(self, name, age, nationality, job):
self.name = name
self.age = age
self.nationality = nationality
self.job = job
def greeting(self):
print(f"Hello my name is {self.name}.\nI'm {self.nationality}.\nI work in {self.job}.")
--------------------------------------------------------------------# initiate a class
me = Person("Xiaoou","Chinese", "NLP")
# using the greeting method
me.greeting()
# access the name property
print(me.name)
Hello my name is Xiaoou.
I'm Chinese.
I work in NLP.
Xiaoou

So far so good. We have built our first class. Now let’s imitate the schema used in the working example of requests we saw at the beginning. As a reminder, the working code bloc is:

r = requests.get('https://www.google.com/')
print(type(r))
<class 'requests.models.Response'>
print(r.status_code)
print(r.text)

When you run requests.get("https://www.google.com/"), you have an object of class <class 'requests.models.Response'>. This object has two properties: status_code and text. Let's imitate this schema with an example of article scraping.

So basically we are trying to have an object returned after calling a function/method.

We first create the class Content and then the function scrape_lemonde which encapsulates the scraped url, title and the first 100 characters into the object created by the Content class.

import requests
from bs4 import BeautifulSoup
class Content:
def __init__(self, url, title, first100):
self.url = url
self.title = title
self.first100 = first100
def get_parsed_text(url):
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soup
def scrape_lemonde(url):
soup = get_parsed_text(url)
title = soup.find('h1').text
body_tags = soup.article.find_all(["p", "h2"], recursive=False)
body = ""
for tag in body_tags:
body += tag.get_text()
first100 = body[:100]
return Content(url, title, first100)
url = 'https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html'content = scrape_lemonde(url)print("Url")
print("---------")
print(content.url)
print("\nTitle\n----------")
print(content.title)
print("\nFirst 100 characters\n-----------")
print(content.first100)
--------------------------------------------------------------------Url
---------
https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html
Title
----------
Le mouvement Wiener Werkstätte, dans « Cartes postales magazine »
First 100 characters
-----------
« Le 22 janvier 2012 décède Paul Armand, ce qui entraîne la mort de Cartes postales et collections (

What are the benefits of such an approach ?

Let’s imagine that the end user who uses your product only has the scrape_lemonde function. He would have to know another 3 functions get_url, get_content and get_title to be able to use your scraper.

It's quite complex and counter-intuitive.

However, an objected-oriented approach is much more comprehensible. Because a webpage naturally has a title, a url and text. Your end user doesn’t need to know how you manage to get these informations, he just needs to know how to access them.

That’s the principle of encapsulation.

Note that I’ve deliberately simplified some concepts relevant to software design as I don't want to get too deep into it. However I hope that you see already how OOP works in vivo and the benefits of such approach.

Another benefit (besides encapsulation and better code structure) is its underlying normalizing abilities. Let’s say you now want to scrape New York Times. It suffices to add a scrape_nytimes while returning always the same kind of object (Content with 3 properties).

In this way you normalize your framework or your product by defining a unified output regardless of the scraper/function that the user employs.

Let’s implement it.

# The new functiondef scrape_nytimes(url):
bs = get_parsed_text(url)
title = bs.find('h1').text
lines = bs.select('div.StoryBodyCompanionColumn div p')
body = '\n'.join([line.text for line in lines])
first100 = body[:100]
return Content(url, title, first100)
# Exactly the same schemaurl = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'content = scrape_nytimes(url)print("Url")
print("---------")
print(content.url)
print("\nTitle\n----------")
print(content.title)
print("\nFirst 100 characters\n-----------")
print(content.first100)
--------------------------------------------------------------------Url
---------
https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html
Title
----------
The Men Who Want to Live Forever
First 100 characters
-----------
Would you like to live forever? Some billionaires, already invincible in every other way, have decid

Isn’t that beautiful?

However the structure of our code (or product) is still one step from being perfectly intuitive. As a common mortal, I feel like a scraper class would be a perfect place to start instead of having to know the two functions scrape_lemonde and scrape_nytimes.

Let’s do it! Note that we reuse the Content class created earlier. Then we integrate the two scraper functions in a new Crawler class.

The new Crawler class has journal as parameter, which allows us to adjust the scraper’s behavior according to the newspaper (which uses different tags to surround text body).

It’s important to add self as a parameter when you create functions in a class. Actually when “functions” are created inside a class they are called methods and the self parameter suggests that these are methods acting on objects of the same class.

The last thing worth mentioning is the print in the constructor which returns a message when the user wishes to crawl a newspaper not recognized by the Crawler.

class Content:
def __init__(self, url, title, first100):
self.url = url
self.title = title
self.first100 = first100
class Crawler: def __init__(self, journal):
self.journal = journal
# if newspaper not recognized if journal not in ["lemonde","nyt"]:
print("Our tech group doesn't scrape this website for the moment.\nPlease contact us to build a scraper for this journal :D")
def get_parsed_text(self,url):
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soup
def scrape_lemonde(self,url):
soup = self.get_parsed_text(url)
title = soup.find('h1').text
body_tags = soup.article.find_all(["p", "h2"], recursive=False)
body = ""
for tag in body_tags:
body += tag.get_text()
first100 = body[:100]
return Content(url, title, first100)
def scrape_nytimes(self,url):
bs = self.get_parsed_text(url)
title = bs.find('h1').text
lines = bs.select('div.StoryBodyCompanionColumn div p')
body = '\n'.join([line.text for line in lines])
first100 = body[:100]
return Content(url, title, first100)
def get(self,url):
if self.journal == "lemonde":
return self.scrape_lemonde(url)
elif self.journal == "nyt":
return self.scrape_nytimes(url)

Le monde scraper

Now let’s instantiate a scraper for articles of the newspaper Le Monde.

# Create the lemonde crawler
lemonde_crawler = Crawler("lemonde")
lemonde_content = lemonde_crawler.get("https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html")
print("Url")
print("---------")
print(lemonde_content.url)
print("\nTitle\n----------")
print(lemonde_content.title)
print("\nFirst 100 characters\n-----------")
print(lemonde_content.first100)
--------------------------------------------------------------------Url
---------
https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html
Title
----------
Le mouvement Wiener Werkstätte, dans « Cartes postales magazine »
First 100 characters
-----------
« Le 22 janvier 2012 décède Paul Armand, ce qui entraîne la mort de Cartes postales et collections (

New York Times scraper

Now a scraper for articles of New York Times.

# Create the New York Times Crawler
nyt_crawler = Crawler("nyt")
nyt_content = nyt_crawler.get("https://www.nytimes.com/2021/04/02/us/politics/capitol-attack.html")
print("Url")
print("---------")
print(nyt_content.url)
print("\nTitle\n----------")
print(nyt_content.title)
print("\nFirst 100 characters\n-----------")
print(nyt_content.first100)
--------------------------------------------------------------------Url
---------
https://www.nytimes.com/2021/04/02/us/politics/capitol-attack.html
Title
----------
Driver Rams Into Officers at Capitol, Killing One and Injuring Another
First 100 characters
-----------
WASHINGTON — The band of razor wire-topped fencing around the Capitol had recently come down. The he

Unknown newspaper

When the end user enters an unknown journal, it will spit out the pre-defined message.

bbc_crawler = Crawler("bbc")--------------------------------------------------------------------Our tech group doesn't scrape this website for the moment.
Please contact us to build a scraper for this journal :D

Wrap-up and further readings

Bravo for having read so far, hopefully you have:

  1. Learned how to manage a little web-scraping project using class in Python
  2. Understood the benefits of object oriented programming both at the level of code structuring and at the level of software design
  3. Known the basic underlying principles of frameworks like requests

Now it’s time to build your own framework :D

If you want to know more about web scraping, read the wonderful book by Ryan Mitchell from which this tutorial is partly inspired.

Web Scraping with Python: Collecting More Data from the Modern Web

If you want to dive into Object-Oriented Programming, be sure to check the book of Mark Lutz:

Programming Python: Powerful Object-Oriented Programming

--

--