Web scraping french news

Complete tutorial on scraping French news from Le Monde❤️

Build your own corpus for common nlp tests

6 min readFeb 20, 2021

While refreshing some web scraping skills I find the best way to keep a personal knowledge base is writing complete, well-documented tutorials. It creates a more robust and contextualized documentation (like Bert :D) than simply commenting code.

So here is a complete tutorial on how to scrape articles from the famous French newspaper: Le Monde

Web scraping can cause some legal issues. I just provide a coding tutorial here, please use it at your own risk and read https://soshace.com/responsible-web-scraping-gathering-data-ethically-and-legally/.

Generate Archive Links

So the first thing is to generate archive links, basically an archive link is a page on which we can scrape article links. For instance, the link https://www.lemonde.fr/archives-du-monde/01-01-2020/ contains all the articles published for 2020–01–01. Here I create the function create_archive_link which requires starting year/month/day and ending year/month/day as input. The output is a dictionary in the form of year:links.

def create_archive_links(year_start, year_end, month_start, month_end, day_start, day_end):
    archive_links = {}
    for y in range(year_start, year_end + 1):
        dates = [str(d).zfill(2) + "-" + str(m).zfill(2) + "-" +
                 str(y) for m in range(month_start, month_end + 1) for d in
                 range(day_start, day_end + 1)]
        archive_links[y] = [
            "https://www.lemonde.fr/archives-du-monde/" + date + "/" for date in dates]
    return archive_links

Example output of create_archive_links(2006,2020,1, 12, 1, 31)

Generate Article Links

The next step is to get all the article links on the archive page. For this you need 3 modules: HTTPError to handle exceptions, urlopen to open webpages and BeautifulSoup to parse webpages.

The exception handling is necessary because we have archive pages for dates like 02–31, it’s much easier to handle exceptions than generating links corresponding only to legitimate dates.

Each article is in a <section> having a class named teaser and here I also filter all the non free articles having a span with class icon__premium. All the links containing the word en-direct are also filtered because they are videos. This is to say that web scraping requires not only programming skills but also some elementary web analysis.

from urllib.error import HTTPError
from urllib.request import urlopen
from bs4 import BeautifulSoupdef get_articles_links(archive_links):
    links_non_abonne = []
    for link in archive_links:
        try:
            html = urlopen(link)
        except HTTPError as e:
            print("url not valid", link)
        else:
            soup = BeautifulSoup(html, "html.parser")
            news = soup.find_all(class_="teaser")
            # condition here : if no span icon__premium (abonnes)
            for item in news:
                if not item.find('span', {'class': 'icon__premium'}):
                    l_article = item.find('a')['href']
                    # en-direct = video
                    if 'en-direct' not in l_article:
                        links_non_abonne.append(l_article)
    return links_non_abonne

Save links to txt

Since no body wants to scrape the same links again and again (they are not very likely to change), a handy function can be created to save links into files. Here the publication year is used to name files.

def write_links(path, links, year_fn):
    with open(os.path.join(path + "/lemonde_" + str(year_fn) + "_links.txt"), 'w') as f:
        for link in links:
            f.write(link + "\n")article_links = {}for year,links in archive_links.items():
    print("processing: ",year)
    article_links_list = get_articles_links(links)
    article_links[year] = article_links_list
    write_links(corpus_path,article_links_list,year)

This would produce something like this:

Scraping a Single Page

Now you can scrape article contents, it’s surprisingly straightforward. In fact you just need to get all the h1,h2 and p elements. The recursive=False is important here because you don’t want to dig any deeper into children elements once you find the text.

For concerns of modularity you should first write a function to scrape a single page.

def get_single_page(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print("url not valid", url)
    else:
        soup = BeautifulSoup(html, "html.parser")
        text_title = soup.find('h1')
        text_body = soup.article.find_all(["p", "h2"], recursive=False)
        return (text_title, text_body)

Classify the scraped texts for NLP testing

Let’s say you need to classify news by theme (which is very common in text classification tasks), so you can use the following function to extract themes from links. For example the link https://www.lemonde.fr/politique/article/2020/01/01/reforme-des-retraites-macron-reste-inflexible-aucune-issue-ne-se-profile_6024550_823448.html contains the keyword politique meaning politics in French.

def extract_theme(link):
    try:
        theme_text = re.findall(r'.fr/.*?/', link)[0]
    except:
        pass
    else:
        return theme_text[4:-1]

You can also get the top n themes. Here is an example of retrieving texts of the top 5 themes. Note that I use a lambda function to reversely sorting themes.

themes = []def list_themes(links):
    themes = []
    for link in links:
        theme = extract_theme(link)
        if theme is not None:
            themes.append(theme)
    return themesfrom collections import Counter
theme_stat = Counter(themes)
theme_top = []for k,v in sorted(theme_stat.items(), key = lambda x:x[1], reverse=True):
    if v > 700:
        theme_top.append((k, v))themes_top_five = [x[0] for x in theme_top[:5]]

Now you need to have the links corresponding to the top 5 themes, so:

from collections import defaultdictdef classify_links(theme_list, link_list):
    dict_links = defaultdict(list)
    for theme in theme_list:
        theme_link = 'https://www.lemonde.fr/' + theme + '/article/'
        for link in link_list:
            if theme_link in link:
                dict_links[theme].append(link)
    return dict_linksall_links = []
for link_list in article_links.values():
    all_links.extend(link_list)
themes_top_five_links = classify_links(themes_top_five,all_links)

Scrape All the Articles

Now it’s time to scrape all the articles that you need. Typically you would want to save all the scraped text in a folder. Note that this function takes a dictionary object with theme as key and corresponding links as value. For example here is a dict example for the top 5 themes.

links_dict = {key: value for key, value
                  in themes_top_five_links.items()
             }

Finally, you can scrape ! Note the tqdm around range, since the whole scraping process could be quite long, you need to keep track of the progress so that you know it’s always running. See how to use tqdm here https://tqdm.github.io/. Below is a progress bar example copied from the official documentation.

def scrape_articles(dict_links):
    themes = dict_links.keys()
    for theme in themes:
        create_folder(os.path.join('corpus', theme))
        print("processing:", theme)
        #### note the use tqdm
        for i in tqdm(range(len(dict_links[theme]))):
            link = dict_links[theme][i]
            fn = extract_fn(link)
            single_page = get_single_page(link)
            if single_page is not None:
                with open((os.path.join('corpus', theme, fn + '.txt')), 'w') as f:
                    # f.write(dict_links[theme][i] + "\n" * 2)
                    f.write(single_page[0].get_text() + "\n" * 2)
                    for line in single_page[1]:
                        f.write(line.get_text() + "\n" * 2)

Helper Function to extract scraped articles from folders

Now that you get all the articles as txt, here is a function allowing you to extract any number of texts from each folder. The default number is set to 1000.

def cr_corpus_dict(path_corpus, n_files=1000):
    dict_corpus = defaultdict(list)
    themes = os.listdir(path_corpus)
    for theme in themes:
        counter = 0
        if not theme.startswith('.'):
            theme_directory = os.path.join(path_corpus, theme)
            for file in os.listdir(theme_directory):
                if counter < n_files:
                    path_file = os.path.join(theme_directory, file)
                    text = read_file(path_file)
                    dict_corpus["label"].append(theme)
                    dict_corpus["text"].append(text)
                counter += 1
    return dict_corpus

Wrap-up

Here you are. Happy scraping! I hope that you have:

1, Realized how web scraping can be a valuable tool to construct your own corpus.

2, Understood the typical workflow of a web scraping project.

And mind the potential legal issues!

You can find all the source code here. I’ve also written tutorials about scraping forums and integrating object orienter programming into web scraping. Ve sure to check them out if interested.

On your way to scraping French forums

Author: Xiaoou Wang, Master’s student (currently in Paris) in NLP looking for a phd position/contrat cifre.

xiaoouwang.medium.com

Understand objected-oriented programming (OOP) by building a minimal Web Scraping framework in…

Why you want to be a more organized programmer