On your way to scraping French forums

Xiaoou&AI

2 min readMar 27, 2021

Be sure to have read the first tutorial here.

Originally published at AIPrototypes.com.

Get pages

The construction of the scraper for the French forum doctissimo is actually simpler than lemonde website.

Let's look at the link of the second page of the "pain dos" section as an example:

https://forum.doctissimo.fr/sante/douleur-dos/liste_sujet-2.htm

First I write a function to generate the links including page numbering:

links_page = ["https://forum.doctissimo.fr/sante/douleur-dos/liste_sujet-" +
         str(i) + ".htm" for i in range(10)]
links_page[:2]['https://forum.doctissimo.fr/sante/douleur-dos/liste_sujet-0.htm', 'https://forum.doctissimo.fr/sante/douleur-dos/liste_sujet-1.htm']

Then I go on with a second function retrieving the threads on each page.

from urllib.request import urlopen  # standard python module
from bs4 import BeautifulSoup
from urllib.error import HTTPError

def get_post_links(link):
    try:
        html = urlopen(link)
    except HTTPError as e:
        print("text url not valid")
    soup = BeautifulSoup(html, "html.parser")
    temp = soup.find_all(scope="row")
    test = set()
    for post in temp:
        if post.find('a'):
            test.add(post.find('a').get('href'))
    return test

url = "https://forum.doctissimo.fr/sante/douleur-dos/liste_sujet-2.htm"

links_threads = get_post_links(url)
list(links_threads)[0]

3.2 Scrape interactions/posts

Now it’s just a matter of scrape the posts from the first page of each thread. It is quite straightforward because the text of each post is in a span tag where the itemprop property equals text. The code says a little more about tag structure.

The code block below allows to print:

the author
the publication date and time
the text body

of each post published on https://forum.doctissimo.fr/sante/douleur-dos/osteopathie-sujet_165606_1.htm.

def read_html(link):
    try:
        html = urlopen(link)
    except HTTPError as e:
        print("text url not valid")
    return BeautifulSoup(html, "html.parser")

content = read_html("https://forum.doctissimo.fr/sante/douleur-dos/osteopathie-sujet_165606_1.htm")

for post in content.find_all(class_="md-post__content"):
    for meta in post.find_previous("header").find_all("span"):
        print(meta.get_text().replace("\n",""))
    text = post.find(itemprop="text").get_text()
    body = text.replace("\n","")
    # replace &#034; with quotes
    s = "&#034;"
    post_body = body.replace(s, '"')
    print(post_body)Mélène4822			
		Mélène4822			
	06/09/2020 à 18h28

        Bonjour à tous,Avez-vous déjà eu recours à un ostéopathe pour votre mal de dos ?Et, si oui, en êtes-vous satisfait ?Ou, au contraire, y a-t-il contre-indication ?(J'ai une hernie discale en L5S1, à laquelle s'ajoute un problème de hanche)J'ai rendez-vous avec une ostéopathe demain et je ne suis pas très rassurée même si, paraît-il, elle emploie les méthodes douces actuelles.Merci de vos réponses et bonne soirée à vous
		LordMaxence			
		LordMaxence			
	20/11/2020 à 16h00

        Bonjour,Comme j'ai répondu dans un autre post j,,,,truncated for display

Perspectives

Now it’s up to you to do some corpus analysis :D

As always, the gist

On your way to scraping French forums

Get pages

3.2 Scrape interactions/posts

Perspectives

Written by Xiaoou&AI