Web Scraping

When doing web scraping for a site that usually has multiple "sub sites" for example : https://docs.flock.io/ you might want to make use of sitemap to help you scrap all the contents.

Make use of "/sitemap.xml" to get the html of the sitemap for the particular website. the <loc> will give you all the sub sites pertaining to the domain.

You can make use of BeautifulSoup in python to get the sites into a list.

import requests
from bs4 import BeautifulSoup

def scrape_flock_docs(): 
    url = 'https://docs.flock.io/sitemap.xml'
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        urls = [url.text.strip() for url in soup.find_all('loc')]

When doing web scraping for a site that usually has multiple "sub sites" for example : https://docs.flock.io/ you might want to make use of sitemap to help you scrap all the contents.

Make use of "/sitemap.xml" to get the html of the sitemap for the particular website. the <loc> will give you all the sub sites pertaining to the domain.

You can make use of BeautifulSoup in python to get the sites into a list.

import requests
from bs4 import BeautifulSoup

def scrape_flock_docs(): 
    url = 'https://docs.flock.io/sitemap.xml'
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        urls = [url.text.strip() for url in soup.find_all('loc')]

hangytong

More from hangytong

hangytong

More from hangytong

No activity yet

hangytong

More from hangytong

hangytong

No activity yet

More from hangytong

Web Scraping

Tips

Web Scraping

Tips

No activity yet

No activity yet