
Subscribe to hangytong

Subscribe to hangytong
Share Dialog
Share Dialog
<100 subscribers
<100 subscribers
When doing web scraping for a site that usually has multiple "sub sites" for example : https://docs.flock.io/ you might want to make use of sitemap to help you scrap all the contents.

Make use of "/sitemap.xml" to get the html of the sitemap for the particular website. the <loc> will give you all the sub sites pertaining to the domain.

You can make use of BeautifulSoup in python to get the sites into a list.
import requests
from bs4 import BeautifulSoup
def scrape_flock_docs():
url = 'https://docs.flock.io/sitemap.xml'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
urls = [url.text.strip() for url in soup.find_all('loc')]When doing web scraping for a site that usually has multiple "sub sites" for example : https://docs.flock.io/ you might want to make use of sitemap to help you scrap all the contents.

Make use of "/sitemap.xml" to get the html of the sitemap for the particular website. the <loc> will give you all the sub sites pertaining to the domain.

You can make use of BeautifulSoup in python to get the sites into a list.
import requests
from bs4 import BeautifulSoup
def scrape_flock_docs():
url = 'https://docs.flock.io/sitemap.xml'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
urls = [url.text.strip() for url in soup.find_all('loc')]hangytong
hangytong
No activity yet