# Web Scraping

*Tips*

By [hangytong](https://paragraph.com/@hangytonggmail.com) · 2024-07-18

---

When doing web scraping for a site that usually has multiple "sub sites" for example : [https://docs.flock.io/](https://docs.flock.io/) you might want to make use of sitemap to help you scrap all the contents.

![](https://storage.googleapis.com/papyrus_images/617833c50e695f3e48a4018450fc868f.png)

Site map of Flock.io

Make use of "/sitemap.xml" to get the html of the sitemap for the particular website. the **<loc>** will give you all the sub sites pertaining to the domain.

![](https://storage.googleapis.com/papyrus_images/93de508ea843db8f7c49a7a4a2212248.png)

You can make use of BeautifulSoup in python to get the sites into a list.

    import requests
    from bs4 import BeautifulSoup
    
    def scrape_flock_docs(): 
        url = 'https://docs.flock.io/sitemap.xml'
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            urls = [url.text.strip() for url in soup.find_all('loc')]

---

*Originally published on [hangytong](https://paragraph.com/@hangytonggmail.com/web-scraping)*
