# Web Scraping

> Tips

**Published by:** [hangytong](https://paragraph.com/@hangytonggmail.com/)
**Published on:** 2024-07-18
**URL:** https://paragraph.com/@hangytonggmail.com/web-scraping

## Content

When doing web scraping for a site that usually has multiple "sub sites" for example : https://docs.flock.io/ you might want to make use of sitemap to help you scrap all the contents. Site map of Flock.ioMake use of "/sitemap.xml" to get the html of the sitemap for the particular website. the <loc> will give you all the sub sites pertaining to the domain. You can make use of BeautifulSoup in python to get the sites into a list. import requests from bs4 import BeautifulSoup def scrape_flock_docs(): url = 'https://docs.flock.io/sitemap.xml' response = requests.get(url) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') urls = [url.text.strip() for url in soup.find_all('loc')]

## Publication Information

- [hangytong](https://paragraph.com/@hangytonggmail.com/): Publication homepage
- [All Posts](https://paragraph.com/@hangytonggmail.com/): More posts from this publication
- [RSS Feed](https://api.paragraph.com/blogs/rss/@hangytonggmail.com): Subscribe to updates

## Optional

- [Collect as NFT](https://paragraph.com/@hangytonggmail.com/web-scraping): Support the author by collecting this post
- [View Collectors](https://paragraph.com/@hangytonggmail.com/web-scraping/collectors): See who has collected this post