Overview
Web scraping is pulling information that appears on a website. This is useful for potentially saving or using that information for another application or task. This tutorial will show how to use Python and two modules, Requests and BeautifulSoup to web scrape websites. The website that will be web scraped in this tutorial is https://weather.com/ (1) or "The Weather Channel" to receive the air quality.
Materials
Computer
Python
- Python Modules: Requests, BeautifulSoup
Terminal or Powershell or Command Prompt or IDE
Procedure
Writing the Script
Begin by making a new Python script and import the requests and BeautifulSoup modules:
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
To use these modules, download the packages using pip before executing any Requests (2) or Beautiful Soup (3) commands. Refer to the pip documentation (4) to download pip if it is not on your system.
Download Requests:
pip install requests
Python 3.7 and above has Requests preinstalled
Download BeautifulSoup:
pip install beautifulsoup4
The URL for "The Weather Channel" to show any state's weather is "https://weather.com/weather/today/l/". After the last "/", the zip code comes after it. In Python, the script can be instructed to take in a five-digit zip code that can be added to the URL, to show the weather for any state.
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
zip_code = int(input("Enter Zip Code:"))
str_zip_code = str(zip_code)
if len(str_zip_code) <= 4:
while len(str_zip_code) <= 4:
str_zip_code = '0' + str_zip_code
url = 'https://weather.com/weather/today/l/' + str_zip_code
elif len(str_zip_code) == 5:
url = 'https://weather.com/weather/today/l/' + str_zip_code
else:
print("Enter a five digit zip code next time.")
quit()
A number or a whole integer can only be accepted to be the zip code and this information is saved as the "zip_code" variable. Afterwards, this variable is casted and saved in another variable called "str_zip_code" since this variable is added to the "url" variable which is also a string data type. This prevents the issue of concatenating two variables with different data types since that would lead to an error.
Some zip codes also start with a 0 as their first few numbers, and in Python, numbers that start with 0, is removed. For example, inputting an integer such as 01234 or 00100 is saved as 1234 and 100 in Python's memory respectively. An if statement and within it, a while loop is implemented to see if the length of the "str_zip_code" is equal or less than 4, then zeros are added to the beginning of the zip code where it originally was before being added to the "url" variable. Otherwise, the elif statement accepts other five-digit zip codes and the else statement rejects all other lengths of zip codes.
Below shows to print out the connection of any URL, to make sure, the URL for the webpage exists before web scraping it:
using_request = requests.get(url)
print(using_request)
As stated before, this shows the connection, and it is successful since "200" was returned. To print the content on the HTML page in the terminal, perform the following command:
print(using_request.content)
Below shows the content on "The Weather Channel" at a particular time and what was web scraped afterward from the webpage:
Using the requests module must output all the content that appears on the webpage as text for it to be shown in the Terminal. Since this text can be difficult to read through, this information can be passed through the module Beautiful Soup to make reading this content easier.
Beautiful Soup allows the user to search and print selected HTML headers on the page. To find the name of the index that wants to be scraped, a recommended way is to inspect the content through the browser console.
From inspecting the console, the needed HTML id names can be observed to showcase what needs to be printed out, in this case it is: "WxuAirQuality-sidebar-aa4a4fb6-4a9b-43be-9004-b14790f57d73"
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
zip_code = int(input("Enter Zip Code:"))
str_zip_code = str(zip_code)
if len(str_zip_code) <= 4:
while len(str_zip_code) <= 4:
str_zip_code = '0' + str_zip_code
url = 'https://weather.com/weather/today/l/' + str_zip_code
elif len(str_zip_code) == 5:
url = 'https://weather.com/weather/today/l/' + str_zip_code
else:
print("Enter a five digit zip code next time.")
quit()
using_requests = requests.get(url)
soup = BeautifulSoup(using_requests.content, 'html.parser')
html_div = soup.find(id="WxuAirQuality-sidebar-aa4a4fb6-4a9b-43be-9004-b14790f57d73")
To observe only the contents of "html_div", print the variable:
print(html_div)
To find the length of the html_div, first, cast the variable to a string and then print the length:
str_html_div = str(html_div)
length_of_html_div = len(str_html_div)
print(length_of_html_div)
"html_div" was also cast into a string before finding the length of the content to give an accurate number back. Without casting the information to a string, the output of the length would be "1" since it reads the entire HTML content as one statement.
In the middle of the paragraph, where it says "text-anchor="middle" x="50%" y="55%">70<..." the number 70, is the air quality number.
A for loop can be used to find the indexes that show the air quality number. Below shows the for loop that was used to find the indexes:
index = 0
for i in str_html_div:
print(index, str_html_div[index])
index += 1
After running the for loop, index positions 1327, 1328, and potentially 1329 displays the air quality digits. Another if statement is implemented to see if index 1329 contains a digit.
#!/usr/bin/env python3
# Final Script
import requests
from bs4 import BeautifulSoup
zip_code = int(input("Enter Zip Code:"))
str_zip_code = str(zip_code)
if len(str_zip_code) <= 4:
while len(str_zip_code) <= 4:
str_zip_code = '0' + str_zip_code
url = 'https://weather.com/weather/today/l/' + str_zip_code
elif len(str_zip_code) == 5:
url = 'https://weather.com/weather/today/l/' + str_zip_code
else:
print("Enter a five digit zip code next time.")
quit()
using_requests = requests.get(url)
soup = BeautifulSoup(using_requests.content, 'html.parser')
html_div = soup.find(id="WxuAirQuality-sidebar-aa4a4fb6-4a9b-43be-9004-b14790f57d73")
str_html_div = str(html_div)
list_of_string_num = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
if str_html_div[1329] in list_of_string_num:
air_quality = str_html_div[1327] + str_html_div[1328] + str_html_div[1329]
print("The air quality is:", air_quality)
else:
air_quality = str_html_div[1327] + str_html_div[1328]
print("The air quality is:", air_quality)
Demonstration: Running the Script
Sources
https://pypi.org/project/requests/ (2)
https://pypi.org/project/beautifulsoup4/ (3)
https://pip.pypa.io/en/stable/installation/ (4)
Source Code
https://github.com/AndrewDass1/SCRIPTS/tree/main/Python/Web%20Scrape%20Air%20Quality