How to Extract Data From a Website Using BeautifulSoup
Extracting data is a common task when working with web scraping. BeautifulSoup is an HTML parsing library that makes it easy to pull data from a website by looking at the underlying code. In this tutorial, we’ll learn how to extract data from
There are mainly two ways to extract data from a website:
- Use APIs(if available) to retrieve data.
- Access the HTML of the webpage and extract useful information/data from it.
In this article, we will extract Billboard magazine’s Top Hot 100 songs of the year 1970 from Billboard Year-End Hot 100 singles of 1970.
Task:
- Perform Web scraping and extract all 100 songs with their artists.
- Create python dictionary which contains key as title of the single and value as lists of artists.
Installation
We need to install requests and bs4.The requests module allows you to send HTTP requests using Python. Beautiful Soup (bs4) is a Python library for pulling data out of HTML and XML files.
pip install requests
pip install bs4
Import the libraries
import requests
from bs4 import BeautifulSoup
Sending request
url = "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1970"
response = requests.get(url)
print(response.url) # print url
response # response status
songSoup = BeautifulSoup(response.text) # Object of BeautifulSoup
data_dictionary = {}
for song in songSoup.findAll('tr')[1:101]: # loop over index 1 to 101 because the findAll('tr') contains table headers
# Priting 100 table rows.............
# print(song)
title = song.findAll('a')[0].string
artist = song.findAll('a')[1].string
# Printing Titles and Artists.............
print(title, ',', artist)
# Printing Dictionary.............
data_dictionary[title] = [artist]
print(data_dictionary)