Python in Plain English

New Python content every day. Follow to join our 3.5M+ monthly readers.

Follow publication

Web Scraping with Python: A Comprehensive Guide to Extracting Data from the Web

--

Web scraping is a method of automating extraction of data from websites, using code to retrieve and parse HTML and store necessary data in structured form for future use. It involves gathering specific information from various websites like news articles, blogs, product details or social media posts. This method is useful for data analysis and competitive intelligence.

One of the most popular library in Python for web scraping is BeautifulSoup known as bs4. It makes it easy to extract content from HTML and XML websites using its built-in methods.

Using BeautifulSoup in Python

1. Install BeautifulSoup

Before getting started, make sure you have BeautifulSoup installed. You can install it using pip by running the following command in your terminal:

pip install beautifulsoup4

2. Import Necessary modules

Import request module in your python file to make HTTP requests and bs4 module to parse HTML content.

import requests
from bs4 import BeautifulSoup

3. Make HTTP request

Use requests library to make HTTP Get requests to the website and save it to a variable.

url = "https://example.com"
response = requests.get(url)

4. Parse HTML content

Use BeautifulSoup class to create a soup object and provide content of response and parser type as arguments.

soup = BeautifulSoup(response.content, "html.parser")

5. Extract data from HTML

Use methods of soup object to find the data you need and use the data however you intend to. You can even save the data to the file for future use.

Let’s suppose that this program intends to scrape all website links inside a web page and print them in console.

links=soup.find_all('a') # find all anchor tags
for link in links:
href=link.get('href') # get value of href from each link/anchor tag
print(href)

6. Storing extracted data

Once you get data, you can clean it and manipulate it however you like inside your code. The next thing you can do is to store them for future processing.

with open('links.txt','w') as file:
for link in links:
href=link.get('href') # get value of href from each link/anchor tag
file.write(href + '\n') # new line added to structure the links text

Methods of BeautifulSoup

1. find (tag)

This method is used to find the tag using tag name inside HTML content. It only returns a single item. It returns the first tag only.

# Example: Find the first <h1> tag
h1_tag = soup.find("h1")

2. find_all (tag)

Similar to find method, this method is used to find the tags using tag name inside HTML content. It returns all tags matching the given tag name.

# Example: Find all <h1> tag
h1_tags = soup.find_all("h1")
# Example: Find all <h1> tag with title class
title_h1_tags = soup.find_all("h1",class_='title')

3. select (tag)

This method is used to find the tag using CSS selectors like tag name, id and classes. It returns a list of all HTML tags that matches the given selector.

# Example: Find all <h1> tag with title class
title_h1_tag = soup.select("h1.title")

4. get (tag)

This method is used to get the attribute of the retrieved tag like class, id, href, name, etc. It returns None value if attribute doesn’t exist.

# returns link
href=link.get("href")
# returns link if exists otherrwise returns not found
href=link.get("href","not found")

5. text

This method gives text value of HTML tag, excluding HTML tags and attributes.

paragraph=soup.find('p')
text=paragraph.text # Output: This is a paragraph.

6. parent

This method gives the parent element of the element.

paragraph=soup.find('p')
text=paragraph.parent # Output: div

These are some basic methods in this library to extract data from websites. You can read the official documentation to get more information: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In Plain English

Thank you for being a part of our community! Before you go:

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in Python in Plain English

New Python content every day. Follow to join our 3.5M+ monthly readers.

Written by Gopal Khadka

Aspiring coder and writer, passionate about crafting compelling stories through code and words. Merging creativity and technology to connect with others.

No responses yet

Write a response