Internal linking is crucial for SEO, helping search engines crawl your site and improving content visibility. However, manually finding internal linking opportunities can be tedious, especially for larger websites.
This article introduces a Python script that automates the process, making it easier to identify potential links across your content. By utilizing Python’s data processing and web scraping capabilities, the script quickly analyzes your website, saving you time and improving your internal link structure. In the following sections, we’ll walk you through setting up the script and using it to enhance your SEO strategy. I should credit the article “Finding Inlink Opportunities with Python for SEO” because I created this script inspired by the method introduced there.
Before diving into the script, let’s briefly outline the overall process:
To begin the process, the first step is to gather and load your keyword data. You’ll need information like keyword positions, search volume, and keyword difficulty, which can be easily obtained from SEMrush.
Download Keyword Data from SEMrush: The first step in the script is to load the keyword data from a CSV file. This data typically includes important metrics such as keyword positions, search volume, keyword difficulty, and URLs that are ranking for these keywords.
Loading the Data with Pandas: After downloading the CSV file, the next step is to load it into Python using the pandas
library. This will allow us to process and analyze the data efficiently. Here’s how to load the data:
import pandas as pd import requests from bs4 import BeautifulSoup # 1. Load Keywords Data list_keywords = pd.read_csv('your_file.csv') list_keywords = list_keywords.values.tolist()
pandas
is a powerful Python library used for data manipulation and analysis. It provides data structures like DataFrames, which are ideal for handling and processing structured data, such as the keyword data in a CSV file. In this script, pandas
is used to load and manipulate the keyword data from the CSV file. It allows for easy reading of the data, conversion to different formats (like lists), and further processing.
requests
is a popular Python library used to make HTTP requests, such as getting the content of a webpage. It simplifies the process of interacting with web resources. The script uses requests
to send HTTP requests to the URLs extracted from the keyword data. It retrieves the HTML content of these pages, which is then parsed to find potential internal linking opportunitie.
BeautifulSoup
is a library from the bs4
package that is used for parsing HTML and XML documents. It creates a parse tree that can be used to extract data from HTML tags. After the requests
library retrieves the HTML content of a webpage, BeautifulSoup
is used to parse this content. It helps in finding and extracting specific elements, such as paragraphs of text and existing links on the page, which are essential for identifying where internal links can be added.
The next step in the script involves extracting the URLs from the keyword data. These URLs represent the pages on your website that are already ranking for the keywords you’re analyzing. This is crucial because these are the pages where you’ll potentially want to add internal links.
# 2. Get the URL list list_urls = [] for x in list_keywords: list_urls.append(x[6]) # Assuming the URL is in the 7th column (index 6)
In this step, the script removes duplicate URLs from the list and prepares a structured list that associates each keyword with its corresponding URL and related metrics. This is an important step to ensure that the script runs efficiently and accurately.
# Remove duplicate URLs list_urls = list(dict.fromkeys(list_urls))
After extracting the URLs from the keyword data, the next task is to ensure there are no duplicates in the list. Duplicate URLs can lead to redundant processing, which is inefficient. The script removes these duplicates by converting the list of URLs into a dictionary and back into a list, leveraging the fact that dictionary keys must be unique.
list_keyword_url = [] for x in list_keywords: # Adjusted columns: Keyword Position (1st column), Search Volume (3rd column), Keyword Difficulty (4th column) list_keyword_url.append([x[6], x[0], x[1], x[3], x[4]])
After extracting the URLs from the keyword data, the next task is to ensure there are no duplicates in the list. Duplicate URLs can lead to redundant processing, which is inefficient. The script removes these duplicates by converting the list of URLs into a dictionary and back into a list, leveraging the fact that dictionary keys must be unique.
In this step, the script crawls each URL extracted earlier, analyzes the content of the pages, and identifies opportunities to add internal links based on the keywords in your dataset.
# 3. Crawling the pages and finding the matches internal_linking_opportunities = [] absolute_rute = str(input("Insert your absolute rute: ")) for iteration in list_urls: page = requests.get(iteration) print(iteration) soup = BeautifulSoup(page.text, 'html.parser') paragraphs = soup.find_all('p') paragraphs = [x.text for x in paragraphs] links = [] for link in soup.findAll('a'): links.append(link.get('href')) for x in list_keyword_url: for y in paragraphs: if " " + x[1].lower() + " " in " " + y.lower().replace(",", "").replace(".", "").replace(";", "").replace("?", "").replace("!", "") + " " and iteration != x[0]: links_presence = False for z in links: try: if x[0].replace(absolute_rute, "") == z.replace(absolute_rute, ""): links_presence = True except AttributeError: pass if not links_presence: internal_linking_opportunities.append([x[1], y, iteration, x[0], "False", x[2], x[3], x[4]]) else: internal_linking_opportunities.append([x[1], y, iteration, x[0], "True", x[2], x[3], x[4]])
For each URL in your list, the script retrieves the page content and parses it to find paragraphs of text. It then checks each paragraph for the presence of keywords from your dataset.
If a keyword is found and an internal link to the relevant page does not already exist, the script logs this as an opportunity. It records the keyword, the paragraph where it was found, the current page (source URL), the target URL, and whether or not the link already exists.
pd.DataFrame(internal_linking_opportunities, columns=["Keyword", "Text", "Source URL", "Target URL", "Link Presence", "Keyword Position", "Search Volume", "Keyword Difficulty"]).to_excel('internal_linking_opportunities.xlsx', header=True, index=False)
The final step of the script involves taking the internal linking opportunities identified in the previous step and exporting them into an Excel file. This allows you to review the results and implement the suggested internal links on your website.
After identifying potential internal linking opportunities, the script’s final task is to compile these results into an Excel file. This step is crucial because it transforms the raw data into a format that is easy to review and implement. The script uses pandas
to create a DataFrame from the list of internal linking opportunities, assigning meaningful column names to each piece of data.
The DataFrame is then exported to an Excel file (internal_linking_opportunities.xlsx
), which includes all the necessary details: the keyword found, the specific text where it appears, the source page, the target page, and additional SEO metrics. This file serves as a practical guide for improving your website’s internal link structure, providing clear and actionable insights for your SEO efforts.
Automating the process of finding internal linking opportunities using Python not only saves time but also ensures a more thorough and systematic approach to enhancing your website’s SEO. By following the steps outlined in this guide, you can efficiently identify pages that can be interconnected through relevant keywords, improving your site’s overall link structure and search engine visibility. If you have any questions or need further assistance, please don’t hesitate to reach out.