What is the Automated SEO Audit Tool described?

The Automated SEO Audit Tool is a Streamlit web app that ingests scraped web page HTML, extracts SEO elements (meta titles, descriptions, headers, links, images, and main text), compiles an audit report with key metrics, and sends that report to a Hugging Face Mistral LLM for enhanced AI recommendations, with downloadable audit reports.

Which scraper was used to collect web page HTML for the audits?

Bright Data’s No-Code AI Scraper (AI Scraper — discover by domain URL) was used to fetch page_html output without writing code and to download the scraped data as JSON, CSV, NDJSON, or JSONL.

How is the scraped HTML passed into the SEO Audit Tool UI?

The Streamlit UI requires uploading a scraped JSON file that contains articles with a 'page_html' field; the app accepts either a list of article objects or a single dictionary containing 'page_html'.

How does the tool resolve relative URLs found in scraped HTML?

The tool extracts a base URL from the HTML by checking for a tag or the first absolute link; it also accepts a user-input base URL in the UI to resolve relative links using urllib.parse.urljoin.

What SEO elements does the parser extract from page HTML?

The parser extracts the content as meta_title, the content as meta_description, counts of h1–h6 headers, image src and alt attributes (resolved to absolute URLs), links with resolved URLs and status checks, and the cleaned main text after removing script and style tags.

How does the tool check link status and classify links?

The tool iterates over tags with hrefs, resolves each to an absolute URL, checks HTTP status using requests.head (and requests.get if head returns 4xx/5xx), and classifies links as internal or external based on the base URL.

What metrics are included in the generated audit report?

The audit report includes the page URL, meta_title, meta_description, header counts, image_count, link_count, internal_link_count, external_link_count, broken_link_count, error_link_count, and main_text_length.

Which large language model and API are used for enhanced analysis and recommendations?

The tool uses a Mistral model via the Hugging Face Inference API (example model URL in the guide: mistralai/Mistral-7B-Instruct-v0.2) by sending a prompt that asks the model to act as an SEO expert and analyze the audit report.

How does the app handle responses and errors from the Hugging Face InferenceClient?

The app attempts client.text_generation and, if the response is a string, returns it stripped; if the response is a list with a generated_text field, it returns that text; otherwise it logs an unexpected response and returns a 'Failed to parse Mistral response.' message; exceptions return 'Error connecting to Mistral.'

What dependencies must be installed to run the SEO Audit Tool locally?

Required dependencies listed are streamlit, requests, beautifulsoup4, huggingface_hub, textstat (optional for readability), and mistral_inference; the example pip install command in the guide is: pip install streamlit requests beautifulsoup4 huggingface_hub textstat mistral_inference.

How is the Hugging Face API token handled for local development and deployment?

The Hugging Face API token should be set as an environment variable (e.g., export HUGGINGFACE_TOKEN on macOS/Linux or set HUGGINGFACE_TOKEN on Windows PowerShell) and retrieved in the script with os.environ.get('HUGGINGFACE_TOKEN'), with a check that the token is present before use.

How are audit reports made available for users to download from the app?

The app creates a downloadable .txt report by base64-encoding the report text and returning an HTML anchor tag with a data:file/txt;base64 URL so users can click a 'Download Report' link in the Streamlit UI.

What are the steps to deploy the Streamlit app to Streamlit Community Cloud?

Deployment steps are: host the project in a public GitHub repository, include a requirements.txt and the Streamlit script (e.g., app.py), sign into Streamlit Community Cloud with GitHub, create a new app, specify the repository, branch (main), and main file path, then click Deploy to build and receive a public URL.

What validation does the Streamlit UI perform on the uploaded JSON file?

The UI checks that the uploaded JSON is a list or dictionary, ensures there are article objects containing a 'page_html' field, and displays errors or warnings and stops execution if those conditions are not met or if the base URL is not provided.

How I Built an Automated SEO Audit Tool Using an AI Scraper

Websites thrive on visibility and engagement. However, search engine algorithms are constantly changing, rendering well-crafted SEO strategies dormant or even harmful to a website over time. Many website managers struggle to keep up with the competition since manually auditing blog articles is time-consuming and slows adaptation.

Recognizing these issues, I built an Automated SEO Audit Tool to address them. However, I needed to first scrape the data to be audited. For this, I used Bright Data’s No-Code AI Scraper. The tool then scans the scraped data, finds SEO weaknesses, analyzes meta titles, descriptions, headers, links, images, and content length, and generates downloadable reports with AI recommendations.

What We’ll Cover

In this article, I will walk you through:

Accessing AI Scraper from Bright Data’s website.
Building an SEO audit tool with the scraped data.
Creating a friendly UI to interact with the tool.
Deploying the tool on Streamlit’s free cloud hosting platform.

By the end, you will have built a fully functional SEO auditing tool that automatically scans scraped web pages, detects critical SEO issues, and generates insightful audit reports.

Let’s get started!

Accessing AI Scraper from Bright Data

The first step in building this tool was to be able to scrape data from any website. To achieve this, I used Bright Data’s Data for AI Web Scrapers, which gets structured and unstructured data that can easily be passed to an AI model. Within a few minutes, I had my scraped data in hand without writing a single line of code.

Buy Datasets - Marketplace & Custom Datasets

To access AI Scraper:

On the sidebar menu, click on “Web Scrapers” displayed as a “bug” icon.
Select “Scrapers marketplace” and type “AI Scraper” in the search box.

Click on “AI Scraper — discover by domain URL”

Select the “No-Code Scraper” option and click “Next”. You can also use the Scraper API option to directly scrape websites via an API call.

Scroll down to the “Add Inputs” section. And paste the URL of the website you want to scrape. The other parameters are optional.

Scroll to the “Output field” column and change the “Markdown” format to “page_html”

Click on “Start Collecting” and the AI Scraper will begin fetching your data instantly.

When your data is ready, you can download your data in various formats like JSON, CSV, NDJSON, and JSONL. I downloaded mine as a JSON file.

With the data in hand, it was time to build the SEO audit tool.

Building the SEO Audit Tool

Step 1: Setting Up the Project

Create a new project folder in a virtual environment and install the necessary dependencies.

Create the project folder:

mkdir seo_audit_project
cd seo_audit_project

2. Set up a virtual environment:

python -m venv venv

Activate the environment:

Windows:

venv\Scripts\activate

macOS/Linux:

source venv/bin/activate

3. Install Dependencies:

pip install streamlit requests beautifulsoup4 huggingface_hub textstat mistral_inference

The key dependencies are:

streamlit: For building the web app’s UI.
requests: Allows the application to make HTTP requests, to check link status.
beautifulsoup4 (bs4): Beautiful Soup, for this project was only used to parse HTML, extracting the SEO metadata from web pages.
huggingface_hub: Allows the app to communicate with the Hugging Face Inference API.
textstat: This is an optional dependency that enables readability analysis of text.

4. Define the Project Structure:

seo_audit_project/
│── ai_seo_auditor.py

Step 2: Enabling a Large Language Model (LLM) from HuggingFace

I used Mistral-7B-Instruct-v0.3 for the AI analysis and recommendations, here are the steps to guide you in doing the same:

Go to the Hugging Face website.
Click on “Sign Up” and create an account.
Navigate to your profile:

4. Go to “Settings”.

5. Scroll down and click on “Access Tokens.”

6. Select “Create new token” to generate an API token.

7. Change ‘Token type” to READ, name the token, and click “Create token”.

8. On the Home page, select “Models” and search for Mistral AI.

9. Within Mistral AI’s page, click on the column that grants you access to the LLM below. That’s it. You now have access to Mistral AI; you can test the model by opening a playground.

Step 3: Building the Code Base

Now, It was time to start building my code base and writing a suitable prompt to guide the LLM for its job.

1. Preparing for SEO Analysis

To begin, the application imports necessary libraries from already installed dependencies, and initializes logging for tracking code activity.

import streamlit as st
import requests
from bs4 import BeautifulSoup
import json
import re
import logging
import urllib.parse
from huggingface_hub import InferenceClient
import base64

# Optional: for readability analysis; install via pip install textstat
try:
   import textstat
   READABILITY_ENABLED = True
except ImportError:
   READABILITY_ENABLED = False

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

The code above achieves the following:

Imports: It starts by bringing in necessary tools. streamlit builds the web app, requests checks links, BeautifulSoup parses web pages, json handles data files, re cleans text, logging tracks activity,**** urllib.parse manages URLs, huggingface_hub connects to AI, and base64 encodes data for downloads.
Optional Feature: Progressively, it tries to add readability analysis with textstat. If the library exists, it turns the feature on.
Logging Setup: Further down, the code configures a system to record its actions, which helps with debugging and tracking what’s happening.

2. AI SEO Analysis

Now, it configures the connection to the Mistral model through a HuggingFace API token and inference URL.

# Hugging Face API configuration
HF_API_TOKEN = "hf_afjhturhdjhrufhfudhufhduj" #replace with your own API token this is a dummy
HF_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2"

# Timeout for HTTP requests (in seconds)
REQUEST_TIMEOUT = 10

Here is a summary of what the code above achieves:

Hugging Face API Setup: It configures the connection to the HuggingFace AI model. HF_API_TOKEN stores the API key for accessing the LLM. HF_URL specifies the exact address of the Mistral-7B-Instruct-v0.2 model.
Network Request Timeout: REQUEST_TIMEOUT sets a 10-second limit for how long the application waits for responses from external web requests, so it’s not stuck if a website is slow or unresponsive.

3. Identifying the Page’s Origin

Web pages usually contain relative links, which are URLs that don’t specify the full address of a resource. For example, instead of ‘https://www.example.com/images/logo.png’, a page might just have /images/logo.png. To make sense of these relative links, my code reinforces the base URL as the absolute URL.

def get_base_url_from_html(page_html):
   """Extracts the base URL from the provided HTML."""
   soup = BeautifulSoup(page_html, "html.parser")

 base_tag = soup.find("base")
   if base_tag and base_tag.get("href"):
       return base_tag["href"]
   else:
       # If no <base> tag, try to extract from the first link's absolute URL, or return None
       first_link = soup.find("a", href=True)
       if first_link and urllib.parse.urlparse(first_link["href"]).netloc:
           return urllib.parse.urljoin(first_link["href"], "/")
       return "" #Return empty string if base url cannot be found

Below are more details on how the function works:

HTML Parsing: soup = BeautifulSoup(page_html, “html.parser”) takes the page_html string (which represents the web page’s HTML) and parses it using BeautifulSoup.
Checking for <base> Tag: base_tag = soup.find(“base”) searches for the <base> tag within the parsed HTML and specifies the base URL for all relative URLs in the document.
No Base URL Found: return “” if neither a <base> tag nor an absolute URL in the first link is found, the function returns an empty string.

3. Extracting SEO Elements

Afterwards, the code analyzes the HTML content of the web page, extracts key SEO data like meta titles, descriptions, headers, links, and images for auditing.

def parse_html(page_html, base_url):
   soup = BeautifulSoup(page_html, "html.parser")
   meta_title = soup.title.string.strip() if soup.title and soup.title.string else ""
   meta_description = ""
   meta_desc_tag = soup.find("meta", attrs={"name": "description"})
   if meta_desc_tag:
       meta_description = meta_desc_tag.get("content", "").strip()
   headers = {f"h{level}": len(soup.find_all(f"h{level}")) for level in range(1, 7)}
   images = []
   for img in soup.find_all("img"):
       src = img.get("src", "")
       alt = img.get("alt", "").strip()
       full_src = urllib.parse.urljoin(base_url, src)
       images.append({"src": full_src, "alt": img.get("alt", "No alt text provided").strip()})

The function above achieves the following:

HTML Parsing: It takes raw HTML (page_html) and a base URL (base_url) as input. Then, it uses BeautifulSoup to create a parsable object from the HTML.
Metadata Extraction: It further extracts the page’s <title> tag content as meta_title, searches for a <meta name=”description”> tag, and extracts its content as meta_description.
Header Analysis: Proceeds to count the occurrences of <h1> to <h6> tags, storing the counts in a dictionary named headers.
Image Processing: It finds all <img> tags and extracts the src and alt attributes of each image, resolves relative image URLs to absolute URLs using urllib.parse.urljoin then stores the image source and alt text in a list.
Link Analysis: It finds all <a>**** tags with href attributes, resolves relative link URLs to absolute URLs, checks the status of each link using requests.head**** (and requests.get if head returns a 4xx or 5xx status code). Then, It determines if each link is internal or external based on the base URL and stores each link’s URL, status, and type.
Content Cleanup: It removes <script> and <style> tags from the parsed HTML, extracts the remaining text content, and removes extra whitespace from the text using regular expressions.
Result Packaging: It returns a dictionary containing all the extracted information: meta_title, meta_description, headers, images, links, and main_text.

4. Compiling the SEO Audit Report

Further down, the code gathers the extracted data and structures it into an audit report detailing key metrics for analysis.

def generate_audit_report(url, parsed_data):
   report = {"url": url, "meta_title": parsed_data["meta_title"], "meta_description": parsed_data["meta_description"], "headers": parsed_data["headers"], "image_count": len(parsed_data["images"]), "link_count": len(parsed_data["links"]), "internal_link_count": len([link for link in parsed_data["links"] if link["type"] == "internal"]), "external_link_count": len([link for link in parsed_data["links"] if link["type"] == "external"]), "broken_link_count": len([link for link in parsed_data["links"] if link["status"] != 200 and link["status"] != "Error"]), "error_link_count": len([link for link in parsed_data["links"] if link["status"] == "Error"]), "main_text_length": len(parsed_data["main_text"])}

The function above achieves the following:

Report Creation: It takes the URL of the page and the parsed data (extracted from the HTML) as input and creates a Python dictionary named report to store the SEO audit results.
Data Population: populates the report dictionary with the page’s URL, extracted meta title and description, header counts, image and link counts (total, internal, external, broken, and error), and, finally, the length of the main text content.
Report Return: The function returns the report dictionary, which contains all the compiled SEO audit data.

5. Sending Report to Mistral (LLM) for Analysis:

The next process is sending the audited report for further analysis and recommendations to Mistral.

def send_to_mistral(report, hf_url, hf_api_token):
   client = InferenceClient(model=hf_url, token=hf_api_token)
   prompt = ("You are an SEO expert. Analyze the following SEO audit report and provide recommendations for improvement:\n\n" f"{json.dumps(report, indent=4)}")
   try:
       logging.info("Sending audit report to Mistral...")
       response = client.text_generation(prompt)
       if isinstance(response, str):
           return response.strip()
       elif isinstance(response, list) and response and "generated_text" in response[0]:
           return response[0]["generated_text"].strip()
       else:
           logging.error(f"Unexpected API response: {response}")
           return "Failed to parse Mistral response."
   except Exception as e:
       logging.error(f"Error communicating with Mistral: {e}")
       return "Error connecting to Mistral."

The function does the following:

AI Client Setup: Initializes an InferenceClient to connect to the Mistral AI model using the provided hf_url and hf_api_token.
Prompt Construction: Creates a prompt for the AI, instructing it to act as an SEO expert and analyze the report (which is converted to a formatted JSON string).
API Request: Sends the prompt to the Mistral AI model using client.text_generation and logs the sending action.
Response Handling: Checks the AI’s response, whether it’s a string, and returns the stripped response. If it’s a list with a “generated_text” field, it returns the stripped “generated_text”. Otherwise, it logs an error and returns a “Failed to parse…” message.
Error Handling: Uses a try…except block to catch potential exceptions during the API communication. If an error occurs, it logs the error and returns an “Error connecting…” message.

6. Generating Downloadable Report:

Create a download function for anyone to download an audited report into their device as a txt file.

def create_download_link(val, filename):
   b64 = base64.b64encode(val.encode()).decode()
   return f'<a href="data:file/txt;base64,{b64}" download="{filename}">Download Report</a>'

This code block achieves the following:

Encoding Data: Collects the report data (val) and a filename (filename) as input. val.encode()**** encodes the report data, base64.b64encode(…)**** encodes the byte data into a base64 string and decode() decodes the base64 byte string back into a regular string.
Creating Download Link: It constructs an HTML <a> (anchor) tag that creates a download link, sets the link’s target to a data URL, specifies the filename that the browser should use when saving the downloaded data and sets the text as the download link.
Returning the Link: The function then returns the generated HTML <a> tag string.

7. Building the UI using Streamlit:

The final step is to build the UI for the tool.

st.title("SEO Audit Tool")

uploaded_file = st.file_uploader("Upload scraped JSON file", type=["json"])

if uploaded_file is not None:
   try:
       data = json.load(uploaded_file)
   except Exception as e:
       st.error(f"Error reading JSON file: {e}")
       st.stop()

   if not isinstance(data, (list, dict)):
       st.error("Uploaded JSON file must be a list or a dictionary.")
       st.stop()

   if isinstance(data, list):
       valid_articles = [article for article in data if isinstance(article, dict) and "page_html" in article]
       if not valid_articles:
           st.error("Uploaded JSON file does not contain any valid articles with a 'page_html' field.")
           st.stop()
   elif isinstance(data, dict) and "page_html" not in data:
       st.error("Uploaded JSON file does not contain a 'page_html' field.")
       st.stop()
   else:
       valid_articles = [data]

   base_url = st.text_input("Enter base URL (for resolving relative links):")
   if not base_url:
       st.warning("Please enter the base URL.")
       st.stop()

   st.subheader("Audit Reports")

   for i, article in enumerate(valid_articles):
       st.write(f"### Article {i+1}")
       html = article["page_html"]
       parsed_data = parse_html(html, base_url)
       report = generate_audit_report(base_url, parsed_data)
       st.json(report)
       enhanced_report = send_to_mistral(report, HF_URL, HF_API_TOKEN)
       if enhanced_report:
           with st.expander("Enhanced Analysis"):
               st.write(enhanced_report)
           report_text = f"URL: {base_url}\n\nAudit Report:\n{json.dumps(report, indent=4)}\n\nEnhanced Analysis:\n{enhanced_report}"
           download_link = create_download_link(report_text, f"audit_report_{i+1}.txt")
           st.markdown(download_link, unsafe_allow_html=True)
       else:
           st.warning("Failed to retrieve enhanced analysis.")
           report_text = f"URL: {base_url}\n\nAudit Report:\n{json.dumps(report, indent=4)}"
           download_link = create_download_link(report_text, f"audit_report_{i+1}.txt")
           st.markdown(download_link, unsafe_allow_html=True)

else:
   st.warning("Please upload a JSON file to continue.")

The Streamlit code achieves the following:

App Setup: st.title(“SEO Audit Tool”) sets the title of the Streamlit web app, while uploaded_file = st.file_uploader(…) creates a file uploader component, allowing users to upload JSON files.
File Upload Handling: if uploaded_file is not None**** checks if a file has been uploaded. The code then proceeds to confirm if the JSON data is a list or dictionary and validates the structure of the JSON data to ensure it contains “page_html.”
Base URL Input: base_url = st.text_input(…) prompts the user to enter a base URL for robust scraping.
Audit Loop: An audit loop iterates through the uploaded articles, retrieves the HTML content, parses the HTML, generates and displays the audit report in JSON format. enhanced_report = send_to_mistral(…), then sends the report to the AI for analysis.
No File Upload Handling: else:executes if no file has been uploaded while st.warning(…) prompts the user to upload a JSON file.

Here is what the tool’s UI looks like:

You will find the complete code for this project on my GitHub.

Deploying the SEO Audit Tool using Streamlit Free Hosting

Here’s how I deployed the SEO Audit Tool on Streamlit free cloud hosting in just a few steps:

Step 1: Set Up a GitHub Repository

Streamlit requires your project to be hosted on GitHub.

1. Create a New Repository On GitHub

Create a new repository on GitHub and set it as public.

2. Push Your Code to GitHub

If you haven’t already set up Git and linked your repository, use the following commands in your terminal:

git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/seo-audit-tool.git
git push -u origin main

Step 2: Store Your HuggingFace Token As An Environment Variable

Before deploying your app, you have to securely store your HugginFace token within your system as an environment variable to protect it from misuse by others.

1. Set Your Token As an Environment Variable (Locally):

macOS/Linux:

export HUGGINGFACE_TOKEN="your_token"

Windows (PowerShell):

set HUGGINGFACE_TOKEN="your_token"

Use os.environ to retrieve the token within your script:

import os

HF_API_TOKEN = os.environ.get("HUGGINGFACE_TOKEN")

if HF_API_TOKEN is None:
    print("Error: Hugging Face token not found in environment variables.")
    # Handle errors
else:
    # Use HF_API_TOKEN in your Hugging Face API calls
    print("Hugging face token loaded successfully")

Restart your code editor.

Step 3: Create a requirements.txt file

Streamlit needs to know what dependencies your app requires.

1. In your project folder, create a file named requirements.txt.

2. Add the following dependencies:

streamlit
requests
beautifulsoup4
huggingface_hub
textstat

3. Save the file and commit it to GitHub:

git add requirements.txt
git commit -m "Added dependencies"
git push origin main

4. Do the same for the app.py file containing all your code:

git add app.py 
git commit -m "Added app script"
git push origin main

Step 3: Deploy on Streamlit Cloud

1. Go to Streamlit Community Cloud.

2. Click “Sign in with GitHub” and authorize Streamlit.

3. Click “Create App”

4. Select “Deploy a public app from GitHub repo.”

5. In the repository settings, enter:

Repository: YOUR_USERNAME/seo-audit-tool
Branch: main
Main file path: app.py (or whatever your Streamlit script is named)

6. Click “Deploy” and wait for Streamlit to build the app.

Step 4: Get Your Streamlit App URL

After deployment, Streamlit will generate a public URL. You can now share this link to allow others access to your tool!

By extracting website data using Bright Data’s AI Scraper, I built an automated tool that detects SEO issues and highlights resolutions in a structured report format–a task that would otherwise take a huge amount of time and resources to achieve.

This tool automates the bulk work for SEO experts and content strategists ensuring a website’s content is top quality and performs optimally for pulling in higher search engine rankings, visibility, and engagements.

How I Built an Automated SEO Audit Tool Using an AI Scraper

Build an automated SEO audit tool that analyses scraped data from Bright Data’s AI Scraper, identifies SEO weaknesses, and generates AI-driven reports.

What We’ll Cover

Accessing AI Scraper from Bright Data

Building the SEO Audit Tool

Step 1: Setting Up the Project

Step 2: Enabling a Large Language Model (LLM) from HuggingFace

Step 3: Building the Code Base

1. Preparing for SEO Analysis

2. AI SEO Analysis

3. Extracting SEO Elements

4. Compiling the SEO Audit Report

5. Sending Report to Mistral (LLM) for Analysis:

7. Building the UI using Streamlit:

Deploying the SEO Audit Tool using Streamlit Free Hosting

Step 1: Set Up a GitHub Repository

1. Create a New Repository On GitHub

2. Push Your Code to GitHub

Step 2: Store Your HuggingFace Token As An Environment Variable

Step 3: Create a requirements.txt file

Step 3: Deploy on Streamlit Cloud

Step 4: Get Your Streamlit App URL

Frequently Asked Questions

Comments

Promote your content

Join our developer community

Main Menu

How I Built an Automated SEO Audit Tool Using an AI Scraper

Build an automated SEO audit tool that analyses scraped data from Bright Data’s AI Scraper, identifies SEO weaknesses, and generates AI-driven reports.

What We’ll Cover

Accessing AI Scraper from Bright Data

Building the SEO Audit Tool

Step 1: Setting Up the Project

Step 2: Enabling a Large Language Model (LLM) from HuggingFace

Step 3: Building the Code Base

1. Preparing for SEO Analysis

2. AI SEO Analysis

3. Extracting SEO Elements

4. Compiling the SEO Audit Report

5. Sending Report to Mistral (LLM) for Analysis:

7. Building the UI using Streamlit:

Deploying the SEO Audit Tool using Streamlit Free Hosting

Step 1: Set Up a GitHub Repository

1. Create a New Repository On GitHub

2. Push Your Code to GitHub

Step 2: Store Your HuggingFace Token As An Environment Variable

Step 3: Create a requirements.txt file

Step 3: Deploy on Streamlit Cloud

Step 4: Get Your Streamlit App URL

Frequently Asked Questions

Comments

Promote your content

Join our developer community