Websites thrive on visibility and engagement. However, search engine algorithms are constantly changing, rendering well-crafted SEO strategies dormant or even harmful to a website over time. Many website managers struggle to keep up with the competition since manually auditing blog articles is time-consuming and slows adaptation.
Recognizing these issues, I built an Automated SEO Audit Tool to address them. However, I needed to first scrape the data to be audited. For this, I used Bright Data’s No-Code AI Scraper. The tool then scans the scraped data, finds SEO weaknesses, analyzes meta titles, descriptions, headers, links, images, and content length, and generates downloadable reports with AI recommendations.
What We’ll Cover
In this article, I will walk you through:
- Accessing AI Scraper from Bright Data’s website.
- Building an SEO audit tool with the scraped data.
- Creating a friendly UI to interact with the tool.
- Deploying the tool on Streamlit’s free cloud hosting platform.
By the end, you will have built a fully functional SEO auditing tool that automatically scans scraped web pages, detects critical SEO issues, and generates insightful audit reports.
Let’s get started!
Accessing AI Scraper from Bright Data
The first step in building this tool was to be able to scrape data from any website. To achieve this, I used Bright Data’s Data for AI Web Scrapers, which gets structured and unstructured data that can easily be passed to an AI model. Within a few minutes, I had my scraped data in hand without writing a single line of code.
To access AI Scraper:
- Sign up on Bright Data.
- On the sidebar menu, click on “Web Scrapers” displayed as a “bug” icon.
- Select “Scrapers marketplace” and type “AI Scraper” in the search box.
- Click on “AI Scraper — discover by domain URL”
- Select the “No-Code Scraper” option and click “Next”. You can also use the Scraper API option to directly scrape websites via an API call.
- Scroll down to the “Add Inputs” section. And paste the URL of the website you want to scrape. The other parameters are optional.
- Scroll to the “Output field” column and change the “Markdown” format to “page_html”
- Click on “Start Collecting” and the AI Scraper will begin fetching your data instantly.
- When your data is ready, you can download your data in various formats like JSON, CSV, NDJSON, and JSONL. I downloaded mine as a JSON file.
With the data in hand, it was time to build the SEO audit tool.
Building the SEO Audit Tool
Step 1: Setting Up the Project
Create a new project folder in a virtual environment and install the necessary dependencies.
- Create the project folder:
mkdir seo_audit_project
cd seo_audit_project
2. Set up a virtual environment:
python -m venv venv
Activate the environment:
- Windows:
venv\Scripts\activate
- macOS/Linux:
source venv/bin/activate
3. Install Dependencies:
pip install streamlit requests beautifulsoup4 huggingface_hub textstat mistral_inference
The key dependencies are:
- streamlit: For building the web app’s UI.
- requests: Allows the application to make HTTP requests, to check link status.
- beautifulsoup4 (bs4): Beautiful Soup, for this project was only used to parse HTML, extracting the SEO metadata from web pages.
- huggingface_hub: Allows the app to communicate with the Hugging Face Inference API.
- textstat: This is an optional dependency that enables readability analysis of text.
4. Define the Project Structure:
seo_audit_project/
│── ai_seo_auditor.py
Step 2: Enabling a Large Language Model (LLM) from HuggingFace
I used Mistral-7B-Instruct-v0.3 for the AI analysis and recommendations, here are the steps to guide you in doing the same:
- Go to the Hugging Face website.
- Click on “Sign Up” and create an account.
- Navigate to your profile:
4. Go to “Settings”.
5. Scroll down and click on “Access Tokens.”
6. Select “Create new token” to generate an API token.
7. Change ‘Token type” to READ, name the token, and click “Create token”.
8. On the Home page, select “Models” and search for Mistral AI.
9. Within Mistral AI’s page, click on the column that grants you access to the LLM below. That’s it. You now have access to Mistral AI; you can test the model by opening a playground.
Step 3: Building the Code Base
Now, It was time to start building my code base and writing a suitable prompt to guide the LLM for its job.
1. Preparing for SEO Analysis
To begin, the application imports necessary libraries from already installed dependencies, and initializes logging for tracking code activity.
import streamlit as st
import requests
from bs4 import BeautifulSoup
import json
import re
import logging
import urllib.parse
from huggingface_hub import InferenceClient
import base64
# Optional: for readability analysis; install via pip install textstat
try:
import textstat
READABILITY_ENABLED = True
except ImportError:
READABILITY_ENABLED = False
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
The code above achieves the following:
- Imports: It starts by bringing in necessary tools.
streamlitbuilds the web app,requestschecks links,BeautifulSoupparses web pages,jsonhandles data files,recleans text,loggingtracks activity,****urllib.parsemanages URLs,huggingface_hubconnects to AI, andbase64encodes data for downloads. - Optional Feature: Progressively, it tries to add readability analysis with textstat. If the library exists, it turns the feature on.
- Logging Setup: Further down, the code configures a system to record its actions, which helps with debugging and tracking what’s happening.
2. AI SEO Analysis
Now, it configures the connection to the Mistral model through a HuggingFace API token and inference URL.
# Hugging Face API configuration
HF_API_TOKEN = "hf_afjhturhdjhrufhfudhufhduj" #replace with your own API token this is a dummy
HF_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2"
# Timeout for HTTP requests (in seconds)
REQUEST_TIMEOUT = 10
Here is a summary of what the code above achieves:
- Hugging Face API Setup: It configures the connection to the HuggingFace AI model.
HF_API_TOKENstores the API key for accessing the LLM.HF_URLspecifies the exact address of the Mistral-7B-Instruct-v0.2 model. - Network Request Timeout:
REQUEST_TIMEOUTsets a 10-second limit for how long the application waits for responses from external web requests, so it’s not stuck if a website is slow or unresponsive.
3. Identifying the Page’s Origin
Web pages usually contain relative links, which are URLs that don’t specify the full address of a resource. For example, instead of ‘https://www.example.com/images/logo.png’, a page might just have /images/logo.png. To make sense of these relative links, my code reinforces the base URL as the absolute URL.
def get_base_url_from_html(page_html):
"""Extracts the base URL from the provided HTML."""
soup = BeautifulSoup(page_html, "html.parser")
base_tag = soup.find("base")
if base_tag and base_tag.get("href"):
return base_tag["href"]
else:
# If no <base> tag, try to extract from the first link's absolute URL, or return None
first_link = soup.find("a", href=True)
if first_link and urllib.parse.urlparse(first_link["href"]).netloc:
return urllib.parse.urljoin(first_link["href"], "/")
return "" #Return empty string if base url cannot be found
Below are more details on how the function works:
- HTML Parsing:
soup = BeautifulSoup(page_html, “html.parser”)takes thepage_htmlstring (which represents the web page’s HTML) and parses it using BeautifulSoup. - Checking for <base> Tag:
base_tag = soup.find(“base”)searches for the<base>tag within the parsed HTML and specifies the base URL for all relative URLs in the document. - No Base URL Found: return “” if neither a
<base>tag nor an absolute URL in the first link is found, the function returns an empty string.
3. Extracting SEO Elements
Afterwards, the code analyzes the HTML content of the web page, extracts key SEO data like meta titles, descriptions, headers, links, and images for auditing.
def parse_html(page_html, base_url):
soup = BeautifulSoup(page_html, "html.parser")
meta_title = soup.title.string.strip() if soup.title and soup.title.string else ""
meta_description = ""
meta_desc_tag = soup.find("meta", attrs={"name": "description"})
if meta_desc_tag:
meta_description = meta_desc_tag.get("content", "").strip()
headers = {f"h{level}": len(soup.find_all(f"h{level}")) for level in range(1, 7)}
images = []
for img in soup.find_all("img"):
src = img.get("src", "")
alt = img.get("alt", "").strip()
full_src = urllib.parse.urljoin(base_url, src)
images.append({"src": full_src, "alt": img.get("alt", "No alt text provided").strip()})
The function above achieves the following:
- HTML Parsing: It takes raw HTML
(page_html)and a base URL(base_url)as input. Then, it usesBeautifulSoupto create a parsable object from the HTML. - Metadata Extraction: It further extracts the page’s
<title>tag content as meta_title, searches for a<meta name=”description”>tag, and extracts its content asmeta_description. - Header Analysis: Proceeds to count the occurrences of
<h1>to<h6>tags, storing the counts in a dictionary namedheaders. - Image Processing: It finds all
<img>tags and extracts the src and alt attributes of each image, resolves relative image URLs to absolute URLs usingurllib.parse.urljointhen stores the image source and alt text in a list. - Link Analysis: It finds all
<a>**** tags withhrefattributes, resolves relative link URLs to absolute URLs, checks the status of each link usingrequests.head**** (andrequests.getif head returns a 4xx or 5xx status code). Then, It determines if each link is internal or external based on the base URL and stores each link’s URL, status, and type. - Content Cleanup: It removes
<script>and<style>tags from the parsed HTML, extracts the remaining text content, and removes extra whitespace from the text using regular expressions. - Result Packaging: It returns a dictionary containing all the extracted information:
meta_title,meta_description,headers,images,links, andmain_text.
4. Compiling the SEO Audit Report
Further down, the code gathers the extracted data and structures it into an audit report detailing key metrics for analysis.
def generate_audit_report(url, parsed_data):
report = {"url": url, "meta_title": parsed_data["meta_title"], "meta_description": parsed_data["meta_description"], "headers": parsed_data["headers"], "image_count": len(parsed_data["images"]), "link_count": len(parsed_data["links"]), "internal_link_count": len([link for link in parsed_data["links"] if link["type"] == "internal"]), "external_link_count": len([link for link in parsed_data["links"] if link["type"] == "external"]), "broken_link_count": len([link for link in parsed_data["links"] if link["status"] != 200 and link["status"] != "Error"]), "error_link_count": len([link for link in parsed_data["links"] if link["status"] == "Error"]), "main_text_length": len(parsed_data["main_text"])}
The function above achieves the following:
- Report Creation: It takes the URL of the page and the parsed data (extracted from the HTML) as input and creates a Python dictionary named
reportto store the SEO audit results. - Data Population: populates the
reportdictionary with the page’s URL, extracted meta title and description, header counts, image and link counts (total, internal, external, broken, and error), and, finally, the length of the main text content. - Report Return: The function returns the
reportdictionary, which contains all the compiled SEO audit data.
5. Sending Report to Mistral (LLM) for Analysis:
The next process is sending the audited report for further analysis and recommendations to Mistral.
def send_to_mistral(report, hf_url, hf_api_token):
client = InferenceClient(model=hf_url, token=hf_api_token)
prompt = ("You are an SEO expert. Analyze the following SEO audit report and provide recommendations for improvement:\n\n" f"{json.dumps(report, indent=4)}")
try:
logging.info("Sending audit report to Mistral...")
response = client.text_generation(prompt)
if isinstance(response, str):
return response.strip()
elif isinstance(response, list) and response and "generated_text" in response[0]:
return response[0]["generated_text"].strip()
else:
logging.error(f"Unexpected API response: {response}")
return "Failed to parse Mistral response."
except Exception as e:
logging.error(f"Error communicating with Mistral: {e}")
return "Error connecting to Mistral."
The function does the following:
- AI Client Setup: Initializes an
InferenceClientto connect to the Mistral AI model using the providedhf_urlandhf_api_token. - Prompt Construction: Creates a prompt for the AI, instructing it to act as an SEO expert and analyze the
report(which is converted to a formatted JSON string). - API Request: Sends the prompt to the Mistral AI model using
client.text_generationand logs the sending action. - Response Handling: Checks the AI’s
response, whether it’s a string, and returns the stripped response. If it’s a list with a “generated_text” field, it returns the stripped “generated_text”. Otherwise, it logs an error and returns a “Failed to parse…” message. - Error Handling: Uses a
try…exceptblock to catch potential exceptions during the API communication. If an error occurs, it logs the error and returns an “Error connecting…” message.
6. Generating Downloadable Report:
Create a download function for anyone to download an audited report into their device as a txt file.
def create_download_link(val, filename):
b64 = base64.b64encode(val.encode()).decode()
return f'<a href="data:file/txt;base64,{b64}" download="{filename}">Download Report</a>'
This code block achieves the following:
- Encoding Data: Collects the report data
(val)and a filename(filename)as input.val.encode()**** encodes the report data,base64.b64encode(…)**** encodes the byte data into abase64string anddecode()decodes the base64 byte string back into a regular string. - Creating Download Link: It constructs an HTML
<a>(anchor) tag that creates a download link, sets the link’s target to a data URL, specifies the filename that the browser should use when saving the downloaded data and sets the text as the download link. - Returning the Link: The function then returns the generated HTML
<a>tag string.
7. Building the UI using Streamlit:
The final step is to build the UI for the tool.
st.title("SEO Audit Tool")
uploaded_file = st.file_uploader("Upload scraped JSON file", type=["json"])
if uploaded_file is not None:
try:
data = json.load(uploaded_file)
except Exception as e:
st.error(f"Error reading JSON file: {e}")
st.stop()
if not isinstance(data, (list, dict)):
st.error("Uploaded JSON file must be a list or a dictionary.")
st.stop()
if isinstance(data, list):
valid_articles = [article for article in data if isinstance(article, dict) and "page_html" in article]
if not valid_articles:
st.error("Uploaded JSON file does not contain any valid articles with a 'page_html' field.")
st.stop()
elif isinstance(data, dict) and "page_html" not in data:
st.error("Uploaded JSON file does not contain a 'page_html' field.")
st.stop()
else:
valid_articles = [data]
base_url = st.text_input("Enter base URL (for resolving relative links):")
if not base_url:
st.warning("Please enter the base URL.")
st.stop()
st.subheader("Audit Reports")
for i, article in enumerate(valid_articles):
st.write(f"### Article {i+1}")
html = article["page_html"]
parsed_data = parse_html(html, base_url)
report = generate_audit_report(base_url, parsed_data)
st.json(report)
enhanced_report = send_to_mistral(report, HF_URL, HF_API_TOKEN)
if enhanced_report:
with st.expander("Enhanced Analysis"):
st.write(enhanced_report)
report_text = f"URL: {base_url}\n\nAudit Report:\n{json.dumps(report, indent=4)}\n\nEnhanced Analysis:\n{enhanced_report}"
download_link = create_download_link(report_text, f"audit_report_{i+1}.txt")
st.markdown(download_link, unsafe_allow_html=True)
else:
st.warning("Failed to retrieve enhanced analysis.")
report_text = f"URL: {base_url}\n\nAudit Report:\n{json.dumps(report, indent=4)}"
download_link = create_download_link(report_text, f"audit_report_{i+1}.txt")
st.markdown(download_link, unsafe_allow_html=True)
else:
st.warning("Please upload a JSON file to continue.")
The Streamlit code achieves the following:
- App Setup:
st.title(“SEO Audit Tool”)sets the title of the Streamlit web app, whileuploaded_file = st.file_uploader(…)creates a file uploader component, allowing users to upload JSON files. - File Upload Handling: if
uploaded_file is not None**** checks if a file has been uploaded. The code then proceeds to confirm if the JSON data is a list or dictionary and validates the structure of the JSON data to ensure it contains “page_html.” - Base URL Input:
base_url = st.text_input(…)prompts the user to enter a base URL for robust scraping. - Audit Loop: An audit loop iterates through the uploaded articles, retrieves the HTML content, parses the HTML, generates and displays the audit report in JSON format.
enhanced_report = send_to_mistral(…), then sends the report to the AI for analysis. - No File Upload Handling: else:executes if no file has been uploaded while
st.warning(…)prompts the user to upload a JSON file.
Here is what the tool’s UI looks like:
You will find the complete code for this project on my GitHub.
Deploying the SEO Audit Tool using Streamlit Free Hosting
Here’s how I deployed the SEO Audit Tool on Streamlit free cloud hosting in just a few steps:
Step 1: Set Up a GitHub Repository
Streamlit requires your project to be hosted on GitHub.
1. Create a New Repository On GitHub
Create a new repository on GitHub and set it as public.
2. Push Your Code to GitHub
If you haven’t already set up Git and linked your repository, use the following commands in your terminal:
git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/seo-audit-tool.git
git push -u origin main
Step 2: Store Your HuggingFace Token As An Environment Variable
Before deploying your app, you have to securely store your HugginFace token within your system as an environment variable to protect it from misuse by others.
1. Set Your Token As an Environment Variable (Locally):
- macOS/Linux:
export HUGGINGFACE_TOKEN="your_token"
Windows (PowerShell):
set HUGGINGFACE_TOKEN="your_token"
Use os.environ to retrieve the token within your script:
import os
HF_API_TOKEN = os.environ.get("HUGGINGFACE_TOKEN")
if HF_API_TOKEN is None:
print("Error: Hugging Face token not found in environment variables.")
# Handle errors
else:
# Use HF_API_TOKEN in your Hugging Face API calls
print("Hugging face token loaded successfully")
- Restart your code editor.
Step 3: Create a requirements.txt file
Streamlit needs to know what dependencies your app requires.
1. In your project folder, create a file named requirements.txt.
2. Add the following dependencies:
streamlit
requests
beautifulsoup4
huggingface_hub
textstat
3. Save the file and commit it to GitHub:
git add requirements.txt
git commit -m "Added dependencies"
git push origin main
4. Do the same for the app.py file containing all your code:
git add app.py
git commit -m "Added app script"
git push origin main
Step 3: Deploy on Streamlit Cloud
1. Go to Streamlit Community Cloud.
2. Click “Sign in with GitHub” and authorize Streamlit.
3. Click “Create App”
4. Select “Deploy a public app from GitHub repo.”
5. In the repository settings, enter:
- Repository: YOUR_USERNAME/seo-audit-tool
- Branch: main
- Main file path: app.py (or whatever your Streamlit script is named)
6. Click “Deploy” and wait for Streamlit to build the app.
Step 4: Get Your Streamlit App URL
After deployment, Streamlit will generate a public URL. You can now share this link to allow others access to your tool!
By extracting website data using Bright Data’s AI Scraper, I built an automated tool that detects SEO issues and highlights resolutions in a structured report format–a task that would otherwise take a huge amount of time and resources to achieve.
This tool automates the bulk work for SEO experts and content strategists ensuring a website’s content is top quality and performs optimally for pulling in higher search engine rankings, visibility, and engagements.