So I wanted to experiment with LLMs and train a domain-specific AI assistant, and I figured a weekend project would be the perfect way to dive in. My initial idea was an AI assistant based on Marvel superheroes, using data pulled from Fandom wikis. But I quickly ran into a problem: Fandom doesn’t have a public API 🙃. At least not one that’s convenient or useful for this kind of project.
So what started as a weekend experiment turned into an interesting technical problem — information might exist on the internet, but it’s not always conveniently available. This is something that web scraping was perfect for solving. By extracting and storing the data myself, I could build a dataset for training the assistant.
So in this post, I’ll walk you through how I built an automated webpage snapshot archive for Fandom wikis using web scraping. The real challenge wasn’t just scraping the data — it was doing it at scale (that’s why I turned to Bright Data’s Scraping Browser.)
This system efficiently collects, converts, and stores web pages as Markdown and HTML snapshots — perfect for AI training, historical analysis, and digital preservation.
Feel free to fork this repo and work on it yourself.
Whether you’re an AI researcher assembling a training corpus, or just someone who doesn’t trust the internet to keep things around, I hope this helps you out!
The Tech Stack
Here’s what I used:
- Bright Data Scraping Browser — A remote browser solution that handles proxies, CAPTCHA solving, and JavaScript rendering, making large-scale web scraping more reliable.
- Puppeteer — To control the above and extract fully rendered HTML. We’ll just need
puppeteer-coresince I'm assuming most would already have Chrome on their systems. Get the full-fatpuppeteerif you don’t. - Turndown — To convert HTML to Markdown for cleaner text storage.
- SQLite — A lightweight, local, fully relational database to store Markdown. For storing the raw HTML backups, I’m just using the filesystem.
- csv-parser — Stream-based CSV parser. Battle tested and reliable.
Use your package manager of choice to get these.
npm install puppeteer-core turndown sqlite3 csv-parser
Why SQLite and not a vector database? Since this part of the project is just about fetching and archiving web data before AI processing, a full-fledged vector database like Pinecone/Weaviate wasn’t necessary at this stage. If I were doing retrieval-augmented generation (RAG) or similarity searches after this, sure, then I’d consider it.
Why Bright Data? Anyone can scrape a single page with Puppeteer and Cheerio or similar. It’s scaling web scraping that’s really tough — handling CAPTCHAs, IP bans, and JavaScript-heavy pages is a headache I didn’t want to deal with. Bright Data’s Scraping Browser streamlines all of this by providing a remote, proxy-rotated browser that loads pages just like a real user, making it much easier to extract data at scale without getting blocked.
Step 1: Setting Up Bright Data’s Scraping Browser
If you don’t have a Bright Data account yet, you can sign up for free. Adding a payment method will grant you a $5 credit to get started — no charges upfront.
1. Sign in to Bright Data
Log in to your Bright Data account.
2. Creating a Proxy Zone
- On the My Zones page, find the Scraping Browser section and click Get Started.
- If you already have an active proxy, you can simply click Add in the top-right corner.
3. Assign a Name to Your Remote Browser Instance
- Choose a meaningful name, as it cannot be changed once created.
4. Click “Add” and Verify Your Account
- If you haven’t verified your account yet, you’ll be prompted to add a payment method at this stage.
- First-time users receive a $5 bonus credit, so you can test the service without any upfront costs.
Once that’s done, copy your Username and Password in the Scraping Browser zone you just created, in the “Overview” tab. Then it’s back to the code.
The key thing to remember here is that unlike API-based scraping, here, you get a WebSocket endpoint for direct Puppeteer control.
Here’s what I mean:
// just separate the two values you copied with a “:”
// should look like this: brd-customer-<ACCOUNT ID>-zone-<ZONE NAME>:<PASSWORD>
const AUTH = "username:password";
// the websocket endpoint you’ll be connecting your local puppeteer instance to
const SBR_WS_ENDPOINT = `wss://${@brd.superproxy.io">AUTH}@brd.superproxy.io:9222`;
// uses Puppeteer’s .connect() method to establish a connection with Bright Data’s remote browser.
const browser = await puppeteer.connect({ browserWSEndpoint: SBR_WS_ENDPOINT });
Unlike puppeteer.launch(), which starts a local instance, puppeteer.connect() connects to an existing browser session running on Bright Data’s infrastructure.
This lets us launch a remote-controlled, auto-proxy rotated browser instance in the cloud that behaves like a real user, and allows for full-page interactions, all with zero infra required on my part.
Step 2: The Basics
Now that we’re connected to Bright Data’s Scraping Browser, let’s give you a little preview of how we’ll do this. We can navigate to a webpage (as usual), extract its HTML, and convert it to Markdown for structured storage.
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 120000 });
try {
// this CSS selector is Fandom page specific; figure out your own if scraping something else
await page.waitForSelector('.marvel_database_section', { timeout: 30000 });
} catch (err) {
console.warn("⚠️ Required section not found. Archiving the page anyway.");
}
// extract page HTML
const html = await page.content();
// convert this extracted HTML to Markdown
const markdown = turndownService.turndown(html);
- The
.waitForSelector()call ensures the page has loaded properly (we wait up to 30 seconds for an element with the class.marvel_database_sectionto appear) before extracting content. This is crucial for dynamic sites with lazy-loaded elements. - If not found, we log an error but archive the page anyway (you can change this behavior if you want).
page.content()grabs the fully rendered HTML of the page, including dynamically loaded content.turndownService.turndown(html)converts the HTML into Markdown for structured text storage.
Together, this ensures my extracted data includes any JavaScript-rendered elements.
💡 **Why
domcontentloaded?**This waits until the HTML structure is fully available but does not wait for images, styles, or AJAX requests to complete. Much faster thannetworkidle2while still capturing the main content.
This heuristic isn’t exactly perfect — some sites defy simple content detection — but it worked for Fandom (and 90% of my other targets). For edge cases, Bright Data’s API offers data discovery tools to pinpoint specific elements, though I stuck with this simpler approach. Don’t fix what ain’t broke, right?
Got all that? Good. That was just a taste. Now, let’s dive right into our actual code.
async function scrapeAndArchive(urls) {
for (const url of urls) {
console.log(`Scraping: ${url}`);
try {
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 120000 });
const html = await page.content();
const markdown = turndownService.turndown(html);
// save extracted data
await storeSnapshot(url, markdown, html);
console.log(`✅ Successfully scraped: ${url}`);
await page.close();
} catch (err) {
console.error(`❌ Error scraping ${url}:`, err.message);
}
}
}
This is pretty much exactly as we saw in the preview, except with extra logging + error handling, and the fact that now we’re using storeSnapshot() to store extracted data.
Let’s cover that next.
Step 3: Storing Snapshots for Long-Term Accessibility
To make my stored snapshots easy to query and resilient over time, I used a dual-storage system:
- SQLite Database: Stores metadata and Markdown for quick text-based queries.
- Filesystem: Stores full HTML files for high-fidelity preservation.
Here’s my schema:
CREATE TABLE IF NOT EXISTS snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
timestamp TEXT,
markdown TEXT,
html_path TEXT
);
After scraping, we need to store the extracted data in both the database and local storage. Here’s the storeSnapshot function, in its entirety:
async function storeSnapshot(url, markdown, html) {
const timestamp = new Date().toISOString();
const snapshotsDir = path.join(__dirname, 'snapshots');
// ensure a snapshots directory exists
if (!fs.existsSync(snapshotsDir)) {
fs.mkdirSync(snapshotsDir, { recursive: true });
}
// store HTML in filesystem, generate a unique filename based on the URL
const htmlFilename = path.join(snapshotsDir, sanitizeFilename(url));
fs.writeFileSync(htmlFilename, html, 'utf8');
// store snapshot metadata + Markdown in DB (SQLite)
db.run(
`INSERT OR REPLACE INTO snapshots (url, timestamp, markdown, html_path) VALUES (?, ?, ?, ?)`,
[url, timestamp, markdown, htmlFilename],
(err) => {
if (err) console.error('Error inserting snapshot:', err.message);
else console.log(`✅ Snapshot stored: ${url}`);
}
);
}
Here, { recursive: true } ensures parent directories are also created if missing.
This function makes use of a helper function for sanitizing filenames for saving.
function sanitizeFilename(url) {
return url.replace(/https?:\/\//, '').replace(/\W/g, '_') + `_${Date.now()}.html`;
}
Let’s break this regex down.
- First of all,
url.replace(/https?:\/\//, ‘’)removes “http://” or “https://” from the URL. - Next,
.replace(/\W/g, ‘_’)removes any non-word character (anything that’s NOT a letter, number, or underscore), replacing them with underscores. - Finally,
+ _${Date.now()}.htmljust appends a timestamp to ensure unique filenames (and the HTML extension). - Putting it all together, if we started with
https://marvel.fandom.com/wiki/Peter_Parker_(Earth-616), we’d end up with a file namedmarvel_fandom_com_wiki_Peter_Parker_Earth_616_1714159123456.html(and thehtml_pathfield in our database entry would point to this file).
The dual filesystem + database approach ensures my Markdown versions are queryable while still retaining the full HTML structure for future use. Our INSERT OR REPLACE statement guarantees that duplicate URLs get updated rather than duplicated.
This structure balances accessibility (SQLite queries) with durability (filesystem backups). For AI training, I can query Markdown directly; for historical fidelity, I’ve got the HTML.
Step 4: Automating the Pipeline
What we have so far works already — but for one page at a time. If we want to automate scraping, at scale, as part of a pipeline, we need to read target URLs from a CSV file and process them sequentially (safest; to avoid rate limits and detection).
const csvPath = path.join(__dirname, 'target_urls.csv');
if (fs.existsSync(csvPath)) { // make sure the file exists
console.log(' Found target_urls.csv - processing URLs');
// read CSV line-by-line, extract url field from each row and add to an array
const urls = [];
fs.createReadStream(csvPath)
.pipe(csv())
.on('data', (row) => {
if (row.url) urls.push(row.url);
})
// call scrapeAndArchive() with all found URLs in CSV
.on('end', () => {
if (urls.length > 0) {
console.log(`Found ${urls.length} URLs in CSV file`);
scrapeAndArchive(urls);
} else {
console.error('No valid URLs found in CSV');
}
})
.on('error', (error) => {
console.error('❌ Error reading CSV:', error.message);
});
} else {
console.error('❌ Error: target_urls.csv not found');
}
- This method reads a list of URLs from
target_urls.csv(this is hardcoded, but you can provide this as a command-line argument if you want) and sequentially scrapes each page. - Why use
.pipe(csv())? This doesn’t load the whole file into memory, which is ideal for large CSVs with thousands of URLs. Also, note that we skip empty rows to avoid unnecessary processing.
While Bright Data’s API supports batch requests, I kept it simple with sequential calls for my small-scale test (50 URLs, one for each Marvel hero). For larger projects, their bulk endpoint could probably reduce latency significantly.
Bringing It All Together
const puppeteer = require('puppeteer-core');
const TurndownService = require('turndown');
const fs = require('fs');
const path = require('path');
const sqlite3 = require('sqlite3').verbose();
const csv = require('csv-parser');
// Bright Data Scraping Browser auth
// should look like 'brd-customer-<ACCOUNT ID>-zone-<ZONE NAME>:<PASSWORD>'
const AUTH = "your-auth-string-here";
const SBR_WS_ENDPOINT = `wss://${AUTH}@brd.superproxy.io:9222`;
// initialize database
const db = new sqlite3.Database('archive.db');
// ensure snapshots table exists
db.run(`CREATE TABLE IF NOT EXISTS snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
timestamp TEXT,
markdown TEXT,
html_path TEXT
)`);
const turndownService = new TurndownService(); // turndown service - converts HTML to Markdown
function sanitizeFilename(url) {
return url.replace(/https?:\/\//, '').replace(/\W/g, '_') + `_${Date.now()}.html`;
}
// stores page snapshot in HTML (filesystem) and Markdown (database)
async function storeSnapshot(url, markdown, html) {
const timestamp = new Date().toISOString();
const snapshotsDir = path.join(__dirname, 'snapshots');
if (!fs.existsSync(snapshotsDir)) {
fs.mkdirSync(snapshotsDir, { recursive: true });
}
const htmlFilename = path.join(snapshotsDir, sanitizeFilename(url));
fs.writeFileSync(htmlFilename, html, 'utf8');
db.run(
`INSERT OR REPLACE INTO snapshots (url, timestamp, markdown, html_path) VALUES (?, ?, ?, ?)`,
[url, timestamp, markdown, htmlFilename],
(err) => {
if (err) console.error('Error inserting snapshot:', err.message);
else console.log(`✅ Snapshot stored: ${url}`);
}
);
}
// scrapes a single given URL
async function scrapePage(url) {
console.log(`🚀 Connecting to Bright Data Scraping Browser...`);
let browser = null;
let page = null;
try {
browser = await puppeteer.connect({ browserWSEndpoint: SBR_WS_ENDPOINT });
console.log(`Navigating to ${url}...`);
page = await browser.newPage();
// ensure page is fully loaded
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 120000 });
try {
await page.waitForSelector('.marvel_database_section', { timeout: 30000 });
} catch (selectorErr) {
console.warn(`⚠️ Marvel database section not found on page. Will try to scrape anyway.`);
}
console.log('Scraping page content...');
const html = await page.content();
const markdown = turndownService.turndown(html);
await storeSnapshot(url, markdown, html);
} catch (err) {
console.error(`❌ Error scraping ${url}: ${err.message}`);
if (err.stack) {
console.debug(err.stack.split('\n').slice(0, 3).join('\n'));
}
} finally {
if (page) {
try {
await page.close();
console.log(`Page closed successfully`);
} catch (closeErr) {
console.error(`Error closing page: ${closeErr.message}`);
}
}
if (browser) {
try {
await browser.close();
console.log(`Browser closed successfully`);
} catch (closeErr) {
console.error(`Error closing browser: ${closeErr.message}`);
}
}
}
}
// automated scraping function
async function scrapeAndArchive(urls) {
console.log(`📋 Starting batch scraping of ${urls.length} URLs...`);
// connect to browser once for the entire batch
console.log(`🚀 Connecting to Bright Data Scraping Browser...`);
let browser = null;
try {
browser = await puppeteer.connect({ browserWSEndpoint: SBR_WS_ENDPOINT });
for (const url of urls) {
console.log(`\n⏳ Processing: ${url}`);
let page = null;
try {
console.log(`Navigating to ${url}...`);
page = await browser.newPage();
// ensure page is fully loaded
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 120000 });
try {
await page.waitForSelector('.marvel_database_section', { timeout: 30000 });
} catch (selectorErr) {
console.warn(`⚠️ Required section not found. Archiving the page anyway.`);
}
console.log('Scraping page content...');
const html = await page.content();
const markdown = turndownService.turndown(html);
await storeSnapshot(url, markdown, html);
} catch (err) {
console.error(`❌ Error scraping ${url}: ${err.message}`);
if (err.stack) {
console.debug(err.stack.split('\n').slice(0, 3).join('\n'));
}
} finally {
if (page) {
try {
await page.close();
console.log(`Page closed successfully`);
} catch (closeErr) {
console.error(`Error closing page: ${closeErr.message}`);
}
}
}
// add a small delay between requests to avoid overloading the service
console.log(`Waiting 2 seconds before next URL...`);
await new Promise(resolve => setTimeout(resolve, 2000));
}
} catch (browserErr) {
console.error(`❌ Browser error: ${browserErr.message}`);
} finally {
if (browser) {
try {
await browser.close();
console.log(`Browser closed successfully`);
} catch (closeErr) {
console.error(`Error closing browser: ${closeErr.message}`);
}
}
}
console.log('✅ Batch scraping completed!');
}
// check if target_urls.csv exists, and use it by default if it does
const csvPath = path.join(__dirname, 'target_urlsv.csv');
if (fs.existsSync(csvPath)) {
console.log('📄 Found target_urls.csv - processing URLs from CSV file');
// Read URLs from CSV file
const urls = [];
fs.createReadStream(csvPath)
.pipe(csv())
.on('data', (row) => {
if (row.url) urls.push(row.url);
})
.on('end', () => {
if (urls.length > 0) {
console.log(`Found ${urls.length} URLs in CSV file`);
scrapeAndArchive(urls);
} else {
console.error('No URLs found in the CSV file or invalid format');
}
})
.on('error', (error) => {
console.error('Error reading CSV file:', error.message);
});
} else {
// error out if no csv file is found
console.error('Error: target_urls.csv not found');
}
Legal Considerations
Here’s the not-so-fun part. Web scraping sits in a gray area. Here’s how I stayed compliant:
- Terms of Service: Bright Data ensures ethical use and GDPR/CCPA compliance, but the onus is on you to check TOS for the sites you scrape.
- Public Data Only: I targeted public-facing pages. Fandom wiki pages for the Marvel universe are not behind a paywall, do not require logins, and contain no user-generated private content.
- Attribution: My archive retains original URLs and timestamps for provenance. Each snapshot in my database includes the original URL (so the source can be referenced), a timestamp (so users know when the data was collected), and, of course, I keep unaltered HTML versions for easy review.
- Rate Limiting: Even with the built-in proxies in the Scraping Browser, aggressively scraping a site can put unnecessary load on its servers, and that’s how you get IPs banned. I add a 2-second delay between requests to avoid spamming the server, avoid scraping in parallel or during peak hours to reduce server strain.
Where to Go From Here?
Using Bright Data’s Scraping Browser, I built a pipeline to archive webpage snapshots as Markdown and HTML. It’s a developer-friendly tool that abstracts away scraping headaches — proxies, CAPTCHAs, rendering — leaving me to focus on data processing and storage. Whether for AI, history, or analytics, this is a practical way to preserve the web’s fleeting data.
Again, feel free to fork this repo and work on it. Here are some cool things you might want to tackle if you want to take this further:
- Make
sanitizeFilename()more robust. Right now, it replaces all non-word characters (\W) with underscores. For some pages, this will create very long filenames and could exceed filesystem limits. Maybe you could limit the filename length and add a hash instead of usingDate.now(). - To use this dataset to train domain-specific AI assistants, you’ll have to clean the Markdown. Also, split long Markdown docs into manageable pieces (e.g., one Markdown file per character or story arc, here). LLMs perform much better with smaller, well-scoped training examples. Store as JSON/JSONL if fine-tuning Llama 3 or Mistral using Hugging Face’s
transformerslibrary. This is what I ended up doing. - If you don’t want to fine-tune a model from scratch, store your scraped data in a vector database (like Pinecone or Weaviate) and use RAG to dynamically fetch relevant context when the AI answers a question.
- With a Chromium built for serverless like @sparticuz/chromium, you could deploy this scraper as an AWS Lambda, Google Cloud Function, or Vercel Serverless Function. You’d have to move Markdown storage to a managed database (Supabase etc.) instead of local SQLite, and offload the HTML snapshots to S3, IPFS, or another decentralized storage solution.
- If you don’t want to write all the Puppeteer scripts yourself, you could use Bright Data’s Web Scraper APIs which is a low-code solution to build and automate web scrapers at scale, and have results delivered to your storage of choice asynchronously.
Regardless, try it out — Bright Data’s free trial is a low-risk starting point. Let me know how you’d tweak this for your own projects!