Many fine-tuned large language models (LLMs) hallucinate not because the datasets are incorrect, but because they are generic. Despite a compound annual growth rate of 20.5% in the cost of training datasets for AI models, it is frustrating to find that your model underperforms simply because everyone is using data from the same sources.
Generic datasets present several issues. To build effective models, you must go beyond standard datasets and ensure your training data is diverse, relevant, and representative of your users’ needs.
In this article, I will walk you through how to use a no-code Scraper API solution to build a custom dataset that you can use for training your AI models.
The problem with off-the-shelf datasets
Off-the-shelf datasets often fail to capture the full context of your users’ language. While they may loosely align with user intent, they are typically outdated and overused. As a result, they miss the subtle nuances in user queries.
There is also a risk of bias. When datasets are not evenly distributed across different perspectives, models may learn and reinforce that bias. Thus, personalizing your dataset helps reduce this risk.
The Solution: Creating custom datasets from niche forums
How can we address this problem? By sourcing data from niche forums such as Reddit, Stack Overflow, GitHub, etc. These niche platforms are rich with real-world, context-specific conversations. They offer targeted, user-generated content that is difficult to replicate in generic datasets.
This approach is especially relevant to anyone who works on classification models, fine-tuning LLMs, or customizing open models. However, for this article, we will be focusing on Reddit.
Why Reddit is a goldmine for AI training data
To optimize your dataset, you do not necessarily need more data. You need better data that is relatable and rich in context. Reddit provides:
Richer context
Niche forums focus on specific topics, so users tend to communicate with more detail and clarity. Reddit, for example, does not limit post or comment length, making it ideal for collecting in-depth, structured dialogue.
Higher credibility
Many niche forums attract professionals and experts, which increases the reliability of the information shared. Subreddits like r/askScience are known for high-quality, peer-reviewed content on specialized subjects.
Authentic language
Niche forums are full of raw, unfiltered, user-generated content. This reflects how people naturally communicate, making these platforms valuable for training models on natural language, including slang, sentiment, and context.
But before you begin sourcing this data, you need to first know what you want to use it for, which brings us to the next section.
Define your use case: Chatbot, Classifier, or RAG?
Most training datasets for AI fall into three common categories, each supporting a core capability in modern applications:
- Training chatbots using multi-turn dialogue or Q&A datasets.
- Teaching models to understand user goals from short queries, which is common in NLU systems.
- Combining datasets of questions, documents, and context to improve factual accuracy in generated responses.
Others include sentiment analysis, summarization, etc. Creating a custom dataset does not end with collecting data from forums. It also involves fine-tuning and evaluation (testing) to validate the dataset for your specific use case.
Your use case determines the required input format for your dataset. Below are the typical formats:
- Chatbot: Question-and-answer (Q&A) pairs, formatted as prompt and completion
- Classifier: Labeled text examples
- RAG: High-quality text passages, split into clean, contextual chunks
For this article, you will focus on creating a custom training dataset that chatbots can use.
How to collect data from Reddit (or any niche forum)
First, you need to get the data. While web scraping is a common way to collect data, writing scraping scripts from scratch is often inefficient and time-consuming; it’s simply not practical if you’re scraping a large amount of data.
Tools like Bright Data’s AI Scraper can help you retrieve relevant, structured, and easy-to-fine-tune datasets more efficiently in seconds rather than hours.
If you’re using Bright Data, you can choose between these methods:
- The Scraper API — If you want to integrate the scraper into your code
- The No-Code Scraper — If you want to collect data through Bright Data without writing any code.
If your use case isn’t covered, you can build your web scraper using their JavaScript integrated development environment or request a custom scraper built specifically for you.
Steps to Collect Data from Reddit
- Create a Bright Data account and access the “Web Scrapers Library”.
2. Search for your target domain, such as Reddit, and select it. Bright Data offers more than 120 popular domains.
3. A list of Reddit scrapers will appear. Select “Reddit — Posts — discover by keyword” for this use case.
4. Choose the “No-Code Scraper”.
5. Click “Add Input” to enter keywords related to the data you want to scrape. Then click “Start Collecting.” Without a Reddit login, you can customize parameters such as date, number of posts, and sort by on Bright Data.
For this project, I used the keywords related to “genetics.”
6. Once the scraper status shows “Ready”, click “Download”, choose “CSV” as the file format, and rename the file to reddit_dataset.csv for clarity.
What’s Inside the Dataset?
The dataset includes essential data fields that enable detailed analysis of information related to genetics.
Post Details
post_id,url,user_posted: The post ID, URL, and Reddit username of the authortitle,description,date_posted: The title, description, and publication date of the postnum_comments,num_upvotes: Number of comments and upvotes per postphotos, videos, tags: Media elements and associated tagsrelated_posts,comments: Similar posts and associated comments
Community Details
community_name,community_url,community_description: Community details to identify the subreddit and its purposecommunity_members_number,community_rank,post_karma: Community details to indicate the community’s activity and influence.
Build your dataset (step-by-step)
This section shows how to process and clean your raw dataset retrieved from Bright Data using Python.
For chatbot training, you’ll need question-and-answer (Q&A) pairs in a prompt-completion format. In this case, we’ll extract user queries (post titles and descriptions) and pair them with top-voted comments as answers.
Step 1: Set up the environment
1.1 Create the project directory
mkdir custom-training-dataset && cd custom-training-dataset
1.2 Set up a virtual environment
python -m venv venv
Activate the environment:
- On Windows:
venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
1.3 Install dependencies
pip install pandas
1.4 Define the project structure
custom-training-dataset/
├── reddit_dataset.csv
└── clean.py
Step 2: Clean and annotate the Reddit dataset
2.1 Clean raw text and filter out posts
Use a clean_text()function to remove raw URLs, markdown, special characters, and line breaks.
# Load the Reddit CSV file
input_file = "reddit_dataset.csv"
df = pd.read_csv(input_file)
# Function to clean text (removes links, markdown, emojis, etc.)
def clean_text(text):
if not isinstance(text, str):
return ""
text = re.sub(r'\[.*?\]\(.*?\)', '', text) # Markdown links
text = re.sub(r'http\S+', '', text) # Raw URLs
text = re.sub(r'[\*\_>`]', '', text) # Markdown syntax
text = re.sub(r'\n+', ' ', text) # Line breaks
text = re.sub(r'\s{2,}', ' ', text) # Extra spaces
return text.strip()
# Filter out posts with 0 upvotes or missing comments
filtered_df = df[(df['num_upvotes'] > 0) & (df['comments'].notnull())]
2.2 Convert to prompt-completion format
Remove posts with zero upvotes or missing comments. Then format the cleaned dataset into prompt-completion pairs. Use the top-upvoted comment as the “completion” for the user’s “prompt.”
Export the data as a JSON Lines (JSONL) file. This format is ideal for training data, as each line contains a valid JSON object.
# Prepare the prompt-completion dataset
dataset = []
for _, row in filtered_df.iterrows():
title = clean_text(row.get('title', ''))
description = clean_text(row.get('description', '')) if pd.notna(row.get('description')) else ''
# Combine title and description for the prompt
prompt = f"{title}. {description}" if description else title
try:
# Parse the comments field (expected to be a list of dicts with 'text' and 'upvotes')
comments = ast.literal_eval(row['comments'])
if isinstance(comments, list) and comments:
# Sort comments by upvotes (descending)
sorted_comments = sorted(
comments,
key=lambda x: x.get('upvotes', 0),
reverse=True
)
best_comment = clean_text(sorted_comments[0].get('comment', ''))
if best_comment:
dataset.append({
"prompt": prompt,
"completion": best_comment
})
print (best_comment)
except (ValueError, SyntaxError):
# Skip rows with improperly formatted comment data
continue
# Output the cleaned dataset as JSONL
output_file = "genetics_prompt_completion_dataset.json"
with open(output_file, "w", encoding="utf-8") as f:
for item in dataset:
json.dump(item, f, ensure_ascii=False)
f.write("\n")
print(f"✅ Done! Extracted {len(dataset)} prompt-completion pairs to '{output_file}'.")
Sample JSONL format:
{"prompt": "What is CRISPR?", "completion": "CRISPR is a gene-editing technology..."}
{"prompt": "Can gene mutations be reversed?", "completion": "In some cases, yes. Scientists..."}
Step 3: Evaluate the dataset
Use evaluation tools like TruLens to test the dataset for quality. These tools return metrics such as:
- Accuracy score
- Sentiment score
- Precision score
These scores help assess how well your dataset will perform in training a chatbot.
Tip: Google Colab is great for fine-tuning and evaluating datasets
3.1 Install the following dependencies in a new Python script
from trulens_eval import Feedback, Tru, TruLlama
from langchain.embeddings import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import json
import logging
import torch
import numpy as np
from datetime import datetime
from typing import List, Dict, Any, Tuple, Optional
import pandas as pd
3.2 Configure logging, initialize evaluation metrics, and models
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# Suppress HF warnings
logging.getLogger("transformers").setLevel(logging.ERROR)
# Initialize evaluation metrics
bertscore = evaluate.load("bertscore")
rouge = evaluate.load("rouge")
meteor = evaluate.load("meteor")
bleurt = evaluate.load("bleurt", module_type="metric") # More sensitive to small improvements
# Initialize models and tokenizers
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
sentiment_model = pipeline(
"text-classification",
model="nlptown/bert-base-multilingual-uncased-sentiment",
tokenizer=tokenizer,
return_all_scores=True
)
# Initialize embedding model for retrieval evaluation
embedding_model = SentenceTransformer('all-MiniLM-L6-v2') # Lightweight, good performance
# Constants
MAX_LENGTH = 512 # BERT models typically have 512 token limit
3.3 Create the ComprehensiveEvaluator class
Define a ComprehensiveEvaluator class that initializes TruLens and includes functions to evaluate various metrics. Save the evaluation results for further analysis.
class ComprehensiveEvaluator:
"""Evaluates model performance on retrieval, conversation, and classification tasks."""
def __init__(self, dataset_path: str = None):
""" Initialize the evaluator with dataset path.
Args: dataset_path: Path to the dataset JSON file
"""
self.dataset = self.load_dataset(dataset_path) if dataset_path else []
self.results = []
# Initialize TruLens
self.tru = Tru()
# Initialize OpenAI feedback provider if using OpenAI
# self.openai_provider = OpenAIProvider()
def load_dataset(self, path: str) -> List[Dict[str, Any]]:
"""Load dataset from a JSON file."""
try:
with open(path, "r") as f:
return [json.loads(line) for line in f]
except Exception as e:
logger.error(f"Error loading dataset: {e}")
return []
def load_dataset_from_dict(self, data: List[Dict[str, Any]]) -> None:
"""Load dataset from a dictionary."""
self.dataset = data
def truncate_text(self, text: str) -> str:
"""Truncate text to fit within model's maximum token length."""
encoded_text = tokenizer(text, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")
return tokenizer.decode(encoded_text["input_ids"][0])
def calculate_sentiment_score(self, text: str) -> float:
"""Calculate sentiment score (0-1) where 1 is most positive."""
try:
scores = sentiment_model(text)[0]
# Convert from 1-5 scale to 0-1
total = sum(item['score'] * (int(item['label'][0]) - 1) for item in scores)
return total / 4
except Exception as e:
logger.error(f"Sentiment error: {e}")
return 0.5 # Neutral default
def calculate_text_similarity(self, text1: str, text2: str) -> float:
"""Calculate semantic similarity between two texts using embeddings."""
try:
embedding1 = embedding_model.encode(text1, convert_to_tensor=True)
embedding2 = embedding_model.encode(text2, convert_to_tensor=True)
return float(util.pytorch_cos_sim(embedding1, embedding2).item())
except Exception as e:
logger.error(f"Similarity error: {e}")
return 0.0
def calculate_bertscore(self, prediction: str, reference: str) -> float:
"""Calculate BERTScore (F1) between prediction and reference."""
try:
results = bertscore.compute(
predictions=[prediction],
references=[reference],
lang="en"
)
return results["f1"][0]
except Exception as e:
logger.error(f"BERTScore error: {e}")
return 0.0
3.4 Add example usage for metric task types
# Example usage for the different task types
def create_sample_datasets():
"""Create sample datasets for each task type."""
# Sample classification dataset
classification_data = [
{
"task_type": "classification",
"prompt": "Classify this sentiment: I love this product!",
"completion": "Positive",
"predictions": ["Positive", "Negative", "Positive", "Positive", "Neutral"],
"ground_truth": ["Positive", "Negative", "Positive", "Neutral", "Neutral"]
}
]
# Sample retrieval dataset
retrieval_data = [
{
"task_type": "retrieval",
"query": "What causes climate change?",
"retrieved_docs": [
"Climate change is caused by greenhouse gas emissions.",
"The primary factors in climate change are human activities.",
"Deforestation contributes to climate change by reducing carbon sinks.",
"Industrial processes release carbon dioxide that warms the planet."
],
"relevant_docs": [
"Climate change is caused by greenhouse gas emissions.",
"Industrial processes release carbon dioxide that warms the planet.",
"The primary factors in climate change are human activities."
]
}
]
# Sample conversation dataset
conversation_data = [
{
"task_type": "conversation",
"prompt": "Can you explain how DNA replication works?",
"completion": "DNA replication is the process by which DNA makes a copy of itself before cell division. The double helix structure unwinds, and each strand serves as a template for the creation of a new complementary strand. This process is catalyzed by enzymes like DNA polymerase.",
"reference": "DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. It occurs in all living organisms and is the basis for biological inheritance. The process starts when proteins recognize the origin of replication, where the DNA double helix is unwound and unzipped by helicase. Each strand then serves as a template for the new DNA molecule."
}
]
return classification_data + retrieval_data + conversation_data
3.5 Instantiate the evaluator and print summary statistics
if __name__ == "__main__":
# Create an instance of the evaluator
evaluator = ComprehensiveEvaluator()
# Option 1: Load dataset from file
evaluator.load_dataset("genetics_prompt_completion_dataset.json")
# Option 2: Create sample dataset
sample_data = create_sample_datasets()
evaluator.load_dataset_from_dict(sample_data)
# Run evaluation
results = evaluator.evaluate_dataset()
# Print summary statistics
summary = evaluator.get_summary_statistics()
print("\n=== Evaluation Summary ===")
for task_type, metrics in summary.items():
print(f"\n{task_type.upper()} METRICS:")
for metric, value in metrics.items():
print(f" {metric}: {value:.4f}")
# Save results
output_path = evaluator.save_results()
print(f"\nDetailed results saved to: {output_path}")
# Generate visualizations
viz_files = evaluator.visualize_results()
print(f"Generated {len(viz_files)} visualization files.")
Sample evaluation results
The dataset achieved the following scores:
- BERT score: 97% — Measures semantic similarity between predicted and reference texts
- Precision score: 80% — Measures the quality of positive predictions
- Sentiment score: 57% — Within a neutral sentiment band (±50%), indicating minimal bias
These results suggest the dataset is clean, relevant, and ready for use in training.
Conclusion
AI is here to stay, and the most efficient AI wouldn’t only be driven by the best models but also by the best data.
And oftentimes, the best data comes from real communities where real users communicate. Off-the-shelf datasets often lack context, detail, and diversity. In contrast, niche forums like Reddit offer richer, more relevant data reflecting how users talk and think.
You can take advantage of tools like Bright Data’s AI Scraper to build domain-specific LLMs with high-quality data without the stress of writing a scraping script.