AI chatbots have become an integral part of customer engagement, information retrieval, and task automation. However, their usefulness depends largely on the quality and relevance of the data they can access. While many chatbots rely solely on pre-defined knowledge bases, integrating real-time structured data can significantly elevate their capabilities.

This article covers the process of enhancing your AI chatbot by using Bright Data’s dataset — which is used to power AI and LLMs, OpenAI’s cutting-edge language models, and LangChain’s advanced data processing tools. You will use these tools to create a chatbot that delivers accurate, context-rich, real-time responses to user queries.

Why Real-Time Data Matters for AI Chatbots

The value of AI chatbots lies in their ability to provide accurate, contextually relevant, and timely information. However, to ensure their accuracy, you need to be able to feed your chatbots with real-time data because static datasets can quickly become outdated, limiting your chatbot’s usefulness. Real-time data integration offers the following benefits:

1. Enhanced Accuracy

Chatbots can provide precise answers that reflect the latest information available by incorporating up-to-date datasets like Bright Data’s Wikipedia Articles. This ensures users receive accurate responses, especially for topics where facts are constantly changing, such as scientific discoveries, current events, or product updates.

2. Improved User Engagement

When users feel that their chatbot interactions are dynamic and contextually relevant, engagement increases, and real-time data enables chatbots to handle nuanced queries, making conversations more personalised and meaningful. This not only enhances user satisfaction but also fosters long-term trust in the chatbot’s capabilities.

3. Broader Use Cases

With real-time access to data, AI chatbots can address a wide range of applications, including:

Education: Answering queries with up-to-date information for students and researchers.
Customer Support: Offering real-time solutions based on the latest knowledge about products or services.
Market Analysis: Analyzing and responding to trends based on current data.

4. Competitive Advantage

Businesses that deploy chatbots with real-time data gain an advantage over competitors using static systems. They can respond faster to evolving customer needs and more effectively adapt to market changes.

Overview of Bright Data’s Data for AI Dataset

Bright Data’s Data for AI dataset is a resource designed to fuel AI and LLM (Large Language Model) projects at every stage — from pre-training to fine-tuning and beyond. It offers a vast collection of over 5 billion LLM-friendly records sourced from 100+ trusted providers, all structured, cleaned, validated, and refreshed on a monthly basis. This ensures the dataset remains accurate and up-to-date for high-quality AI applications.

Suppose the specific data you require isn’t available. In that case, Bright Data also enables you to build a custom web data pipeline tailored to your needs using either the code or no code option available. This flexibility makes the platform a go-to choice for developers and non-developers seeking robust, scalable, and relevant data for their AI systems.

For this article, you’ll use Wikipedia articles about engineering from Bright Data’s Data for AI dataset to build a chatbot. Follow these steps to access and prepare the dataset:

Steps to Access the Dataset

Sign in to the Bright Data Dashboard Log in to your Bright Data account to access their extensive suite of tools and datasets.
Navigate to the Web Datasets Section

On the sidebar menu, click on “Web Datasets.”
Next, click on “Dataset Marketplace” to explore available datasets.

3. Select the Data for AI Category

Under the “Categories” filter, choose “Data for AI.”
This will redirect you to a curated list of datasets specifically designed for AI applications.4. Choose the Wikipedia Articles Dataset
Scroll down and locate “Wikipedia Articles.”
Within this category, select “Wikipedia articles about engineering.” This dataset contains structured information relevant to engineering topics, perfect for training or enhancing AI chatbots.

5. Purchase and Download the Dataset

Click on “Proceed to Purchase” to acquire the dataset.
You can choose between JSON or CSV formats (this article uses the CSV format).
Once purchased, the dataset will be downloaded to your local machine. Save it as Wikipedia_articles.csv for easy access.

What are the benefits of using Bright Data’s data for AI Dataset?

This dataset is ideal for AI-driven projects because of its:

Access to a Wide Range of Datasets: Bright Data offers an expansive library of datasets, including over 5 billion LLM-friendly records sourced from 100+ reliable providers. These datasets cover various categories, such as engineering, healthcare, finance, and more, enabling you to find relevant data for virtually any AI application.
Easy Integration with AI Workflows: Bright Data’s datasets are available in formats like JSON and CSV, making them easy to integrate with tools like OpenAI’s GPT models and frameworks like LangChain. This compatibility accelerates the development and deployment of AI solutions.
User-Friendly Interface: The platform’s intuitive dashboard and streamlined navigation make data accessible to users of all skill levels.
Dedicated Support and Documentation: Bright Data provides customer support, data experts and detailed documentation to help you get the most out of your data.
Ethically sourced: Bright Data emphasises ethical data sourcing, ensuring compliance with intellectual property laws and fair usage policies. This guarantees that AI applications using the dataset operate transparently and responsibly, a critical aspect for businesses and developers.
Scalability: You can fetch large datasets without manual collection or worrying about changes in website structure or anti-scraping mechanisms.

How to Integrate Bright Data’s Wikipedia Dataset into Your AI Chatbot

In this section, you’ll walk through integrating Bright Data’s Wikipedia dataset into your AI chatbot using OpenAI and LangChain. This allows your chatbot to provide accurate, context-rich responses by leveraging the vast wealth of structured data from Wikipedia.

Step 1: Prerequisites

Before proceeding, ensure you have the following tools and data set up:

Get Bright Data’s Wikipedia Dataset:

Obtain and download the dataset in CSV format.
Save the file as Wikipedia_articles.csv in your project directory.

2. OpenAI API Key: Register with OpenAI and generate your API key.

3. Python Environment: Install the necessary Python libraries using the following command:

pip install flask pandas langchain-openai langchain-community faiss-cpu langchain-text-splitters

4. Project Structure: Organize your files as outlined in the folder structure below:

your_project_folder/ ├── app.py ├── wiki_chatbot.py ├── Wikipedia_articles.csv └── templates/ └── index.html

Step 2: Load and Process the Dataset

The wiki_chatbot.py script handles dataset loading, processing, and integration with the AI model.

Read and Combine Data:

The load_and_process_data method in WikipediaChatbot reads the dataset using pandas and combines relevant columns like title, raw_text, references, etc., into a single combined_text.

df['combined_text'] = df.apply(lambda row: f"""
Title: {row['title']}
Content: {row['raw_text']}
See Also: {row['see_also']}
References: {row['references']}
External Links: {row['external_links']}
Categories: {row['categories']}
""", axis=1)

2. Text Splitting for Efficiency:

The RecursiveCharacterTextSplitter divides large texts into smaller chunks for better indexing and faster retrieval.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, length_function=len)

4. Create a Vector Store:

Use the FAISS library to build a vector database from the processed text, allowing the chatbot to retrieve relevant information efficiently.

self.vectorstore = FAISS.from_texts(texts, self.embeddings)

5. Initialise a Conversational Chain:

The ConversationalRetrievalChain is set up using OpenAI’s GPT model, enabling intelligent responses and retrieval of source documents.

self.qa_chain = ConversationalRetrievalChain.from_llm(
    llm=self.llm,
    retriever=self.vectorstore.as_retriever(),
    return_source_documents=True
)

Here is the complete code for the wiki_chatbot.py script

import pandas as pd
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_openai.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain_text_splitters import RecursiveCharacterTextSplitter

import os
from typing import List, Dict

class WikipediaChatbot:
    def __init__(self, openai_api_key: str):
        """
        Initialize the Wikipedia chatbot with OpenAI API key
        
        Args:
            openai_api_key (str): Your OpenAI API key
        """
        self.openai_api_key = openai_api_key
        os.environ["OPENAI_API_KEY"] = openai_api_key
        
        # Initialize OpenAI and embedding models
        self.llm = ChatOpenAI(temperature=0.7, model_name="gpt-4")
        self.embeddings = OpenAIEmbeddings()
        
        # Initialize conversation history
        self.chat_history = []
        
    def load_and_process_data(self, csv_path: str) -> None:
        """
        Load and process Wikipedia data from CSV
        
        Args:
            csv_path (str): Path to the CSV file containing Wikipedia data
        """
        # Read CSV file
        df = pd.read_csv(csv_path)
        
        # Combine relevant text columns
        df['combined_text'] = df.apply(lambda row: f"""
        Title: {row['title']}
        Content: {row['raw_text']}
        See Also: {row['see_also']}
        References: {row['references']}
        External Links: {row['external_links']}
        Categories: {row['categories']}
        """, axis=1)
        
        # Split text into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len
        )
        
        texts = []
        for text in df['combined_text'].tolist():
            texts.extend(text_splitter.split_text(text))
        
        # Create vector store
        self.vectorstore = FAISS.from_texts(texts, self.embeddings)
        
        # Initialize conversation chain
        self.qa_chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(),
            return_source_documents=True
        )
        
    def get_response(self, query: str) -> Dict:
        """
        Get response from the chatbot
        
        Args:
            query (str): User's question
            
        Returns:
            Dict: Response containing answer and source documents
        """
        # Using invoke() instead of direct call
        result = self.qa_chain.invoke({
            "question": query, 
            "chat_history": self.chat_history
        })
        
        # Update chat history
        self.chat_history.append((query, result["answer"]))
        
        return {
            "answer": result["answer"],
            "sources": [doc.page_content for doc in result["source_documents"]]
        }
    
    def clear_history(self) -> None:
        """Clear conversation history"""
        self.chat_history = []

Step 3: Create the Flask Backend

The app.py script connects the chatbot with a frontend interface and manages API endpoints.

Load the Dataset:

Instantiate the chatbot and load the processed dataset during app initialisation.

chatbot.load_and_process_data(csv_path)

2. Set Up Chat API:

The /api/chat endpoint receives user queries, fetches responses from the chatbot, and returns answers with sources.

@app.route('/api/chat', methods=['POST'])
def chat():
    data = request.json
    user_message = data.get('message', '')
    response = chatbot.get_response(user_message)
    return jsonify({
        'answer': response['answer'],
        'sources': response['sources'][:3]
    })

Here is the complete code for the app.py script

from flask import Flask, render_template, request, jsonify
from wiki_chatbot import WikipediaChatbot
import os

app = Flask(__name__)

# Initialize the chatbot
OPENAI_API_KEY = "<replace with your openai key"  # Replace with your actual API key
chatbot = WikipediaChatbot(OPENAI_API_KEY)

# Load your Wikipedia CSV data
csv_path = "Wikipedia_articles.csv"  # Replace with your CSV file path
chatbot.load_and_process_data(csv_path)

@app.route('/')
def home():
    return render_template('index.html')

@app.route('/api/chat', methods=['POST'])
def chat():
    try:
        data = request.json
        user_message = data.get('message', '')
        
        if not user_message:
            return jsonify({'error': 'No message provided'}), 400
            
        response = chatbot.get_response(user_message)
        
        return jsonify({
            'answer': response['answer'],
            'sources': response['sources'][:3]  # Limiting to top 3 sources for brevity
        })
        
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(debug=True)

Step 4: Build the Frontend Interface

The index.html file provides a simple and interactive user interface for the chatbot.

Chatbox UI:

The interface allows users to type questions, view bot responses, and check sources.
CSS is used to style the chat container, user and bot messages, and loading indicators.

2. Send User Input to the Backend:

JavaScript captures user input and sends it to the /api/chat endpoint using Axios.

async function sendMessage() {
    const response = await axios.post('/api/chat', { message });
    appendMessage('bot', response.data.answer, response.data.sources);
}

3. Display Responses:

Bot responses, along with truncated source information, are displayed in the chat window.

function appendMessage(type, message, sources = null) {
    const sourcesDiv = document.createElement('div');
    sourcesDiv.innerHTML = '<strong>Sources:</strong><br>' + sources.map(source => source).join('<br>');
}

You can find the code for the index.html file here.

Step 5: Run the Chatbot

Start the Flask Server: Run the Flask app to host the chatbot.

python app.py

2. Access the Chatbot: Open your browser and navigate to http://127.0.0.1:5000/ to interact with the chatbot.

Key Features

Enhanced Responses: The chatbot provides detailed answers supported by references from the Wikipedia dataset.
Source Attribution: Users can verify information with links to original sources.
Scalable Design: The setup allows integration with other Bright Data datasets for diverse applications.

By following these steps, you’ve successfully integrated Bright Data’s Wikipedia dataset into your AI chatbot. This setup allows your chatbot to deliver accurate and context-rich responses to user queries. You can find the complete code for this tutorial on GitHub.

Final Thoughts

This article covered the process of building a chatbot with real-time data obtained through Bright Data’s Data for AI platform. It demonstrated how straightforward it is to access structured public data, specifically tailored for AI and LLMs, without the need for complex scripting or manual extraction.

By integrating the Wikipedia datasets with OpenAI models and LangChain, you built a chatbot that is capable of providing accurate and up-to-date responses in even the most specialised fields in engineering.

How to Enhance AI Chatbots with Real-Time Data from Bright Data using OpenAI and LangChain

Leverage Bright Data’s Data for AI, OpenAI, and LangChain to Build an AI Chatbot with Real-Time Access to Comprehensive Knowledge Bases.

Why Real-Time Data Matters for AI Chatbots

1. Enhanced Accuracy

2. Improved User Engagement

3. Broader Use Cases

4. Competitive Advantage

Overview of Bright Data’s Data for AI Dataset

Steps to Access the Dataset

What are the benefits of using Bright Data’s data for AI Dataset?

How to Integrate Bright Data’s Wikipedia Dataset into Your AI Chatbot

Step 1: Prerequisites

Step 2: Load and Process the Dataset

Step 3: Create the Flask Backend

Step 4: Build the Frontend Interface

Step 5: Run the Chatbot

Final Thoughts

Promote your content

Join our developer community

Main Menu

How to Enhance AI Chatbots with Real-Time Data from Bright Data using OpenAI and LangChain

Leverage Bright Data’s Data for AI, OpenAI, and LangChain to Build an AI Chatbot with Real-Time Access to Comprehensive Knowledge Bases.

Why Real-Time Data Matters for AI Chatbots

1. Enhanced Accuracy

2. Improved User Engagement

3. Broader Use Cases

4. Competitive Advantage

Overview of Bright Data’s Data for AI Dataset

Steps to Access the Dataset

What are the benefits of using Bright Data’s data for AI Dataset?

How to Integrate Bright Data’s Wikipedia Dataset into Your AI Chatbot

Step 1: Prerequisites

Step 2: Load and Process the Dataset

Step 3: Create the Flask Backend

Step 4: Build the Frontend Interface

Step 5: Run the Chatbot

Final Thoughts

Promote your content

Join our developer community