How I Use Real-Time Web Data to Build AI Agents That Are 10x Smarter

With the recent buzz about AI agents and apps, I decided to work on a fun project — a real-time content summarizer that pulls trending posts from Reddit and X (formerly Twitter), runs them through a local large language model (LLM), and generates concise digests.

The hardest part of this project would have been gathering the data; scraping felt like too much stress and required too much work, especially since it needed to be cleaned afterwards. Luckily, I found and made use of structured datasets for both Reddit and X, which provided me with clean, high-quality, and up-to-date content without the usual pain points like rate limits or fragile scraping logic.

So in this post, I’ll walk you through how I built this tool using readymade datasets, LangChain, Ollama, and Streamlit.

What we’re building

The idea was simple: build a tool that fetches trending content and turns it into a quick, digestible summary.

Think of it like a personal digest bot. It grabs the latest posts from Reddit and X, runs them through a local LLM via LangChain, and then spits out clean, summarized takes you can actually keep up with. The goal wasn’t to reinvent the wheel — it was just to see how far I could go by connecting the right tools and, more importantly, feeding them the right data.

Everything runs inside a Streamlit app, making it easy to interact with and iterate on. You load it up, and boom — fresh summaries, ready to go.

Nothing crazy. Just a small, focused project that shows how powerful good data and a few well-picked tools can be.

Tools and tech stack

I kept the stack minimal, just what I needed to get from raw content to digestible summaries without overengineering anything.

Bright Data Datasets: This was the backbone of the project. Rather than dealing with the complexities of web scraping, I made use of Bright Data’s structured Reddit and X datasets. They offered clean, up-to-date, and reliable data — eliminating concerns around rate limits, broken HTML, and time-consuming maintenance.
LangChain: This handled the entire LLM pipeline. It made it super easy to structure the prompt, pass in the content, and get clean summaries back from the model.
Ollama: I wanted to keep everything local, so I used Ollama to run the LLM. It was fast, lightweight, and didn’t need a cloud API key to get started. Perfect for real-time summaries.
Streamlit: For the UI, Streamlit made it almost too easy. A few lines of code, and I had a functional, clean dashboard to display digests.

Each of these tools played a role, but again, none of it would matter without the clean, structured data from Bright Data. That’s really what made the whole thing click.

Getting the data: Using Bright Data datasets

Here’s how I got the data for Reddit and X (formerly Twitter):

Head over to Bright Data and log in to your dashboard. If you don’t have an account, you can create one for free.

Step 2: Open the dataset marketplace

On the dashboard sidebar, click on Web Datasets, then select Dataset Marketplace.

Step 3: Search for the Reddit dataset

In the search bar, type “Reddit — Posts”. Click on the result that matches, preview the dataset, and proceed to purchase it.

Once purchased, download the dataset in CSV format.

Step 4: Do the same for X (formerly Twitter)

Repeat the same process — this time searching for “X (formerly Twitter) — Posts”.

Also, purchase and download that dataset. You can try out their sample data, see how it looks, and play with it before purchasing.

Alternatively, you can also obtain clean data from Reddit, X, and other platforms using their web scrapers, available in both code and no-code options.

Step 5: Explore the CSV files

Each CSV gives you a clean structure with fields like post title, text, timestamp, engagement metrics, and more — perfect for feeding into the LLM later on.

Step-by-Step implementation

Now that we’ve got the data and our tools in place, let’s break down how everything comes together. This part covers setting up the project from scratch, installing dependencies, and spinning up the local LLM with Ollama.

Setting Up the Project

Let’s kick things off by structuring the project. I like to keep things simple and modular, especially when experimenting.

Create a Project Folder

mkdir real-time-content-digest
cd real-time-content-digest

2. Set Up a Virtual Environment

If you’re using Python:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install required dependencies

Here’s what we need:

streamlit — for the UI
pandas — for working with datasets
langchain — to build LLM chains
ollama — to run the local model
openai — if you want to switch between local and hosted models

Install everything at once

pip install streamlit pandas langchain langchain_community openai

You’ll also need ollama installed on your machine. More on that below.

4. Define the Project Structure

Here’s a simple folder structure to work with:

real-time-content-digest/
├── data/
│   ├── reddit_posts.csv
│   └── x_posts.csv
├── main.py

Installing & Running the Ollama LLaVA-Llama3 Model

For local inference, I used the llava-llama3 model with Ollama. It’s lightweight and works great for real-time summarization tasks.

1. Install Ollama

You can install Ollama by following the instructions at: https://ollama.com/download (Available for macOS, Windows, and Linux)

2. Pull the LLaVA-Llama3 model

Once Ollama is installed, run the following command:

ollama pull llava-llama3:latest

This fetches the latest version of the llava-llama3 model.

3. Run the model locally

Start the model by running:

ollama run llava-llama3

By default, it spins up an API you can hit locally at http://localhost:11434.

That’s it. With the model running, we’re ready to start loading data, generating summaries, and connecting everything with Streamlit.

Building the Application Components

Now that our environment is set up, let’s implement each part of our content summarizer bot in the main.py file.

1. Data Processing Layer

The first component we need is the data processing layer, which will handle loading and preparing the CSV data:

def process_reddit_data(file):
    try:
        df = pd.read_csv(file)
        required_columns = ["post_id", "title", "description", "num_comments", 
                          "date_posted", "community_name", "num_upvotes", "comments"]
        
        # Check if the required columns exist
        for col in required_columns:
            if col not in df.columns:
                missing = [c for c in required_columns if c not in df.columns]
                st.warning(f"Warning: Missing columns in Reddit data: {missing}")
                break
                
        # Return the processed dataframe
        return df
    except Exception as e:
        st.error(f"Error processing Reddit data: {e}")
        return None

def process_twitter_data(file):
    try:
        df = pd.read_csv(file)
        required_columns = ["id", "user_posted", "name", "description", 
                          "date_posted", "replies", "reposts", "likes"]
        
        # Check if the required columns exist
        for col in required_columns:
            if col not in df.columns:
                missing = [c for c in required_columns if c not in df.columns]
                st.warning(f"Warning: Missing columns in Twitter/X data: {missing}")
                break
                
        # Return the processed dataframe
        return df
    except Exception as e:
        st.error(f"Error processing Twitter/X data: {e}")
        return None

def clean_text(text):
    if pd.isna(text):
        return ""
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove special characters and extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

These functions handle loading the CSV files, validating that they contain the necessary columns, and cleaning text content. The clean_text function removes URLs and normalizes the text to improve processing quality.

2. LLM Integration

Next, we’ll implement the integration with the Ollama model for content generation:

@st.cache_resource
def get_llm():
    try:
        return Ollama(model="llama3")
    except Exception as e:
        st.error(f"Error initializing Ollama model: {e}")
        return None

llm = get_llm()

We use Streamlit’s cache_resource decorator to ensure that we only initialize the Ollama model once, which improves performance.

3. Content Summarization Functions

Now we implement the core functionality for summarizing content from both platforms:

def summarize_reddit_post(post_data):
    prompt_template = PromptTemplate(
        input_variables=["title", "description", "community", "comments", "upvotes"],
        template="""
        Summarize the following Reddit post:
        
        Title: {title}
        Community: {community}
        Upvotes: {upvotes}
        Content: {description}
        
        Key comments (if available): {comments}
        
        Please provide:
        1. A concise TL;DR (2-3 sentences)
        2. 3-5 key takeaways
        3. The most insightful quote from the post
        4. Sentiment analysis (positive, negative, neutral)
        """
    )
    
    # Extract relevant info from post data
    title = post_data.get('title', 'No title available')
    description = post_data.get('description', 'No description available')
    community = post_data.get('community_name', 'Unknown community')
    
    # Process comments
    comments_list = post_data.get('comments', '')
    if isinstance(comments_list, str) and comments_list:
        comments = comments_list[:500] + "..." if len(comments_list) > 500 else comments_list
    else:
        comments = "No comments available"
    
    upvotes = post_data.get('num_upvotes', '0')
    
    chain = LLMChain(llm=llm, prompt=prompt_template)
    return chain.run(title=title, description=description, community=community, comments=comments, upvotes=upvotes)

The function for Twitter (X) content follows a similar pattern but is adapted for the different data structure.

4. Format Conversion Functions

Next, we implement functions to convert summaries into different output formats:

def convert_to_newsletter(summary, source_type, source_data):
    if source_type == "Reddit":
        title = source_data.get('title', 'Untitled Post')
        community = source_data.get('community_name', 'Unknown Community')
        url = source_data.get('url', '#')
        
        prompt_template = PromptTemplate(
            input_variables=["summary", "title", "community", "url"],
            template="""
            Convert this Reddit post summary into a newsletter segment:
            
            Original Post Title: {title}
            From r/{community}
            URL: {url}
            
            Summary: {summary}
            
            Write an engaging newsletter segment that includes:
            1. An attention-grabbing headline
            2. A brief introduction (1-2 sentences)
            3. The main insights formatted with bullet points
            4. A closing thought that encourages readers to check out the original post
            
            Format the output in Markdown.
            """
        )
        
        chain = LLMChain(llm=llm, prompt=prompt_template)
        return chain.run(summary=summary, title=title, community=community, url=url)

A similar function is implemented for social media post creation, with adjustments for character limits and platform-specific conventions.

5. User Interface Implementation

Finally, we implement the Streamlit UI for our application:

# Set page configuration
st.set_page_config(
    page_title="Content Summarizer Bot",
    page_icon="📝",
    layout="wide"
)

# Title and description
st.title("Real-Time Content Summarizer / Digest Bot")
st.markdown("""
This app takes trending content from Reddit and Twitter/X, summarizes key points, 
and can convert them into newsletter or social post formats.
""")

# Main interface
tab1, tab2 = st.tabs(["Data Processing", "Content Generation"])

with tab1:
    st.header("Upload Data")
    # Data upload UI code...

with tab2:
    st.header("Generate Content Summaries")
    # Content generation UI code...

The complete UI implementation includes tabs for data processing and content generation, along with controls for selecting posts, generating summaries, and converting to different formats.

Run the application

Run the Streamlit app:

streamlit run main.py

Your real-time content summarizer and digest bot is now live! 🎉

Conclusion

Starting with clean, relevant, and timely datasets from Reddit and Twitter (X) made the difference.

By layering structured datasets with open-source tools like LangChain, Ollama, and Streamlit, I was able to create a bot that summarizes the noise, and that’s only possible because the data it starts with is high-quality, well-scoped, and ready for transformation.

For anyone working with LLMs, it’s important to not just obsess over the model and remember that the model is only part of the equation. Selecting the right dataset often solves half the problem.

How I Use Real-Time Web Data to Build AI Agents That Are 10x Smarter

How clean datasets and open-source LLMs can turn social noise into digestible insights.

What we’re building

Tools and tech stack

Getting the data: Using Bright Data datasets

Step 2: Open the dataset marketplace

Step 3: Search for the Reddit dataset

Step 4: Do the same for X (formerly Twitter)

Step 5: Explore the CSV files

Step-by-Step implementation

Setting Up the Project

3. Install required dependencies

Installing & Running the Ollama LLaVA-Llama3 Model

Building the Application Components

Conclusion

Comments

Promote your content

Join our developer community

Main Menu

How I Use Real-Time Web Data to Build AI Agents That Are 10x Smarter

How clean datasets and open-source LLMs can turn social noise into digestible insights.

What we’re building

Tools and tech stack

Getting the data: Using Bright Data datasets

Step 1: Sign in to Bright Data

Step 2: Open the dataset marketplace

Step 3: Search for the Reddit dataset

Step 4: Do the same for X (formerly Twitter)

Step 5: Explore the CSV files

Step-by-Step implementation

Setting Up the Project

3. Install required dependencies

Installing & Running the Ollama LLaVA-Llama3 Model

Building the Application Components

Conclusion

Comments

Promote your content

Join our developer community