With the recent buzz about AI agents and apps, I decided to work on a fun project — a real-time content summarizer that pulls trending posts from Reddit and X (formerly Twitter), runs them through a local large language model (LLM), and generates concise digests.
The hardest part of this project would have been gathering the data; scraping felt like too much stress and required too much work, especially since it needed to be cleaned afterwards. Luckily, I found and made use of structured datasets for both Reddit and X, which provided me with clean, high-quality, and up-to-date content without the usual pain points like rate limits or fragile scraping logic.
So in this post, I’ll walk you through how I built this tool using readymade datasets, LangChain, Ollama, and Streamlit.
What we’re building
The idea was simple: build a tool that fetches trending content and turns it into a quick, digestible summary.
Think of it like a personal digest bot. It grabs the latest posts from Reddit and X, runs them through a local LLM via LangChain, and then spits out clean, summarized takes you can actually keep up with. The goal wasn’t to reinvent the wheel — it was just to see how far I could go by connecting the right tools and, more importantly, feeding them the right data.
Everything runs inside a Streamlit app, making it easy to interact with and iterate on. You load it up, and boom — fresh summaries, ready to go.
Nothing crazy. Just a small, focused project that shows how powerful good data and a few well-picked tools can be.
Tools and tech stack
I kept the stack minimal, just what I needed to get from raw content to digestible summaries without overengineering anything.
- Bright Data Datasets: This was the backbone of the project. Rather than dealing with the complexities of web scraping, I made use of Bright Data’s structured Reddit and X datasets. They offered clean, up-to-date, and reliable data — eliminating concerns around rate limits, broken HTML, and time-consuming maintenance.
- LangChain: This handled the entire LLM pipeline. It made it super easy to structure the prompt, pass in the content, and get clean summaries back from the model.
- Ollama: I wanted to keep everything local, so I used Ollama to run the LLM. It was fast, lightweight, and didn’t need a cloud API key to get started. Perfect for real-time summaries.
- Streamlit: For the UI, Streamlit made it almost too easy. A few lines of code, and I had a functional, clean dashboard to display digests.
Each of these tools played a role, but again, none of it would matter without the clean, structured data from Bright Data. That’s really what made the whole thing click.
Getting the data: Using Bright Data datasets
Here’s how I got the data for Reddit and X (formerly Twitter):
Step 1: Sign in to Bright Data
Head over to Bright Data and log in to your dashboard. If you don’t have an account, you can create one for free.
Step 2: Open the dataset marketplace
On the dashboard sidebar, click on Web Datasets, then select Dataset Marketplace.
Step 3: Search for the Reddit dataset
In the search bar, type “Reddit — Posts”. Click on the result that matches, preview the dataset, and proceed to purchase it.
Once purchased, download the dataset in CSV format.
Step 4: Do the same for X (formerly Twitter)
Repeat the same process — this time searching for “X (formerly Twitter) — Posts”.
Also, purchase and download that dataset. You can try out their sample data, see how it looks, and play with it before purchasing.
Alternatively, you can also obtain clean data from Reddit, X, and other platforms using their web scrapers, available in both code and no-code options.
Step 5: Explore the CSV files
Each CSV gives you a clean structure with fields like post title, text, timestamp, engagement metrics, and more — perfect for feeding into the LLM later on.
Step-by-Step implementation
Now that we’ve got the data and our tools in place, let’s break down how everything comes together. This part covers setting up the project from scratch, installing dependencies, and spinning up the local LLM with Ollama.
Setting Up the Project
Let’s kick things off by structuring the project. I like to keep things simple and modular, especially when experimenting.
- Create a Project Folder
mkdir real-time-content-digest
cd real-time-content-digest
2. Set Up a Virtual Environment
If you’re using Python:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
3. Install required dependencies
Here’s what we need:
- streamlit — for the UI
- pandas — for working with datasets
- langchain — to build LLM chains
- ollama — to run the local model
- openai — if you want to switch between local and hosted models
Install everything at once
pip install streamlit pandas langchain langchain_community openai
You’ll also need ollama installed on your machine. More on that below.
4. Define the Project Structure
Here’s a simple folder structure to work with:
real-time-content-digest/
├── data/
│ ├── reddit_posts.csv
│ └── x_posts.csv
├── main.py
Installing & Running the Ollama LLaVA-Llama3 Model
For local inference, I used the llava-llama3 model with Ollama. It’s lightweight and works great for real-time summarization tasks.
1. Install Ollama
You can install Ollama by following the instructions at: https://ollama.com/download (Available for macOS, Windows, and Linux)
2. Pull the LLaVA-Llama3 model
Once Ollama is installed, run the following command:
ollama pull llava-llama3:latest
This fetches the latest version of the llava-llama3 model.
3. Run the model locally
Start the model by running:
ollama run llava-llama3
By default, it spins up an API you can hit locally at http://localhost:11434.
That’s it. With the model running, we’re ready to start loading data, generating summaries, and connecting everything with Streamlit.
Building the Application Components
Now that our environment is set up, let’s implement each part of our content summarizer bot in the main.py file.
1. Data Processing Layer
The first component we need is the data processing layer, which will handle loading and preparing the CSV data:
def process_reddit_data(file):
try:
df = pd.read_csv(file)
required_columns = ["post_id", "title", "description", "num_comments",
"date_posted", "community_name", "num_upvotes", "comments"]
# Check if the required columns exist
for col in required_columns:
if col not in df.columns:
missing = [c for c in required_columns if c not in df.columns]
st.warning(f"Warning: Missing columns in Reddit data: {missing}")
break
# Return the processed dataframe
return df
except Exception as e:
st.error(f"Error processing Reddit data: {e}")
return None
def process_twitter_data(file):
try:
df = pd.read_csv(file)
required_columns = ["id", "user_posted", "name", "description",
"date_posted", "replies", "reposts", "likes"]
# Check if the required columns exist
for col in required_columns:
if col not in df.columns:
missing = [c for c in required_columns if c not in df.columns]
st.warning(f"Warning: Missing columns in Twitter/X data: {missing}")
break
# Return the processed dataframe
return df
except Exception as e:
st.error(f"Error processing Twitter/X data: {e}")
return None
def clean_text(text):
if pd.isna(text):
return ""
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove special characters and extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
These functions handle loading the CSV files, validating that they contain the necessary columns, and cleaning text content. The clean_text function removes URLs and normalizes the text to improve processing quality.
2. LLM Integration
Next, we’ll implement the integration with the Ollama model for content generation:
@st.cache_resource
def get_llm():
try:
return Ollama(model="llama3")
except Exception as e:
st.error(f"Error initializing Ollama model: {e}")
return None
llm = get_llm()
We use Streamlit’s cache_resource decorator to ensure that we only initialize the Ollama model once, which improves performance.
3. Content Summarization Functions
Now we implement the core functionality for summarizing content from both platforms:
def summarize_reddit_post(post_data):
prompt_template = PromptTemplate(
input_variables=["title", "description", "community", "comments", "upvotes"],
template="""
Summarize the following Reddit post:
Title: {title}
Community: {community}
Upvotes: {upvotes}
Content: {description}
Key comments (if available): {comments}
Please provide:
1. A concise TL;DR (2-3 sentences)
2. 3-5 key takeaways
3. The most insightful quote from the post
4. Sentiment analysis (positive, negative, neutral)
"""
)
# Extract relevant info from post data
title = post_data.get('title', 'No title available')
description = post_data.get('description', 'No description available')
community = post_data.get('community_name', 'Unknown community')
# Process comments
comments_list = post_data.get('comments', '')
if isinstance(comments_list, str) and comments_list:
comments = comments_list[:500] + "..." if len(comments_list) > 500 else comments_list
else:
comments = "No comments available"
upvotes = post_data.get('num_upvotes', '0')
chain = LLMChain(llm=llm, prompt=prompt_template)
return chain.run(title=title, description=description, community=community, comments=comments, upvotes=upvotes)
The function for Twitter (X) content follows a similar pattern but is adapted for the different data structure.
4. Format Conversion Functions
Next, we implement functions to convert summaries into different output formats:
def convert_to_newsletter(summary, source_type, source_data):
if source_type == "Reddit":
title = source_data.get('title', 'Untitled Post')
community = source_data.get('community_name', 'Unknown Community')
url = source_data.get('url', '#')
prompt_template = PromptTemplate(
input_variables=["summary", "title", "community", "url"],
template="""
Convert this Reddit post summary into a newsletter segment:
Original Post Title: {title}
From r/{community}
URL: {url}
Summary: {summary}
Write an engaging newsletter segment that includes:
1. An attention-grabbing headline
2. A brief introduction (1-2 sentences)
3. The main insights formatted with bullet points
4. A closing thought that encourages readers to check out the original post
Format the output in Markdown.
"""
)
chain = LLMChain(llm=llm, prompt=prompt_template)
return chain.run(summary=summary, title=title, community=community, url=url)
A similar function is implemented for social media post creation, with adjustments for character limits and platform-specific conventions.
5. User Interface Implementation
Finally, we implement the Streamlit UI for our application:
# Set page configuration
st.set_page_config(
page_title="Content Summarizer Bot",
page_icon="📝",
layout="wide"
)
# Title and description
st.title("Real-Time Content Summarizer / Digest Bot")
st.markdown("""
This app takes trending content from Reddit and Twitter/X, summarizes key points,
and can convert them into newsletter or social post formats.
""")
# Main interface
tab1, tab2 = st.tabs(["Data Processing", "Content Generation"])
with tab1:
st.header("Upload Data")
# Data upload UI code...
with tab2:
st.header("Generate Content Summaries")
# Content generation UI code...
The complete UI implementation includes tabs for data processing and content generation, along with controls for selecting posts, generating summaries, and converting to different formats.
Run the application
Run the Streamlit app:
streamlit run main.py
Your real-time content summarizer and digest bot is now live! 🎉
Conclusion
Starting with clean, relevant, and timely datasets from Reddit and Twitter (X) made the difference.
By layering structured datasets with open-source tools like LangChain, Ollama, and Streamlit, I was able to create a bot that summarizes the noise, and that’s only possible because the data it starts with is high-quality, well-scoped, and ready for transformation.
For anyone working with LLMs, it’s important to not just obsess over the model and remember that the model is only part of the equation. Selecting the right dataset often solves half the problem.