I built an AI-powered chatbot that can answer questions about GitHub repositories by extracting key insights from repository data. I used Bright Data’s Data for AI Web Scraper to collect structured data and trained a chatbot using Ollama’s Phi3 model to analyze and interact with the data.
In this article, I’ll walk you through:
✅ How I obtained GitHub repository data using Bright Data’s Github Data for AI Web Scraper. ✅ Training a chatbot with Ollama’s Phi3 model. ✅ Implementing a Streamlit-based GitHub Insights Tool for real-time interactions. ✅ Lessons learned and the impact of using AI for repository analysis.
How I Obtained GitHub Datasets Using Bright Data
To train the chatbot, I needed a high-quality dataset containing key repository details. Instead of scraping GitHub manually, I used Bright Data’s AI Scraper, which provided a structured and automated way to collect repository data.
They have two methods for scraping data from any website using their Web Scrapers. The web scrapers have the Scraper API and the No-Code Scraper that anyone can use.
Steps to extract GitHub data using Bright Data Web Scraper
- Sign up on Bright Data and click Web Scrapers in the left pane.
If you are a new user just signing in, you will get a free $5 to try their services for 7 days.
2. Search for “GitHub” in the search bar and click on the first result.
3. A list of GitHub scrapers will appear. Select “GitHub Repository — Collect by URL” for this use case.
4. Select the No-Code Scraper.
5. Click “Add Input” to add your required GitHub repository links, then click “Start Collecting”.
6. Once the status field shows “Ready, " click “Download” and choose CSV as the format.
Building a GitHub Insights Tool
This project uses Python for data processing and Streamlit for a simple UI.
Prerequisites
- Any code editor of your choice.
- Python installed (version 3.8+ recommended).
Step 1: Setting Up the Project
- Create the project folder:
mkdir github-insights-tool
cd github-insights-tool
2. Set up a virtual environment:
python -m venv venv
3. Activate the environment:
- Windows:
venv\Scripts\activate
- macOS/Linux:
source venv/bin/activate
4. Install dependencies:
pip install pandas streamlit langchain_community
Streamlit — For building the UI
- Pandas — For handling dataset operations
- LangChain (Ollama) — For AI-driven repository analysis
Project structure:
github-insights-tool/
│── github.csv #your dataset from Bright Data
│── ai.py
Step 2: Installing and Running the Chatbot (Ollama Phi3 Model) Locally
This AI-powered tool generates insights about GitHub repositories by analyzing their strengths, weaknesses, and usability. It also provides key repository details without requiring you to navigate multiple sections on GitHub.
Why Ollama?
- Free and easy to set up
- Runs locally without internet dependency
- Provides fast and customizable responses
Installing Ollama
Ollama provides a simple CLI tool to run large language models (LLMs) locally. Install it based on your operating system:
- Windows (PowerShell):
curl -LO https://ollama.com/download/latest/windows && start ollama.exe
- Linux (Curl):
curl -fsSL https://ollama.ai/install.sh | sh
- macOS (Homebrew):
brew install ollama
Download the Phi3 Model:
ollama pull phi3
Run the Ollama Model:
ollama run phi3
💡 Note: Always ensure the Ollama model is running locally before executing your code. Otherwise, the AI model won’t be accessible.
Step 3: Implementing the GitHub Insights Tool
The tool consists of the following functionalities:
Initializing Ollama
import streamlit as st
import pandas as pd
from langchain_ollama import OllamaLLM
# Initialize Ollama with the chosen model
llm = OllamaLLM(model="phi3")
Loading the GitHub Database
@st.cache_data
def load_github_data():
df = pd.read_csv("githubdata.csv")
df.columns = df.columns.str.strip().str.lower() # Normalize column names to lowercase
return df
Analyzing the Desired Repository Using AI
def analyze_repository(repo_data, llm):
prompt = f"""
Analyze the following GitHub repository data and provide insights:
{repo_data.to_dict()}
Focus on:
1. Code quality and maintainability
2. Popularity and engagement
3. Potential use cases
4. Key strengths and weaknesses
"""
try:
return llm.invoke(prompt)
except Exception as e:
return f"Error generating analysis: {e}"
This function generates insights based on code quality, engagement, and potential use cases.
Interacting with the AI-Generated Analysis
def interact_with_analysis(analysis, query, llm):
prompt = f"""
Based on the following analysis:
{analysis}
Answer the user's query: {query}
"""
try:
return llm.invoke(prompt)
except Exception as e:
return f"Error processing query: {e}"
This allows users to interact with AI-generated analysis of the repository.
Step 4: Defining the Streamlit Application
Core Features
- Allows users to enter a GitHub URL (This is any of the URLs present in the CSV file, so it can provide answers tailored to that specific GitHub repository).
- Initiates AI chatbot interaction based on analysis.
def main():
# Add GitHub logo next to the title
st.markdown("""<h1 style='display: flex; align-items: center;'>
<img src='https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png' width='40' style='margin-right:10px;'>
GitHub Repository Insights Tool
</h1>""", unsafe_allow_html=True)
github_df = load_github_data()
# User input field for entering a GitHub repository URL
repo_url = st.text_input("Enter GitHub Repository URL")
analysis_result = ""
if repo_url:
# Filter the dataset based on the entered URL
repo_data = github_df[github_df["url"] == repo_url]
if not repo_data.empty:
repo_data = repo_data.iloc[0]
# Display repository details
st.subheader("Repository Details")
st.write(f"Language: {repo_data['code_language']}")
st.write(f"Stars: {repo_data['num_stared']}")
st.write(f"Forks: {repo_data['num_fork']}")
st.write(f"Pull Requests: {repo_data['num_pull_requests']}")
st.write(f"Last Feature: {repo_data['last_feature']}")
st.write(f"Latest Update: {repo_data['latest_update']}")
# Display repository owner details
st.subheader("Owner Details")
st.write(f"Owner: {repo_data['user_name']}")
st.write(f"URL: {repo_data['url']}")
# AI-powered analysis of the repository
st.subheader("AI Analysis")
if st.button("Generate Analysis"):
with st.spinner("Analyzing repository..."):
analysis_result = analyze_repository(repo_data, llm)
st.session_state["analysis"] = analysis_result # Store analysis in session state
st.write(analysis_result)
else:
st.warning("Repository not found in the dataset. Please enter a valid URL.")
# AI Chatbot interaction based on the generated analysis
if "analysis" in st.session_state:
st.subheader("Chat with AI about this Repository")
user_query = st.text_input("Ask a question about the repository analysis")
if user_query:
with st.spinner("Processing query..."):
response = interact_with_analysis(st.session_state["analysis"], user_query, llm)
st.write(response)
# Run the Streamlit application
if __name__ == "__main__":
main()
Running the application:
On your terminal, run this command:
streamlit run app.py
Step 5: Using the GitHub Insights Tool Application
- Paste the repository URL and view the analytics.
2. Click “Generate Analysis” to develop a report of the repository.
3. Interact with the chatbot to gain further insights.
Conclusion
Training a chatbot on GitHub repositories using Data for AI Web Scraper from Bright Data and Ollama Phi3 proved highly effective for automating repository insights. This approach saves time, improves accuracy, and provides AI-powered responses based on real repository data.
For developers looking for clean, structured GitHub datasets, Bright Data offers reliable, ready-made datasets and API integration to streamline data extraction and analysis.
🚀 Try it out and let me know your thoughts!