How I Trained a Chatbot on GitHub Repositories Using an AI Scraper and LLM

Building an AI-Powered Chatbot to Analyze GitHub Repositories Using Scraped Data and LLMs

ByVictor Yakubu
Published on

Frequently Asked Questions

Common questions about this topic

What does the GitHub Insights Tool do?
The GitHub Insights Tool analyzes GitHub repositories by loading structured repository data, generating AI-powered analysis of code quality, popularity, potential use cases, strengths and weaknesses, and enabling interactive Q&A about that analysis via a chatbot interface.
How is GitHub repository data obtained for the tool?
Repository data is collected using Bright Data’s Data for AI Web Scraper (the GitHub Repository — Collect by URL scraper), using the No-Code Scraper to add repository URLs, start collection, and download results as CSV.
Which Bright Data scraping interfaces are mentioned for collecting data?
Bright Data’s Scraper API and the No-Code Scraper are mentioned as the two methods their web scrapers provide for scraping websites.
What file format is used to import repository data into the project?
The repository data is downloaded from Bright Data as a CSV file and loaded into the project (expected as githubdata.csv or github.csv in the project folder).
What are the project prerequisites and dependencies?
Prerequisites are a code editor and Python (version 3.8+ recommended). Dependencies installed via pip are pandas, streamlit, and langchain_community.
What is the required project structure to run the tool?
The expected project structure is a folder named github-insights-tool containing the dataset file (github.csv) and an ai.py (or app.py) script that implements the Streamlit application and AI integration.
Why is Ollama used for the AI model in this project?
Ollama is used because it is free, easy to set up, runs locally without internet dependency, and provides fast, customizable responses suitable for local LLM inference.
How is the Ollama Phi3 model installed and run locally?
Ollama is installed using the platform-specific commands provided (PowerShell curl for Windows, install script for Linux, or Homebrew for macOS), then the Phi3 model is pulled with 'ollama pull phi3' and started with 'ollama run phi3'.
How does the Streamlit app connect to the local Ollama model?
The Streamlit app initializes an OllamaLLM instance with model='phi3' (from langchain_ollama), and the app invokes the model via that client when generating analysis and answering follow-up queries.
How does a user generate an AI analysis for a specific repository in the app?
A user pastes a repository URL that exists in the CSV dataset into the app’s input field, the app filters the dataset for that URL, displays repository and owner details, and when the user clicks 'Generate Analysis' the app calls analyze_repository to invoke the LLM and produce the AI analysis.
What interactive capability does the app provide after generating analysis?
After generating analysis, the app stores the analysis in session state and provides a chatbot interface where users can type queries about the analysis; those queries are answered by calling the LLM with the analysis plus the user query.
How is the repository dataset loaded and normalized in the application code?
The dataset is loaded with pandas.read_csv inside a cached function, and column names are normalized by stripping whitespace and converting to lowercase via df.columns = df.columns.str.strip().str.lower().
What command starts the Streamlit application?
The Streamlit application is started from the terminal with the command 'streamlit run app.py'.
What happens if the entered repository URL is not found in the dataset?
If the entered repository URL is not found in the dataset, the application displays a warning stating 'Repository not found in the dataset. Please enter a valid URL.'
What error handling is implemented for LLM invocations in the code snippets?
LLM invocations in analyze_repository and interact_with_analysis are wrapped in try/except blocks that return an error message string formatted as 'Error generating analysis: {e}' or 'Error processing query: {e}' if an exception occurs.

Enjoyed this article?

Share it with your network to help others discover it

Last Week in Plain English

Stay updated with the latest news in the world of AI, tech, business, and startups.

Interested in Promoting Your Content?

Reach our engaged developer audience and grow your brand.

Help us expand the developer universe!

This is your chance to be part of an amazing community built by developers, for developers.