How to Build a Machine Learning Model from Scratch with Python

ML Model from Scratch (source of image: DAL-E)

Building a machine learning model from scratch may seem like a daunting task, but with Python and its powerful libraries, the process becomes manageable, even for beginners. In this guide, we'll walk through the step-by-step process of creating a simple machine learning model using Python's Scikit-learn library.

We'll cover the following steps:

  1. Understanding the Problem
  2. Choosing a Dataset
  3. Data Preprocessing
  4. Splitting the Data
  5. Choosing a Model
  6. Training the Model
  7. Evaluating the Model
  8. Making Predictions

Let's get started!

1. Understanding the Problem

Before building a model, it's important to understand the problem you're trying to solve. For this example, we'll solve a classification problem: predicting whether a person has diabetes based on several health-related variables like age, glucose level, and BMI.

We'll use the Pima Indians Diabetes Dataset, which is a popular dataset in the machine learning community. It contains information about 768 women, including whether or not they have diabetes.

2. Choosing a Dataset

You can download the dataset from Kaggle from below link . After downloading the CSV file, make sure it is stored in the same directory as your Python script.

https://www.kaggle.com/datasets/mathchi/diabetes-data-set

import pandas as pd

# Load the dataset
data = pd.read_csv('diabetes.csv')

# View the first few rows
data.head()
#output is shown below

3. Data Preprocessing

Once the dataset is loaded, we need to clean and preprocess it. Preprocessing may include handling missing data, normalizing features, and encoding categorical variables. In our case, there are no categorical variables, but we will normalize the data for better model performance.

# Separate the features (X) and the target variable (y)
X = data.drop('Outcome', axis=1)  # Features (independent variables)
y = data['Outcome']  # Target (dependent variable)

# Normalize the feature data (scaling the values between 0 and 1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

4. Splitting the Data

We need to split the data into two sets: one for training and one for testing. The model will learn from the training set and then be tested on the test set to evaluate its performance.

from sklearn.model_selection import train_test_split

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

5. Choosing a Model

For this example, we'll use a Logistic Regression model, which is a common algorithm for binary classification problems like this.

from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

6. Training the Model

Once the model is chosen, we can train it using the training data. This step is where the model learns patterns from the data.

# Train the model
model.fit(X_train, y_train)

7. Evaluating the Model

Now that the model is trained, we need to evaluate how well it performs on unseen data, i.e., the test set. We'll use metrics like accuracy to measure the performance.

from sklearn.metrics import accuracy_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy * 100:.2f}%')

# Output
# Model Accuracy: 75.32%

8. Making Predictions

Once the model is trained and evaluated, you can use it to make predictions on new data. Here's how you would use the model to predict whether a new patient has diabetes:

# Example new data (normalized)

new_patient = [[1.5, 85, 66, 29, 0, 26.6, 0.351, 31]]

# Predict the outcome (0 = no diabetes, 1 = diabetes)

prediction = model.predict(new_patient)
print(f'Diabetes Prediction: {prediction[0]}')

# output
# Diabetes Prediction: 1

Conclusion:

You've now built a simple machine learning model from scratch using Python! Here's a quick recap of the steps we followed:

  1. Loaded and preprocessed the data.
  2. Split the data into training and test sets.
  3. Chose a logistic regression model.
  4. Trained the model on the training data.
  5. Evaluated its performance using the test data.
  6. Made predictions on new data.

By following these steps, you can solve many classification problems. As you progress, you'll explore more advanced techniques like cross-validation, hyperparameter tuning, and different algorithms. Python libraries like Scikit-learn make this entire process intuitive and accessible for all levels of experience.

Enjoyed this article?

Share it with your network to help others discover it

Continue Learning

Discover more articles on similar topics