Python is famous for its wide ecosystem of libraries tailored for machine learning and data science. In the previous article, you have seen the best practices for setting up a Python project and essential machine learning development tools tools.
In this article, let's explore the Python ecosystem for machine learning and data science in more detail. We will discuss some core libraries that make it easy to develop models and algorithms. You will learn about Python libraries for data manipulation (NumPy and Pandas), data visualization (Matplotlib and Seaborn), and machine learning (Scikit-Learn, TensorFlow, and PyTorch).
You can find the source code in this notebook.
Data in Machine Learning
The data you load to train your machine learning model is usually in a matrix form (multi-dimensional array). It contains rows and columns with their corresponding labels. The image below shows the data we will be using for practical examples in this article. We will generate and work on this dataset using libraries as we move along.
You can even represent images as vectors of matrices based on pixel values. Working with multi-dimensional arrays using only Python can be very difficult, so many libraries have been designed to make it easier. The most famous of these libraries are NumPy and Pandas.
NumPy
NumPy, short for "Numerical Python", is a popular library for scientific computing in Python. It supports arrays, matrices, and provides various mathematical functions to operate on them. NumPy is the foundation of many other Python libraries used in data analysis and machine learning.
NumPy simplifies tasks like matrix multiplication, statistical analysis, and array manipulation. For example, you can easily use NumPy to perform element-wise operations on arrays and calculate statistics like mean and standard deviation.
NumPy provides many methods to create, manipulate, and perform array or matrix data computations. Let's generate a small dataset to use in this tutorial. In this dataset, we will generate random data on space objects, including their sizes, speeds, compositions, and whether they pose a risk of impact on Earth.
import numpy as np
After importing Numpy, you can use its random
module to generate various types of random data.
## setting a random seed for reproducibility
np.random.seed(42)
num_samples = 1000
## generating random data
object_sizes = np.random.uniform(1, 1000, num_samples) # assuming size in range from 1 to 1000
object_speeds = np.random.uniform(1, 30, num_samples) # assuming speed in range from 1 to 30
object_composition = np.random.choice([0, 1], num_samples) # randomly assuming metallic (1) or rocky (0)
## generating binary data based on condition
target_variable = np.logical_and(object_sizes > 500, object_speeds > 15).astype(int)
We used the following methods in this code snippet:
random.uniform
: Generates random numbers uniformly distributed between the given range.random.choice
: Randomly select values from the given list.logical_and
: Returns theboolean
value based on provided conditions and data. In NumPy, you can do "vectorized operations" on arrays, applying operations to entire arrays rather than iterating over elements individually. Here, this function generates an array similar to theobject_sizes
array.
# Create a NumPy structured array to store the data
space_object_data = np.array(
list(zip(object_sizes, object_speeds, object_composition, target_variable)),
dtype=[('Object_Size_Meters', 'f4'), ('Object_Speed_km_s', 'f4'), ('Object_Composition', 'i1'), ('Risk_of_Impact', 'i1')]
)
The numpy.array()
method converts a list or tuple to a NumPy array. Now, you will have the dataset stored in the space_object_data
variable.
If you print the dataset, it will look something like this:
The generated values are entirely random, so don't expect to see the same values as in the picture above. For more information about NumPy, refer to this article.
Pandas
Pandas is a powerful Python library built on top of NumPy for working with tabular data. It introduces two primary data structures: "Series" (one-dimensional arrays) and "DataFrames" (two-dimensional arrays, or columns of Series objects). We make predictions based on multiple features in machine learning, so we primarily work with DataFrames.
import pandas as pd
space_objects_df = pd.DataFrame(space_object_data)
This creates a DataFrame
from the NumPy array. You can also use read_csv()
, read_excel()
, and read_sql()
functions to load data into a DataFrame from various file formats.
Pandas provides multiple functions for exploring the data. You can get information about your data using functions like info()
and describe()
.
# Display the first few rows
print(space_objects_df.head())
# Get the number of rows and columns
print(f"Shape of the DataFrame: {space_objects_df.shape}")
# Summary statistics for numeric columns
print(space_objects_df.info())
print(space_objects_df.describe())
Pandas also provides a range of functions to handle missing data, remove duplicates, and fill invalid data. You will commonly use methods like isnull()
, fillna()
, dropna()
, drop_duplicates()
to clean the data.
# Check for missing values
print(space_objects_df.isnull().sum())
# Fill missing values with a specific value (e.g., 0)
space_objects_df.fillna(0, inplace=True)
# Drop rows with missing values
space_objects_df.dropna(inplace=True)
# Remove duplicate rows
df_no_duplicates = space_objects_df.drop_duplicates()
You can also select and filter specific records in a DataFrame
by column name. This is a prevalent task in data analysis, allowing you to focus on the particular data relevant to your analysis.
# Select objects with a risk of impact and size greater than 500 meters
selected_objects = space_objects_df[(space_objects_df['Risk_of_Impact'] == 1) & (space_objects_df['Object_Size_Meters'] > 500)]
For more information about pandas refer to the documentation.
Matplotlib and Seaborn
Matplotlib and Seaborn are two Python libraries for data visualization. Matplotlib is a powerful library that gives you more control over your plots, while Seaborn is a high-level library that makes it easy to create beautiful and informative visualizations. You can use these libraries to create a wide range of plots, from basic histograms and scatter plots to advanced time series representations and heat maps.
In machine learning, you will often use Matplotlib and Seaborn together for better data visualizations.
import matplotlib.pyplot as plt
import seaborn as sns
# Create a figure with two subplots side by side
plt.figure(figsize=(12, 5))
# First subplot using Matplotlib
plt.subplot(1, 2, 1)
plt.scatter(space_objects_df['Object_Size_Meters'], space_objects_df['Object_Speed_km_s'], c=space_objects_df['Risk_of_Impact'], cmap='viridis')
plt.xlabel('Object Size (meters)')
plt.ylabel('Object Speed (km/s)')
plt.title('Space Object Size vs. Speed (Matplotlib)')
plt.colorbar(label='Risk of Impact')
# Second subplot using Seaborn
plt.subplot(1, 2, 2)
sns.scatterplot(x='Object_Size_Meters', y='Object_Speed_km_s', data=space_objects_df, hue='Risk_of_Impact', palette='viridis')
plt.xlabel('Object Size (meters)')
plt.ylabel('Object Speed (km/s)')
plt.title('Space Object Size vs. Speed (Seaborn)')
plt.legend(title='Risk of Impact')
# Adjust layout to prevent subplot overlap
plt.tight_layout()
# Display the plots
plt.show()
This example compares a scatter plot using Matplotlib and a scatter plot using Seaborn to visualize the relationship between object size and speed for space objects. I recommend using Seaborn for simple and aesthetic visualizations and Matplotlib for flexible and advanced visualizations. Read more on Matplotlib and seaborn.
Scikit-Learn
Scikit-Learn (shortly, sklearn) is a library that provides a wide range of tools, algorithms, and utilities for tasks such as classification, regression, clustering, and more. It also includes libraries for data processing, model selection, and model evaluation.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
The train_test_split()
method splits the whole dataset into segments for training, validating, and testing your model, and the StandardScaler()
is a distribution function that processes the data. It is optional but can improve some models. You can train your model using the RandomForestClassifier
algorithm from the scikit-learn ensemble library. Once the model is trained, you can evaluate its performance on the test set using the accuracy_score
and classification_report
metrics.
You will see these methods in action in the upcoming machine learning journey.
Scikit-learn is not limited to these methods but also includes various other tools and algorithms. Read more about sklearn here.
TensorFlow and Keras
TensorFlow is the popular framework for building and training deep learning models. Google backs it, and you can use it in popular languages like Python and JavaScript. It also has a collection of APIs for data collection, processing, and tools to help you build your deep learning models efficiently.
TensorFlow runs on tensors, which are multi-dimensional arrays that can hold various data types, such as scalars, vectors, matrices, or even higher-dimensional data. Tensors are similar to NumPy arrays but with GPU acceleration. Tensors flow through the computational graph, carrying data from one operation to another. The nodes in this graph represent the operations, and the edges represent the data flowing between them. For more information on computation graphs, refer to this article.
Google also has custom hardware accelerators called Tensor Processing Units (TPUs) specifically designed to help you train complex models faster.
Keras is a high--level neural networks API that runs on top of TensorFlow, making building and experimenting with deep learning models easier. The most commonly used API is Sequential, which helps you create neural networks by adding layer by layer. Keras is included within TensorFlow and does not need a separate installation. Learn more about TensorFlow here.
PyTorch
PyTorch another popular deep learning library introduced by Meta AI Research Lab. It is also a library based on tensors and computation graphs for executing your models. PyTorch was the first library to present "dynamic computation graphs" for optimizing models at runtime. It constructs a directed acyclic graph (DAG) consisting of functions that keep track of executed operations on tensors, allowing you to change operations if needed. Here's the documentation of PyTorch.
Both TensorFlow and PyTorch make it easier to create neural networks. They also support popular datasets, so you don't have to download and process them separately.
Conclusion
Overall, you have seen the uses of NumPy, Pandas, Matplotlib, Seaborn, TensorFlow, Keras, and PyTorch in data science. NumPy and Pandas help handle data, while Matplotlib and Seaborn are great for visualizing it. Scikit-learn is a handy library for machine learning model development, and TensorFlow and PyTorch are the best for deep learning. As you progress in your machine-learning journey, these libraries will become your best friends.