EDA for the Iris Flower dataset

EDA on the Iris dataset with colorful visualizations using Python’s seaborn and matplotlib libraries.

EDA for the Iris Flower dataset

Data source: The Iris Flower Dataset — https://archive.ics.uci.edu/ml/datasets/iris

Tools: Python (Pandas, Matplotlib, Seaborn)]

Exploratory Data Analysis (EDA) is an essential step in understanding and visualizing your dataset. The Iris dataset is a popular dataset in data science, and it can be visualized using various libraries in Python.

Here, I’ll show you how to perform EDA on the Iris dataset with colorful visualizations using Python’s seaborn and matplotlib libraries.

Now, let’s perform EDA on the Iris dataset using Jupyter Notebook:

Step 1

Load the necessary libraries

# from datasets import the Iris dataset
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

Step 2

Load the Iris dataset.

Get the Summary statistics of the Iris dataset for Explorative Data Analysis.

# Load the Iris dataset
iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['target'] = iris.target
data['species'] = iris.target_names[iris.target]# Summary statistics
print(data.describe())

Output:

sepal length (cm) sepal width (cm) petal length (cm) \ count 150.000000 150.000000 150.000000 mean 5.843333 3.057333 3.758000 std 0.828066 0.435866 1.765298 min 4.300000 2.000000 1.000000 25% 5.100000 2.800000 1.600000 50% 5.800000 3.000000 4.350000 75% 6.400000 3.300000 5.100000 max 7.900000 4.400000 6.900000 petal width (cm) target count 150.000000 150.000000 mean 1.199333 1.000000 std 0.762238 0.819232 min 0.100000 0.000000 25% 0.300000 0.000000 50% 1.300000 1.000000 75% 1.800000 2.000000 max 2.500000 2.000000

Inference Summary Statistics (describe()):

  • This table provides basic statistical summaries (mean, standard deviation, min, max, etc.) for each numeric feature (sepal length, sepal width, petal length, petal width) grouped by species.
  • We can see differences in the mean values among the three species for each feature, which suggests potential discriminative power in these features for species classification.

Step 3

Now is the time to create colorful visuals. Here are the codes and visuals.

1. Pairplot

# Pairplot for pairwise relationships
sns.set(style="ticks")
sns.pairplot(data, hue="species", markers=["o", "s", "D"])

Pairplot

Inference Pairplot:

  • The pairplot shows pairwise relationships between numeric features, with different colors for each species.
  • It helps us visualize how features relate to each other and how they separate the species.
  • For example, in scatterplots involving petal length and petal width, we can see clear separation between the Iris-setosa species and the other two species, making it a good feature for classification.

2. Boxplots

# Boxplots to visualize distributions
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x="species", y="sepal length (cm)", data=data)
plt.subplot(1, 2, 2)
sns.boxplot(x="species", y="sepal width (cm)", data=data)
plt.tight_layout()plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x="species", y="petal length (cm)", data=data)
plt.subplot(1, 2, 2)
sns.boxplot(x="species", y="petal width (cm)", data=data)
plt.tight_layout()

Boxplots

Inference Boxplots:

  • Boxplots provide a summary of the distribution of each numeric feature for each species.
  • They show the median, quartiles, and potential outliers.
  • From the boxplots, we can see that some features (e.g., petal length and petal width) have significant differences in their distributions among the three species, making them valuable for classification.

3. Violin Plots

# Violin plots to visualize distributions
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.violinplot(x="species", y="sepal length (cm)", data=data)
plt.subplot(1, 2, 2)
sns.violinplot(x="species", y="sepal width (cm)", data=data)
plt.tight_layout()plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.violinplot(x="species", y="petal length (cm)", data=data)
plt.subplot(1, 2, 2)
sns.violinplot(x="species", y="petal width (cm)", data=data)
plt.tight_layout()

Violin Plots

Inference Violin Plots:

  • Violin plots combine the benefits of boxplots and kernel density plots to visualize the distribution of each numeric feature for each species.
  • They show the data distribution more comprehensively than boxplots.
  • Violin plots confirm the findings from the boxplots and provide a more detailed view of the data distributions.

4. Countplot

# Countplot for species distribution
plt.figure(figsize=(8, 4))
sns.countplot(x="species", data=data, palette="viridis")plt.show()

Count plot

Inference Countplot:

  • The countplot displays the distribution of species in the dataset.
  • It shows that the dataset is balanced, with an equal number of samples for each species.
  • This balance is important for machine learning tasks as it prevents bias towards one class.

5. Correlation Heatmap

# Correlation heatmap
correlation_matrix = data.iloc[:, :4].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5)# Countplot for species distribution
plt.figure(figsize=(8, 4))
sns.countplot(x="species", data=data, palette="viridis")plt.show()

Heatmap

Inference Correlation Heatmap:

  • The correlation heatmap shows the pairwise correlation coefficients between numeric features.
  • It helps us understand how features are related to each other.
  • In this case, we can see strong positive correlations between petal length and petal width, as well as between sepal length and petal length.

Overall, the EDA of the Iris dataset reveals that features like petal length and petal width are highly discriminative for distinguishing between the three Iris species. This information is crucial if you plan to build a machine learning model for Iris species classification.

With the above codes, we have performed Explorative Data Analysis on the Iris dataset and created various colorful visualizations, including pairplots, boxplots, violin plots, a correlation heatmap, and a countplot for species distribution.

You can adjust the visualizations and styles according to your preferences.

Continue Learning

Discover more articles on similar topics