Photo by Markus Spiske on Unsplash
What’s Hash Encoding?
Hash encoding, also known as hashing, is a technique used to convert data of arbitrary length into a fixed-size string of characters. The resulting string, known as a hash value or hash code, is a unique representation of the input data. Hash encoding is commonly used in computer science, cryptography, and data storage for various purposes, such as data integrity checks, password storage, and data indexing.
Today, I will be talking about how hash encoding can be applied in ML for categorical varibales.
In data science, hash encoding can be used as a technique for encoding categorical data when dealing with large and high-cardinality categorical variables. Hash encoding helps to reduce the dimensionality of the data by mapping each category to a fixed number of features (columns) in a dataset.
In the context of categorical data, high cardinality refers to a categorical variable that has a large number of distinct categories or levels. It means that the variable has a wide range of unique values, with each value occurring relatively infrequently in the dataset.
Here’s how hash encoding is typically applied to encode categorical data:
- Identify categorical variables: First, you need to identify the categorical variables in your dataset. These variables represent non-numerical data, such as categorical labels or groups.
- Choose the number of hash features: Determine the number of features (columns) you want to create for the hash-encoded representation. This number should be smaller than the total number of distinct categories in the variable.
- Apply hash encoding. For each categorical variable, apply the hash encoding process as follows:
- Initialize the desired number of hash features/columns.
- Apply a hash function to each category in the variable, which generates a hash value.
- Use the hash value to determine the corresponding feature/column for that category.
- Assign a value (e.g., 1) to the determined feature/column and 0 to the others.
Note that the same hash function and the same number of features should be used consistently across the training and testing datasets to ensure consistency.
4. Incorporate hash-encoded features: Once the hash encoding is applied, the resulting hash-encoded features can be incorporated into your data analysis pipeline.
These encoded features can now be treated as numerical features, allowing you to use them as input for machine learning algorithms or other data analysis tasks.
Why Hash Encoding
Hash encoding can be particularly useful when dealing with categorical variables with high cardinality, where one-hot encoding may result in an excessive number of features. By limiting the number of hash features, hash encoding provides a more compact representation while preserving some level of information about the original categorical variable.
Below is a python code to explain:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher
data = {
'color': ['red', 'blue', 'green', 'red', 'yellow'],
'size': ['small', 'medium', 'large', 'small', 'medium'],
'shape': ['circle', 'square', 'triangle', 'circle', 'square']
}
df = pd.DataFrame(data)
Let’s break down the steps involved:
- Initialising the FeatureHasher:
- The FeatureHasher is initialised with a specified number of hash features (
n_features
) and the input type (input_type
). - The
n_features
parameter determines the number of hash-encoded features to create. - The
input_type
parameter specifies the data type of the input, which is set to'string'
in this case.
2. Applying hash encoding:
- For each category in the categorical variable, the hash encoding process is as follows:
- The category is converted to a string representation.
- A hash function is applied to the string representation, generating a hash value.
- The hash value is used to determine the corresponding hash feature/column for that category.
- The value of the determined hash feature/column is set to 1, indicating the presence of that category.
- The values of other hash features/columns remain 0, indicating the absence of those categories.
3. Creating a new DataFrame:
- The resulting hash-encoded features are obtained as a sparse matrix.
- The sparse matrix is converted to a dense array using the
toarray()
method. - A new DataFrame is created using the dense array, where each column represents a hash-encoded feature.
- Column names can be assigned to the hash-encoded features for better interpretability.
It’s important to note that the specific hash function used and the exact mapping from categories to hash features depend on the implementation details of the FeatureHasher
and the specific hashing algorithm employed.
Now that we have seen the output, let’s see how the output would look like:
color size shape hash_feature_1 hash_feature_2 hash_feature_3
0 red small circle -1.0 -1.0 -1.0
1 blue medium square 1.0 -1.0 -1.0
2 green large triangle -1.0 -1.0 1.0
3 red small circle -1.0 -1.0 -1.0
4 yellow medium square -1.0 -1.0 -1.0
The output is the final DataFrame final_df
after applying hash encoding to the categorical columns. The DataFrame contains the original columns ('color'
, 'size'
, 'shape'
) and additional columns representing the hash-encoded features ('hash_feature_1'
, 'hash_feature_2'
, 'hash_feature_3'
).
Each row in the DataFrame corresponds to a sample, and the values in the hash-encoded feature columns represent the hash-encoded representations of the original categorical values. The hash-encoded features are real-valued numbers, where -1.0
indicates the absence of a feature value and 1.0
indicates the presence of a feature value.
To create a combined representation, we can sum up the corresponding hash-encoded features across the categorical features. Each element in the combined representation corresponds to the sum of the respective elements from the hash-encoded features. This combined representation can be considered as a compressed representation of the original categorical values, capturing the relationships between different categorical features. It can be used as input for various machine learning algorithms or further analysis.
Please note that the exact numeric values in the hash-encoded features may differ in your output, as they are determined by the specific hash function used and the order of categorical values.
Benefits of Hash Encoding data:
When dealing with high cardinality categorical data, hash encoding can be advantageous compared to traditional methods that directly encode categories as numbers, such as label encoding or integer encoding. Here are some reasons why hash encoding can be a better choice:
- Dimensionality Reduction: High cardinality variables can result in a large number of distinct categories, leading to an explosion in the number of features when using traditional encoding methods like one-hot encoding. In contrast, hash encoding allows for dimensionality reduction by mapping categories to a fixed number of hash features. This reduces the number of features required to represent the categorical variable and avoids the curse of dimensionality.
- Compact Representation: Hash encoding provides a more compact representation of high cardinality variables. Instead of allocating separate columns for each category, hash encoding maps categories to a smaller set of hash features. This helps to reduce memory usage and storage requirements, particularly when working with large datasets.
- Flexibility in Feature Space: Hash encoding allows flexibility in allocating the same number of hash features regardless of the number of categories. This means that even if new categories are encountered during testing or deployment that were not present in the training data, they can still be encoded using the same fixed number of hash features. In contrast, with methods like one-hot encoding, accommodating new categories may require retraining the model or expanding the feature space.
- Reduced Collisions: Hash encoding can handle collisions, which occur when different categories produce the same hash value. While collisions are possible in hash encoding, a well-designed hash function minimizes their occurrence. The impact of collisions can be mitigated by selecting an appropriate number of hash features. In contrast, when directly encoding categories as numbers, collisions are not handled, and unique numerical values must be assigned to each category.
- Information Preservation: Hash encoding retains some level of information about the original categories, even though it’s not a reversible transformation. The relationship between the original categories and the hash-encoded features is not explicitly interpretable, but the resulting hash values can still capture some patterns and associations present in the data.
Overall, hash encoding provides a balance between dimensionality reduction, compact representation, and flexibility in handling high cardinality categorical data. It can be a suitable alternative to traditional encoding methods when dealing with large datasets and variables with numerous distinct categories.
If you liked this blog post, please feel free to shower some claps; and to stay updated with more such relevant content follow me and subscribe to my feed :)