Data Preprocessing Overview#

The following discussion focuses on data preprocessing techniques. To help illustrate the techniques, an example dataset of synthetic data will be used that represents transaction events between customers and merchants. The goal is to build robust features from raw data that can help in detecting fraudulent activities effectively.

Feature Types#

Transforming raw data into meaningful features is essential for building effective machine-learning models. The following is a discussion on the types of features typically preprocessed for financial fraud detection:

1. Categorical Data#

Categorical data is information that can be grouped into distinct categories. For example, car colors like “red,” “blue,” and “green” represent categorical data. Although numerical codes can sometimes be assigned to categorical labels for ease of processing, these codes do not carry arithmetic meaning.

Examples include: City (for example, “New York,” “Chicago,” “Denver”) Gender (for example, “male,” “female,” “non-binary”) Types of fruit (for example, “apple,” “banana,” “cherry”) Customer feedback rating (“poor,” “fair,” “good,” “excellent”) - even when these categories imply a ranking, they are still considered categorical because they represent ordered labels rather than measurable quantities.

Categorical features can be further divided into ordinal and nominal types.

Ordinal Data#

Definition: Ordinal data has a natural order or ranking among its categories.
Example in Dataset:
- Merchant Rating: If the merchant rating is provided on a scale (for example, 1-5 stars), the values are ordinal.
Data Preprocessing Techniques:
- Mapping to Numerical Values: If ratings or scores are in text, convert each category to a corresponding numerical value while preserving the order. For example, mapping “Poor,” “Average,” “Good,” “Very Good,” and “Excellent” to 1, 2, 3, 4, and 5, respectively.
- Binning: If ratings are continuous, they can be grouped into bins (for example, low, medium, high).

Nominal Data#

Definition: Nominal data represents categories with no intrinsic ordering.
Examples in Dataset:
- Gender: Categories such as Male, Female, or Other.
- Merchant Category: Different types of merchants with no inherent order.
Common Data Preprocessing Techniques:
One-Hot Encoding, Binary Encoding:
These techniques convert a nominal categorical feature into multiple numerical features. One-hot encoding uses indicator variables to represent each category, while binary encoding assigns an integer to each category and then represents that integer in binary form, resulting in a compact numerical representation.
- Hashing: Hashing encoding converts a nominal categorical feature into multiple numerical features using a hash function.

Some attributes, such as zip codes or credit card numbers (for example, “90210” or “4111111111111111”), can appear to be numeric; they are actually labels that identify entities rather than quantities. As a result, these values should be treated as nominal categorical data rather than numerical.

2. Numerical Data#

Numerical features include continuous or discrete values, such as transaction amounts or customer ages.

Common Data Preprocessing Techniques:
- Scaling/Normalization: Apply techniques like min-max scaling or standardization to ensure that numerical features contribute equally to the model.
- Log Transformation: For skewed distributions (for example, transaction amounts), applying a log transform can help stabilize variance.
- Outlier Handling: Identify and treat outliers that could skew the analysis, either through capping or transformation.

3. Timestamp Data#

Timestamps are crucial in fraud detection as they can help reveal unusual transaction patterns over time.

Common Data Preprocessing Techniques:
- Extracting Time-Based Features: Break down the timestamp into useful components such as:
  - Hour of Day: Identify unusual times for transactions.
  - Day of Week: Detect patterns on weekends versus weekdays.
  - Month/Quarter: Seasonal trends in spending or fraud.
- Time Difference Features: Calculate the time difference between consecutive transactions to understand frequency.
- Cyclical Features: Use sine and cosine transformations for cyclic time features (for example, hour of the day) to better represent periodicity.

Preprocessing Sample Notebook#

A complete working notebook with preprocessing examples is in the Financial Fraud Detection repository

Example Dataset Overview#

Customer Attributes:#

customer_id (Categorical, Nominal): Unique identifier for each customer.
age (Numerical): Age of the customer.
gender (Categorical, Nominal): Gender of the customer (for example, Male, Female, Other).
loyalty_points (Numerical): An integer representing the customer’s loyalty points.

Merchant Attributes:#

merchant_id (Categorical, Nominal): Unique identifier for each merchant.
merchant_rating (Categorical, Ordinal): An ordinal rating (for example, “Poor,” “Average,” “Good,” “Very Good,” “Excellent”).
merchant_category (Categorical, Nominal): Type of merchant (for example, Grocery, Electronics, Apparel).
merchant_location (Categorical, Nominal): Location of the merchant (can be a city name or ZIP code).

Transaction Attributes:#

transaction_id (Categorical, Nominal): Unique identifier for each transaction.
customer_id (Categorical, Nominal): Links transaction to a customer.
merchant_id (Categorical, Nominal): Links transaction to a merchant.
amount (Numerical): Monetary value of the transaction.
transaction_timestamp (Timestamp): Date and time when the transaction occurred.
transaction_type (Categorical, Nominal): Type of transaction (for example, online, in-store).
is_fraud (binary variable, 0 or 1): Indicates whether a transaction is fraudulent.

Preprocessing Example#

Categorical data (ordinal and nominal)
Numerical data preprocessing
Timestamp feature extraction
Encoding IDs and merchant categories using binary encoding
Preprocessing a purely integer feature

1. Categorical Data#

A. Ordinal Data#

Ordinal features, like merchant_rating, carry an inherent order. To convert this ordinal categorical attribute into a numerical one while maintaining its order, one can define a mapping that assigns a numeric value respecting the original order to each category and then apply this mapping to the data. This process transforms the ordinal values into numeric values while preserving their natural sequence, making the data ready for effective learning.

Defining a Mapping:
An ordinal mapping dictionary (rating_mapping) is defined to associate each textual rating with a corresponding numeric value. For example, “Poor” maps to 1, and “Excellent” maps to 5. This preserves the order inherent in the ratings.

Applying the Mapping:
The map function is applied to the merchant_rating column, creating a new column, merchant_rating_numeric, that contains the numeric values corresponding to each textual rating.

By converting the categorical ratings into a numerical scale, this transformation makes the ordinal data suitable for effective modeling.

The following code snippet illustrates the transformation.

import pandas as pd

# Sample DataFrame for merchants
df_merchants = pd.DataFrame(
    {
        "merchant_id": ['M001', 'M002', 'M003'],
        "merchant_rating": ["Poor", "Average", "Excellent"],
        "merchant_category": ["Grocery", "Electronics", "Apparel"],
        "merchant_location": ["New York", "90001", "Chicago"],  # City or ZIP code
    }
)

# Define mapping for ordinal ratings
rating_mapping = {
    'Poor': 1,
    'Average': 2,
    'Good': 3,
    'Very Good': 4,
    'Excellent': 5
}

# Map ordinal ratings to numeric values
df_merchants['merchant_rating_numeric'] = df_merchants['merchant_rating'].map(rating_mapping)

B. Nominal Data#

Nominal categorical features, like merchant_category, do not have an inherent order. When dealing with such features, traditional one-hot encoding can lead to a very sparse and high-dimensional representation—especially if the number of categories is large. Binary encoding or hashing are effective alternatives that produce a denser numeric representation. This approach reduces dimensionality and can improve the performance of many machine learning algorithms.”

The code snippet shown below demonstrates how to apply binary encoding to the merchant_category column. It performs the following steps -

Fitting the Encoder: The category_encoders library provides various methods for encoding categorical features, including binary encoding. The BinaryEncoder is instantiated for the merchant_category column. The fit method learns the binary representation for each unique category in the column. This step establishes the mapping from each category to its corresponding binary code.

Retrieving New Column Names:
After fitting, the code retrieves the names of the new columns created by the encoder. These column names are obtained from the encoder’s parameters (specifically, from the mapping information) so that they can be used to update the DataFrame.

Transforming the Data:
The transform method is applied to the original merchant_category data. This converts the categorical values into multiple numerical columns, effectively transforming the text data into numerical data.

import category_encoders as ce

# Binary encode merchant_category
encoder_merchant_cat = ce.BinaryEncoder(cols=["merchant_category"]).fit(
    df_merchants["merchant_category"]
)
transformed_column_names = encoder_merchant_cat.get_params()["mapping"][0]["mapping"].columns
df_merchants[transformed_column_names] = encoder_merchant_cat.transform(df_merchants["merchant_category"])

Numerical Data

Numerical features such as amount or age may require scaling such as standardization. Standardizing is the process of rescaling numerical features so that they have a mean of 0 and a standard deviation of 1. Standardizing prevents features with larger scales from dominating the training process, ensuring each feature contributes equally to the model.

The following code snippet demonstrates how to standardize numerical data using scikit-learn’s StandardScaler.

Fitting and Transforming:
The fit_transform method is applied to the amount column. This method calculates the mean and standard deviation for the amount values and scales them accordingly so that the resulting data has a mean of 0 and a standard deviation of 1.

Storing the Transformed Data:
The scaled values are stored in a new column, amount_scaled, in the DataFrame. This new column can now be used in modeling, ensuring that the amount feature is on a comparable scale with other features.

from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample DataFrame for transactions
df_transactions = pd.DataFrame(
    {
        "transaction_id": ['T001', 'T002', 'T003', 'T004', 'T005'],
        "customer_id": ['C002', 'C003', 'C001', 'C001', 'C003'],
        "merchant_id": ['M002', 'M003', 'M001', 'M003', 'M003'],
        "amount": [250.0, 175.5, 320.75, 85.99, 1000.0],
        "transaction_timestamp": [
            "2025-03-10 14:23:00",
            "2025-03-10 09:45:00",
            "2025-03-08 16:30:00",
            "2025-03-01 06:30:00",
            "2025-03-11 02:30:00",
        ],
        "transaction_type": ["online", "in-store", "online", "online", "in-store"],
        "is_fraud": [1, 0, 1, 0, 1],
    }
)

# Standard scaling for the amount column
scaler = StandardScaler()
df_transactions['amount_scaled'] = scaler.fit_transform(df_transactions[['amount']])

If the majority of transaction amounts are concentrated at lower values but a few transactions have significantly high amounts, applying a log transformation can help reduce the impact of these outliers.

Timestamp Data

Timestamps can reveal temporal patterns. Extract features such as hour, day of week, and month.

Example Code:

# Extract hour, day of week, and month
df_transactions['hour'] = df_transactions['transaction_timestamp'].dt.hour
df_transactions['day_of_week'] = df_transactions['transaction_timestamp'].dt.dayofweek
df_transactions['month'] = df_transactions['transaction_timestamp'].dt.month

Encoding IDs as Categorical Features Using Binary Encoding

Similar to the merchant_category attribute, ID fields such as customer_id, merchant_id, and transaction_id can be encoded using binary encoding to avoid sparse matrices.

The following code snippet converts the categorical customer_id into numerical features using binary encoding, making it suitable for machine learning models.

Fitting the Encoder:
The BinaryEncoder is initialized with the customer_id column, and the fit method learns the binary representation for each unique ID.

Retrieving Column Names:
After fitting, the encoder’s parameters are used to extract the names of the newly created columns.

Transforming the Data:
The transform method encodes the customer_id column into numerical columns, which are then added to the original DataFrame (df_customers).

This approach converts customer_id from a categorical variable to a numerical format. While binary encoding is one effective method, alternative techniques such as hashing encoding or other kinds of encoding could also be used as well.

# Sample DataFrame for customer IDs

# Sample DataFrame for customers with 'loyalty_points'
df_customers = pd.DataFrame(
    {
        "customer_id": ["C001", "C002", "C003"],
        "age": [25, 40, 35],
        "gender": ["Male", "Female", "Other"],
        "loyalty_points": [150, 300, 225],
    }
)

# Binary encode customer_id
encoder_customer_id = ce.BinaryEncoder(cols=["customer_id"]).fit(
    df_customers["customer_id"]
)
transformed_column_names = encoder_customer_id.get_params()["mapping"][0]["mapping"].columns
df_customers[transformed_column_names] = encoder_customer_id.transform(df_customers["customer_id"])

Preprocessing Purely Integer Customer Features

Similar to the amount attribute, the customer attribute loyalty_points can be scaled through standardization, as shown in the following code snippet.

# Scale 'loyalty_points'
scaler_loyalty = StandardScaler()
df_customers['loyalty_points_scaled'] = scaler_loyalty.fit_transform(df_customers[['loyalty_points']])
print("Customer Data with Scaled Loyalty Points:\n", df_customers)

After all preprocessing steps, the data is converted to float32 format for model compatibility.

The supplementary section provides the Python code that preprocesses the example dataset and generates node.csv, node_to_node.csv, and node_label.csv.