AI, Blog, Machine Learning

A complete Guide to understand and handle Data Leakage!


Data leakage in artificial intelligence happens when a model, during its testing phase, gains access to information from the training dataset that it should not use for making predictions. This issue arises if the training data contains elements that are only available post-event and should not influence the model’s decision-making process. Such leakage can inflate the performance metrics of the model, making it appear more effective than it actually is and potentially reducing its real-world applicability.

Lets see some of few examples of data leakage in AI:

  1. In a predictive maintenance scenario, a model is trained on sensor data to predict when a machine will fail. The training dataset includes data from the machine’s sensors, as well as the dates when the machine was serviced. However, the dates of the service were only available after the machine failed, and should not have been used in the training dataset. This causes the model to learn that certain sensor readings are indicative of a machine failure because they occurred just before the machine was serviced, rather than because they indicate an actual problem.
  2. In a healthcare scenario, a model is trained to predict patient outcomes based on their medical history and lab results. The training dataset includes information about whether the patient was hospitalized, which is only known after the outcome has occurred. This causes the model to learn that certain medical conditions or lab results are indicative of a poor outcome because they occurred in patients who were hospitalized, rather than because they are actually related to the outcome.
  3. In a financial scenario, a model is trained to predict credit risk based on a customer’s financial history. The training dataset includes information about whether the customer defaulted on a loan, which is only known after the loan has been granted. This causes the model to learn that certain financial conditions are indicative of a high credit risk because they occurred in customers who defaulted, rather than because they are actually related to credit risk.
  4. In a natural language processing (NLP) scenario, a model is trained to classify text documents into different categories. The training dataset includes the text of the documents, as well as their category labels. However, the category labels are also included in the text of the documents, such as in the form of a prefix or suffix. This causes the model to learn that certain words or phrases are indicative of a certain category because they occurred with that category label, rather than because they are actually related to the category.
  5. In a computer vision scenario, a model is trained to identify objects in images. The training dataset includes the images, as well as the bounding boxes around the objects. However, the bounding boxes are also included in the images, such as in the form of a watermark. This causes the model to learn that certain image features are indicative of an object because they occurred within a bounding box, rather than because they are actually related to the object.
  6. In a time-series forecasting scenario, a model is trained to predict future values of a time-series. The training dataset includes historical values of the time-series, as well as future values. However, the future values are only known after the event has occurred, and should not have been used in the training dataset. This causes the model to learn that certain patterns in the historical data are indicative of future values, rather than because they are actually related to the future values.

Toprevent data leakage, it’s important to carefully construct the training and testing datasets so that they do not include information that would not be available at the time predictions are made. This may involve removing certain features, or using techniques like cross-validation or time-based splitting. It’s also important to continuously monitor the model performance on unseen data and look out for any abnormal performance improvement.

Lets See an Example of data leakage on real example from code itself using Python’s scikit-learn library:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Generate some synthetic data
import numpy as np
np.random.seed(0)
X = np.random.randn(100, 1)
y = 2 + 3 * X + np.random.randn(100, 1)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit a linear regression model to the training data
reg = LinearRegression().fit(X_train, y_train)

# Evaluate the model on the test data
print("Test score:", reg.score(X_test, y_test))

In this example, the model is trained on a synthetic dataset with a single feature (X) and a single target variable (y) and then tested on a holdout set. The goal is to use the model to predict y given X.

Here, there is no leakage as the model is trained on X_train, y_train and tested on X_test, y_test. The testing dataset is completely unseen to the model and the performance is evaluated on that.

However, if we were to do something like this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = False)

In this case, the data is not shuffled before splitting and the model is trained on the first 80% of the data and tested on the last 20% of the data. This causes the model to have seen the test data in the training phase and the test score would be artificially high.

It’s important to shuffle the data before splitting to avoid such leakage.

Lets See one more example and how to Prevent data leakage on it in a text classification problem using Python’s scikit-learn library :

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Create some synthetic data
data = ["positive review", "negative review", "positive review", "neutral review"]
target = ["positive", "negative", "positive", "neutral"]

# Create a CountVectorizer to preprocess the data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2)

# Fit a Naive Bayes classifier to the training data
clf = MultinomialNB().fit(X_train, y_train)

# Evaluate the model on the test data
print("Test score:", clf.score(X_test, y_test))

In this example, the model is trained on a synthetic dataset with text data and a target variable, and then tested on a holdout set.

However, if we were to use the target variable in the text data, it would lead to leakage:

data = ["positive review", "negative review", "positive review", "neutral review"]
target = ["positive", "negative", "positive", "neutral"]
data = [data[i]+target[i] for i in range(len(data))]
# Create a CountVectorizer to preprocess the data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2)

# Fit a Naive Bayes classifier to the training data
clf = MultinomialNB().fit(X_train, y_train)

# Evaluate the model on the test data
print("Test score:", clf.score(X_test, y_test))

Here the model is trained with the target variable in the text data and thus it would be able to predict the target variable even on unseen data leading to high accuracy.

To prevent data leakage in this example, we should make sure that the target variable is not included in the text data and only used for training and testing.

Another way to prevent data leakage is to use cross-validation to train and evaluate the model. Instead of splitting the data into a training and testing set, we can use k-fold cross-validation to train and evaluate the model multiple times on different subsets of the data.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X, target, cv=5)
print("Cross-validation scores:", scores)

This will train the model multiple times on different subsets of the data, and give us an estimate of its performance that is less susceptible to data leakage.

There are several other methods to prevent data leakage:

  1. Time-based splitting: This method is used when working with time-series data, where the data is split into training and testing sets based on the time of observation. The training dataset includes observations from a certain period in the past, while the testing dataset includes observations from a different, more recent period. This method ensures that the model is not exposed to future information during the training phase.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Load the dataset
data = fetch_california_housing()

# Define the features and target
X = data.data
y = data.target

# Create the time-based splitter
tscv = TimeSeriesSplit(n_splits=5)

# Fit the model and evaluate using time-based splitting (preventing data leakage)
scores = []
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model = LinearRegression().fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))
print("Cross-validation scores (with time-based splitting):", scores)

# Fit the model and evaluate without time-based splitting (allowing data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Test score (without time-based splitting):", score)
Cross-validation scores (with time-based splitting): [-0.11102198252247408, 0.5393788961369235, 0.6013637646829824, 0.5075579844648623, 0.6722502848295411]
Test score (without time-based splitting): 0.6104539306045191

2. Data Masking: This method involves masking or removing certain columns or features from the dataset that are not available during the testing phase. This can be done by removing columns or features that are not relevant to the problem or are only available after the event of interest has occurred.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the dataset
iris = load_iris()

# Define the features and target
X = iris.data
y = iris.target

# Mask the feature that should not be used in the model
X = X[:, :2]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit a logistic regression model to the training data
clf = LogisticRegression().fit(X_train, y_train)

# Evaluate the model on the test data
print("Test score:", clf.score(X_test, y_test))

#Test score: 0.8666666666666667
Test score (with data masking): 0.7666666666666667
Test score (without data masking): 1.0

3. Creating a separate validation dataset: Instead of using the traditional train-test split method, creating a separate validation dataset is another way to prevent data leakage. This validation dataset will be used to tune the model and select the best model, while the test dataset will be used to evaluate the performance of the final selected model.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

best_score = -1
best_C = 0
for C in [0.001, 0.01, 0.1, 1, 10]:
    model = LogisticRegression(C=C, random_state=42, max_iter=1000).fit(X_train, y_train)
    score = model.score(X_val, y_val)
    if score > best_score:
        best_score = score
        best_C = C
        print(best_score)
        print(best_C)

X_train = np.concatenate((X_train, X_val), axis=0)
y_train = np.concatenate((y_train, y_val), axis=0)
best_model = LogisticRegression(C=best_C, random_state=42, max_iter=1000).fit(X_train, y_train)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Test score (with separate validation set):", best_model.score(X_test, y_test))
0.9333333333333333
0.001
1.0
0.1
Test score (with separate validation set): 1.0

It loads the Iris dataset using the load_iris function from scikit-learn and stores the features in the variable X and the target variable in the variable y.It uses the function train_test_split to split the data into training and validation sets. The validation set will be used to select the best model.It uses the validation set to select the best model by looping over different values of the regularization parameter C and evaluating the model’s accuracy on the validation set. The best model is the one with the highest accuracy.It re-fits the best model on the full training set (i.e., the concatenation of the original training set and the validation set) and evaluates it on the test set using X_test and y_test. It prints the final accuracy score of the best model on the test set.

4. Anomaly Detection: Continuously monitoring the model performance on unseen data and looking out for any abnormal performance improvement can also help in identifying data leakage.One popular method for anomaly detection is the Isolation Forest algorithm. It works by randomly selecting a feature and a random split value, and recursively partitioning the data into smaller subsets. The outliers will be the data points that are separated from the rest of the data in the fewest number of partitions.

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Initialize and fit the Isolation Forest model
clf = IsolationForest(contamination=0.1, random_state=42)
clf.fit(X)

# Use the model to predict the labels of the data set
y_pred = clf.predict(X)

# Select the inliers
inliers = y_pred == 1
X_inliers = X[inliers]
y_inliers = y[inliers]

# Select the inliers (normal data points)
inliers = y_pred == 1

# Print the number of inliers and outliers
print("Number of inliers:", inliers.sum())
print("Number of outliers:", (~inliers).sum())

# Split the inliers into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_inliers, y_inliers, test_size=0.2, random_state=42)

# Fit a linear regression model to the inliers
lr = LinearRegression().fit(X_train, y_train)

# Evaluate the linear regression model on the test set
score = lr.score(X_test, y_test)
print("R^2 on test set: {:.2f}".format(score))
Number of inliers: 18576
Number of outliers: 2064
R^2 on test set: 0.67

Isolation Forest is one of the algorithm which can be used for anomaly detection. The Isolation Forest algorithm creates an ensemble of decision trees, where each tree is trained to separate a single data point from the rest of the data. The algorithm works by randomly selecting a feature and a random split value, and recursively partitioning the data into smaller subsets. The data points that are separated from the rest of the data in the fewest number of partitions are considered to be outliers.

In the code example I have provided above, first the California housing dataset is loaded, after that an instance of IsolationForest is created, then the fitted with the data, then used to predict the labels of the data set, the inliers are selected and the number of inliers and outliers is printed.

You can use this code as a starting point and adjust the parameters, try with different datasets, and see how it performs on your specific problem. You can try other algorithm too such as Local Outlier Factor, One-class SVM etc. You can also use visualization techniques like boxplot, scatter plot etc to understand your data and get better insights.

I know it is difficult to understand, lets see another example to get more understanding

Here’s an example of how you can use Local Outlier Factor (LOF) for anomaly detection:

from sklearn.datasets import fetch_california_housing
from sklearn.neighbors import LocalOutlierFactor

# Load the California housing dataset
X, _ = fetch_california_housing(return_X_y=True)

# Create an instance of LOF
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.1)

# Fit the model to the data
clf.fit(X)

# Predict the labels (1 for inliers, -1 for outliers)
y_pred = clf.fit_predict(X)

# Count the number of inliers and outliers
n_inliers = (y_pred == 1).sum()
n_outliers = (y_pred == -1).sum()

print("Number of inliers:", n_inliers)
print("Number of outliers:", n_outliers)
Number of inliers: 18576
Number of outliers: 2064

This code will use the LOF algorithm to identify outliers in the California housing dataset. LOF uses a nearest neighbors approach to identify outliers by computing the local density of each data point. The n_neighbors parameter controls the number of nearest neighbors used to compute the local density, and the contamination parameter controls the proportion of data that is considered to be outliers.

In this example, the n_neighbors parameter is set to 20, meaning that each data point’s local density is computed using the 20 nearest neighbors. The contamination parameter is set to 0.1, meaning that the algorithm will consider 10% of the data to be outliers.

You can adjust the parameters and see how it affect the results, you can also try different datasets, and see how well LOF performs on your specific problem.

Another algorithm for anomaly detection is One-class SVM, here’s an example of how to use it:

from sklearn.datasets import fetch_california_housing
from sklearn import svm

# Load the California housing dataset
X, _ = fetch_california_housing(return_X_y=True)

# Create an instance of OneClassSVM
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma='scale')

# Fit the model to the data
clf.fit(X)

# Predict the labels (1 for inliers, -1 for outliers)
y_pred = clf.predict(X)

# Count the number of inliers and outliers
n_inliers = (y_pred == 1).sum()
n_outliers = (y_pred == -1).sum()

print("Number of inliers:", n_inliers)
Number of inliers: 18574

5. Using different training and test data: This is a simple method of preventing data leakage, but it requires a lot of data. The idea is to use different data for training and testing, and make sure that the model has not seen the test data during the training phase.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model on the training set
clf = LogisticRegression().fit(X_train, y_train)

# Evaluate the model on the test set
test_score = clf.score(X_test, y_test)
print("Test score: {:.2f}".format(test_score))

In this example, we first load the Iris dataset using scikit-learn’s load_iris function. Then we use the train_test_split function to split the data into a training set and a test set. We set the test size to 20% meaning that 20% of the data will be used for testing and 80% will be used for training. The random_state parameter is used to set the seed for the random number generator, so that the same split is obtained every time the code is run. We then train a logistic regression model on the training data and evaluate its performance on the test set. This way the model is not tested on the data it has already seen.

Data leakage can happen in various ways, and it is important to be aware of all the different ways that it can happen in order to prevent it.
Here are a few additional things to keep in mind when working with machine learning models:
Data snooping: This happens when the model is overfitted to the training data because you’ve tried multiple models, features and parameters and the model is performing well on the training data but not on unseen data. To prevent this, you should use techniques like cross-validation to get a more accurate estimate of the model’s performance.
Leakage through features: This happens when features in the model are derived from the target variable, for example if you are trying to predict a person’s income based on their age and you create a feature that is the square of the age, this feature will be directly related to the target variable. To prevent this, you should carefully examine all the features in your model and make sure that they are not derived from the target variable.
Leakage through data preprocessing: This happens when the data is preprocessed in a way that inadvertently leaks information from the test set into the training set. For example, if you are normalizing the data and you use the mean and standard deviation from the entire dataset to normalize the data, then you are leaking information from the test set into the training set. To prevent this, you should use different means and standard deviations for the training and test sets.
Leakage through the model: This happens when the model is designed in a way that leaks information from the test set into the training set. For example, if you are using a neural network and the weights are shared between the training and test sets, then you are leaking information from the test set into the training set. To prevent this, you should design the model in a way that the training and test sets are completely separate.
Leakage through the evaluation metric: This happens when the evaluation metric is designed in a way that leaks information from the test set into the training set. For example, if you are using a metric like accuracy and the test set is highly imbalanced, then the accuracy of the model will be very high even if the model is not making accurate predictions.
Another way to prevent data leakage is to use techniques such as data sanitization, which involves removing sensitive information or perturbing the data in a controlled way so that it is still useful for training models but cannot be used to identify individuals.

In addition, it is important to be aware of the potential for leakage when working with third-party data, such as data from external sources or data that has been collected by other teams. In such cases, it is important to carefully review and understand the data, as well as to establish clear agreements and protocols for how the data will be used and protected.

Lastly, it is important to have a clear data governance framework in place that outlines the roles and responsibilities of different teams, establishes clear protocols for data access, and provides guidance on how to handle sensitive data. This can help to ensure that data is used in an ethical and responsible way and that any potential risks are identified and addressed in a timely manner.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

#Artificial Intelligence
#Machine Learning
#Beginner
#Data Science
#AI

    Leave a Reply