Test Time Training Under Evolutionary Concept Drift Using Explanation Distributions

Master`s Thesis

Open Master's Thesis

Description

Example problem

Given a dataset from 2014 about smartphones 𝒟𝑋={𝑥1,…,𝑥𝑛}⊆ℝ𝑑 and their attributes (screen size, camera resolution, battery lifetime, price, etc.) and labels that indicate whether a smartphone is a best buy or not 𝒟𝑌={𝑦1,…,𝑦𝑛},𝑦𝑖∈{1,0}. You can train a classifier 𝑓̂ with any preferred method.

Now you encounter a new dataset from 2024, 𝒟𝑋∗={𝑥1∗,…,𝑥𝑛∗}⊆ℝ𝑑, but you do not have any labels 𝑦𝑖∗. How can you predict the best buys given that the concept of what a best buy is has shifted – different camera resolutions, different battery lifetime, different prices, etc.? What is a good predictor 𝑓∗̂.

Generic problem

Given a labeled training dataset 𝒟=(𝒟𝑋,𝒟𝑌) and an unlabeled test dataset 𝒟𝑋∗ that is from a different distribution: which function optimizes the prediction. In an evaluation setup, we may know 𝒟𝑌∗, though in an operative environment, we do not have that.

Assumptions

Our assumption is that concept drift was evolutionary and did not change the set of attributes. For instance, if a new smartphone attribute occurs (e.g. the capability to teleport the smartphone owner to a physically different place), this might radically change the utility of the smartphone and without an overall assessment of the benefit of this new attribute cannot be extrapolated from past data. This would be a revolutionary change and beyond our intended framework.

Methods

We have developed methods to recognize distribution shifts without concept drift by comparing the explanation distributions 𝑆(𝑓̂,𝒟𝑋) and 𝑆(𝑓̂,𝒟𝑋∗), where 𝑆(𝑓̂,𝒟𝑋) is defined as the distribution of importances to the different attributes using Shapley values.

Key idea: We now develop a model of concept drift that assumes that the relative importance of attribute values stays constant.

First, we want to investigate how to define a function 𝑔 that map data using optimal transport such that 𝑔(𝒟𝑋)≈𝒟𝑋∗ and 𝑆(𝑓̂,𝒟𝑋)≈𝑘∙𝑆(𝑓∗̂,𝒟𝑋∗).

A second approach might consider a mapping of the predictor ℎ(𝑓̂)=𝑓∗̂ such that 𝑆(𝑓̂,𝒟𝑋)≈𝑘∙𝑆(𝑓∗̂,𝒟𝑋∗) by adding a term to the loss function. Likely it is necessary to combine both approaches.

Project Members

To the top of the page