Dataset Similarity Evaluation Framework

Wireless labs often collect data in isolation, making it hard to predict whether a pre-trained model will generalize to a new environment. I worked with collaborators at ASU and Bell Labs to build a dataset similarity evaluation framework that:

Represents datasets via task-aligned fingerprints spanning spectrogram statistics, spatial correlations, and semantic metadata.
Uses LWM embeddings to score cross-dataset affinity and recommend fine-tuning targets.
Ships with automated reports so practitioners can quickly understand gaps before launching expensive data collection campaigns.

The framework is now part of our open benchmark, appears in the Asilomar 2024 paper, and is actively used when planning new twin deployments or foundation model experiments.

References