An expedition across the magical and treacherous realm of data

1: Data exploration 2: Data splits 3: Data shifts

It was on a warm and breezy Friday afternoon that Luca, having indulged in a heavy lunch, found himself gazing at a determined squirrel that was perched on his window sill. The squirrel, clutching a stubborn nutshell with her tiny paws, executed an array of maneuvers with intense focus. She began by turning the nut around, testing its angles and seams, while steadily biting and gnawing with determination. Luca couldn't help draw a parallel to the time he had spent poring over lines of code, navigating the maze of his dataset with little success. Memories came rushing of the hours spent babysitting his model, in pursuit of a reasonable loss curve. Despite his best efforts, the model that had performed so well on the reported data inexplicably and repeatedly failed on his dataset. The model doesn't generalize, it just doesn't generalize . At that very moment, a gentle breeze swept in and comforted him. Gradually, he sensed his thoughts slowing down, his focus dwindling and his eyelids growing heavier...when an unexpected touch on his shoulder jolted him.

Come with me. Now. There's no time to explain Wh..What?

Chapter 1: Explorers

Top

And so it was, that Luca now found himself in a strange and peculiar land. What he saw seemed both familiar and foreign to him, as he stared in disbelief at the boundless expanse that lay before him. Suddenly, Luca's thoughts turned to the unfinished work, his model that was probably training, the cluttered desk and the creaky chair that he left behind, all of which seemed like a world far far away from the one he was now standing in. He was overcome with confusion regarding the nature of his journey and the identity of the woman who had led him here. As if sensing his doubt, she proceeded to announce, "Greetings to the land of DataRealm, where all data lives. Here, information flows like rivers and mountains of code rise into the sky. This is a realm of creation and discovery, where the human spirit and the digital domain converge to form a connected whole."

Image generated by Sandy Engelhardt using Midjourney

As his eyes swept across the landscape, it slowly dawned upon him, as he faintly noticed gigantic crystalline structures in a distance that stood tall, intricate with complex algorithms and databases. Interlinked nodes and circuits sprawled out like arteries, each a nexus of complex information, thoughts, and ideas. It was as if the whole world around him was drowning in the very essence of information and understanding. She continued, "Deep within the heart of this land lies the Archive of Wisdom, a treasure of all what humanity has thus far learnt from data. I notice that you seem troubled by your journey. Worry not, I am going to guide you through it."

His companion exuded a calming presence, one got the feeling she had seen countless travelers like him before and knew exactly how to comfort them. With a gentle grasp of his hand, she guided him towards a towering structure that loomed before them, pulsating with energy and emanating a soft, harmonious hum permeating the city behind it. Luca, initially puzzled, began to realize the significance of their journey. "To embark on this journey, you must first pass through these gates. These gates stand in memoriam to all the data that has been misused, neglected, tortured, and sacrificed to fit a narrative." Taking his hands and placing them gently upon the gates, she added "Remember what they stand for, that you shall not repeat the same mistakes."

As Luca's hands met the gates, a vivid spectacle enveloped his senses. Swirling histograms and pair plots surrounded him, and he caught faint echoes of a voice, which might have been his companion's or something altogether different—he could not discern. "Your journey is one of exploration," the voice resonated. "Before surrendering your data to a model, you shall embark on the quest to comprehend it. Delve into its distribution, and keep an eye out for glaring anomalies. Visualizing your data serves as a powerful tool for unearthing hidden insights. The seaborn Python library provides a wealth of such visualization resources, including seaborn.histplot , and seaborn.pairplot . And when the need arises, you shall visualize the features of your data, with principal component analysis or t-SNE visualizations . These methods not only aid in detecting outliers, but also unveil patterns in the data, helping you identify and select the most informative features."

"You shall strive to see if your data aligns with the fundamental assumptions that you made about your world. Be vigilant about potential spurious correlations in your dataset. Should such correlations arise, conduct a deeper examination to explore potential causal factors or hidden confounders that could underlie these observed effects." The voice then gradually faded, leaving Luca immersed in silence as he grappled with the unfolding of events. Luca found himself engulfed by anticipation and wonder, pondering whether this was a fleeting glimpse of the Archive of Wisdom."

At that moment, Luca's companion broke the silence, inquiring, "Did you grasp the essence of the message?" Luca, still somewhat befuddled as he processed the information, struggled to formulate a response. Observing his confusion, she offered further clarity, "Allow me to provide you with a few examples." She began, "Consider, for instance, the research study that uncovered a noteworthy linear correlation (r=0.791, P<0.0001) between chocolate consumption and the number of Nobel laureates of a country. The chocolate intake per capita was found to be correlated with the cognitive function. However, a deeper look into the underlying factors influencing both the observed variables may be required, before coming to a conclusion that increased chocolate consumption explains an increased cognitive function. Similarly, take for example, a rule-based model that observed a history of asthma to be associated with a lower risk of pneumonia. Clearly this correlation is counterintuitive. In this case, a deeper look at the possible causal factors reveals that the asthmatic patients were directly admitted to the intensive care unit, and received better care, resulting in a now decreased risk of pneumonia compared to the rest of the population." Their discourse was abruptly interrupted by an intense cacophony, that sounded like the turning of metal cogs, prompting their attention to the gates that were gradually swinging open.

Correlation between Countries' Annual Per Capita Chocolate Consumption and the Number of Nobel Laureates per 10 Million Population .

Chapter 2: A balancing act

Top

The two travellers embarked upon their journey into the heart of the realm, where the Archive of wisdom was said to reside. Luca's head buzzed with the newfound knowledge he had acquired. Meanwhile, his companion, her focus now shifted to their immediate surroundings, turned towards him with a sense of urgency. In an anxious tone, she explained, "Luca, we're about to enter the enchanted forests of the data realm, a crucial part of our journey. These forests are teeming with diverse creatures, some elusive, others freely roaming. While we're explorers here, we're not alone. Our task is to observe, learn, and understand this unique world, but navigating these dense woods requires assistance. The guides of this forest will help us learn and imbibe what these forests have to offer. Our goal is to finally meet the guardians , who sit at its center, guarding its ancient secrets. They remain unseen to our eye. Only when we've learned enough from the guides can we hope to meet the guardians and gain access to the coveted Archive of wisdom ."

She paused, then continued with an analogy for Luca's better understanding, "Let me draw a parallel, Luca. Think of the training dataset as our exploration of the learning space, where we extract knowledge from our data. The performance on the validation set guides us in this process. Ultimately, we evaluate our model's performance on an entirely unseen test dataset. However, it's critical to assign roles wisely, akin to appointing explorers, guides, and guardians. Neglecting this could disrupt the entire ecosystem. You must ensure that these groups—equivalent to your train, validation, and test sets—are completely separate, with no knowledge of each other. In your context, this means avoiding data leakage; for instance, data from the same patient or frames from the same surgery should not be split across different sets."

"But that's not all," she continued, emphasizing the importance of balance, "these creatures within our forest may come from various heights, some from the high canopies, others from the forest floor, and even some from beneath the earth. To maintain the harmony of this ecosystem, explorers, guides, and guardians must be proportionally represented, just as in the concept of stratification. You must have encountered the sklearn library, particularly the function sklearn.model_selection.StratifiedShuffleSplit , right? Essentially, the distribution or the composition of the different classes or species within your folds must closely resemble that of your entire dataset. Moreover, if your data contains hierarchies, these too must be accurately mirrored within your distinct splits."

Legend:
Description:
The straightforward way to split a dataset is to take the first N samples for training and the remaining samples for testing. However, this approach may introduce bias into the resulting dataset split, especially if the order of the samples entails relationships within the data. Additionally, it may create sets that do not reflect the distribution of the source dataset or do not cover all classes.
Dataset ordering:
Split options:
Interactive visualization of different dataset splitting techniques. Data is generated randomly on page reload.

"I will give you a more concrete analogous use-case of surgical workflow analysis, something you may be familiar with. Here multiple aspects of a surgery come into the picture, such as surgical phases, and the transitions between them, the occurrence and co-occurrences of different surgical instruments, and the duration or difficulty of a surgery. Also considering the hierarchical relations of these aspects, creating representative splits of surgical workflow data can be quite a challenge. Here, visualisation tools such as Endovis-ML help in interactively exploring the splits of the data and ensuring no classes are unrepresented."

"There is something else you should know about these enchanted forests. They are home to various creatures, but some species are rare and elusive, while others are abundant. Sometimes, the people seated in the archive of wisdom intentionally multiply the representation of the rarer species to maintain the harmony of the forests and allowing the minority to thrive. This is akin to the oversampling that you perform, where you artificially generate instances of the minority class of a dataset to make it more proportionate to the majority class. Sometimes we also banish some creatures that are rampant, to ensure balance. This would be similar to undersampling, where the number of instances from the majority class, are reduced, at times randomly, to create a balanced representation and minmize the bias. Take for example the case of detecting a rare diseases, from MRI images. The number of images that contain a tumor may be very limited, which may lead your model to be biased towards the majority class, in this case no tumour. This would eventually lead to a misleading evaluation, especially a lower sensitivity, and missed detections as a result. There are a multiple ways to oversample such as data augmentation, or using techniques like SMOTE (Synthetic minority over-sampling technique) which creates new synthetic instances by interpolating between existing minority class instances."

Chapter 3: Shifting realities

Top

Luca's journey had been long and winding, and fatigue was beginning to wear him down. Yet, he remained resolute, determined to see it through to its conclusion. His confusion had given way to gratitude and deep appreciation for the wealth of knowledge he was gaining. Life often introduces unexpected benefactors who profoundly change one's life for the better, and Luca had come to realize that his companion was one such benefactor. Together, they had left the enchanted forests behind and now ventured into the valleys, drawing closer to their ultimate destination. As Luca took his first step into the valley, he felt the ground beneath him shifting. He turned to his companion in astonishment, and she promptly offered an explanation, "We are now in the final leg of our journey, but be warned, this is no ordinary valley. Here, everything is in a constant state of flux, with these terrains continually shifting. I have a map that will guide you to the Archive of Wisdom, but you'll notice that everything is constantly moving on it. Creating and maintaining such a map is not an extraordinary feat." To illustrate her point, she continued with an analogy, "Let me provide you with an example. No matter how diligently you train your model on a dataset, its performance may decline over time due to data shifts. Just as these terrains keep moving, data can change in unpredictable ways. It's a challenge we must be prepared for on our journey."

"Take for instance, this example , where you have the task of predicting tumor volume (Y), using features learned from CT scans (X). The size of the tumor is also influenced by say, the contrast agent (M). Here, sampling the data independently of these parameters would avoid a shift in data, whereas sampling based on the images (rejecting low resolution images) or the contrast agent (low does of contrast agent), may introduce a bias or a data shift, thereby impacting your model performance on test data. Analogously, you can think of the task of maintaining a moving map that will guide you in the right path to the archive at all times. To do this, you can constantly take snapshots of the valley and keep track of the changes. But, if the snapshots are all taken at a particular time of day, or from one place, this may introduce bias into your map because it doesn't accurately reflect the true nature of the ever-changing environment. Some ways to tackle this are by importance weighting your data, for example as per the occurrence of the features you expected to observe in the target distribution. For instance, the attributes that are rarely seen observed in the source dataset but are more likely to be seen in the target dataset are assigned more importance. Another approach is to use uncertainty estimation to make the model robust to shifts that are expected to be observed. I've taught you enough Luca, it's time for me to go."

Sample selection bias: three examples. On the right are graphs giving conditional independence relations . Y is the lesion volume to be predicted (i.e., the output). M are the imaging parameters, e.g., contrast agent dosage. X is the image, and depends both on Y and M. S indicates that data are selected to enter the source dataset (orange points) or not (blue points). The symbol equation ⊥⊥ means independence between variables. Preferentially selecting samples results in a dataset shift (middle and bottom row). Depending on whether equation Y⊥⊥S|X, the conditional distribution of Y|X—here lesion volume given the image—estimated on the selected data may be biased or not.
Image generated by Sandy Engelhardt using Midjourney

With as simple a farewell, she departed, leaving Luca to navigate the ever-changing paths of the valley. Luca, now tantalizingly close to the archive, awaited its presence with bated breath, filled with excitement. However, when he glanced backward, his steadfast companion, who had guided him throughout, had mysteriously disappeared. She left behind a trail of cherished memories and knowledge. It was as if her purpose had been fulfilled, and her absence now served as a testament to Luca's progress on this transformative journey. As Luca's foot made contact with the ground, he stumbled backward with a resounding thud, in a sudden wave of disorientation. The world around him flickered and dimmed. When he finally opened his eyes, he found himself sprawled on the floor of his office, seemingly having fallen from his chair. The ethereal visions of the DataRealm had faded, replaced by the familiar sights of his cluttered desk. His model had now completed its training. With a thoughtful smile, Luca outside to find his old friend, the squirrel, still perched on the window sill, having cracked open the shell. For the briefest of moments, it almost seemed as if she had winked at him, before scampering away to savor the fruits of her labor with a glint in her eye.

Note

This article is a submission to the MICCAI Educational Challenge 2023

Acknowledgements

The authors would like to acknowledge the contributions of Lorenz Quack, Halvar Kelm, and Dale Chrislene for their valuable inputs and suggestions.