Population-Level Medical AI Using 3D Self-Supervised Learning

This is deeply personal for me. In 2020, my father, Alan, developed a stomachache. A CT scan revealed a subtle pattern—what’s called a double-duct sign. It often means something serious, but there was no visible tumor. CT, CT with contrast, MRI, even MRI with contrast—all showed the same sign, but no clear cause. After three months of scan after scan, an endoscopic ultrasound finally revealed a hidden pancreatic cancer. By the time my dad received treatment, it had spread. He fought for two years. He didn’t survive.

So, we built this AI so that his tumor would have been diagnosed the moment that first CT scan was taken. See how our AI instantly finds double duct signs among 300k medical images.

TL;DR

Medical AI – Self-Supervised Discoveries

91 % k‑Nearest Neighbor Clustering — No Labels Required

Trained embeddings Untrained embeddings
Left: Before Training. Right: After Traing. Drag the slider left-right to reveal before and after. tSNE Clustering of 300k organ image emnbeddings from CT, MRI (T1, T2, T1GD, FLAIR, DTI, MRA, PD), and mammogram over 130 anatomical categories learned without labels.
See details & per‑organ accuracy

Our self‑supervised Vision Transformer learned from 300 000 CT, MRI, and mammogram organ images without labels, achieving 91 % k‑Nearest Neighbor accuracy across 130 categories.

Organ (Scroll this table to see accuracy for 130 organs) ResNet18 Accuracy ViT-10 Accuracy Test Samples
Overall 91.44% 87.71% --
Adrenal Gland Left (CT) 99.54% 96.94% 434
Adrenal Gland Right (CT) 99.78% 99.12% 454
Aorta (CT) 98.14% 98.63% 483
Autochthon Left (CT) 95.20% 93.55% 500
Autochthon Right (CT) 96.27% 94.46% 510
Brain (CT) 92.86% 87.88% 28
Clavicula Left (CT) 97.92% 94.25% 96
Clavicula Right (CT) 100.00% 96.74% 85
Colon (CT) 96.23% 93.43% 504
Duodenum (CT) 95.60% 94.09% 432
Esophagus (CT) 97.89% 97.38% 379
Face (CT) 90.62% 91.84% 32
Femur Left (CT) 95.81% 90.89% 454
Femur Right (CT) 97.09% 91.46% 412
Full Torso (CT) 99.06% 98.22% 640
Gallbladder (CT) 98.45% 97.93% 386
Gluteus Maximus Left (CT) 92.34% 88.81% 444
Gluteus Maximus Right (CT) 95.73% 86.89% 422
Gluteus Medius Left (CT) 96.65% 93.28% 418
Gluteus Medius Right (CT) 97.28% 86.90% 404
Gluteus Minimus Left (CT) 98.17% 94.36% 436
Gluteus Minimus Right (CT) 99.76% 97.03% 414
Heart Atrium Left (CT) 96.35% 90.16% 137
Heart Atrium Right (CT) 95.73% 93.33% 164
Heart Myocardium (CT) 92.55% 92.13% 255
Heart Ventricle Left (CT) 87.56% 91.74% 209
Heart Ventricle Right (CT) 94.91% 94.06% 275
Hip Left (CT) 92.60% 87.95% 446
Hip Right (CT) 89.12% 86.21% 432
Humerus Left (CT) 89.71% 75.76% 68
Humerus Right (CT) 94.44% 83.33% 90
Iliac Artery Left (CT) 86.51% 81.25% 415
Iliac Artery Right (CT) 90.76% 75.06% 433
Iliac Vena Left (CT) 92.89% 93.93% 408
Iliac Vena Right (CT) 94.44% 81.31% 414
Iliopsoas Left (CT) 97.23% 94.26% 433
Iliopsoas Right (CT) 94.88% 93.07% 430
Inferior Vena Cava (CT) 99.18% 97.03% 487
Kidney Left (CT) 97.82% 95.02% 412
Kidney Right (CT) 98.84% 95.68% 430
Liver (CT) 98.46% 99.11% 456
Lung Lower Lobe Left (CT) 97.77% 95.95% 449
Lung Lower Lobe Right (CT) 97.88% 98.28% 472
Lung Middle Lobe Right (CT) 97.43% 96.67% 350
Lung Upper Lobe Left (CT) 97.24% 93.52% 398
Lung Upper Lobe Right (CT) 91.07% 94.69% 112
Pancreas (CT) 94.57% 94.20% 516
Portal Vein and Splenic Vein (CT) 97.47% 97.91% 435
Pulmonary Artery (CT) 98.00% 93.68% 100
Sacrum (CT) 99.02% 99.75% 409
Scapula Left (CT) 95.19% 87.37% 104
Scapula Right (CT) 95.70% 84.78% 93
Small Bowel (CT) 90.19% 91.22% 428
Spine Segment (CT) 95.67% 86.93% 531
Spleen (CT) 99.17% 97.68% 480
Stomach (CT) 98.29% 92.54% 469
Trachea (CT) 98.99% 94.85% 99
Tumour Colon (CT) 86.67% 15.38% 15
Tumour Lung (CT) 75.00% 30.00% 8
Tumour Pancreas (CT) 94.12% 75.86% 34
Urinary Bladder (CT) 99.74% 98.87% 390
Vertebrae C1 (CT) 93.10% 89.74% 29
Vertebrae C2 (CT) 82.50% 82.86% 40
Vertebrae C3 (CT) 66.67% 61.11% 45
Vertebrae C4 (CT) 51.52% 46.34% 33
Vertebrae C5 (CT) 52.73% 47.50% 55
Vertebrae C6 (CT) 79.25% 57.14% 53
Vertebrae C7 (CT) 78.90% 80.00% 109
Vertebrae L1 (CT) 94.49% 82.81% 490
Vertebrae L2 (CT) 92.39% 81.41% 486
Vertebrae L3 (CT) 89.10% 78.08% 477
Vertebrae L4 (CT) 91.43% 84.33% 490
Vertebrae L5 (CT) 96.58% 96.21% 439
Vertebrae L6 (CT) 0.00% 0.00% 4
Vertebrae T1 (CT) 84.75% 83.59% 118
Vertebrae T10 (CT) 78.42% 62.22% 329
Vertebrae T11 (CT) 88.11% 66.60% 454
Vertebrae T12 (CT) 93.14% 77.62% 510
Vertebrae T13 (CT) 0.00% 0.00% 1
Vertebrae T2 (CT) 80.00% 83.15% 115
Vertebrae T3 (CT) 77.88% 71.43% 113
Vertebrae T4 (CT) 66.33% 63.89% 98
Vertebrae T5 (CT) 53.39% 37.84% 118
Vertebrae T6 (CT) 47.06% 45.79% 119
Vertebrae T7 (CT) 42.50% 32.43% 120
Vertebrae T8 (CT) 41.22% 26.76% 148
Vertebrae T9 (CT) 66.67% 40.26% 204
Full Brain (DTI) 100.00% 100.00% 818
Edema Brain (FLAIR) 32.69% 67.65% 52
Enhancing Tumour Brain (FLAIR) 25.42% 23.73% 59
Full Brain (FLAIR) 69.70% 64.84% 99
Non-Enhancing Tumor Brain (FLAIR) 8.97% 12.73% 78
Full Brain (MRA) 100.00% 100.00% 73
Full Brain (PD) 95.24% 100.00% 63
Edema Brain (T1GD) 37.93% 42.19% 58
Enhancing Tumour Brain (T1GD) 9.09% 22.22% 44
Full Brain (T1GD) 51.28% 82.08% 78
Non-Enhancing Tumor Brain (T1GD) 9.09% 7.50% 66
Edema Brain (T1) 22.22% 59.70% 54
Enhancing Tumour Brain (T1) 6.90% 27.27% 58
Full Brain (T1) 55.56% 68.09% 90
Full Head (T1) 100.00% 100.00% 77
Non-Enhancing Tumor Brain (T1) 6.67% 20.37% 60
Edema Brain (T2) 16.33% 58.67% 49
Enhancing Tumour Brain (T2) 11.86% 20.00% 59
Full Brain (T2) 88.59% 91.43% 149
Non-Enhancing Tumor Brain (T2) 4.55% 15.25% 66

Anatomy Behaves Like a Network

Scale‑free graph of T1 MRI brains

Above: Each dot in this graph represents a 3‑D MRI scan of a human brain as learned by our self-learning model. Blue dots represent biological males, red dots represent biological females. 6-degrees of separation is in effect here. Click below for more...

Learn more about these networks

Our self‑supervised model learned how these ~1 000 brains relate to one another — without any labels — and organized them into a small‑world, scale‑free network: About 5 % of brains act as “hub‑brains” (prototypes) at the center. The rest branch outward like relations in a social network, only a few “steps” away from any other brain (think six degrees of separation).

Notice the split: the large cluster on the left is entirely male, while the large cluster on the right mixes male and female brains. We don’t yet know why — but this is the power of discovery: the model finds patterns we didn’t expect, giving us new questions to explore.

Self‑supervised learning organizes anatomy into scale‑free, small‑world graphs — the same kind of networks we see in biology and even social systems. In simple terms, the model is finding the most efficient way to connect similar organs, like neighborhoods linked by a few well‑connected “hubs.”

This isn’t something we programmed in — it happens on its own, like a phase change when water freezes. And that’s where the discoveries begin. For example, some organs, like the bladder, cluster by biological sex, while many others, like brains (see above), do not. This is tremendously interesting!

Another surprise: these learned representations don’t follow the familiar “bell curve” (Gaussian distribution) that so much of medical science assumes. Instead, they have a long tail — meaning rare, unusual cases carry far more weight than we typically expect.

Scale‑free graph of CT vertebrae
Above: Another example of organ representaions organizing into a small-world and scale-free graph. In this case, each dot represents an L1 vertebra CT image. Blue dots represent biological males, red dots represent biological females.

Medical Search: Finding a Needle in a Haystack. Searching for a Patient Priors and Disease Across 300,000 Scans

Our model learns organs like fingerprints — unique to each person — opening the door to a new kind of personalized medicine. It doesn’t just group organs by type; it recognizes them at the level of the individual. This makes for a search engine so accurate, our model can literally find individuals in a population using any one of their organs as a search query.

L1 Vertebra with bone island fingerprint

Above:An organ (outlined in bold) is used as a search query. Our model finds the most similar results (indicated by the arrows). The left search query image is an L1 verteba with a bone island (purple outline) where other bone-island L1 vertebrae instances are returned. The right search query image shows a pancreas with double-duct sign (yellow outline) and here other double-duct sign instances are returned. Searching through 300 000 total organs take only about 0.0005 seconds which enables population scale searching.

See additional cases & results

To expand on how remarkable this is, there is one patient in our dataset that has 4 torso CT scans. We use one CT scan, select 45 different organs including L-vertebrae, pancreas, gallbladder, duodenum, stomach, left and right adrenal glands, heart ventricles, kidneys, psoas, glutes, and more - a good variety of organs. Since this patient has 4 scans, the search scan is held out meaning that there are 3 instances of each of his organs in out 300k organ dataset. Our search finds his other organs instances, with all 3 other instances of each organ as the top 3 search results, 65% of the time, and this is with a tiny research-sized model too. To reiterate just how sensitive the search is, out model finds this patient's L1-L5 vertebrae as the top 3 search results 100% of the time. So, inputing his L3 vertebra as a search query, the top 3 search results are his other three L3 vertebrae - out dataset has ~6000 L3 vertebrae and 52,000 vertebra of all types in it. The model doesn't just organize by organ type, but it clusters anatomy at the level of the individual.

How the Model Sees: Patch‑Level Visualizations

Above: These videos show how the Vision Transformer “sees” medical images. Each pane uses patch‑level Principal Component Analysis (PCA) to project the model’s internal representations into visible colors, revealing how it groups and interprets different anatomical structures. The videos scroll frame‑by‑frame through the 3D scans along the depth axis, giving a window into the model’s learned organization of the data.
Learn more about these visualizations

In each video, multiple panes are shown from left to right. The first pane displays the original scan image, while each successive pane shows three principal components: the first component mapped to the red channel, the second to green, and the third to blue. Moving across the panes from left to right, the principal components increase in groups of three (e.g., components 1–3, then 4–6, then 7–9, and so on).

The videos include a BIRADS 5 tomosynthesis (3D) mammogram, a T1‑weighted contrast‑enhanced (T1Gd) MRI of the brain, a CT of a left kidney, and a CT of a liver.

It's especially obvious in the liver video that the model learns to focus on the correct anatomy despite having no labels or annotations during training — this focus emerges automatically.

Segmentation Without Skip Connections — 94 % DICE

ViT‑based Segmentation Output
Above: Vertebrae, pelvis, femurs, sacrum, and psoas segmented out of a full torso CT scan. In order to build our 300,000 organ image dataset, full torso CT scans were segmented into 104 organs using our custom segmentation model that uses a vision transformer encoder connected to a shallow convolutional decoder.
Learn more about the segmentation model

The U-Net architecture was a game‑changer for medical image segmentation. It uses a clever design with an encoder–decoder setup connected by “skip connections,” which pass information directly from the early layers of the network to the later ones. This gives the model strong “hints” about what the correct segmentation should look like. But these shortcuts come with downsides: they can cause the model to rely too heavily on those direct connections, making it less flexible when dealing with unfamiliar data. They can also introduce small visual errors—often called “floaters”—in the final results.

To overcome these issues, we built a new segmentation model that removes skip connections entirely. This forces the network to build a deeper understanding of the images it sees, improving its ability to handle new, unseen data and reducing visual artifacts in its outputs.

Our model also uses a Vision Transformer—a type of neural network that can “look” at an image as a whole, not just piece by piece. This gives it a global perspective and opens the door for future features like prompt‑based segmentation. Even without prompting, our model achieves a 94% DICE score on the TotalSegmentator dataset, performing on par with leading models like nnUnet but with greater consistency.

The system is built for scale, supporting distributed training and inference using PyTorch. We’ll be open‑sourcing the model soon so others can explore and build on it.