As a doctor with plenty of clinical images, can I make use of them?
Many clinical specialties accumulate large numbers of images: endoscopic images, dermatology photographs, fundus photographs, ultrasound screenshots, wound photographs, pathology images, intraoperative photos, oral cavity photographs, hysteroscopic images, cystoscopic images, nasopharyngoscopic images, and more.
So it is natural to ask:
“I have so many images — can I build an AI model?”
“Can it predict disease?”
“Can it predict treatment response?”
“Can I publish a machine learning paper?”
But clinical images are not raw material that can simply be thrown into an “AI pot” and cooked into a paper.
An image is not just an image. Behind it are questions such as who took it, when it was taken, why it was taken, how it was taken, what happened before the image was taken, and what happened afterward.
If these questions are not clarified, a model may appear accurate, but in reality it may only be learning “which device took the image,” “which clinician took the image,” or “which ward has a more standardized workflow,” rather than learning the disease itself.
This article uses clinician-friendly language to walk through the key issues that should be clarified before starting a clinical image-AI study.
1. Do Not Start with AI. Start by Asking: Where Did This Image Come From?
Clinical images do not appear randomly in nature. They are generated by a workflow.
For example:
- Dermatology photographs may be taken by clinicians or by patients using smartphones.
- Endoscopic images may be captured when the operator sees a representative lesion.
- Ultrasound images may be selected by the sonographer as the “best” or “most representative” view.
- Wound photographs may be taken repeatedly only in severe, complex, or non-healing cases.
- Hysteroscopic images may be taken according to a fixed protocol, or more images may be saved only when abnormalities are found.
This leads to an important point:
An image is not simply a “record of a lesion.” It is also a record of clinical behavior.
In other words, images reflect not only disease, but also the habits of the image-taker, the device, the workflow, patient selection, and the culture of the clinical department.
A Simple Example
Suppose that in a gastroenterology unit, senior endoscopists tend to take very clear images of suspicious lesions, while junior endoscopists only capture a few casual screenshots.
If an AI model is trained to identify early cancer, it may learn:
“Images that are clearer, more centered, and more like teaching slides are more likely to be early cancer.”
This is not a true disease pattern. It is bias introduced by image-taking behavior.
So the First Step Is Not Modeling, but Auditing the Image Source
Ask first:
- Which devices were used to capture the images?
- Do the images span different years, clinicians, wards, or clinical units?
- How many images were taken per patient?
- Were images taken according to a fixed protocol, or only when the clinician considered something important?
- Were normal cases imaged in the same way?
- Are the saved images complete, or only “representative images”?
- Was the preliminary diagnosis already known at the time of image capture?
- Why are some images of poor quality? Is poor quality random, or are complex cases more likely to have poor images?
These questions may seem trivial, but they determine whether the model learns “disease” or “workflow.”
2. Image Quality Is Not a Minor Issue: Blur, Reflection, and Angle Can All Introduce Bias
Many clinicians say:
“We can just exclude low-quality images.”
That sounds reasonable, but it requires caution.
If poor image quality occurs randomly, exclusion may not be a major issue.
But if poor-quality images are concentrated in specific types of patients, excluding them can introduce bias.
For example:
- Ultrasound images may be poorer in patients with obesity.
- Endoscopic images may be blurrier in patients with more bleeding.
- Severe inflammation may make the field of view cloudy.
- Pediatric patients may be less cooperative, making standardized photos harder to obtain.
- Hidden lesion locations may lead to poorer image angles.
In these cases, “excluding low-quality images” may be equivalent to “excluding more difficult and more clinically realistic cases.”
Suggested Approach
Do not simply write “poor-quality images were excluded.” Instead, design an image-quality grading scheme:
| Quality Dimension | Example |
|---|---|
| Sharpness | Clear / mildly blurred / severely blurred |
| Obstruction | No obstruction / partial obstruction / severe obstruction |
| Reflection | No obvious reflection / mild reflection / affects interpretation |
| Completeness of field of view | Target region fully visible / partially visible / not assessable |
| Standard view or angle | Yes / no / uncertain |
The goal is not to pursue perfect images, but to understand:
Can the model handle the imperfect images that occur in real clinical practice?
3. Human Annotation, AI Annotation, and Deep Features Are Not the Same Thing
Clinical images usually enter a study in one of three ways.
Type 1: Structured Annotation by Clinicians
A clinician reviews the image and converts visual findings into structured variables.
For example:
- Number of polyps: single / multiple
- Lesion border: clear / unclear
- Ulcer base: clean / exudative / necrotic
- Endoscopic inflammation: none / mild / moderate / severe
- Mass morphology: regular / irregular
- Vascular appearance: sparse / rich / abnormal
This is like translating a clinician’s visual impression into tabular language.
The advantage is clinical interpretability.
The disadvantage is that it is time-consuming, and different clinicians may not agree.
Type 2: AI-Based Structured Phenotype Extraction
Here, AI does not directly predict the clinical outcome. Instead, it first learns to describe the image.
For example, it may output:
- Lesion size
- Lesion area proportion
- Whether lesions are multiple
- Whether the border is unclear
- Whether hyperemia is present
- Whether edema is present
- Whether necrosis is present
- Vascular density
- Color distribution
- Texture heterogeneity
This is like giving AI a “resident physician task”:
Do not make the final diagnosis yet. First describe the objective findings clearly.
This approach is particularly suitable for clinical research because it connects clinician language with machine-readable language.
Type 3: AI-Extracted Deep Features
The model does not tell you what specific structures it sees. Instead, it compresses the image into a numerical vector.
You can think of this as:
AI generates a “visual fingerprint” for each image.
This visual fingerprint may contain information about color, texture, edges, shape, spatial relationships, and other visual patterns, but it may not translate neatly into familiar human terms such as “hyperemia,” “edema,” or “necrosis.”
The advantage is that it may capture complex patterns clinicians do not notice.
The disadvantage is that it is difficult to interpret, and clinicians may be less willing to trust it.
4. When Is It Worth Letting AI Read Images? First Ask Whether Clinicians Can Describe the Findings Reliably
Not every image task needs AI.
A practical way to decide is to consider four scenarios.
| Scenario | Example | Is AI Worth It? |
|---|---|---|
| Clinicians can describe it easily and reliably | Single/multiple, left/right, obvious lesion present or absent | Limited AI value; human annotation may be sufficient |
| Clinicians can describe it, but it is subjective | Mild/moderate/severe inflammation, hyperemia, edema, irregular border | AI may help standardize the description |
| Clinicians struggle to quantify it reliably | Area proportion, color distribution, texture complexity, vascular density | AI has high value |
| Clinicians can do it, but the workload is too high | 4,000 cases, 10 images per case, all requiring structured annotation | AI can serve as a batch pre-annotation tool |
An Analogy
If the task is “Is there a cat in this image?”, humans can do it easily, and AI mainly saves time.
If the task is “Do the cat’s fur color, body shape, posture, gaze, and background lighting jointly predict its health status?”, then more systematic image quantification is needed.
Clinical images are similar.
If clinicians can already describe a feature reliably, quickly, and at low cost, AI is not essential.
But if clinician descriptions are inconsistent, unquantified, absent from the medical record, or difficult to scale, AI becomes meaningful.
5. Image Annotation Has Multiple Levels: Do Not Simply Say “Train an AI Model”
“Annotating images” has different levels. Different annotation types correspond to different model types.
| Annotation Type | What the Clinician Does | Suitable Question | Common Models |
|---|---|---|---|
| Image-level binary label | Label the whole image: disease / no disease | Does this image contain the target lesion? | ResNet, EfficientNet, DenseNet, ViT |
| Image-level multiclass label | Label which class the whole image belongs to | Polyp / myoma / inflammation / normal? | ResNet, EfficientNet, ConvNeXt, ViT |
| Bounding-box annotation | Draw a rectangle around the lesion | Where is the lesion? | YOLO, Faster R-CNN, RetinaNet |
| Pixel-level annotation | Carefully trace the lesion boundary | What are the lesion area, border, and shape? | U-Net, DeepLabV3+, Mask R-CNN |
| Severity grading | Grade inflammation, hyperemia, or edema | How severe is the visual finding? | CNN classification models, ordinal models |
| Multi-instance annotation | Multiple images from one patient correspond to one patient-level outcome | What is the patient-level risk? | Multiple Instance Learning, attention pooling |
The Most Common Mistake
A patient may have many images — for example, 10 images.
If these images are randomly split into training and testing sets, a serious problem can occur:
Image 1 from the same patient is in the training set, while Image 2 is in the test set.
In that case, the model may not be learning the disease. It may simply be recognizing the same patient, the same device, or the visual style of the same examination session.
The correct approach is:
Split the dataset at the patient level.
All images from the same patient must belong to only one of the following: training set, validation set, or test set.
This is extremely important.
6. The First Sentence of an Image Study Should Be: What Is My Clinical Question?
Clinical image-AI studies often fail when they start from technology:
“I have images, so I will train a CNN.”
A better starting point is:
“What clinical question am I trying to answer? What type of information in the image might help answer it?”
Examples:
Example 1: Dermatology
Clinical question: Does this skin lesion require biopsy?
Image purpose: Identify border irregularity, color heterogeneity, shape irregularity, number of colors, and change in diameter.
Suitable task: Image classification + structured phenotype extraction.
Example 2: Wound Care
Clinical question: Will this diabetic foot ulcer heal within the next 4 weeks?
Image purpose: Quantify wound area, granulation tissue proportion, necrotic tissue proportion, exudate, and epithelialization at the edge.
Suitable task: Segmentation + area measurement + prediction model.
Example 3: Endoscopy
Clinical question: Does a specific type of lesion suggest high-risk pathology?
Image purpose: Identify lesion morphology, vascular pattern, surface structure, and border.
Suitable task: Detection/classification + structured clinician scoring.
Example 4: Hysteroscopy
Clinical question: After polypectomy, is the first embryo transfer more likely to result in clinical pregnancy?
Image purpose: Describe polyp number, location, base, surface morphology, endometrial hyperemia, edema, and inflammatory appearance.
Suitable task: First extract structured image phenotypes, then incorporate them into a clinical prediction model.
The key question is not “Can AI read the image?” but:
Which information in the image is theoretically related to the clinical outcome?
7. AI Does Not Have to Predict the Outcome Directly: It Can First Serve as a “Translator” of Clinical Findings
Many clinical image studies immediately aim to build:
Image → AI → prediction of treatment success
This is possible, but it is not always the most appropriate approach.
A more robust route is:
Image → AI-extracted structured phenotype → clinical variables → prediction model → outcome prediction
In other words, AI is not a “fortune-teller.” It is a “translator.”
It first translates the image into variables that clinicians can understand and models can read, such as:
- Lesion area
- Lesion location
- Color distribution
- Vascular richness
- Inflammation grade
- Edema grade
- Necrosis proportion
- Border complexity
These variables can then be combined with age, laboratory tests, treatment strategy, pathology results, and other clinical features in a prediction model.
This has several advantages:
- It is more interpretable.
- It aligns better with clinical reasoning.
- It allows testing whether image phenotypes add incremental value.
- It is less likely to be criticized as a black box.
- It facilitates future standardization and implementation.
8. Retrospective and Prospective Images Have Different Value
Retrospective Images
Advantages:
- Already available
- Potentially large sample size
- Low cost
- Suitable for rapid exploratory research
Limitations:
- Imaging workflow may not be standardized
- Normal controls may be missing
- Image quality may vary widely
- Key clinical variables may be missing
- Selection bias may be substantial
Prospective Images
Advantages:
- Imaging workflow can be standardized
- The number and angles of images can be predefined
- Clinical variables can be collected at the same time
- Outcomes can be defined in advance
- Better suited for validation
Limitations:
- Takes longer
- Costs more
- Requires ethics approval and workflow management
- Sample size may be limited in the short term
The Ideal Path
Step 1: Use retrospective images for exploration.
Step 2: Identify which image features appear valuable.
Step 3: Design a prospective standardized imaging protocol.
Step 4: Validate whether the model is truly usable.
9. Standardized Imaging Is Not About Making Images Look Nice. It Is About Preventing the Model from Being Fooled
Standardized imaging is like giving AI the same exam paper.
If every clinician takes images differently, it is like some students receive Exam Paper A, others receive Paper B, and others receive a blurry photocopy. The model may learn the differences between exam papers instead of learning the correct answers.
Standardization may include:
- Fixed imaging angle
- Fixed distance
- Fixed light source or device parameters
- Fixed anatomical sites per patient
- A predefined minimum number of images per lesion
- Both overview and close-up images
- Recording the imaging device and operator
- Recording the imaging time point
- Recording whether the image was taken before or after treatment
Prospective studies should pay special attention to this.
10. Privacy Is Not Just About Covering the Name: Images Themselves May Identify Patients
Privacy issues in clinical images are complex.
Some images can obviously identify patients, such as:
- Facial photographs
- External oral photographs
- Tattoos
- Scars
- Sensitive body regions
- Screenshots containing patient names or IDs
- DICOM images containing embedded identifying information
Other images may seem non-identifiable, such as endoscopic images, hysteroscopic images, and pathology images, but they still require caution:
- Does the file name contain the patient ID?
- Does the corner of the image contain a name or examination number?
- Do metadata contain device, time, or location information?
- Could a rare disease or rare anatomical site indirectly identify a patient?
- Could the examination date and ward records be used to infer the patient’s identity?
Practical Solutions
- Remove names, IDs, dates, and other text embedded in the image.
- Remove identifying EXIF or DICOM metadata.
- Use research IDs instead of patient IDs.
- Store the linkage file separately and securely.
- Restrict access permissions.
- Clearly define image-use scope in the ethics application.
- Use stricter de-identification and data-use agreements when sharing images externally.
An Important Balance
De-identification should not destroy clinical information.
For example, in dermatology images, excessive cropping may remove surrounding skin, lesion location, and scale, making the image less useful for the model.
Privacy protection must balance two goals:
Prevent patient identification, while preserving the clinical meaning of the image.
11. Image Storage: Being Able to Open the File Today Does Not Mean It Will Be Usable Five Years Later
Clinical image-AI research depends heavily on long-term, standardized data management.
Consider:
- Are original images preserved?
- Are preprocessed images stored separately?
- Do annotation files correctly correspond to images?
- Are file names stable?
- Is the imaging device recorded?
- Is the imaging time recorded?
- Is there a patient-level research ID?
- Is version control available?
- Who has access?
- Is there backup?
- Is future secondary research permitted?
At minimum, keep three data layers:
| Data Layer | Content |
|---|---|
| Raw layer | De-identified original images |
| Annotation layer | Clinician annotations, bounding boxes, segmentation masks, quality scores |
| Analysis layer | Preprocessed images, model inputs, extracted features, training results |
Do not keep only the compressed images used by the model. Otherwise, future verification and reuse will be difficult.
12. Contributing Images to Other Researchers Is Not the Same as Sending a Cloud Drive Link
If you hope to contribute images for other researchers, you should design for this from the beginning.
Consider:
- Does patient consent or ethics waiver allow data sharing?
- Can the data only be used within the institution, or can it be shared across centers?
- Is commercial use allowed?
- Is AI model training allowed?
- Is public release allowed?
- Is a data-use agreement required?
- Is there a standardized data dictionary?
- Is there an annotation manual?
- Are image source and device information recorded?
- Are recommended training/validation/test splits provided?
A truly valuable image dataset is not defined only by the number of images. It also requires:
Clear labels, clear provenance, clear usage permissions, and a clear data structure.
13. What If I Do Not Have Enough Images?
Many clinicians encounter this problem:
“I only have a few hundred images. Is that enough to train AI?”
The answer is: it depends on the task.
If the task is simple classification, a few hundred cases may be enough for exploration.
If the goal is to train a complex deep learning model, it is usually not enough.
If each patient has multiple images, remember that the number of patients is often more important than the number of images.
Solution 1: Transfer Learning
Transfer learning can be understood as:
Let the model first learn how to “see” from large datasets, then adapt it to your specialized task.
Common approaches use ImageNet-pretrained models such as ResNet, EfficientNet, and ConvNeXt.
But remember: ImageNet contains natural images — cats, dogs, cars, houses — not endoscopy, dermoscopy, or pathology images.
Therefore, it provides general visual ability, not specialized clinical understanding.
Solution 2: Domain Pretraining
If you have a large number of images from the same domain, even without outcome labels, you can first let the model learn the “visual language” of that domain.
For example, hysteroscopic images may not all have IVF outcomes, but they can still help the model become familiar with:
- The hysteroscopic field of view
- Lighting and reflection
- Endometrial texture
- Fluid environment
- Lesion borders
- Endoscopic angle
- Image noise
This is like a medical student first reviewing many normal and abnormal images to understand “what this domain looks like,” before learning a specific diagnosis or prediction task.
Solution 3: Self-Supervised Learning
Self-supervised learning can be understood as:
The model practices by making its own exercises, without needing a teacher to provide answers.
For example, part of an image can be masked and the model learns to reconstruct the missing part; or the same image can be transformed in different ways and the model learns that these versions still come from the same object.
This is especially useful for medical images, where images are often plentiful but high-quality labels are scarce.
Suggested reading:
Self-supervised learning for medical image classification, npj Digital Medicine, 2023
Solution 4: Find Public Data or Contact Authors
Useful search terms include:
disease name image datasetorgan name endoscopy datasetclinical image datasetmedical image segmentation datasetAI medical imaging GitHubhysteroscopy image datasetdermoscopy image datasetwound image datasetfundus image dataset
Common sources include:
- Supplementary materials of papers
- GitHub
- Kaggle
- PhysioNet
- The Cancer Imaging Archive
- Grand Challenge
- Medical imaging AI challenge platforms
- Direct contact with paper authors
But remember:
Whether public data can be used depends not only on whether it can be downloaded, but also on licensing, ethics, task relevance, and whether the image source matches your clinical setting.
14. Supervised, Unsupervised, or Self-Supervised Learning?
Supervised Learning
Supervised learning is like having a teacher grade homework.
You give the model images and corresponding answers:
- This image is a polyp.
- This image is normal.
- This region is the lesion.
- This patient later achieved clinical pregnancy.
- This patient did not achieve clinical pregnancy.
The model learns:
The relationship between images and answers.
Suitable tasks include:
- Classification
- Detection
- Segmentation
- Outcome prediction
The disadvantage is that high-quality labels are required.
Unsupervised Learning
Unsupervised learning is like letting the model sort images into piles by itself.
It does not know the answer. It groups images based on similarity.
For example, it may divide images into:
- A redder group
- A group with irregular surfaces
- A group with strong reflection
- A group with cloudy views
The problem is: these groups may not have clinical meaning.
Therefore, unsupervised clustering is often useful for exploration, but it is not appropriate to immediately claim that a “new classification system” has been discovered.
A more cautious phrasing is:
We explored image phenotype clusters and evaluated their relationship with clinical variables and outcomes.
Self-Supervised Learning
Self-supervised learning lies between the two.
It does not require human-provided answers, but it designs “pretraining tasks” that allow the model to learn image structure.
Afterward, the model is fine-tuned using a smaller labeled dataset.
It is especially suitable when:
There are many images but few labels.
15. How Should Image Sample Size Be Counted?
Clinicians often say:
“I have 10,000 images.”
But what the model often really needs to know is:
“How many independent patients are there?”
If 1,000 patients each have 10 images, that gives 10,000 images, but not 10,000 independent samples.
Patient-Level Outcome Prediction
If the outcome is patient-level, such as pregnancy, recurrence, death, or treatment response, then sample size mainly depends on the number of patients and the number of events.
For example:
- 800 patients
- 300 clinical pregnancies
- 500 non-pregnancies
Here, the event count is approximately 300 — not 8,000 images.
Image Phenotype Training
If the task is to identify structures in images, such as polyps, ulcers, masses, or inflammatory regions, then the number of images and annotations also matters.
For example:
- 4,000 patients
- 8 images per patient
- 32,000 total images
- 5,000 images with bounding boxes
- 1,000 images with pixel-level segmentation
This may support image-recognition model training, but patient-level splitting is still required.
Sample Size Depends on Task Complexity
| Task | Sample Size Requirement |
|---|---|
| Clinician scoring + traditional prediction model | Relatively lower; mainly depends on patient count and event count |
| Image classification | Moderate; sufficient positive and negative images are needed |
| Lesion detection | Higher; bounding-box annotations are needed |
| Pixel-level segmentation | Higher; fine annotations are needed |
| End-to-end prediction of distant outcomes from images | Usually highest and most prone to overfitting |
| Self-supervised domain pretraining | More images are better; unlabeled images can be used |
So do not ask only “How many images do I have?” Ask instead:
What is my task? What is the label? What is the outcome? How many independent patients are there? How many positive events are there?
16. A More Clinician-Friendly Route for Image-AI Research
If you are a clinician with a collection of images, I suggest thinking in this sequence:
Step 1: Define the Clinical Question
Not “I want to do AI,” but:
What do I want to predict, diagnose, stratify, or support?
Step 2: Define the Clinical Hypothesis Inside the Image
Ask:
Which visual findings in the image may be related to this question?
Examples include size, location, color, border, vascular pattern, necrosis, inflammation, area, number, and texture.
Step 3: First Perform Human Structured Annotation
Let clinicians convert visual findings into tabular variables.
This step reveals many problems:
- Which variables have high inter-rater agreement?
- Which variables are too subjective?
- Which variables lack clinical meaning?
- Which variables are difficult to define?
Step 4: Then Decide What AI Should Learn
AI can learn to:
- Identify lesions
- Draw bounding boxes around lesions
- Segment lesions
- Grade severity
- Extract continuous quantitative features
- Generate structured image phenotypes
Do not start by asking AI to predict the final outcome directly.
Step 5: Build a Clinical Prediction Model
Combine image phenotypes with clinical variables and compare:
- Clinical variables only
- Clinical variables + clinician-assessed image phenotypes
- Clinical variables + AI-extracted image phenotypes
- Clinical variables + deep image features
This is how you answer:
Does the image actually provide additional value?
Step 6: Validate the Model
At minimum, evaluate:
- Patient-level splitting
- Internal validation
- Calibration
- Decision curve analysis
- Subgroup analysis
- External validation, if possible
For clinical prediction models, see:
TRIPOD+AI statement, BMJ 2024
For medical imaging AI reporting, see:
CLAIM checklist for AI in Medical Imaging
For assessing prediction model bias, see:
PROBAST+AI, BMJ 2025
For early real-world clinical evaluation of AI, see:
DECIDE-AI guideline, BMJ 2022
17. Final Point: In Clinical Image-AI, the Question Matters More Than the Model
The most common reason clinical image-AI studies fail is not that the model is not advanced enough. It is that the question was not clearly defined:
- Why was the image taken?
- Who took it?
- Which cases were imaged?
- Which cases were not imaged?
- What happened to patients with poor-quality images?
- Who created the labels?
- Do clinicians agree with each other?
- Were multiple images from the same patient leaked into different datasets?
- What is the plausible clinical hypothesis linking image features to outcomes?
- Which clinical decision is the model intended to support?
- Does the image truly add information beyond existing clinical variables?
If these questions are not answered, even the most sophisticated AI model may be beautiful but fragile.
For clinicians doing image-AI research, the best starting point is not:
“Can I train a deep learning model?”
But rather:
“What exactly do these images record? Can those visual findings be described reliably, quantified, and structured? Do they truly help answer an important clinical question?”
Once this is clear, AI becomes genuinely useful.