As a doctor with plenty of clinical images, can I make use of them?

Many clinical specialties accumulate large numbers of images: endoscopic images, dermatology photographs, fundus photographs, ultrasound screenshots, wound photographs, pathology images, intraoperative photos, oral cavity photographs, hysteroscopic images, cystoscopic images, nasopharyngoscopic images, and more.

So it is natural to ask:

“I have so many images — can I build an AI model?”
“Can it predict disease?”
“Can it predict treatment response?”
“Can I publish a machine learning paper?”

But clinical images are not raw material that can simply be thrown into an “AI pot” and cooked into a paper.
An image is not just an image. Behind it are questions such as who took it, when it was taken, why it was taken, how it was taken, what happened before the image was taken, and what happened afterward.

If these questions are not clarified, a model may appear accurate, but in reality it may only be learning “which device took the image,” “which clinician took the image,” or “which ward has a more standardized workflow,” rather than learning the disease itself.

This article uses clinician-friendly language to walk through the key issues that should be clarified before starting a clinical image-AI study.

1. Do Not Start with AI. Start by Asking: Where Did This Image Come From?

Clinical images do not appear randomly in nature. They are generated by a workflow.

For example:

Dermatology photographs may be taken by clinicians or by patients using smartphones.
Endoscopic images may be captured when the operator sees a representative lesion.
Ultrasound images may be selected by the sonographer as the “best” or “most representative” view.
Wound photographs may be taken repeatedly only in severe, complex, or non-healing cases.
Hysteroscopic images may be taken according to a fixed protocol, or more images may be saved only when abnormalities are found.

This leads to an important point:

An image is not simply a “record of a lesion.” It is also a record of clinical behavior.

In other words, images reflect not only disease, but also the habits of the image-taker, the device, the workflow, patient selection, and the culture of the clinical department.

A Simple Example

Suppose that in a gastroenterology unit, senior endoscopists tend to take very clear images of suspicious lesions, while junior endoscopists only capture a few casual screenshots.
If an AI model is trained to identify early cancer, it may learn:

“Images that are clearer, more centered, and more like teaching slides are more likely to be early cancer.”

This is not a true disease pattern. It is bias introduced by image-taking behavior.

So the First Step Is Not Modeling, but Auditing the Image Source

Ask first:

Which devices were used to capture the images?
Do the images span different years, clinicians, wards, or clinical units?
How many images were taken per patient?
Were images taken according to a fixed protocol, or only when the clinician considered something important?
Were normal cases imaged in the same way?
Are the saved images complete, or only “representative images”?
Was the preliminary diagnosis already known at the time of image capture?
Why are some images of poor quality? Is poor quality random, or are complex cases more likely to have poor images?

These questions may seem trivial, but they determine whether the model learns “disease” or “workflow.”

2. Image Quality Is Not a Minor Issue: Blur, Reflection, and Angle Can All Introduce Bias

Many clinicians say:

“We can just exclude low-quality images.”

That sounds reasonable, but it requires caution.

If poor image quality occurs randomly, exclusion may not be a major issue.
But if poor-quality images are concentrated in specific types of patients, excluding them can introduce bias.

For example:

Ultrasound images may be poorer in patients with obesity.
Endoscopic images may be blurrier in patients with more bleeding.
Severe inflammation may make the field of view cloudy.
Pediatric patients may be less cooperative, making standardized photos harder to obtain.
Hidden lesion locations may lead to poorer image angles.

In these cases, “excluding low-quality images” may be equivalent to “excluding more difficult and more clinically realistic cases.”

Suggested Approach

Do not simply write “poor-quality images were excluded.” Instead, design an image-quality grading scheme:

Quality Dimension	Example
Sharpness	Clear / mildly blurred / severely blurred
Obstruction	No obstruction / partial obstruction / severe obstruction
Reflection	No obvious reflection / mild reflection / affects interpretation
Completeness of field of view	Target region fully visible / partially visible / not assessable
Standard view or angle	Yes / no / uncertain

The goal is not to pursue perfect images, but to understand:

Can the model handle the imperfect images that occur in real clinical practice?

3. Human Annotation, AI Annotation, and Deep Features Are Not the Same Thing

Clinical images usually enter a study in one of three ways.

Type 1: Structured Annotation by Clinicians

A clinician reviews the image and converts visual findings into structured variables.

For example:

Number of polyps: single / multiple
Lesion border: clear / unclear
Ulcer base: clean / exudative / necrotic
Endoscopic inflammation: none / mild / moderate / severe
Mass morphology: regular / irregular
Vascular appearance: sparse / rich / abnormal

This is like translating a clinician’s visual impression into tabular language.

The advantage is clinical interpretability.
The disadvantage is that it is time-consuming, and different clinicians may not agree.

Type 2: AI-Based Structured Phenotype Extraction

Here, AI does not directly predict the clinical outcome. Instead, it first learns to describe the image.

For example, it may output:

Lesion size
Lesion area proportion
Whether lesions are multiple
Whether the border is unclear
Whether hyperemia is present
Whether edema is present
Whether necrosis is present
Vascular density
Color distribution
Texture heterogeneity

This is like giving AI a “resident physician task”:

Do not make the final diagnosis yet. First describe the objective findings clearly.

This approach is particularly suitable for clinical research because it connects clinician language with machine-readable language.

Type 3: AI-Extracted Deep Features

The model does not tell you what specific structures it sees. Instead, it compresses the image into a numerical vector.

You can think of this as:

AI generates a “visual fingerprint” for each image.

This visual fingerprint may contain information about color, texture, edges, shape, spatial relationships, and other visual patterns, but it may not translate neatly into familiar human terms such as “hyperemia,” “edema,” or “necrosis.”

The advantage is that it may capture complex patterns clinicians do not notice.
The disadvantage is that it is difficult to interpret, and clinicians may be less willing to trust it.

4. When Is It Worth Letting AI Read Images? First Ask Whether Clinicians Can Describe the Findings Reliably

Not every image task needs AI.

A practical way to decide is to consider four scenarios.

Scenario	Example	Is AI Worth It?
Clinicians can describe it easily and reliably	Single/multiple, left/right, obvious lesion present or absent	Limited AI value; human annotation may be sufficient
Clinicians can describe it, but it is subjective	Mild/moderate/severe inflammation, hyperemia, edema, irregular border	AI may help standardize the description
Clinicians struggle to quantify it reliably	Area proportion, color distribution, texture complexity, vascular density	AI has high value
Clinicians can do it, but the workload is too high	4,000 cases, 10 images per case, all requiring structured annotation	AI can serve as a batch pre-annotation tool

An Analogy

If the task is “Is there a cat in this image?”, humans can do it easily, and AI mainly saves time.
If the task is “Do the cat’s fur color, body shape, posture, gaze, and background lighting jointly predict its health status?”, then more systematic image quantification is needed.

Clinical images are similar.

If clinicians can already describe a feature reliably, quickly, and at low cost, AI is not essential.
But if clinician descriptions are inconsistent, unquantified, absent from the medical record, or difficult to scale, AI becomes meaningful.

5. Image Annotation Has Multiple Levels: Do Not Simply Say “Train an AI Model”

“Annotating images” has different levels. Different annotation types correspond to different model types.

Annotation Type	What the Clinician Does	Suitable Question	Common Models
Image-level binary label	Label the whole image: disease / no disease	Does this image contain the target lesion?	ResNet, EfficientNet, DenseNet, ViT
Image-level multiclass label	Label which class the whole image belongs to	Polyp / myoma / inflammation / normal?	ResNet, EfficientNet, ConvNeXt, ViT
Bounding-box annotation	Draw a rectangle around the lesion	Where is the lesion?	YOLO, Faster R-CNN, RetinaNet
Pixel-level annotation	Carefully trace the lesion boundary	What are the lesion area, border, and shape?	U-Net, DeepLabV3+, Mask R-CNN
Severity grading	Grade inflammation, hyperemia, or edema	How severe is the visual finding?	CNN classification models, ordinal models
Multi-instance annotation	Multiple images from one patient correspond to one patient-level outcome	What is the patient-level risk?	Multiple Instance Learning, attention pooling

The Most Common Mistake

A patient may have many images — for example, 10 images.
If these images are randomly split into training and testing sets, a serious problem can occur:

Image 1 from the same patient is in the training set, while Image 2 is in the test set.

In that case, the model may not be learning the disease. It may simply be recognizing the same patient, the same device, or the visual style of the same examination session.

The correct approach is:

Split the dataset at the patient level.
All images from the same patient must belong to only one of the following: training set, validation set, or test set.

This is extremely important.

6. The First Sentence of an Image Study Should Be: What Is My Clinical Question?

Clinical image-AI studies often fail when they start from technology:

“I have images, so I will train a CNN.”

A better starting point is:

“What clinical question am I trying to answer? What type of information in the image might help answer it?”

Examples:

Example 1: Dermatology

Clinical question: Does this skin lesion require biopsy?
Image purpose: Identify border irregularity, color heterogeneity, shape irregularity, number of colors, and change in diameter.
Suitable task: Image classification + structured phenotype extraction.

Example 2: Wound Care

Clinical question: Will this diabetic foot ulcer heal within the next 4 weeks?
Image purpose: Quantify wound area, granulation tissue proportion, necrotic tissue proportion, exudate, and epithelialization at the edge.
Suitable task: Segmentation + area measurement + prediction model.

Example 3: Endoscopy

Clinical question: Does a specific type of lesion suggest high-risk pathology?
Image purpose: Identify lesion morphology, vascular pattern, surface structure, and border.
Suitable task: Detection/classification + structured clinician scoring.

Example 4: Hysteroscopy

Clinical question: After polypectomy, is the first embryo transfer more likely to result in clinical pregnancy?
Image purpose: Describe polyp number, location, base, surface morphology, endometrial hyperemia, edema, and inflammatory appearance.
Suitable task: First extract structured image phenotypes, then incorporate them into a clinical prediction model.

The key question is not “Can AI read the image?” but:

Which information in the image is theoretically related to the clinical outcome?

7. AI Does Not Have to Predict the Outcome Directly: It Can First Serve as a “Translator” of Clinical Findings

Many clinical image studies immediately aim to build:

Image → AI → prediction of treatment success

This is possible, but it is not always the most appropriate approach.

A more robust route is:

Image → AI-extracted structured phenotype → clinical variables → prediction model → outcome prediction

In other words, AI is not a “fortune-teller.” It is a “translator.”

It first translates the image into variables that clinicians can understand and models can read, such as:

Lesion area
Lesion location
Color distribution
Vascular richness
Inflammation grade
Edema grade
Necrosis proportion
Border complexity

These variables can then be combined with age, laboratory tests, treatment strategy, pathology results, and other clinical features in a prediction model.

This has several advantages:

It is more interpretable.
It aligns better with clinical reasoning.
It allows testing whether image phenotypes add incremental value.
It is less likely to be criticized as a black box.
It facilitates future standardization and implementation.

8. Retrospective and Prospective Images Have Different Value

Retrospective Images

Advantages:

Already available
Potentially large sample size
Low cost
Suitable for rapid exploratory research

Limitations:

Imaging workflow may not be standardized
Normal controls may be missing
Image quality may vary widely
Key clinical variables may be missing
Selection bias may be substantial

Prospective Images

Advantages:

Imaging workflow can be standardized
The number and angles of images can be predefined
Clinical variables can be collected at the same time
Outcomes can be defined in advance
Better suited for validation

Limitations:

Takes longer
Costs more
Requires ethics approval and workflow management
Sample size may be limited in the short term

The Ideal Path

Step 1: Use retrospective images for exploration.
Step 2: Identify which image features appear valuable.
Step 3: Design a prospective standardized imaging protocol.
Step 4: Validate whether the model is truly usable.

9. Standardized Imaging Is Not About Making Images Look Nice. It Is About Preventing the Model from Being Fooled

Standardized imaging is like giving AI the same exam paper.

If every clinician takes images differently, it is like some students receive Exam Paper A, others receive Paper B, and others receive a blurry photocopy. The model may learn the differences between exam papers instead of learning the correct answers.

Standardization may include:

Fixed imaging angle
Fixed distance
Fixed light source or device parameters
Fixed anatomical sites per patient
A predefined minimum number of images per lesion
Both overview and close-up images
Recording the imaging device and operator
Recording the imaging time point
Recording whether the image was taken before or after treatment

Prospective studies should pay special attention to this.

10. Privacy Is Not Just About Covering the Name: Images Themselves May Identify Patients

Privacy issues in clinical images are complex.

Some images can obviously identify patients, such as:

Facial photographs
External oral photographs
Tattoos
Scars
Sensitive body regions
Screenshots containing patient names or IDs
DICOM images containing embedded identifying information

Other images may seem non-identifiable, such as endoscopic images, hysteroscopic images, and pathology images, but they still require caution:

Does the file name contain the patient ID?
Does the corner of the image contain a name or examination number?
Do metadata contain device, time, or location information?
Could a rare disease or rare anatomical site indirectly identify a patient?
Could the examination date and ward records be used to infer the patient’s identity?

Practical Solutions

Remove names, IDs, dates, and other text embedded in the image.
Remove identifying EXIF or DICOM metadata.
Use research IDs instead of patient IDs.
Store the linkage file separately and securely.
Restrict access permissions.
Clearly define image-use scope in the ethics application.
Use stricter de-identification and data-use agreements when sharing images externally.

An Important Balance

De-identification should not destroy clinical information.

For example, in dermatology images, excessive cropping may remove surrounding skin, lesion location, and scale, making the image less useful for the model.
Privacy protection must balance two goals:

Prevent patient identification, while preserving the clinical meaning of the image.

11. Image Storage: Being Able to Open the File Today Does Not Mean It Will Be Usable Five Years Later

Clinical image-AI research depends heavily on long-term, standardized data management.

Consider:

Are original images preserved?
Are preprocessed images stored separately?
Do annotation files correctly correspond to images?
Are file names stable?
Is the imaging device recorded?
Is the imaging time recorded?
Is there a patient-level research ID?
Is version control available?
Who has access?
Is there backup?
Is future secondary research permitted?

At minimum, keep three data layers:

Data Layer	Content
Raw layer	De-identified original images
Annotation layer	Clinician annotations, bounding boxes, segmentation masks, quality scores
Analysis layer	Preprocessed images, model inputs, extracted features, training results

Do not keep only the compressed images used by the model. Otherwise, future verification and reuse will be difficult.

12. Contributing Images to Other Researchers Is Not the Same as Sending a Cloud Drive Link

If you hope to contribute images for other researchers, you should design for this from the beginning.

Consider:

Does patient consent or ethics waiver allow data sharing?
Can the data only be used within the institution, or can it be shared across centers?
Is commercial use allowed?
Is AI model training allowed?
Is public release allowed?
Is a data-use agreement required?
Is there a standardized data dictionary?
Is there an annotation manual?
Are image source and device information recorded?
Are recommended training/validation/test splits provided?

A truly valuable image dataset is not defined only by the number of images. It also requires:

Clear labels, clear provenance, clear usage permissions, and a clear data structure.

13. What If I Do Not Have Enough Images?

Many clinicians encounter this problem:

“I only have a few hundred images. Is that enough to train AI?”

The answer is: it depends on the task.

If the task is simple classification, a few hundred cases may be enough for exploration.
If the goal is to train a complex deep learning model, it is usually not enough.
If each patient has multiple images, remember that the number of patients is often more important than the number of images.

Solution 1: Transfer Learning

Transfer learning can be understood as:

Let the model first learn how to “see” from large datasets, then adapt it to your specialized task.

Common approaches use ImageNet-pretrained models such as ResNet, EfficientNet, and ConvNeXt.
But remember: ImageNet contains natural images — cats, dogs, cars, houses — not endoscopy, dermoscopy, or pathology images.

Therefore, it provides general visual ability, not specialized clinical understanding.

Solution 2: Domain Pretraining

If you have a large number of images from the same domain, even without outcome labels, you can first let the model learn the “visual language” of that domain.

For example, hysteroscopic images may not all have IVF outcomes, but they can still help the model become familiar with:

The hysteroscopic field of view
Lighting and reflection
Endometrial texture
Fluid environment
Lesion borders
Endoscopic angle
Image noise

This is like a medical student first reviewing many normal and abnormal images to understand “what this domain looks like,” before learning a specific diagnosis or prediction task.

Solution 3: Self-Supervised Learning

Self-supervised learning can be understood as:

The model practices by making its own exercises, without needing a teacher to provide answers.

For example, part of an image can be masked and the model learns to reconstruct the missing part; or the same image can be transformed in different ways and the model learns that these versions still come from the same object.

This is especially useful for medical images, where images are often plentiful but high-quality labels are scarce.

Solution 4: Find Public Data or Contact Authors

Useful search terms include:

disease name image dataset
organ name endoscopy dataset
clinical image dataset
medical image segmentation dataset
AI medical imaging GitHub
hysteroscopy image dataset
dermoscopy image dataset
wound image dataset
fundus image dataset

Common sources include:

Supplementary materials of papers
GitHub
Kaggle
PhysioNet
The Cancer Imaging Archive
Grand Challenge
Medical imaging AI challenge platforms
Direct contact with paper authors

But remember:

Whether public data can be used depends not only on whether it can be downloaded, but also on licensing, ethics, task relevance, and whether the image source matches your clinical setting.

14. Supervised, Unsupervised, or Self-Supervised Learning?

Supervised Learning

Supervised learning is like having a teacher grade homework.

You give the model images and corresponding answers:

This image is a polyp.
This image is normal.
This region is the lesion.
This patient later achieved clinical pregnancy.
This patient did not achieve clinical pregnancy.

The model learns:

The relationship between images and answers.

Suitable tasks include:

Classification
Detection
Segmentation
Outcome prediction

The disadvantage is that high-quality labels are required.

Unsupervised Learning

Unsupervised learning is like letting the model sort images into piles by itself.

It does not know the answer. It groups images based on similarity.

For example, it may divide images into:

A redder group
A group with irregular surfaces
A group with strong reflection
A group with cloudy views

The problem is: these groups may not have clinical meaning.

Therefore, unsupervised clustering is often useful for exploration, but it is not appropriate to immediately claim that a “new classification system” has been discovered.

A more cautious phrasing is:

We explored image phenotype clusters and evaluated their relationship with clinical variables and outcomes.

Self-Supervised Learning

Self-supervised learning lies between the two.

It does not require human-provided answers, but it designs “pretraining tasks” that allow the model to learn image structure.
Afterward, the model is fine-tuned using a smaller labeled dataset.

It is especially suitable when:

There are many images but few labels.

15. How Should Image Sample Size Be Counted?

Clinicians often say:

“I have 10,000 images.”

But what the model often really needs to know is:

“How many independent patients are there?”

If 1,000 patients each have 10 images, that gives 10,000 images, but not 10,000 independent samples.

Patient-Level Outcome Prediction

If the outcome is patient-level, such as pregnancy, recurrence, death, or treatment response, then sample size mainly depends on the number of patients and the number of events.

For example:

800 patients
300 clinical pregnancies
500 non-pregnancies

Here, the event count is approximately 300 — not 8,000 images.

Image Phenotype Training

If the task is to identify structures in images, such as polyps, ulcers, masses, or inflammatory regions, then the number of images and annotations also matters.

For example:

4,000 patients
8 images per patient
32,000 total images
5,000 images with bounding boxes
1,000 images with pixel-level segmentation

This may support image-recognition model training, but patient-level splitting is still required.

Sample Size Depends on Task Complexity

Task	Sample Size Requirement
Clinician scoring + traditional prediction model	Relatively lower; mainly depends on patient count and event count
Image classification	Moderate; sufficient positive and negative images are needed
Lesion detection	Higher; bounding-box annotations are needed
Pixel-level segmentation	Higher; fine annotations are needed
End-to-end prediction of distant outcomes from images	Usually highest and most prone to overfitting
Self-supervised domain pretraining	More images are better; unlabeled images can be used

So do not ask only “How many images do I have?” Ask instead:

What is my task? What is the label? What is the outcome? How many independent patients are there? How many positive events are there?

16. A More Clinician-Friendly Route for Image-AI Research

If you are a clinician with a collection of images, I suggest thinking in this sequence:

Step 1: Define the Clinical Question

Not “I want to do AI,” but:

What do I want to predict, diagnose, stratify, or support?

Step 2: Define the Clinical Hypothesis Inside the Image

Ask:

Which visual findings in the image may be related to this question?

Examples include size, location, color, border, vascular pattern, necrosis, inflammation, area, number, and texture.

Step 3: First Perform Human Structured Annotation

Let clinicians convert visual findings into tabular variables.
This step reveals many problems:

Which variables have high inter-rater agreement?
Which variables are too subjective?
Which variables lack clinical meaning?
Which variables are difficult to define?

Step 4: Then Decide What AI Should Learn

AI can learn to:

Identify lesions
Draw bounding boxes around lesions
Segment lesions
Grade severity
Extract continuous quantitative features
Generate structured image phenotypes

Do not start by asking AI to predict the final outcome directly.

Step 5: Build a Clinical Prediction Model

Combine image phenotypes with clinical variables and compare:

Clinical variables only
Clinical variables + clinician-assessed image phenotypes
Clinical variables + AI-extracted image phenotypes
Clinical variables + deep image features

This is how you answer:

Does the image actually provide additional value?

Step 6: Validate the Model

At minimum, evaluate:

Patient-level splitting
Internal validation
Calibration
Decision curve analysis
Subgroup analysis
External validation, if possible

For clinical prediction models, see:
TRIPOD+AI statement, BMJ 2024

For medical imaging AI reporting, see:
CLAIM checklist for AI in Medical Imaging

For assessing prediction model bias, see:
PROBAST+AI, BMJ 2025

For early real-world clinical evaluation of AI, see:
DECIDE-AI guideline, BMJ 2022

17. Final Point: In Clinical Image-AI, the Question Matters More Than the Model

The most common reason clinical image-AI studies fail is not that the model is not advanced enough. It is that the question was not clearly defined:

Why was the image taken?
Who took it?
Which cases were imaged?
Which cases were not imaged?
What happened to patients with poor-quality images?
Who created the labels?
Do clinicians agree with each other?
Were multiple images from the same patient leaked into different datasets?
What is the plausible clinical hypothesis linking image features to outcomes?
Which clinical decision is the model intended to support?
Does the image truly add information beyond existing clinical variables?

If these questions are not answered, even the most sophisticated AI model may be beautiful but fragile.

For clinicians doing image-AI research, the best starting point is not:

“Can I train a deep learning model?”

But rather:

“What exactly do these images record? Can those visual findings be described reliably, quantified, and structured? Do they truly help answer an important clinical question?”

Once this is clear, AI becomes genuinely useful.