SWIN Transformer for Image Captioning

This project implements an end-to-end image captioning architecture that generates natural language descriptions of images by combining a SWIN Transformer visual encoder with a transformer decoder, built from scratch in TensorFlow/Keras and trained on the Flickr8k dataset.


Model Architecture

Model architecture

Architecture Components

SWIN Transformer Encoder

Input images are resized to 384×384 and processed by a Shifted Window Transformer — a hierarchical vision transformer that partitions the image into non-overlapping windows and applies self-attention within each window. Shifted windows in alternating layers allow cross-window information flow, building a multi-scale feature pyramid.

Refining Encoder Layers

Three additional encoder layers follow the SWIN backbone. Each combines:

This refines the visual features before they are passed to the decoder.

Prefusion Layer

Before decoding, word embeddings are fused with the global visual context via a Prefusion Layer, ensuring the decoder is conditioned on the image from the very first token.

Transformer Decoder

Six decoder layers with:

Training

The model is trained with the Adam optimizer using a warmup + decay learning rate schedule, minimizing cross-entropy loss over the token predictions on Flickr8k caption-image pairs.


Summary

Component Details
Visual encoder SWIN Transformer (384×384 input)
Feature refinement 3 cross-attention refining encoder layers
Decoder 6-layer transformer with masked self-attention
Dataset Flickr8k (8,000 images, 5 captions each)
Framework TensorFlow ≥2.8.0, Keras
Optimizer Adam with warmup + decay schedule

View Code on GitHub