Comparing local large language models for alt-text generation

I tested 12 LLMs – 10 running locally and 2 cloud-based – to assess their accuracy in generating alt-text for images.

I have 10,000 photos on my website. About 9,000 have no alt-text. I'm not proud of that, and it has bothered me for a long time.

When I started my blog nearly 20 years ago, I didn't think much about alt-texts. Over time, I realized its importance for visually impaired users who rely on screen readers.

The past 5+ years, I diligently added alt-text to every new image I uploaded. But that only covers about 1,000 images, leaving most older photos without descriptions.

Writing 9,000 alt-texts manually would take ages. Of course, AI could do this much faster, but is it good enough?

To see what AI can do, I tested 12 Large Language Models (LLMs): 10 running locally and 2 in the cloud. My goal was to test their accuracy and determine whether they can generate accurate alt-text.

The TL;DR is that, not surprisingly, cloud models (GPT-4, Claude Sonnet 3.5) set the benchmark with A-grade performance, though not 100% perfect. I prefer local models for privacy, cost, and offline use. Among local options, the Llama variants and MiniCPM-V perform best. Both earned a B grade: they work reliably but sometimes miss important details.

I know I'm not the only one. Plenty of people — entire organizations even — have massive backlogs of images without alt-text. I'm determined to fix that for my blog and share what I learn along the way. This blog post is just step one — subscribe by email or RSS to get future posts.

Models evaluated

I tested alt-text generation using 12 AI models: 9 on my MacBook Pro with 32GB RAM, 1 on a higher-RAM machine (thanks to Jeremy Andrews, a friend and long-time Drupal contributor), and 2 cloud-based services.

The table below lists the models I tested, with details like links to research papers, release dates, parameter sizes (in billions), memory requirements, some architectural details and more:

	Model	Launch date	Type	Vision encoder	Language encoder	Model size (billions of parameters)	RAM	Deployment
1	VIT-GPT2	2021	Image-to-text	ViT (Vision Transformer)	GPT-2	0.4B	~8GB	Local, Dries
2	Microsoft GIT	2022	Image-to-text	Swin Transformer	Transformer Decoder	1.2B	~8GB	Local, Dries
3	BLIP Large	2022	Image-to-text	ViT	BERT	0.5B	~8GB	Local, Dries
4	BLIP-2 OPT	2023	Image-to-text	CLIP ViT	OPT	2.7B	~8GB	Local, Dries
5	BLIP-2 FLAN-T5	2023	Image-to-text	CLIP ViT	FLAN-T5 XL	3B	~8GB	Local, Dries
6	MiniCPM-V	2024	Multi-modal	SigLip-400M	Qwen2-7B	8B	~16GB	Local, Dries
7	LLaVA 13B	2024	Multi-modal	CLIP ViT	Vicuna 13B	13B	~16GB	Local, Dries
8	LLaVA 34B	2024	Multi-modal	CLIP ViT	Vicuna 34B	34B	~32GB	Local, Dries
9	Llama 3.2 Vision 11B	2024	Multi-modal	Custom Vision Encoder	Llama 3.2	11B	~20GB	Local, Dries
10	Llama 3.2 Vision 90B	2024	Multi-modal	Custom Vision Encoder	Llama 3.2	90B	~128GB	Local, Jeremy
11	OpenAI GPT-4o	2023	Multi-modal	Custom Vision Encoder	GPT-4	>150B		Cloud
12	Anthropic Claude 3.5 Sonnet	2024	Multi-modal	Custom Vision Encoder	Claude 3.5	>150B		Cloud

How image-to-text models work (in less than 30 seconds)

LLMs come in many forms, but for this project, I focused on image-to-text and multi-modal models. Both types of models can analyze images and generate text, either by describing images or answering questions about them.

Image-to-text models follow a two-step process: vision encoding and language decoding:

Vision encoding: First, the model breaks an image down into patches. You can think of these as "puzzle pieces". The patches are converted into mathematical representations called embeddings, which summarize their visual details. Next, an attention mechanism filters out the most important patches (e.g. the puzzle pieces with the cat's outline or fur texture) and eliminates less relevant details (e.g. puzzle pieces with plain blue skies).
Language encoding: Once the model has summarized the most important visual features, it uses a language model to translate those features into words. This step is where the actual text (image captions or Q&A answers) is generated.

In short, the vision encoder sees the image, while the language encoder describes it.

If you look at the table above, you'll see that each row pairs a vision encoder (e.g., ViT, CLIP, Swin) with a language encoder (e.g., GPT-2, BERT, T5, Llama).

For a more in-depth explanation, I recommend Sebastian Raschka's article Understanding Multi-modal LLMs, which also covers how image encoders work. It's fantastic!

Comparing different AI models

I wrote a Python script that generates alt-texts for images using nine different local models. You can find it in my GitHub repository. It takes care of installing models, running them, and generating alt-texts. It supports both Hugging Face and Ollama and is built to be easily extended as new models come out.

You can run the script as follows:

$ ./caption.py ./test-images/image-1.jpg

The first time you run the script, it will download all models, which requires significant disk space and bandwidth — expect to download over 50GB of model data.

The script outputs a JSON response, making it easy to integrate or analyze programmatically. Here is an example output:

{
  "image": "test-images/image-1.jpg",
  "alt-texts": {
    "vit-gpt2": "A city at night with skyscrapers and a traffic light on the side of the street in front of a tall building.",
    "git": "A busy city street is lit up at night, with the word qroi on the right side of the sign.",
    "blip": "This is an aerial view of a busy city street at night with lots of people walking and cars on the side of the road.",
    "blip2-opt": "An aerial view of a busy city street at night.",
    "blip2-flan": "An aerial view of a busy street in tokyo, japanese city at night with large billboards.",
    "minicpm-v": "A bustling cityscape at night with illuminated billboards and advertisements, including one for Michael Kors.",
    "llava-13b": "A bustling nighttime scene from Tokyo's famous Shibuya Crossing, characterized by its bright lights and dense crowds of people moving through the intersection.",
    "llava-34b": "A bustling city street at night, filled with illuminated buildings and numerous pedestrians.",
    "llama32-vision-11b": "A bustling city street at night, with towering skyscrapers and neon lights illuminating the scene."
  }
}

Test images

With the script ready, I decided to test it on some of my 10,000 photos. Not all of them at once. I picked five that I consider non-standard. Instead of simple portraits or landscapes, I picked photos with elements that might confuse or challenge the models.

One photo is from the Isabella Stewart Gardner Museum in Boston and features an empty gold frame. The frame once held a masterpiece stolen in the infamous 1990 heist, one of the biggest art thefts in history. I wanted to see if the models would recognize it as empty or mistake it for a framed painting.

Another photo, taken last summer in Vermont, shows a wakeboarder. Though he is the main subject, he is relatively small in the frame. I was curious to see if the models could still recognize him as the focal point.

In another photo, a backgammon game is set in a dark but cozy atmosphere. I was curious to see if the models could recognize partially visible objects and capture the mood of the scene.

To ensure a fair test, I stripped all EXIF metadata from the images. This includes any embedded captions, GPS coordinates, or other details that could inadvertently help the models.

Yes, I know that a test set of five images is small, but it's sufficient to identify the top models for further evaluation. With 12 models generating alt-texts for each photo, I had to manually evaluate 60 alt-texts, which was tedious to do fairly and accurately. For now, these five images are enough to filter out weaker models and pinpoint the best 2-3 LLMs for more detailed testing.

Next, I'll share the detailed results for each of the five photos. It's a lot to go through, so feel free to skip ahead to the summary.

Test image 1: Shibuya Crossing, Tokyo

Areal view of an intersection, capturing the evening commute with pedestrians, traffic and electronic billboards.

Benchmark LLMs:

Model	Description	Grade
GPT-4o	A vibrant city intersection with crowds crossing, surrounded by illuminated billboards and towering buildings, likely Shibuya, Tokyo.	A
Claude	Hundreds of people cross Tokyo's famous Shibuya intersection as neon signs light up the surrounding skyscrapers.	A

Local LLMs:

Model	Description	Grade
VIT-GPT2	A city at night with skyscrapers and a traffic light on the side of the street in front of a tall building.	F
GIT	A busy city street is lit up at night, with the word qroi on the right side of the sign.	E
BLIP	This is an aerial view of a busy city street at night with lots of people walking and cars on the side of the road.	D
BLIP-2 with OPT	An aerial view of a busy city street at night.	E
BLIP-2 with FLAN	An aerial view of a busy street in Tokyo, Japanese city at night with large billboards.	E
LLaVA (13B)	A bustling nighttime scene of Tokyo's Shibuya Crossing, with neon lights and people walking across the intersection.	A
LLaVA (34B)	A bustling city street at night, with illuminated buildings and numerous pedestrians.	C
MiniCPM-V	A bustling cityscape at night with various advertisements, including one for Michael Kors and another for Yakult.	C
Llama 3.2 Vision (11B)	A bustling city street at night, with numerous billboards and advertisements lining the buildings.	C
Llama 3.2 Vision (90B)	A bustling city street with many people and buildings.	D

Test image 2: Isabella Stewart Gardner Museum, Boston

An empty picture frame in a room with ornate wallpaper, candles, old paintings, and antique furniture.

Benchmark LLMs:

Model	Description	Grade
GPT-4o	Elegant wallpaper, an empty gold frame, candelabra, and framed painting in a classic room.	B
Claude	Ornate Victorian room with damask wallpaper, candelabra, and gold frames on wall.	B

Local LLMs:

Model	Description	Grade
VIT-GPT2	A living room with a mirror, candles, and a vase of flowers on a table in front of a mirror.	F
GIT	A picture frame is hanging on a wall next to a vase and a vase with the word tulips on it.	E
BLIP	A room with a painting on the wall and two vases on the table in front of it.	E
BLIP-2 with OPT	A room with a painting on the wall, a picture frame, and a chandelier.	C
BLIP-2 with FLAN	A room with a painting on the wall and a couple of framed pictures hanging on the wall next to it.	D
LLaVA (13B)	An empty wall with ornate decorations, including candlesticks and framed artwork, suggesting a formal or historical setting.	D
LLaVA (34B)	An ornate room with framed pictures on the wall, a chandelier hanging from the ceiling, and a fireplace mantel adorned with decorative items.	E
MiniCPM-V	A room with ornate wallpaper, candlesticks and framed portraits of historical figures is displayed.
Llama 3.2 Vision (11B)	An empty gold frame on a wall with ornate wallpaper, surrounded by other decorative items.	A
Llama 3.2 Vision (90B)	An ornate room with a gold frame, a painting of a man, and a vase on a table.	B

Test image 3: wakeboarding in Vermont, USA

Two men in swim shorts on the back of a boat watching another person wakeboarding behind the boat.

Benchmark LLMs:

Model	Description	Grade
GPT-4o	Two shirtless boys on a boat watch a person wakeboarding in the distance on a cloudy day.	A
Claude	People watching from a boat as a person is wakeboarding on a lake with a wooded shoreline.	B

Local LLMs:

Model	Description	Grade
VIT-GPT2	A man standing on top of a boat next to another man holding a surfboard in one hand and a surfboard in the other.	E
GIT	Two men are in a boat, one of them is wearing an orange hat and the other is wearing an orange hat.	D
BLIP	There are two men riding on the back of a boat in the water, one of them is on a surfboard and the other is on a board.	D
BLIP-2 with OPT	Three young men sitting on the back of a boat.	D
BLIP-2 with FLAN	A group of people sitting on a boat watching a man ride a surfboard in the middle of the water.	C
LLaVA (13B)	Two shirtless men on a boat, watching another man surfing in the water.	B
LLaVA (34B)	Two shirtless men on a boat, watching another person water skiing.	B
MiniCPM-V	Two shirtless men on a boat watch as another person water skis away from them, with trees and sky visible across the lake behind him.	B
Llama 3.2 Vision (11B)	Two shirtless men on a boat watching people wakeboarding or surfing behind it.	B
Llama 3.2 Vision (90B)	Two men on a boat watching another man wakeboarding.	A

Test image 4: hiking in the Dolomites, Italy

Benchmark LLMs:

Model	Description	Grade
GPT-4o	A wooden shrine with a statue of Mary and directional signs, set against mountains and green hills.	A
Claude	Wooden shrine of St. Magdalena with trail signs in the Dolomites mountains of Italy.	A

Local LLMs:

Model	Description	Grade
VIT-GPT2	A bird perched on top of a wooden post in front of a mountain range with a view of a mountain range and mountains.	F
GIT	A wooden sign in front of a mountain with a statue on it that says "Meise museum" on the top.	D
BLIP	There is a birdhouse with a statue on top of it in the middle of a field with trees and mountains in the background.	D
BLIP-2 with OPT	A statue of Mary on a wooden post in the middle of a field with mountains in the background.	B
BLIP-2 with FLAN	A statue of the Virgin Mary sits on top of a wooden post in the middle of a mountain.	C
LLaVA (13B)	A picturesque scene of a small wooden chapel with a statue of Jesus, set against a backdrop of majestic mountains and lush greenery.	C
LLaVA (34B)	A wooden shrine with religious iconography, surrounded by alpine scenery and signposts.	B
MiniCPM-V	A wooden shrine with a statue of Mary and several directional signs pointing to various locations such as Kirchsteig, Magdalena, St.	B
Llama 3.2 Vision (11B)	A wooden shrine with a statue of Mary and a signpost in front of mountains.	A
Llama 3.2 Vision (90B)	A statue of Mary in a wooden shrine with a signpost pointing to various locations, including Rundweg St.	B

Test image 5: backgammon by candlelight

A backgammon board on a wooden table, accompanied by candles that cast a warm glow.

Benchmark LLMs:

Model	Description	Grade
GPT-4o	A cozy, dimly lit room with a candle on a wooden table, next to a backgammon board, creating a warm, rustic ambiance.	B
Claude	Two candles light up a game board for backgammon on a wooden table at night.	A

Local LLMs:

Model	Description	Grade
VIT-GPT2	A candle is lit on a wooden table in front of a fire place with candles and other items on top of it.	E
GIT	Two candles are lit next to each other on a table, one of them is lit up and the other is lit up.	D
BLIP	There is a lit candle sitting on top of a wooden table next to a game board and a glass of wine on the table.	D
BLIP-2 with OPT	A candle sits on top of a wooden table.	C
BLIP-2 with FLAN	A candle sits on a wooden table next to a backgammon board and a glass of wine.	D
LLaVA (13B)	A cozy indoor setting with candles, a guitar, and a warm ambiance.	D
LLaVA (34B)	A candlelit wooden table with musical instruments and a cozy ambiance.	C
MiniCPM-V	A dimly lit room with candles and backgammon pieces on a wooden table, creating an atmosphere of relaxation or leisure activity.	A
Llama 3.2 Vision (11B)	A dimly lit room with a wooden table, featuring a backgammon board and two candles.	A
Llama 3.2 Vision (90B)	A candle and backgammon board on a wooden table.	B

Model accuracy

I evaluated each description using a structured but subjective scoring system. For each image, I identified the two or three most important objects the AI should recognize and include in its description. I also assessed whether the model captured the photo's mood, which can be important for visually impaired users. Finally, I deducted points for repetition, grammar errors, or hallucinations (invented details). Each alt-text received a score from 0 to 5, which I then converted to a letter grade from A to F.

Model	Repetitions	Hallucinations	Moods	Average score	Grade
VIT-GPT2	Often	Often	Poor	0.4/5	F
GIT	Often	Often	Poor	1.6/5	D
BLIP	Often	Often	Poor	1.8/5	D
BLIP2 w/OPT	Rarely	Sometimes	Fair	2.6/5	C
BLIP2 w/FLAN	Rarely	Sometimes	Fair	2.2/5	D
LLaVA 13B	Never	Sometimes	Good	3.2/5	C
LLaVA 34B	Never	Sometimes	Good	3.2/5	C
MiniCPM-V	Never	Never	Good	3.8/5	B
Llama 11B	Never	Rarely	Good	4.4/5	B
Llama 90B	Never	Rarely	Good	3.8/5	B
GPT-4o	Never	Never	Good	4.8/5	A
Claude 3.5 Sonnet	Never	Never	Good	5/5	A

The cloud-based models, GPT-4o and Claude 3.5 Sonnet, performed nearly perfectly on my small test of five images, with no major errors, hallucinations, repetitions and excellent mood detection.

Among local models, both Llama variants and MiniCPM-V show the strongest performance.

Repetition in descriptions frustrates users of screen readers. Early models like VIT-GPT2, GIT, BLIP, and BLIP2 frequently repeat content, making them unsuitable.

Hallucinations can be a serious issue in my opinion. Describing nonexistent objects or actions misleads visually impaired users and erodes trust. Among the best-performing local models, MiniCPM-V did not hallucinate, while Llama 11B and Llama 90B each made one mistake. Llama 90B misidentified a cabinet at the museum as a table, and Llama 11B described multiple people wakeboarding instead of just one. While these errors aren't dramatic, they are still frustrating.

Capturing mood is essential for giving visually impaired users a richer understanding of images. While early models struggled in this area, all recent models all performed well. This includes both LLaVA variants and MiniCPM-V.

From a practical standpoint, Llama 11B and MiniCPM-V ran smoothly on my 32GB RAM laptop, but Llama 90B needed more memory. Long story short, this means that Llama 11B and MiniCPM-V are my best candidates for additional testing.

Possible next steps

The results raise a tough question: is a "B"-level alt-text better than none at all? Many human-written alt-texts probably aren't perfect either. Should I wait for local models to hit an "A"-grade, or is an imperfect description still better than no alt-text at all?

Here are four possible next steps:

Combine AI outputs – Run the same image through different models and merge their results to try and create more accurate descriptions.
Wait and upgrade – Use the best local model for now, tag AI-generated alt-texts in the database, and refresh them in 6–12 months when new and better local models are available.
Go cloud-based – Get the best quality with a cloud model, even if it means uploading 65GB of photos. I can't explain why, or if the feeling is even justified, but it feels like giving in.
Hybrid approach – Use AI to generate alt-texts but review them manually. With 9,000 images, that is not practical. I'd need a way to flag alt-texts most likely to be wrong. Can LLMs give me a reliably confidence score?

Each option comes with trade-offs. Some options are quick but imperfect, others take work but might be worth it. Going cloud-based is the easiest but it feels like giving in. Waiting for better models is effortless but means delaying progress. Merging AI outputs or assigning a confidence score takes more effort but might be the best balance of speed and accuracy.

Maybe the solution is a combination of these options? I could go cloud-based now, tag the AI-generated alt-texts in my database, and regenerate them in 6–12 months when LLMs got even better.

It also comes down to pragmatism versus principle. Should I stick to local models because I believe in data privacy and Open Source, or should I prioritize accessibility by providing the best possible alt-text for users? The local-first approach better aligns with my values, but it might come at the cost of a worse experience for visually impaired users.

I'll be weighing these options over the next few weeks. What would you do? I'd love to hear your thoughts!

Update: My thoughts on using AI for alt-text has evolved across several blog posts. First, I chose a cloud-based LLM after all. Then, I built an automated system to generate and update descriptions for just one image. Finally, I scaled it to 9,000 images and learned to trust AI in the process.

— Dries Buytaert