Generate a summary of an image with an LLM in Python

Create Text Summaries of images with Python and LLaVa AI model

Original photo by Anthony DELANOIX on Unsplash. Text added by author.

With the emergence of large language models, it has become easier than ever to generate a dataset for your AI application.

Everyone has access to the same technologies, but only you potentially own the data that you can turn in a MOAT for your AI application.

What do I mean with MOAT?Moat is a deep ditch that surrounds the caste.In business context, this metaphore is used to describe competitive advantage that makes it difficult for others to replicate your product or service.

One way of creating a MOAT is to turn your images into a dataset for your AI application.

1. What’s this article about?

We’ll go through a series of steps on generating text summaries from images in Python.

You can run this procedure locally. In other words, it’s private and free.

2. Toolbox

To generate text summaries from images, we’re going to use:

Let’s break this down.

2.1 Python

I assume you’re already familiar with Python — the most popular programming language for AI development due to its simplicity.

2.2 LLaMa.cpp

Playing with Llama2 large language model on an M2 Pro Macbook with llama.cpp (video by author). ->Link to video

LLaMa.cpp is an interference engine written in C++ that enables us to run a large language model on a laptop.

How can LLaMa.cpp be so efficient?

TLDR; it achieves this by using quantization — using less precision which reduces the amount of memory needed to store the model in memory.

The library was written by Georgi Gerganov. As he states, the initial version was hacked in an evening and it was significantly improved by him and the community since then.

See the “How is LLaMa.cpp possible?” article for a detailed explanation.

2.3 LLaVA large multimodal model

LLaVa paper: Visual Instruction Tuning (Screenshot by author).

LLaVa is short for Large Language and Vision Assistant.

It is a large multimodal model that combines a vision encoder and Vicuna for visual and language understanding.

2.4 What’s groundbreaking about LLaVa?

Visual Instruction Tunning — a technique that involves finetuning a large language model with visual cues.

This makes the model learn a connection between images and text.

2.5 What’s the easiest way to try the latest LLaVa model?

You can try the latest LLaVA model online.

Just upload an image and write the prompt.

Demo of LLaVA: Large Language and Vision Assistant (screenshot by author).

See the Visual Instruction Tuning paper on arXiv to learn more about the LLaVa model.

3. Setup

3.1 Downloading LLaVA model

mys/ggml_llava-v1.5–7b repository on Huggingface (screenshot by author).

Let''s start by downloading the LLaVA model as it takes quite some time to download.

You need to download two files:

You can download the files from mys/ggml_llava-v1.5–7b repository.

Which ggml-model should you download?

Depends on your main memory size.

q5_k works well on the M2 Pro Macbook with 32 GB of memory, while q4_k is less memory intensive.

3.2 Building llama.ccp

The next step is to install the interference engine — llama.cpp.

First, you need to clone the repository:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Then you need to build it.

On MacOS you can build it with CMake:

mkdir build
cd build
cmake ..
cmake --build . --config Release

For other OSes please refer to documentation.

You should see the llava executable in the build/bin folder if your build was successful.

llama.cpp executables after successful build (image by author).

3.3 Downloading image data

Images of Paris on Unsplash (screenshot by author).

Here are random 5 images of Paris that I’ve downloaded from Unsplash. Click on the “Photo” below to get the actual image.

To repeat my experiment, download these images and save them to the image folder.

4. Summarizing images

Now that we have the required tools installed, we’re ready for action!

We just need to tie everything together and run it.

You can run the code below in a Python script or a Jupyter Notebook. Pick your favorite one.

4.1 Specifying paths

For this project, I created the following folder hierarchy, where image_summary is the project’s root folder:

image_summary/
    image_summary.ipynb
    data/
        image/ # folder with images
            anthony-delanoix-Q0-fOL2nqZc-unsplash.jpg
            ...
        txt/ # empty folder

In the image_summary.ipynb script, we specify

  • LLaVa executable path
  • model path
  • projector file path

Set these paths accordingly.

LLAVA_EXEC_PATH = "~/Code/llama.cpp/build/bin/llava"
MODEL_PATH = "~/Models/ggml_llava-v1.5-7b/ggml-model-q5_k.gguf"
MMPROJ_PATH = "~/Models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf"

Then we specify the IMAGE and TXT folder paths:

from pathlib import Path

DATA_DIR = "data"
IMAGE_DIR = Path(DATA_DIR, "image")
TXT_DIR = Path(DATA_DIR, "txt")

Use glob to read image paths. You should see a list of image paths if everything is in its place.

import glob

image_paths = sorted(glob.glob(str(IMAGE_DIR.joinpath("*.jpg"))))


# image_paths output
# [''data/image/anthony-delanoix-Q0-fOL2nqZc-unsplash.jpg'',
# ''data/image/arthur-humeau-3xwdarHxHqI-unsplash.jpg'',
# ''data/image/bastien-nvs-SprV1eqNrqM-unsplash.jpg'',
# ''data/image/marloes-hilckmann-EUzxLX8p8IA-unsplash.jpg'',
# ''data/image/michael-fousert-Ql9PCaOhyyE-unsplash.jpg'']

4.2 Specifying LLaVa command

llava executable was created when we built llama.cpp.

llava executable takes multiple arguments, which may seem familiar if you ever run a large language model locally:

  • Prompt — we tell the model to act like a tourist guide as we are working with images of Paris.
  • Temperature — for LLaVa, a lower temperature like 0.1 is recommended for better quality.
  • model and projector file paths

bash_command variable ties everything together.

TEMP = 0.1
PROMPT = "The image shows a site in Paris. Describe the image like a tourist guide would."

bash_command = f''{LLAVA_EXEC_PATH} -m {MODEL_PATH} --mmproj {MMPROJ_PATH} --temp {TEMP} -p "{PROMPT}"''

# Bash command output
# ~/Code/llama.cpp/build/bin/llava -m ~/Models/ggml_llava-v1.5-7b/ggml-model-q5_k.gguf --mmproj ~/Models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf --temp 0.1 -p "The image shows a site in Paris. Describe the image like a tourist guide would."

4.3 Generating summaries of images

Generating summaries of images in JupyterLab (Screenshot by author)

We’re going to call LLaVa executable with Python’s subprocess module.

subprocess is a built-in module that (among other things) provides a way to run a process in the shell.

With the code below, we loop over images and save the text summary for each image in the TXT folder.

for image_path in image_paths:
    print(f"Processing {image_path}")
    image_name = Path(image_path).stem
    image_summary_path = TXT_DIR.joinpath(image_name + ".txt")

    # add input image and output txt filenames to bash command
    bash_command_cur = f''{bash_command} --image "{image_path}" > "{image_summary_path}"''

    # run the bash command
    process = subprocess.Popen(
        bash_command_cur, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
    )

    # get the output and error from the command
    output, error = process.communicate()

    # commment output and error for less verbose output
    print("Output:")
    print(output.decode("utf-8"))

    print("Error:")
    print(error.decode("utf-8"))

    # return the code of the command 
    return_code = process.returncode
    print(f"Return code: {return_code}")
    print()

print("Done")

Processing of 1 image takes ~30 seconds on a M2 Macbook Pro.

“done” is printed on the output once the code is done executing.

4.4 Inspecting the generated text

Let’s read generated Txt files to a Python list.

image_texts = []

for filepath in filepaths:
    with open(filepath, "r") as f:
        image_text = f.read()
    image_texts.append(image_text)

With print(image_texts[0]), we print the generated text for the first image.

clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 2
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     595.62 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: total allocated memory: 201.77 MB

prompt: ''The image shows a site in Paris. Describe the image like a tourist guide would.''

The image showcases a beautiful view of the iconic Eiffel Tower in Paris, France. The tower stands tall and proudly in the sky, towering over the city. The scene is bustling with activity, as numerous cars and trucks are scattered throughout the area, likely indicating a busy city street. In addition to the vehicles, there are several people visible in the scene, likely enjoying the view or going about their daily activities. The combination of the Eiffel Tower, the busy city street, and the people creates a lively and vibrant atmosphere in this Parisian setting.

main: image encoded in   809.28 ms by CLIP (    1.40 ms per image patch)

I have three observations:

  • The model diagnostic output, prompt and generated output are all saved to the txt file. We need to filter out the generated image summary.
  • The generated output is quite good for an image that shows the Eiffel Tower.
  • Your generated text might diverge for the same image.

4.5 Cleaning up the generated text

We don’t need model diagnostic output and the prompt so we can filter out.

With the code below, we split the text by new lines and take all lines after the prompt until the last line, which is also model diagnostics.

image_text_cleaned = []

for text_index, image_text in enumerate(image_texts):
    # split the text by new lines
    image_text_split = image_text.split("\n")

    # find index of the line that starts with prompt:
    start_index_list = [
        i for i, line in enumerate(image_text_split) if line.startswith("prompt:")
    ]

    # find index of the line that starts with main:
    end_index_list = [
        i for i, line in enumerate(image_text_split) if line.startswith("main:")
    ]

    if (
        len(start_index_list) != 1
        or len(end_index_list) != 1
        or start_index_list[0] < start_index_list[0]
    ):
        # there was a problem with image text
        print(f"Warning: start/end indices couldn''t be found for document {text_index}")
        continue

    start_index = start_index_list[0]
    end_index = end_index_list[0]

    # extract the text based on indices above
    image_text_cleaned.append(
        "".join(image_text_split[start_index + 1 : end_index]).strip()
    )

Let’s look at cleaned image text image_text_cleaned[0]:

The image showcases a beautiful view of the iconic Eiffel Tower in Paris, France.
The tower stands tall and proudly in the sky, towering over the city.
The scene is bustling with activity, as numerous cars and trucks are scattered throughout the area, likely indicating a busy city street.
In addition to the vehicles, there are several people visible in the scene, likely enjoying the view or going about their daily activities.
The combination of the Eiffel Tower, the busy city street, and the people creates a lively and vibrant atmosphere in this Parisian setting.

It works!

Now that we have extracted the generated text, we could calculate an embedding and save it to a vector database, but that’s for another article.

Let me know in the comments if you would be interested in such a tutorial.

4.6 Evaluating the quality of generated text

Beautiful images of Paris were downloaded from Unsplash. The authors are credited above (Screenshot by author).

Generated text:

Source: data/image/anthony-delanoix-Q0-fOL2nqZc-unsplash.jpg
The image showcases a beautiful view of the iconic Eiffel Tower in Paris, France.
The tower stands tall and proudly in the sky, towering over the city.
The scene is bustling with activity, as numerous cars and trucks are scattered throughout the area, likely indicating a busy city street.
In addition to the vehicles, there are several people visible in the scene, likely enjoying the view or going about their daily activities.
The combination of the Eiffel Tower, the busy city street, and the people creates a lively and vibrant atmosphere in this Parisian setting.

Source: data/image/arthur-humeau-3xwdarHxHqI-unsplash.jpg
The image showcases a beautiful city scene in Paris, featuring a large stone archway with statues on it.
The archway is part of a famous landmark, the Arc de Triomphe, which stands tall in the city.
The sun is shining on the arch, creating a warm and inviting atmosphere.
In the foreground, there is a busy street with several cars parked or driving by.
Some cars are parked closer to the arch, while others are further away.
The street is bustling with activity, making it a lively and vibrant part of the city.

Source: data/image/bastien-nvs-SprV1eqNrqM-unsplash.jpg
The image showcases a beautiful and historic site in Paris, France.
The iconic Eiffel Tower stands tall in the background, towering over the city.
The tower is adorned with statues, adding to its architectural charm.
The scene is bustling with activity, as numerous people can be seen walking around the area, likely admiring the tower and its surroundings.
The presence of benches provides a place for visitors to sit and enjoy the view.
The overall atmosphere of the scene is lively and captivating, reflecting the essence of Parisian culture and history.

Source: data/image/marloes-hilckmann-EUzxLX8p8IA-unsplash.jpg
The image depicts a lively street in Paris, France, with a bustling atmosphere.
The street is lined with numerous potted plants, adding a touch of greenery to the urban setting.
There are several chairs and dining tables placed along the sidewalk, indicating that the area is a popular spot for outdoor dining.
People are walking down the street, with some carrying handbags and backpacks.
There are also several cars parked or driving along the road, and a truck can be seen further down the street.
The scene captures the essence of a busy city street with a mix of pedestrians, vehicles, and outdoor seating.

Source: data/image/michael-fousert-Ql9PCaOhyyE-unsplash.jpg
The image captures a beautiful nighttime scene in Paris, featuring the iconic Eiffel Tower and the Louvre Museum.
The Eiffel Tower, a famous landmark, stands tall in the background, illuminated by lights, while the Louvre Museum, a large and impressive building, is also lit up.
The scene is further enhanced by the reflection of the lights on the water, creating a serene and picturesque atmosphere.
The combination of the illuminated Eiffel Tower, the Louvre Museum, and the water creates a captivating view that showcases the beauty of Paris at night.

Let’s evaluate the generated text:

  • The generated text is impressive for anthony-delanoix-Q0-fOL2nqZc-unsplash and arthur-humeau-3xwdarHxHqI-unsplash images.
  • There is no Eiffel Tower on the bastien-nvs-SprV1eqNrqM-unsplash. There are people around the site, which are almost not visible and the model recognizes them. It also recognizes statues.
  • The marloes-hilckmann-EUzxLX8p8IA-unsplash shows Montmartre, a famous Paris street. The name is also written on the canopy, but there is no mention of it in the generated text. This could be improved by first doing OCR of the image and putting it into the prompt. The model could then generate text with context.
  • There is no Eiffel Tower in the michael-fousert-Ql9PCaOhyyE-unsplash image. The model correctly recognizes the Louvre Museum.

5. Conclusion

The LLaVa model can be used to efficiently generate summaries of images on consumer hardware. Another plus is that it is free and private.

The generated text is pretty impressive, but the quality might decline with less-known sites. Images of Paris were most probably included in the model training set.

The model makes a few mistakes, seeing the Eiffel Tower where there aren’t any. This is probably related to the prompt.

I hypothesize that because the prompt includes Paris, the model generates the Eiffel Tower text with almost any site. This could be the pattern in the data on which the model was trained.

With the latter problem prompt engineering or a larger LLaVa model could help.

For text-heavy images, the text generation procedure could be improved by first doing OCR, and then providing the extracted text as a context in the prompt. The model would then have access to the relevant context.

Please let me know in the comments if you spot any other interesting observations that I’ve missed.

Thanks for reading!

6. Resources

Continue Learning

Discover more articles on similar topics