Introduction: The era of text-only LLMs is ending. Modern vision-language models like GPT-4V, Claude 3, and Gemini can see images, understand diagrams, read documents, and reason about visual content alongside text. This opens entirely new application categories: document understanding, visual Q&A, image-based search, accessibility tools, and creative applications. This guide covers building multi-modal AI applications using the latest APIs, from basic image understanding to complex document processing pipelines, with practical code examples for each major provider.
Multi-modal AI enables applications that understand images, audio, and video alongside text. This two-part series takes you from API fundamentals to production-ready multi-modal systems.
- Part 1 (this article): Vision API fundamentals – OpenAI GPT-4V, Claude Vision, Gemini
- Part 2: Advanced applications – Image analysis, Audio processing, Multi-modal RAG

Multi-Modal AI Architecture
Multi-modal AI systems process multiple input types—images, audio, video, and text—through specialized encoders before combining them for unified understanding. This diagram shows the architecture of a modern vision-language system.
flowchart TB
subgraph Input["Input Modalities"]
IMG[Images]
TXT[Text/Prompts]
AUD[Audio]
VID[Video Frames]
end
subgraph Encoders["Encoders"]
VE[Vision Encoder
ViT/CLIP]
TE[Text Encoder
Transformer]
AE[Audio Encoder
Whisper]
end
subgraph Fusion["Multi-Modal Fusion"]
CA[Cross-Attention]
EMB[Shared Embedding Space]
end
subgraph LLM["Language Model"]
DEC[Decoder
GPT-4V/Gemini/Claude]
end
subgraph Output["Output"]
RSP[Text Response]
ACT[Actions/Tools]
end
IMG --> VE
TXT --> TE
AUD --> AE
VID --> VE
VE --> CA
TE --> CA
AE --> CA
CA --> EMB
EMB --> DEC
DEC --> RSP
DEC --> ACT
style IMG fill:#E3F2FD,stroke:#90CAF9,stroke-width:2px,color:#1565C0
style TXT fill:#E3F2FD,stroke:#90CAF9,stroke-width:2px,color:#1565C0
style AUD fill:#E3F2FD,stroke:#90CAF9,stroke-width:2px,color:#1565C0
style VID fill:#E3F2FD,stroke:#90CAF9,stroke-width:2px,color:#1565C0
style VE fill:#E8F5E9,stroke:#A5D6A7,stroke-width:2px,color:#2E7D32
style TE fill:#E8F5E9,stroke:#A5D6A7,stroke-width:2px,color:#2E7D32
style AE fill:#E8F5E9,stroke:#A5D6A7,stroke-width:2px,color:#2E7D32
style CA fill:#F3E5F5,stroke:#CE93D8,stroke-width:2px,color:#6A1B9A
style EMB fill:#F3E5F5,stroke:#CE93D8,stroke-width:2px,color:#6A1B9A
style DEC fill:#FFF3E0,stroke:#FFCC80,stroke-width:2px,color:#E65100
style RSP fill:#E0F2F1,stroke:#80CBC4,stroke-width:2px,color:#00695C
style ACT fill:#E0F2F1,stroke:#80CBC4,stroke-width:2px,color:#00695C
Figure 1: C4 Container Diagram – Multi-Modal AI Architecture
OpenAI Vision API
OpenAI’s GPT-4 Vision (GPT-4V) can analyze images alongside text, enabling use cases from document understanding to visual question answering. The API accepts base64-encoded images or URLs and returns natural language descriptions and analysis.
from openai import OpenAI
import base64
from pathlib import Path
client = OpenAI()
def encode_image(image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
def analyze_image(image_path: str, prompt: str) -> str:
"""Analyze an image with GPT-4V."""
base64_image = encode_image(image_path)
# Determine media type
suffix = Path(image_path).suffix.lower()
media_types = {".jpg": "jpeg", ".jpeg": "jpeg", ".png": "png", ".gif": "gif", ".webp": "webp"}
media_type = media_types.get(suffix, "jpeg")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/{media_type};base64,{base64_image}",
"detail": "high" # "low", "high", or "auto"
}
}
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
# Analyze from URL
def analyze_image_url(url: str, prompt: str) -> str:
"""Analyze an image from URL."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": url}}
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
# Multiple images
def compare_images(image_paths: list[str], prompt: str) -> str:
"""Compare multiple images."""
content = [{"type": "text", "text": prompt}]
for path in image_paths:
base64_image = encode_image(path)
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
max_tokens=1500
)
return response.choices[0].message.content
# Usage examples
result = analyze_image("diagram.png", "Explain this architecture diagram in detail.")
print(result)
comparison = compare_images(
["before.png", "after.png"],
"What are the differences between these two UI designs?"
)
Claude Vision API
Anthropic’s Claude models offer competitive vision capabilities with strong performance on document analysis and reasoning tasks. Claude’s constitutional AI approach can make it particularly suitable for sensitive document processing.
import anthropic
import base64
client = anthropic.Anthropic()
def analyze_with_claude(image_path: str, prompt: str) -> str:
"""Analyze image with Claude 3."""
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
# Determine media type
if image_path.endswith(".png"):
media_type = "image/png"
elif image_path.endswith(".gif"):
media_type = "image/gif"
elif image_path.endswith(".webp"):
media_type = "image/webp"
else:
media_type = "image/jpeg"
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
},
},
{
"type": "text",
"text": prompt
}
],
}
],
)
return message.content[0].text
# Document understanding with Claude
def extract_from_document(image_path: str) -> dict:
"""Extract structured data from document image."""
prompt = """Analyze this document and extract:
1. Document type (invoice, receipt, form, etc.)
2. Key fields and values
3. Any tables or structured data
4. Important dates and amounts
Return as JSON."""
result = analyze_with_claude(image_path, prompt)
import json
try:
return json.loads(result)
except:
return {"raw_text": result}
# Chart/graph analysis
def analyze_chart(image_path: str) -> str:
"""Analyze a chart or graph."""
prompt = """Analyze this chart/graph:
1. What type of visualization is this?
2. What data is being presented?
3. What are the key trends or insights?
4. Are there any anomalies or notable patterns?
Provide a detailed analysis."""
return analyze_with_claude(image_path, prompt)
Google Gemini Vision
Google’s Gemini models are natively multi-modal, trained from the ground up on mixed text and image data. This native multi-modality often yields strong performance on tasks requiring deep visual understanding.
import google.generativeai as genai
from PIL import Image
genai.configure(api_key="your-api-key")
def analyze_with_gemini(image_path: str, prompt: str) -> str:
"""Analyze image with Gemini."""
model = genai.GenerativeModel("gemini-1.5-pro")
image = Image.open(image_path)
response = model.generate_content([prompt, image])
return response.text
# Video analysis with Gemini
def analyze_video(video_path: str, prompt: str) -> str:
"""Analyze video with Gemini (supports up to 1 hour)."""
model = genai.GenerativeModel("gemini-1.5-pro")
# Upload video
video_file = genai.upload_file(video_path)
# Wait for processing
import time
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
response = model.generate_content([prompt, video_file])
return response.text
# Multi-turn visual conversation
def visual_conversation():
"""Have a multi-turn conversation about an image."""
model = genai.GenerativeModel("gemini-1.5-pro")
chat = model.start_chat()
image = Image.open("architecture.png")
# First turn with image
response = chat.send_message([
"Here's an architecture diagram. What components do you see?",
image
])
print(response.text)
# Follow-up questions (image context maintained)
response = chat.send_message("What are the potential bottlenecks?")
print(response.text)
response = chat.send_message("How would you improve the scalability?")
print(response.text)
Document Processing Pipeline
from dataclasses import dataclass
from typing import Optional
import json
from openai import OpenAI
client = OpenAI()
@dataclass
class DocumentResult:
doc_type: str
extracted_fields: dict
tables: list[dict]
confidence: float
raw_text: str
class DocumentProcessor:
"""Process documents using vision models."""
def __init__(self, model: str = "gpt-4o"):
self.model = model
def process(self, image_path: str) -> DocumentResult:
"""Process a document image."""
# Step 1: Classify document type
doc_type = self._classify_document(image_path)
# Step 2: Extract fields based on type
fields = self._extract_fields(image_path, doc_type)
# Step 3: Extract tables if present
tables = self._extract_tables(image_path)
return DocumentResult(
doc_type=doc_type,
extracted_fields=fields,
tables=tables,
confidence=0.9,
raw_text=""
)
def _classify_document(self, image_path: str) -> str:
"""Classify document type."""
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model=self.model,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Classify this document. Return only the type: invoice, receipt, form, contract, letter, report, or other."
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
}
]
}
],
max_tokens=50
)
return response.choices[0].message.content.strip().lower()
def _extract_fields(self, image_path: str, doc_type: str) -> dict:
"""Extract fields based on document type."""
field_prompts = {
"invoice": "Extract: invoice_number, date, due_date, vendor_name, total_amount, line_items",
"receipt": "Extract: store_name, date, items, subtotal, tax, total",
"form": "Extract all filled fields and their values",
"contract": "Extract: parties, effective_date, term, key_terms"
}
prompt = field_prompts.get(doc_type, "Extract all key information")
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model=self.model,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"{prompt}. Return as JSON."
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
}
]
}
],
response_format={"type": "json_object"},
max_tokens=1000
)
return json.loads(response.choices[0].message.content)
def _extract_tables(self, image_path: str) -> list[dict]:
"""Extract tables from document."""
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model=self.model,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract any tables from this document. Return as JSON array of tables, each with headers and rows."
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
}
]
}
],
response_format={"type": "json_object"},
max_tokens=2000
)
result = json.loads(response.choices[0].message.content)
return result.get("tables", [])
# Usage
processor = DocumentProcessor()
result = processor.process("invoice.png")
print(f"Document type: {result.doc_type}")
print(f"Fields: {result.extracted_fields}")
References
- OpenAI Vision: https://platform.openai.com/docs/guides/vision
- Claude Vision: https://docs.anthropic.com/claude/docs/vision
- Gemini Vision: https://ai.google.dev/gemini-api/docs/vision
- LlamaIndex Multi-Modal: https://docs.llamaindex.ai/en/stable/module_guides/models/multi_modal/
Key Takeaways
- ✅ Choose models by use case – GPT-4V for general vision, Claude for documents, Gemini for native multi-modality
- ✅ Optimize image handling – Resize images appropriately, use base64 for reliability, consider cost per image
- ✅ Structure prompts carefully – Be specific about desired output format and analysis criteria
- ✅ Combine modalities thoughtfully – Multi-modal RAG enables powerful search over visual documents
- ✅ Handle errors gracefully – Vision APIs can fail on low-quality images or unsupported content
Conclusion
Multi-modal AI transforms what’s possible with LLM applications. Document processing that once required complex OCR pipelines now works with a single API call. Visual Q&A enables natural interaction with images and diagrams. Video understanding opens new possibilities for content analysis and accessibility. The key is choosing the right model for your use case: GPT-4o excels at general vision tasks, Claude 3 is strong at document understanding and reasoning, and Gemini handles long videos uniquely well. Start with simple image analysis, then build toward complex pipelines that combine vision with text processing. Remember that vision tokens are more expensive than text—optimize by using appropriate detail levels and preprocessing images to reasonable sizes. The multi-modal future is here, and the applications are limited only by imagination.
Discover more from C4: Container, Code, Cloud & Context
Subscribe to get the latest posts sent to your email.