VLM Image Fidelity Benchmark Tool

A Windows desktop application for systematically characterizing how image scale and JPEG quality affect the inference performance and response quality of local Vision Language Models (VLMs) running via LLamaSharp/llama.cpp.

The core question this tool answers: what is the lowest fidelity image you can send to your model before its responses meaningfully degrade? Finding that threshold lets you maximize tokens/second without sacrificing output quality.

Why This Exists

Most VLM benchmarking tools focus on prompt quality or model selection. This tool focuses on a different bottleneck: image preprocessing. A 4K image sent at full resolution costs far more in tokens and memory than a 25% scaled JPEG at quality 60 — but the model's response may be identical for many tasks.

No public tool existed for sweeping JPEG quality and resolution percentage against local GGUF multimodal models to find that fidelity threshold. This fills that gap.

Prerequisites

Windows
.NET 8.0 or later
A GGUF model file with vision support (e.g., LFM2.5-VL, LLaVA variants)
A multimodal projection file (.gguf) matching the model
LLamaSharp with a compatible llama.cpp native backend

If the bundled llama.cpp version in LLamaSharp doesn't support your model, see llama.cpp-Builder-for-LLamaSharp for building a compatible native binary.

Setup

Launch the application.
Go to File > Select Models > Model… and select your .gguf model file.
Go to File > Select Models > Projection… and select the matching projection .gguf.
Click File > Load Model to load both into memory.
Select an image using the Image… button.

Model and image paths are saved automatically and restored on next launch.

Interface Overview

┌─────────────────────────────────────────────────────────┐
│ File menu (Select Models, Load Model)                   │
├─────────────────────────────────────────────────────────┤
│ System Prompt                                           │
├─────────────────────────────────────────────────────────┤
│ Chat output (RichTextBox)                               │
├─────────────────────────────────────────────────────────┤
│ Message input                           │ Send │ Stop   │
├────────────────────┬────────────────────────────────────┤
│ Original image     │ Preview (what the model receives)  │
├─────────────────────────────────────────────────────────┤
│ Scale slider                  Scale: 100%  (1920×1080)  │
│ Quality slider                Quality: 95               │
├─────────────────────────────────────────────────────────┤
│ Image path                        │ Image… │ Reset      │
├─────────────────────────────────────────────────────────┤
│ Test Prompt                                             │
│ Q Step │ Q Passes │ Scale Step │ Scale Passes           │
│ Scale: [start] to [end]  Quality: [start] to [end]      │
│                               │ Run Test │ Stop Test    │
├─────────────────────────────────────────────────────────┤
│ Run log (sortable by any column)                        │
├─────────────────────────────────────────────────────────┤
│ Status                              │ Save Report… │ CPU│
└─────────────────────────────────────────────────────────┘

Chat Mode

Chat mode lets you manually explore model behavior before committing to an automated sweep.

Type a message in the input box and click Send.
The model responds using the current scale and quality settings applied to the loaded image.
The Assistant: label only appears when the first token arrives — it is not pre-printed.
The chat window scrolls automatically as tokens stream in.
Stop cancels generation without unloading the model.
Each send logs a row to the run log with full metrics.

The system prompt field at the top is sent with every message. Default: You are a helpful assistant.

Every inference uses a fresh context — there is no conversation memory between sends. This is intentional for benchmarking isolation.

Image Controls

Original / Preview panels

The left panel shows the original image as loaded from disk. The right panel shows exactly what gets encoded and sent to the model — the JPEG-compressed, scaled version. Both panels update in real time as you move the sliders.

Scale slider (10% – 200%)

Controls the resize percentage applied to the original image before encoding. The label shows the resulting pixel dimensions, e.g. Scale: 75% (1440×810).

Quality slider (1 – 100)

Controls the JPEG encoder quality. Higher values produce sharper images at larger file sizes. Lower values introduce compression artifacts. The label shows the current value, e.g. Quality: 60.

Reset button

Returns scale to 100% and quality to 95, and updates the preview immediately.

Notes

JPEG quality is non-linear. The difference between 95 and 85 is usually imperceptible. The difference between 50 and 40 introduces visible blocking artifacts that can confuse the model.
Scale and quality compound — degrading both simultaneously degrades faster than either alone.
The preview panel shows the actual temp file that will be sent — what you see is what the model gets.

Auto-Test Mode

Auto-test runs the full cartesian product of scale and quality combinations automatically, logging every result.

Controls

Control	Description
Test Prompt	The prompt sent for every test run. Default: `Describe what is in this image.`
Q Step	How much to decrement quality between passes (e.g., 10)
Q Passes	How many quality levels to test at each scale level
Scale Step	How much to decrement scale between passes (e.g., 25)
Scale Passes	How many scale levels to test
Scale: [start] to [end]	Explicit start and end scale percentages for the sweep
Quality: [start] to [end]	Explicit start and end quality values for the sweep

How the sweep works

Scale and quality lists are built from the start/end values stepping by the step size. The sweep always runs high to low regardless of which spinner has the larger value. The endpoint value is always included even if the step doesn't land on it exactly.

Example:

Scale start=100, end=25, step=25 → tests at 100%, 75%, 50%, 25%
Quality start=95, end=45, step=10 → tests at 95, 85, 75, 65, 55, 45
Total runs = 4 × 6 = 24

Each run:

Builds a temp JPEG at the target scale and quality using high-quality bicubic resampling
Runs inference with the test prompt and streams the response to the chat window
Records all metrics to the run log
Deletes the temp file

Stop Test cancels after the current inference completes.

Run Log

The run log records every inference — both manual sends and auto-test runs.

Column	Description
Time	Wall clock time of the run (HH:mm:ss)
File	Image filename
Tokens	Number of tokens generated
Tok/s	Generation throughput in tokens per second
TTFT (ms)	Time to first token in milliseconds
Size	Pixel dimensions of the image sent to the model
Scale %	Scale percentage applied
Quality	JPEG quality value used
Stopped	Whether the run was cancelled mid-generation

Click any column header to sort. Click again to reverse. Newest entries appear at the top by default.

Saving Reports

Click Save Report… to export the full run log. A save dialog lets you choose the format:

CSV — comma-separated, quoted fields, UTF-8. Opens directly in Excel or any spreadsheet tool.

JSON — pretty-printed array of objects with named fields. Useful for programmatic analysis or importing into other tools.

Both formats export all rows sorted by Tok/s descending regardless of the current sort order in the UI, so the fastest configurations appear first.

CSV example

"Time","File","Tokens","Tok/s","TTFT (ms)","Size","Scale %","Quality","Stopped"
"14:23:01","photo.jpg","312","18.4","847","960×540","50","75","No"
"14:19:44","photo.jpg","298","16.1","912","1920×1080","100","95","No"

JSON example

[
  {
    "Time": "14:23:01",
    "File": "photo.jpg",
    "Tokens": "312",
    "Tok/s": "18.4",
    "TTFT (ms)": "847",
    "Size": "960×540",
    "Scale %": "50",
    "Quality": "75",
    "Stopped": "No"
  }
]

Settings Persistence

The following are saved automatically to FrmApiTests.settings.json next to the executable and restored on the next launch:

Model path
Projection path
Last image path (image is reloaded and preview is regenerated automatically if the file still exists)

Performance Monitor

The status bar shows live CPU and RAM usage updated every 2 seconds:

CPU 34.2%   Proc RAM: 4821 MB   System RAM: 16,384 MB

CPU color changes to orange above 40% and red above 70%. The monitor pauses during inference to avoid interference with timing measurements.

Tuning Strategy

Concepts

Fidelity refers to how faithfully the preprocessed image represents the original. High scale + high quality = high fidelity. Low scale + low quality = low fidelity.

Drift is when the model's response changes relative to a high-fidelity baseline:

None — response is effectively identical
Minor — small wording differences, same meaning
Moderate — missing detail or changed emphasis
Severe — hallucination or factually wrong answer

The goal is finding the lowest fidelity that keeps drift at None or Minor for your specific task.

Recommended workflow

Phase 1 — Baseline. Run at Scale 100%, Quality 95. This is your reference. Note what the model says.

Phase 2 — Broad sweep. Use large steps (Scale step 25, Quality step 20) across the full range to quickly identify which zone causes drift.

Phase 3 — Narrow sweep. Zoom in on the transition zone with small steps (Scale step 5, Quality step 5) to find the precise cliff edge.

Phase 4 — Validate. Test your candidate settings on 5–10 different images. A single image is not representative.

Practical targets by task type

Task	Scale range	Quality range
General description	50–70%	60–75
OCR / text in image	80–100%	70–85
Object detection / counting	60–80%	55–70
Classification / yes-no	30–50%	40–60

Text-in-image tasks require the most resolution — the model needs to read individual characters. Classification tasks are the most tolerant of degradation.

Watch for the hallucination trap

Low fidelity can sometimes produce responses that are faster, longer, and delivered with apparent confidence — but are factually wrong. A high Tok/s with a wrong answer is worse than a slow correct one. Always compare response content, not just throughput numbers.

Understanding the Metrics

Tok/s is the primary throughput metric. This is calculated from first-token time to end of generation, reflecting pure generation speed and excluding image encoding overhead.

TTFT (Time to First Token) captures the image encoding and prompt processing cost. A very high TTFT with fast Tok/s means the bottleneck is image processing, not generation — scaling down the image will have the largest impact on total wall time.

Tokens is the output length. Be cautious comparing Tok/s across runs with very different token counts — a 50-token response will often show higher Tok/s than a 500-token response due to context growth effects during generation.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
FrmVlmBenchmark.Designer.cs		FrmVlmBenchmark.Designer.cs
FrmVlmBenchmark.cs		FrmVlmBenchmark.cs
FrmVlmBenchmark.resx		FrmVlmBenchmark.resx
LICENSE.txt		LICENSE.txt
Program.cs		Program.cs
README.md		README.md
VLMImageFidelityBenchmarkTool.csproj		VLMImageFidelityBenchmarkTool.csproj
VLMImageFidelityBenchmarkTool.sln		VLMImageFidelityBenchmarkTool.sln
apex_report_20260303_043053.csv		apex_report_20260303_043053.csv
apex_report_20260303_043053.xhtml		apex_report_20260303_043053.xhtml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM Image Fidelity Benchmark Tool

Table of Contents

Why This Exists

Prerequisites

Setup

Interface Overview

Chat Mode

Image Controls

Original / Preview panels

Scale slider (10% – 200%)

Quality slider (1 – 100)

Reset button

Notes

Auto-Test Mode

Controls

How the sweep works

Run Log

Saving Reports

CSV example

JSON example

Settings Persistence

Performance Monitor

Tuning Strategy

Concepts

Recommended workflow

Practical targets by task type

Watch for the hallucination trap

Understanding the Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VLM Image Fidelity Benchmark Tool

Table of Contents

Why This Exists

Prerequisites

Setup

Interface Overview

Chat Mode

Image Controls

Original / Preview panels

Scale slider (10% – 200%)

Quality slider (1 – 100)

Reset button

Notes

Auto-Test Mode

Controls

How the sweep works

Run Log

Saving Reports

CSV example

JSON example

Settings Persistence

Performance Monitor

Tuning Strategy

Concepts

Recommended workflow

Practical targets by task type

Watch for the hallucination trap

Understanding the Metrics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages