Skip to content

johnbrodowski/VLMImageFidelityBenchmarkTool

Repository files navigation

VLM Image Fidelity Benchmark Tool

A Windows desktop application for systematically characterizing how image scale and JPEG quality affect the inference performance and response quality of local Vision Language Models (VLMs) running via LLamaSharp/llama.cpp.

The core question this tool answers: what is the lowest fidelity image you can send to your model before its responses meaningfully degrade? Finding that threshold lets you maximize tokens/second without sacrificing output quality.


image

Table of Contents


Why This Exists

Most VLM benchmarking tools focus on prompt quality or model selection. This tool focuses on a different bottleneck: image preprocessing. A 4K image sent at full resolution costs far more in tokens and memory than a 25% scaled JPEG at quality 60 — but the model's response may be identical for many tasks.

No public tool existed for sweeping JPEG quality and resolution percentage against local GGUF multimodal models to find that fidelity threshold. This fills that gap.


Prerequisites

  • Windows
  • .NET 8.0 or later
  • A GGUF model file with vision support (e.g., LFM2.5-VL, LLaVA variants)
  • A multimodal projection file (.gguf) matching the model
  • LLamaSharp with a compatible llama.cpp native backend

If the bundled llama.cpp version in LLamaSharp doesn't support your model, see llama.cpp-Builder-for-LLamaSharp for building a compatible native binary.


Setup

  1. Launch the application.
  2. Go to File > Select Models > Model… and select your .gguf model file.
  3. Go to File > Select Models > Projection… and select the matching projection .gguf.
  4. Click File > Load Model to load both into memory.
  5. Select an image using the Image… button.

Model and image paths are saved automatically and restored on next launch.


Interface Overview

┌─────────────────────────────────────────────────────────┐
│ File menu (Select Models, Load Model)                   │
├─────────────────────────────────────────────────────────┤
│ System Prompt                                           │
├─────────────────────────────────────────────────────────┤
│ Chat output (RichTextBox)                               │
├─────────────────────────────────────────────────────────┤
│ Message input                           │ Send │ Stop   │
├────────────────────┬────────────────────────────────────┤
│ Original image     │ Preview (what the model receives)  │
├─────────────────────────────────────────────────────────┤
│ Scale slider                  Scale: 100%  (1920×1080)  │
│ Quality slider                Quality: 95               │
├─────────────────────────────────────────────────────────┤
│ Image path                        │ Image… │ Reset      │
├─────────────────────────────────────────────────────────┤
│ Test Prompt                                             │
│ Q Step │ Q Passes │ Scale Step │ Scale Passes           │
│ Scale: [start] to [end]  Quality: [start] to [end]      │
│                               │ Run Test │ Stop Test    │
├─────────────────────────────────────────────────────────┤
│ Run log (sortable by any column)                        │
├─────────────────────────────────────────────────────────┤
│ Status                              │ Save Report… │ CPU│
└─────────────────────────────────────────────────────────┘

Chat Mode

Chat mode lets you manually explore model behavior before committing to an automated sweep.

  • Type a message in the input box and click Send.
  • The model responds using the current scale and quality settings applied to the loaded image.
  • The Assistant: label only appears when the first token arrives — it is not pre-printed.
  • The chat window scrolls automatically as tokens stream in.
  • Stop cancels generation without unloading the model.
  • Each send logs a row to the run log with full metrics.

The system prompt field at the top is sent with every message. Default: You are a helpful assistant.

Every inference uses a fresh context — there is no conversation memory between sends. This is intentional for benchmarking isolation.


Image Controls

Original / Preview panels

The left panel shows the original image as loaded from disk. The right panel shows exactly what gets encoded and sent to the model — the JPEG-compressed, scaled version. Both panels update in real time as you move the sliders.

Scale slider (10% – 200%)

Controls the resize percentage applied to the original image before encoding. The label shows the resulting pixel dimensions, e.g. Scale: 75% (1440×810).

Quality slider (1 – 100)

Controls the JPEG encoder quality. Higher values produce sharper images at larger file sizes. Lower values introduce compression artifacts. The label shows the current value, e.g. Quality: 60.

Reset button

Returns scale to 100% and quality to 95, and updates the preview immediately.

Notes

  • JPEG quality is non-linear. The difference between 95 and 85 is usually imperceptible. The difference between 50 and 40 introduces visible blocking artifacts that can confuse the model.
  • Scale and quality compound — degrading both simultaneously degrades faster than either alone.
  • The preview panel shows the actual temp file that will be sent — what you see is what the model gets.

Auto-Test Mode

Auto-test runs the full cartesian product of scale and quality combinations automatically, logging every result.

Controls

Control Description
Test Prompt The prompt sent for every test run. Default: Describe what is in this image.
Q Step How much to decrement quality between passes (e.g., 10)
Q Passes How many quality levels to test at each scale level
Scale Step How much to decrement scale between passes (e.g., 25)
Scale Passes How many scale levels to test
Scale: [start] to [end] Explicit start and end scale percentages for the sweep
Quality: [start] to [end] Explicit start and end quality values for the sweep

How the sweep works

Scale and quality lists are built from the start/end values stepping by the step size. The sweep always runs high to low regardless of which spinner has the larger value. The endpoint value is always included even if the step doesn't land on it exactly.

Example:

  • Scale start=100, end=25, step=25 → tests at 100%, 75%, 50%, 25%
  • Quality start=95, end=45, step=10 → tests at 95, 85, 75, 65, 55, 45
  • Total runs = 4 × 6 = 24

Each run:

  1. Builds a temp JPEG at the target scale and quality using high-quality bicubic resampling
  2. Runs inference with the test prompt and streams the response to the chat window
  3. Records all metrics to the run log
  4. Deletes the temp file

Stop Test cancels after the current inference completes.


Run Log

The run log records every inference — both manual sends and auto-test runs.

Column Description
Time Wall clock time of the run (HH:mm:ss)
File Image filename
Tokens Number of tokens generated
Tok/s Generation throughput in tokens per second
TTFT (ms) Time to first token in milliseconds
Size Pixel dimensions of the image sent to the model
Scale % Scale percentage applied
Quality JPEG quality value used
Stopped Whether the run was cancelled mid-generation

Click any column header to sort. Click again to reverse. Newest entries appear at the top by default.


Saving Reports

Click Save Report… to export the full run log. A save dialog lets you choose the format:

CSV — comma-separated, quoted fields, UTF-8. Opens directly in Excel or any spreadsheet tool.

JSON — pretty-printed array of objects with named fields. Useful for programmatic analysis or importing into other tools.

Both formats export all rows sorted by Tok/s descending regardless of the current sort order in the UI, so the fastest configurations appear first.

CSV example

"Time","File","Tokens","Tok/s","TTFT (ms)","Size","Scale %","Quality","Stopped"
"14:23:01","photo.jpg","312","18.4","847","960×540","50","75","No"
"14:19:44","photo.jpg","298","16.1","912","1920×1080","100","95","No"

JSON example

[
  {
    "Time": "14:23:01",
    "File": "photo.jpg",
    "Tokens": "312",
    "Tok/s": "18.4",
    "TTFT (ms)": "847",
    "Size": "960×540",
    "Scale %": "50",
    "Quality": "75",
    "Stopped": "No"
  }
]

Settings Persistence

The following are saved automatically to FrmApiTests.settings.json next to the executable and restored on the next launch:

  • Model path
  • Projection path
  • Last image path (image is reloaded and preview is regenerated automatically if the file still exists)

Performance Monitor

The status bar shows live CPU and RAM usage updated every 2 seconds:

CPU 34.2%   Proc RAM: 4821 MB   System RAM: 16,384 MB

CPU color changes to orange above 40% and red above 70%. The monitor pauses during inference to avoid interference with timing measurements.


Tuning Strategy

Concepts

Fidelity refers to how faithfully the preprocessed image represents the original. High scale + high quality = high fidelity. Low scale + low quality = low fidelity.

Drift is when the model's response changes relative to a high-fidelity baseline:

  • None — response is effectively identical
  • Minor — small wording differences, same meaning
  • Moderate — missing detail or changed emphasis
  • Severe — hallucination or factually wrong answer

The goal is finding the lowest fidelity that keeps drift at None or Minor for your specific task.

Recommended workflow

Phase 1 — Baseline. Run at Scale 100%, Quality 95. This is your reference. Note what the model says.

Phase 2 — Broad sweep. Use large steps (Scale step 25, Quality step 20) across the full range to quickly identify which zone causes drift.

Phase 3 — Narrow sweep. Zoom in on the transition zone with small steps (Scale step 5, Quality step 5) to find the precise cliff edge.

Phase 4 — Validate. Test your candidate settings on 5–10 different images. A single image is not representative.

Practical targets by task type

Task Scale range Quality range
General description 50–70% 60–75
OCR / text in image 80–100% 70–85
Object detection / counting 60–80% 55–70
Classification / yes-no 30–50% 40–60

Text-in-image tasks require the most resolution — the model needs to read individual characters. Classification tasks are the most tolerant of degradation.

Watch for the hallucination trap

Low fidelity can sometimes produce responses that are faster, longer, and delivered with apparent confidence — but are factually wrong. A high Tok/s with a wrong answer is worse than a slow correct one. Always compare response content, not just throughput numbers.


Understanding the Metrics

Tok/s is the primary throughput metric. This is calculated from first-token time to end of generation, reflecting pure generation speed and excluding image encoding overhead.

TTFT (Time to First Token) captures the image encoding and prompt processing cost. A very high TTFT with fast Tok/s means the bottleneck is image processing, not generation — scaling down the image will have the largest impact on total wall time.

Tokens is the output length. Be cautious comparing Tok/s across runs with very different token counts — a 50-token response will often show higher Tok/s than a 500-token response due to context growth effects during generation.

About

A Windows desktop application for systematically characterizing how image scale and JPEG quality affect the inference performance and response quality of local Vision Language Models (VLMs) running via LLamaSharp/llama.cpp.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors