A Windows desktop application for systematically characterizing how image scale and JPEG quality affect the inference performance and response quality of local Vision Language Models (VLMs) running via LLamaSharp/llama.cpp.
The core question this tool answers: what is the lowest fidelity image you can send to your model before its responses meaningfully degrade? Finding that threshold lets you maximize tokens/second without sacrificing output quality.
- Why This Exists
- Prerequisites
- Setup
- Interface Overview
- Chat Mode
- Image Controls
- Auto-Test Mode
- Run Log
- Saving Reports
- Settings Persistence
- Performance Monitor
- Tuning Strategy
- Understanding the Metrics
Most VLM benchmarking tools focus on prompt quality or model selection. This tool focuses on a different bottleneck: image preprocessing. A 4K image sent at full resolution costs far more in tokens and memory than a 25% scaled JPEG at quality 60 — but the model's response may be identical for many tasks.
No public tool existed for sweeping JPEG quality and resolution percentage against local GGUF multimodal models to find that fidelity threshold. This fills that gap.
- Windows
- .NET 8.0 or later
- A GGUF model file with vision support (e.g., LFM2.5-VL, LLaVA variants)
- A multimodal projection file (
.gguf) matching the model - LLamaSharp with a compatible llama.cpp native backend
If the bundled llama.cpp version in LLamaSharp doesn't support your model, see llama.cpp-Builder-for-LLamaSharp for building a compatible native binary.
- Launch the application.
- Go to File > Select Models > Model… and select your
.ggufmodel file. - Go to File > Select Models > Projection… and select the matching projection
.gguf. - Click File > Load Model to load both into memory.
- Select an image using the Image… button.
Model and image paths are saved automatically and restored on next launch.
┌─────────────────────────────────────────────────────────┐
│ File menu (Select Models, Load Model) │
├─────────────────────────────────────────────────────────┤
│ System Prompt │
├─────────────────────────────────────────────────────────┤
│ Chat output (RichTextBox) │
├─────────────────────────────────────────────────────────┤
│ Message input │ Send │ Stop │
├────────────────────┬────────────────────────────────────┤
│ Original image │ Preview (what the model receives) │
├─────────────────────────────────────────────────────────┤
│ Scale slider Scale: 100% (1920×1080) │
│ Quality slider Quality: 95 │
├─────────────────────────────────────────────────────────┤
│ Image path │ Image… │ Reset │
├─────────────────────────────────────────────────────────┤
│ Test Prompt │
│ Q Step │ Q Passes │ Scale Step │ Scale Passes │
│ Scale: [start] to [end] Quality: [start] to [end] │
│ │ Run Test │ Stop Test │
├─────────────────────────────────────────────────────────┤
│ Run log (sortable by any column) │
├─────────────────────────────────────────────────────────┤
│ Status │ Save Report… │ CPU│
└─────────────────────────────────────────────────────────┘
Chat mode lets you manually explore model behavior before committing to an automated sweep.
- Type a message in the input box and click Send.
- The model responds using the current scale and quality settings applied to the loaded image.
- The Assistant: label only appears when the first token arrives — it is not pre-printed.
- The chat window scrolls automatically as tokens stream in.
- Stop cancels generation without unloading the model.
- Each send logs a row to the run log with full metrics.
The system prompt field at the top is sent with every message. Default: You are a helpful assistant.
Every inference uses a fresh context — there is no conversation memory between sends. This is intentional for benchmarking isolation.
The left panel shows the original image as loaded from disk. The right panel shows exactly what gets encoded and sent to the model — the JPEG-compressed, scaled version. Both panels update in real time as you move the sliders.
Controls the resize percentage applied to the original image before encoding. The label shows the resulting pixel dimensions, e.g. Scale: 75% (1440×810).
Controls the JPEG encoder quality. Higher values produce sharper images at larger file sizes. Lower values introduce compression artifacts. The label shows the current value, e.g. Quality: 60.
Returns scale to 100% and quality to 95, and updates the preview immediately.
- JPEG quality is non-linear. The difference between 95 and 85 is usually imperceptible. The difference between 50 and 40 introduces visible blocking artifacts that can confuse the model.
- Scale and quality compound — degrading both simultaneously degrades faster than either alone.
- The preview panel shows the actual temp file that will be sent — what you see is what the model gets.
Auto-test runs the full cartesian product of scale and quality combinations automatically, logging every result.
| Control | Description |
|---|---|
| Test Prompt | The prompt sent for every test run. Default: Describe what is in this image. |
| Q Step | How much to decrement quality between passes (e.g., 10) |
| Q Passes | How many quality levels to test at each scale level |
| Scale Step | How much to decrement scale between passes (e.g., 25) |
| Scale Passes | How many scale levels to test |
| Scale: [start] to [end] | Explicit start and end scale percentages for the sweep |
| Quality: [start] to [end] | Explicit start and end quality values for the sweep |
Scale and quality lists are built from the start/end values stepping by the step size. The sweep always runs high to low regardless of which spinner has the larger value. The endpoint value is always included even if the step doesn't land on it exactly.
Example:
- Scale start=100, end=25, step=25 → tests at 100%, 75%, 50%, 25%
- Quality start=95, end=45, step=10 → tests at 95, 85, 75, 65, 55, 45
- Total runs = 4 × 6 = 24
Each run:
- Builds a temp JPEG at the target scale and quality using high-quality bicubic resampling
- Runs inference with the test prompt and streams the response to the chat window
- Records all metrics to the run log
- Deletes the temp file
Stop Test cancels after the current inference completes.
The run log records every inference — both manual sends and auto-test runs.
| Column | Description |
|---|---|
| Time | Wall clock time of the run (HH:mm:ss) |
| File | Image filename |
| Tokens | Number of tokens generated |
| Tok/s | Generation throughput in tokens per second |
| TTFT (ms) | Time to first token in milliseconds |
| Size | Pixel dimensions of the image sent to the model |
| Scale % | Scale percentage applied |
| Quality | JPEG quality value used |
| Stopped | Whether the run was cancelled mid-generation |
Click any column header to sort. Click again to reverse. Newest entries appear at the top by default.
Click Save Report… to export the full run log. A save dialog lets you choose the format:
CSV — comma-separated, quoted fields, UTF-8. Opens directly in Excel or any spreadsheet tool.
JSON — pretty-printed array of objects with named fields. Useful for programmatic analysis or importing into other tools.
Both formats export all rows sorted by Tok/s descending regardless of the current sort order in the UI, so the fastest configurations appear first.
"Time","File","Tokens","Tok/s","TTFT (ms)","Size","Scale %","Quality","Stopped"
"14:23:01","photo.jpg","312","18.4","847","960×540","50","75","No"
"14:19:44","photo.jpg","298","16.1","912","1920×1080","100","95","No"[
{
"Time": "14:23:01",
"File": "photo.jpg",
"Tokens": "312",
"Tok/s": "18.4",
"TTFT (ms)": "847",
"Size": "960×540",
"Scale %": "50",
"Quality": "75",
"Stopped": "No"
}
]The following are saved automatically to FrmApiTests.settings.json next to the executable and restored on the next launch:
- Model path
- Projection path
- Last image path (image is reloaded and preview is regenerated automatically if the file still exists)
The status bar shows live CPU and RAM usage updated every 2 seconds:
CPU 34.2% Proc RAM: 4821 MB System RAM: 16,384 MB
CPU color changes to orange above 40% and red above 70%. The monitor pauses during inference to avoid interference with timing measurements.
Fidelity refers to how faithfully the preprocessed image represents the original. High scale + high quality = high fidelity. Low scale + low quality = low fidelity.
Drift is when the model's response changes relative to a high-fidelity baseline:
- None — response is effectively identical
- Minor — small wording differences, same meaning
- Moderate — missing detail or changed emphasis
- Severe — hallucination or factually wrong answer
The goal is finding the lowest fidelity that keeps drift at None or Minor for your specific task.
Phase 1 — Baseline. Run at Scale 100%, Quality 95. This is your reference. Note what the model says.
Phase 2 — Broad sweep. Use large steps (Scale step 25, Quality step 20) across the full range to quickly identify which zone causes drift.
Phase 3 — Narrow sweep. Zoom in on the transition zone with small steps (Scale step 5, Quality step 5) to find the precise cliff edge.
Phase 4 — Validate. Test your candidate settings on 5–10 different images. A single image is not representative.
| Task | Scale range | Quality range |
|---|---|---|
| General description | 50–70% | 60–75 |
| OCR / text in image | 80–100% | 70–85 |
| Object detection / counting | 60–80% | 55–70 |
| Classification / yes-no | 30–50% | 40–60 |
Text-in-image tasks require the most resolution — the model needs to read individual characters. Classification tasks are the most tolerant of degradation.
Low fidelity can sometimes produce responses that are faster, longer, and delivered with apparent confidence — but are factually wrong. A high Tok/s with a wrong answer is worse than a slow correct one. Always compare response content, not just throughput numbers.
Tok/s is the primary throughput metric. This is calculated from first-token time to end of generation, reflecting pure generation speed and excluding image encoding overhead.
TTFT (Time to First Token) captures the image encoding and prompt processing cost. A very high TTFT with fast Tok/s means the bottleneck is image processing, not generation — scaling down the image will have the largest impact on total wall time.
Tokens is the output length. Be cautious comparing Tok/s across runs with very different token counts — a 50-token response will often show higher Tok/s than a 500-token response due to context growth effects during generation.