Shared: Provenance-based filtering of flow summaries #21051

hvitved · 2025-12-16T13:40:36Z

This PR aligns the logic across languages for how flow summaries are prioritized based on provenance and exactness (that is, whether a model is defined directly for a function or for a function that is implemented/overridden).

A flow summary is considered relevant if:

It is manual exact model, or
It is a manual inexact model and there is no exact manual (neutral) model, or
It is a generated model and (a) there is no source code available for the modeled callable, (b) there is no manual (neutral) model, and (c) the model is inexact and there is no generated exact (neutral) model.

Note that for dynamic languages we currently pretend that no source code is available for functions with flow summaries, so 3.(a) holds vacuously.

Points 2 and 3.c represent a change for e.g. Java, where we would previously union exact and inexact models, which meant that it was not possible to overrule inexact models. As a consequence, some inexact manual have been replicated. DCA for Java reports some lost java/sensitive-log results on apache_solr, but looking at those results, they all have flow paths of length > 150, so they are almost certainly false positives, and most likely a consequence of 3.c.

In order for the logic to be defined in the shared flow summary library, I had to move provenance and exactness information into the propagatesFlow predicate, which is a breaking change.

Lastly, I have applied the ::Range pattern to the SummarizedCallable class for all languages except C++, which currently does not expose this class. This means that SummarizedCallable::Range will contain all flow summaries, whereas SummarizedCallable will only contain relevant summaries.

rust/ql/lib/codeql/rust/dataflow/FlowSummary.qll

shared/dataflow/codeql/dataflow/internal/FlowSummaryImpl.qll

java/ql/lib/semmle/code/java/dataflow/internal/DataFlowDispatch.qll

rust/ql/test/library-tests/dataflow/models/models.qlref

hvitved · 2026-01-21T11:57:35Z

@hvitved : This appears to break the model generator idempotency (at least for C#). I tried generating C# Runtime models from scratch (by first deleting the existing generated models) and then re-generate the model after this (which further changed the models).

@michaelnebel : I have pushed a revert change that appears to fix this: When I run python3 generate_mad.py --language csharp --with-summaries <path to dotnet_runtime_db> twice, I get no changes with the second invocation.

michaelnebel · 2026-01-22T09:59:12Z

csharp/ql/lib/semmle/code/csharp/dataflow/internal/FlowSummaryImpl.qll

+    c.fromSource() and
+    not c.getFile().isStub() and
+    not (
+      c.getFile().extractedQlTest() and


Maybe this deserves a comment (that ql test files where the body is just a throw are considered stub like and thus not a part of the source code).

michaelnebel

Really nice work @hvitved !
Only a couple of minor questions/remarks.

yoff

Python 👍

aschackmull · 2026-01-23T08:59:53Z

Offline feedback recap: Some of the added Java models look wrong. Tom and I identified several issues: the code snippet to identify and generate the missing models lacked the signature, and notably the signature can be different from the overridden method. Also, some existing manual exact models were missing signatures, which caused them to wrongly apply to inherited overloads.

Missing manual models were added using the following code added to `FlowSummaryImpl.qll`: ```ql private predicate testsummaryElement( Input::SummarizedCallableBase c, string namespace, string type, boolean subtypes, string name, string signature, string ext, string originalInput, string originalOutput, string kind, string provenance, string model, boolean isExact ) { exists(string input, string output, Callable baseCallable | summaryModel(namespace, type, subtypes, name, signature, ext, originalInput, originalOutput, kind, provenance, model) and baseCallable = interpretElement(namespace, type, subtypes, name, signature, ext, isExact) and ( c.asCallable() = baseCallable and input = originalInput and output = originalOutput or correspondingKotlinParameterDefaultsArgSpec(baseCallable, c.asCallable(), originalInput, input) and correspondingKotlinParameterDefaultsArgSpec(baseCallable, c.asCallable(), originalOutput, output) ) ) } private predicate testsummaryElement2( string namespace, string type, boolean subtypes, string name, string signature, string ext, string originalInput, string originalOutput, string kind, string provenance, string model, string namespace2, string type2 ) { exists(Input::SummarizedCallableBase c | testsummaryElement(c, namespace2, type2, _, _, _, ext, originalInput, originalOutput, kind, provenance, model, false) and testsummaryElement(c, namespace, type, subtypes, name, _, _, _, _, _, provenance, _, true) and signature = paramsString(c.asCallable()) and not testsummaryElement(c, _, _, _, _, _, _, originalInput, originalOutput, kind, provenance, _, true) ) } private string getAMissingManualModel(string namespace2, string type2) { exists( string namespace, string type, boolean subtypes, string name, string signature, string ext, string originalInput, string originalOutput, string kind, string provenance, string model | testsummaryElement2(namespace, type, subtypes, name, signature, ext, originalInput, originalOutput, kind, provenance, model, namespace2, type2) and result = "- [\"" + namespace + "\", \"" + type + "\", True, \"" + name + "\", \"" + signature + "\", \"\", \"" + originalInput + "\", \"" + originalOutput + "\", \"" + kind + "\", \"" + provenance + "\"]" ) } ```