Skip to content

Feature: file cache: max size#945

Open
autoantwort wants to merge 56 commits intomicrosoft:mainfrom
autoantwort:feature/file-cache-max-size
Open

Feature: file cache: max size#945
autoantwort wants to merge 56 commits intomicrosoft:mainfrom
autoantwort:feature/file-cache-max-size

Conversation

@autoantwort
Copy link
Copy Markdown
Contributor

@autoantwort autoantwort commented Mar 8, 2023

Fixes microsoft/vcpkg#19452

If no settings are found a new settings file is created. You can set the following properties:

"properties": {
"delete-policy": {
"description": "After which policy objects should be deleted.",
"enum": [
"None",
"OldestAccessDate",
"OldestModificationDate"
]
},
"max-size-in-gb": {
"description": "The maximum size of the cache in gigabytes.",
"type": "number",
"minimum": 0
},
"max-age-in-days": {
"description": "The maximum age of the cache in days.",
"type": "number",
"minimum": 0
},
"keep-available-in-percentage": {
"description": "How much space should be kept available on the disk in percentage.",
"type": "number",
"minimum": 0
}
},

To ensure that the limits are respected when multiple instances are running the following is done:
0. Every instance has a unique "sync" file in "/sync/<random_number>".

  1. Append a line to sync file of the format "<name_of_object>;<file_size>\n"
  2. Read the new entries from other sync files
  3. Check if files must be deleted to respect the limits

# Conflicts:
#	include/vcpkg/base/messages.h
#	src/vcpkg/base/messages.cpp
# Conflicts:
#	src/vcpkg/binarycaching.cpp
# Conflicts:
#	src/vcpkg-test/metrics.cpp
#	src/vcpkg/binarycaching.cpp
# Conflicts:
#	include/vcpkg/base/files.h
#	src/vcpkg-test/files.cpp
#	src/vcpkg/base/files.cpp
#	src/vcpkg/binarycaching.cpp
@autoantwort autoantwort marked this pull request as ready for review August 19, 2023 16:19
@autoantwort
Copy link
Copy Markdown
Contributor Author

This is not 100% ready for merge, but ready for review. When I get a design approval I will handle all edge case errors and do localization.

# Conflicts:
#	include/vcpkg/base/message-data.inc.h
Copy link
Copy Markdown
Member

@BillyONeal BillyONeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0. Every instance has a unique "sync" file in "<root>/sync/<random_number>". 
1. Append a line to sync file of the format "<name_of_object>;<file_size>\n"
2. Read the new entries from other sync files
3. Check if files must be deleted to respect the limits

I do not believe this is correct. There is nothing to stop 2 instances from concurrently reading all the sync files, deciding that something needs to be deleted, writing their intent to delete to their own sync files, and now there is no way to resolve the ambiguity on who 'won' that race.

In particular, this is assuming that writes or appends within a file will be atomic, which is a feature most file systems do not provide and several network file systems extremely do not provide. The only operation which we can assume is atomic is the creation or removal of a file system entry; as in rename.

I think you need something like a read/write oplock here, where instances trying to remove entries from the cache are writers, and instances trying to do anything else are readers. Only one instance should be trying to delete out of the cache at a time, and if any instance is doing so, we need to make sure we don't delete a cache entry that other instances are potentially touching.

Comment thread src/vcpkg/binarycaching.cpp Outdated
Comment thread src/vcpkg/binarycaching.cpp Outdated
Comment thread src/vcpkg/binarycaching.cpp Outdated
cache.own_sync_file = get_own_sync_file(cache.sync_root_dir);
if (cache.folder_settings.delete_policy != FolderSettings::DeletePolicy::None)
{
std::unordered_map<std::string, uint64_t> file_sizes;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that in general this works correctly, due to concurrent insertions into the cache. Cache entries are not meaningfully part of the cache until they have been renamed into place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a cache so that I don't have to call fs.file_size() for every cache entry. If a file is not in the cache anymore, this cache size is not used.

}
}

size_t push_success(const BinaryPackageWriteInfo& request, MessageSink& msg_sink) override
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think 'during push_success' is the right time to be doing this. It probably should be a one time pass that looks at the cache(s) after all cache operations this particular vcpkg instance will do on this run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the cache could be larger then its max size in the meantime?

@autoantwort
Copy link
Copy Markdown
Contributor Author

autoantwort commented Oct 25, 2023

I do not believe this is correct. There is nothing to stop 2 instances from concurrently reading all the sync files, deciding that something needs to be deleted, writing their intent to delete to their own sync files, and now there is no way to resolve the ambiguity on who 'won' that race.

The instances only write to the sync files what they want to add, not what they want to delete.
So if you want to add a file with hash xxx and size 123 to the binary cache you append the following to your own sync file:
xxx;123\n

In particular, this is assuming that writes or appends within a file will be atomic, which is a feature most file systems do not provide and several network file systems extremely do not provide. The only operation which we can assume is atomic is the creation or removal of a file system entry; as in rename.

No this code does not assumes this. I explicitly handle this case ... I just realize that I wanted to implement this but haven't done this yet ... 🤦 🙈 Edit: Now implemented
It this is implemented, the implementation is save:

Start:
Current binary cache size: 9.5 GB
Two instances want to add 0.6 GB

Instance 1 Instance 2
1 Add half a line to own sync file
2 Add line to own sync file
3 Read other sync file
4 Delete files until there is space for the 0.6 GB file
5 Write rest of line
6 Read other sync file
7 Delete files until there is space for the 0.6 GB file

@julianxhokaxhiu
Copy link
Copy Markdown

Has there any progress been made on this direction? it's becoming quite an effort to cleanup manually the vcpkg cache directory from time to time. Thank you in advance 🙏🏻

@autoantwort
Copy link
Copy Markdown
Contributor Author

@julianxhokaxhiu You could compile your own version of vcpkg-tool 🙈 I am using this on mac/linux/windows on a daily basis since years and never had any problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants