Artifact for TOSEM paper: Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors.
For SARD dataset we have uploaded to zenodo, for Fan dataset, the related information is at MSR_20_Code_vulnerability_CSV_Dataset, the dataset csv can be downloaded from google driver. We extract func_before and func_after from it.
For preprocess code into graph, please refer to preprocess/ReadMe.md
Run python pretrain.py detector_name path2train_datas embedding_model_path
-
detector_name: The name of detectors, choice isreveal,devign,ivdetect,deepwukong, we will soon add remaining 3 sequence-based detectors into this pipeline. -
path2train_datas: The dir which storestrain_vul.json,train_normal.json,eval_vul.json,eval_normal.json,test_vul.json,test_normal.json, the script will read training data from train jsons. -
embedding_model_path: The path to the saved embedding model.
Run python detection.py <args> to train detectors. <args> includes:
-
--detector <detector_name>,<detector_name>could be one of["deepwukong", "reveal", "ivdetect", "devign", "tokenlstm", "vuldeepecker", "sysevr"] -
--w2v_model_path <model_path>,<model_path>could be relative or absolute path of pretrained word2vec model. -
--dataset_dir <dataset_dir>,<dataset_dir>is path to the dir storing json datas. It should includetrain_vul.json,train_normal.json,eval_vul.json,eval_normal.json,test_vul.json,test_normal.json. -
--model_dir <model_dir>,<model_dir>is where the model pth file placed, it's corresponding directory. The scripts will automatically load the best model in the dir. -
--train, means will train model. If there exist a model in<model_dir>, the script will first load that model and then train. -
--test, means will test the model. There must be a model in<model_dir>first.
Run python explain.py <args>. <args> includes:
-
--detector <detector_name>,<detector_name>could be one of["deepwukong", "reveal", "ivdetect", "devign", "tokenlstm", "vuldeepecker", "sysevr"] -
--w2v_model_path <model_path>,<model_path>could be relative or absolute path of pretrained word2vec model. -
--dataset_dir <dataset_dir>,<dataset_dir>is path to the dir storing json datas. It should includetest_vul.json. -
--model_dir <model_dir>,<model_dir>is where the model pth file placed, it's corresponding directory. The scripts will automatically load the best model in the dir. -
--explainer <explainer_name>,<explainer_name>could be one of["gnnexplainer", "pgexplainer", "gnnlrp", "gradcam", "deeplift"]for now. We are organizing the code in sequence-based explainers into this pipeline.
@misc{cheng2024fidelity,
title={Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors},
author={Baijun Cheng and Shengming Zhao and Kailong Wang and Meizhen Wang and Guangdong Bai and Ruitao Feng and Yao Guo and Lei Ma and Haoyu Wang},
year={2024},
eprint={2401.02686},
archivePrefix={arXiv},
primaryClass={cs.CR}
}