Currently, metadata about the datasets like df_summary and dataset_names are static. This means each time a dataset is changed/added, we would need to re-release the package so everything is updated. I propose we make these into functions that fetch related metadata in real time (perhaps with a session cache). I already made this change in pmlbr EpistasisLab/pmlbr#5 (new release coming on CRAN in a day or two) but I'll leave the python implementation for someone else with more expertise. 🙏🏽 @lacava @weixuanfu
|
df_summary = pandas.read_csv(StringIO(data.decode("utf-8")) , sep='\t') |
|
regression_dataset_names = df_summary.query('task=="regression"')['dataset'].tolist() |
|
classification_dataset_names = df_summary.query('task=="classification"')['dataset'].tolist() |
|
dataset_names = regression_dataset_names + classification_dataset_names |
Currently, metadata about the datasets like
df_summaryanddataset_namesare static. This means each time a dataset is changed/added, we would need to re-release the package so everything is updated. I propose we make these into functions that fetch related metadata in real time (perhaps with a session cache). I already made this change in pmlbr EpistasisLab/pmlbr#5 (new release coming on CRAN in a day or two) but I'll leave the python implementation for someone else with more expertise. 🙏🏽 @lacava @weixuanfupmlb/pmlb/dataset_lists.py
Lines 29 to 32 in 7c1f4bd