You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.
Parse data from the original into the expected format.
Deprecate cmc and contraceptive datasets.
195_auto_price and 207_autoPrice. The symboling feature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for 195_auto_price and 207_autoPrice, and symboling for auto, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.
Parse data from the original into the expected format with price as target.
Parse data from the original into the expected format with symboling as target.
Ensure that Description of each new dataset references the other.
Deprecate 195_auto_price, 207_autoPrice and auto datasets.
glass and prnn_fglass. The target class levels seem to be switched between datasets. The original can be found on the UCI ML repository.
Parse data from the original into the expected format.
Deprecate glass and prnn_fglass datasets.
heart_c, cleve, cleveland_nominal and cleveland. The cleve and heart_c data sets have a binarized target (vs. ordinal in the other two datasets); the cleveland_nominal data set contains only a feature subset. The original can be found on the UCI ML repository.
Parse data from the original into the expected format.
Deprecate australian, buggyCrx, credit_a and crx datasets.
breast_w and breast are based on the same data. The breast dataset has a Sample code number feature that is not present in breast_w. The original can be found on the UCI ML repository.
Parse data from the original into the expected format.
Identify original source. This dataset appears to have been hosted at the UCI ML repository. However, the original owner seems to have withdrawn permission to use this dataset.
Parse data from the original into the expected format.
Identify original source. The original can be found the UCI ML repository.
Parse data from the original into the expected format.
Deprecate credit_g and german datasets.
solar_flare_2 and flare derive from the same data, but differ in the way the target is formulated. solar_flare_2 also contains two additional features.
Identify original source. The original can be found the UCI ML repository. There are three targets, of which one is useful for ML prediction. The additional features in solar_flare_2 are in fact the other two targets.
Parse data from the original into the expected format.
Parse data from the original into the expected format.
Deprecate chess and kr_vs_kp datasets.
satimage and 294_satellite_image are the same, with the exception that 294_satellite_image incorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.
Parse data from the original into the expected format.
Deprecate satimage and 294_satellite_image datasets.
Parse data from the original into the expected format.
Deprecate 197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act datasets.
poker and 1595_poker are identical except for the target specification. The original can be found on the UCI ML repository, and suggest the target is ordinal.
Parse data from the original into the expected format.
Deprecate poker and 1595_poker datasets.
My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:
While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.
cmcandcontraceptivedatasets.symbolingfeature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for195_auto_priceand207_autoPrice, and symboling forauto, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.Descriptionof each new dataset references the other.195_auto_price,207_autoPriceandautodatasets.glassandprnn_fglassdatasets.cleveandheart_cdata sets have a binarized target (vs. ordinal in the other two datasets); thecleveland_nominaldata set contains only a feature subset. The original can be found on the UCI ML repository.clevedata set.heart_c,cleve,cleveland_nominal,cleveland,heart_statlog,heart_handhungariandatasets.colicandhorse_colicdatasets.voteandhouse_votes_84datasets.breast_cancer_wisconsinandwdbcdatasets.australian,buggyCrx,credit_aandcrxdatasets.breastdataset has aSample code numberfeature that is not present inbreast_w. The original can be found on the UCI ML repository.breast_wandbreastdatasets.Parse data from the original into the expected format.diabetesandpimadatasets.credit_gandgermandatasets.solar_flare_2also contains two additional features.solar_flare_2are in fact the other two targets.solar_flare_2andflaredatasets.car_evaluationdataset several categorical (ordinal) features fromcarare one-hot-encoded. The original can be found on the UCI ML repository. This issue was also mention in car and car_evaluation seem to be identical #84.carandcar_evaluationdatasets.chessandkr_vs_kpdatasets.294_satellite_imageincorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.satimageand294_satellite_imagedatasets.227_cpu_smalland562_cpu_smallhave fewer features.197_cpu_act,227_cpu_small,562_cpu_smalland573_cpu_actdatasets.pokerand1595_pokerdatasets.My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues: