Duplicate datasets.

While trying to identify which data sets from the *modeldata* R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.

- [cmc](https://epistasislab.github.io/pmlb/profile/cmc.html) and [contraceptive](https://epistasislab.github.io/pmlb/profile/contraceptive.html) are the same. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/contraceptive+method+choice).
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `cmc` and `contraceptive` datasets.
- [195_auto_price](https://epistasislab.github.io/pmlb/profile/195_auto_price.html) and [207_autoPrice](https://epistasislab.github.io/pmlb/profile/207_autoPrice.html). The `symboling` feature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for [auto](https://epistasislab.github.io/pmlb/profile/auto.html). The difference between the datasets is the target, which is price for `195_auto_price` and `207_autoPrice`, and symboling for `auto`, as well as how missing values were removed. The original dataset may be found on the  [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/automobile).
  - [x] Parse data from the original into the expected format with price as target.
  - [x] Parse data from the original into the expected format with symboling as target.
  - [x] Ensure that `Description` of each new dataset references the other.
  - [x] Deprecate `195_auto_price`, `207_autoPrice` and `auto` datasets.
- [glass](https://epistasislab.github.io/pmlb/profile/glass.html) and [prnn_fglass]( https://epistasislab.github.io/pmlb/profile/prnn_fglass.html). The target class levels seem to be switched between datasets. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/glass+identification).
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `glass` and `prnn_fglass` datasets.
- [heart_c](https://epistasislab.github.io/pmlb/profile/heart_c.html), [cleve](https://epistasislab.github.io/pmlb/profile/cleve.html), [cleveland_nominal](https://epistasislab.github.io/pmlb/profile/cleveland_nominal.html) and [cleveland](https://epistasislab.github.io/pmlb/profile/cleveland.html). The `cleve` and `heart_c` data sets have a binarized target (vs. ordinal in the other two datasets); the `cleveland_nominal` data set contains only a feature subset. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/heart+disease).
- [heart_statlog](https://epistasislab.github.io/pmlb/profile/heart_statlog.html) is a subset of the `cleve` data set. 
- [heart_h](https://epistasislab.github.io/pmlb/profile/heart_h.html) and [hungarian](https://epistasislab.github.io/pmlb/profile/hungarian.html) appear to be the same.
  - [x] Parse Cleveland data from the original into the expected format.
  - [x] Parse Hungarian data from the original into the expected format.
  - [x] Parse Switzerland data (currently missing) from the original into the expected format.
  - [x] Parse VA Long beach data (currently missing) from the original into the expected format.
  - [x] Deprecate `heart_c`, `cleve`, `cleveland_nominal`, `cleveland`, `heart_statlog`, `heart_h` and `hungarian` datasets.
- [colic](https://epistasislab.github.io/pmlb/profile/colic.html) and [horse_colic]( https://epistasislab.github.io/pmlb/profile/horse_colic.html) appear to be the same. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/horse+colic). This issue was also mentioned in #75.
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `colic` and `horse_colic` datasets.
- [vote](https://epistasislab.github.io/pmlb/profile/vote.html) and [house_votes_84]( https://epistasislab.github.io/pmlb/profile/house_votes_84.html) are identical.
  - [x] Identify original source.
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `vote` and `house_votes_84` datasets.
- [breast_cancer_wisconsin](https://epistasislab.github.io/pmlb/profile/breast_cancer_wisconsin.html) and [wdbc]( https://epistasislab.github.io/pmlb/profile/wdbc.html) are the same. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+diagnostic).
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `breast_cancer_wisconsin` and `wdbc` datasets.
- [australian](https://epistasislab.github.io/pmlb/profile/australian.html), [buggyCrx](https://epistasislab.github.io/pmlb/profile/buggyCrx.html), [credit_a](https://epistasislab.github.io/pmlb/profile/credit_a.html) and [crx](https://epistasislab.github.io/pmlb/profile/crx.html) are identical or based on the same data.
  - [x] Identify original source.
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `australian`, `buggyCrx`, `credit_a` and `crx` datasets.
- [breast_w](https://epistasislab.github.io/pmlb/profile/breast_w.html) and [breast](https://epistasislab.github.io/pmlb/profile/breast.html) are based on the same data. The `breast` dataset has a `Sample code number` feature that is not present in `breast_w`. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+original).
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `breast_w` and `breast` datasets.
- [diabetes](https://epistasislab.github.io/pmlb/profile/diabetes.html) and [pima](https://epistasislab.github.io/pmlb/profile/pima.html) appear to be identical.
  - [x] Identify original source. This dataset appears to have been hosted at the UCI ML repository. However, the original owner seems to have withdrawn permission to use this dataset.
  - [x] ~~Parse data from the original into the expected format.~~
  - [x] Deprecate `diabetes` and `pima` datasets.
- [credit_g](https://epistasislab.github.io/pmlb/profile/credit_g.html) and [german](https://epistasislab.github.io/pmlb/profile/german.html) appear to be identical.
  - [x] Identify original source. The original can be found the [UCI ML repository](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)).
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `credit_g` and `german` datasets.
- [solar_flare_2](https://epistasislab.github.io/pmlb/profile/solar_flare_2.html) and [flare](https://epistasislab.github.io/pmlb/profile/flare.html) derive from the same data, but differ in the way the target is formulated. `solar_flare_2` also contains two additional features.
  - [x] Identify original source.  The original can be found the [UCI ML repository](https://archive.ics.uci.edu/ml/datasets/solar+flare). There are three targets, of which one is useful for ML prediction. The additional features in `solar_flare_2` are in fact the other two targets.
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `solar_flare_2` and `flare` datasets.
- [car](https://epistasislab.github.io/pmlb/profile/car.html) and [car_evaluation](https://epistasislab.github.io/pmlb/profile/car_evaluation.html) are based on the same dataset. In the `car_evaluation` dataset several categorical (ordinal) features from `car` are one-hot-encoded. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/car+evaluation). This issue was also mention in #84.
  - [x] Parse data from the original into the expected format.
  - [x] Deprecate `car` and `car_evaluation` datasets.
- [chess](https://epistasislab.github.io/pmlb/profile/chess.html) and [kr_vs_kp](https://epistasislab.github.io/pmlb/profile/kr_vs_kp.html) are identical. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/chess+king+rook+vs+king+pawn).
  - [ ] Parse data from the original into the expected format.
  - [ ] Deprecate `chess` and `kr_vs_kp` datasets.
- [satimage](https://epistasislab.github.io/pmlb/profile/satimage.html) and [294_satellite_image](https://epistasislab.github.io/pmlb/profile/294_satellite_image.html) are the same, with the exception that `294_satellite_image` incorrectly specifies a regression problem. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/statlog+landsat+satellite), and has multiple (6) classes as target.
  - [ ] Parse data from the original into the expected format.
  - [ ] Deprecate `satimage` and `294_satellite_image` datasets.
- [197_cpu_act](https://epistasislab.github.io/pmlb/profile/197_cpu_act.html), [227_cpu_small](https://epistasislab.github.io/pmlb/profile/227_cpu_small.html), [562_cpu_small](https://epistasislab.github.io/pmlb/profile/562_cpu_small.html) and [573_cpu_act](https://epistasislab.github.io/pmlb/profile/573_cpu_act.html) are based on the same dataset, with the difference being that `227_cpu_small` and `562_cpu_small` have fewer features.
  - [ ] Identify original source.
  - [ ] Parse data from the original into the expected format.
  - [ ] Deprecate `197_cpu_act`, `227_cpu_small`, `562_cpu_small`  and `573_cpu_act` datasets.
- [poker](https://epistasislab.github.io/pmlb/profile/poker.html) and [1595_poker](https://epistasislab.github.io/pmlb/profile/1595_poker.html) are identical except for the target specification. The original can be found on the [UCI ML repository](https://archive-beta.ics.uci.edu/ml/datasets/poker+hand), and suggest the target is ordinal.
  - [ ] Parse data from the original into the expected format.
  - [ ] Deprecate `poker` and `1595_poker` datasets.

My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:

- #20 
- #75
- #84
- #159

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate datasets. #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Duplicate datasets. #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions