Demo datasets

Ready-to-use datasets

In case you don’t have your own dataset ready, you can still get started right away with one of the provided datasets below:

US Census Income dataset

This dataset is taken from the Adult Dataset from UC Irvine’s Machine Learning Repository.

It is an extraction from the 1994 US Census database and contains 48,842 records and 13 columns of data, with a mix of data types.

Click here to download the .csv.gz file.

us-census-income.csv.gz
       age         workclass  fnlwgt  education      marital-status         occupation  ...                race     sex hours-per-week  native-country capital  income
0       39         State-gov   77516  Bachelors       Never-married       Adm-clerical  ...               White    Male             40   United-States    2174   <=50K
1       50  Self-emp-not-inc   83311  Bachelors  Married-civ-spouse    Exec-managerial  ...               White    Male             13   United-States       0   <=50K
2       38           Private  215646    HS-grad            Divorced  Handlers-cleaners  ...               White    Male             40   United-States       0   <=50K
3       53           Private  234721       11th  Married-civ-spouse  Handlers-cleaners  ...               Black    Male             40   United-States       0   <=50K
4       28           Private  338409  Bachelors  Married-civ-spouse     Prof-specialty  ...               Black  Female             40            Cuba       0   <=50K
...    ...               ...     ...        ...                 ...                ...  ...                 ...     ...            ...             ...     ...     ...
48837   39           Private  215419  Bachelors            Divorced     Prof-specialty  ...               White  Female             36   United-States       0   <=50K
48838   64                 ?  321403    HS-grad             Widowed                  ?  ...               Black    Male             40   United-States       0   <=50K
48839   38           Private  374983  Bachelors  Married-civ-spouse     Prof-specialty  ...               White    Male             50   United-States       0   <=50K
48840   44           Private   83891  Bachelors            Divorced       Adm-clerical  ...  Asian-Pac-Islander    Male             40   United-States    5455   <=50K
48841   35      Self-emp-inc  182148  Bachelors  Married-civ-spouse    Exec-managerial  ...               White    Male             60   United-States       0    >50K

Baseball dataset

This dataset is taken from the Sean Lahman Baseball Database.

Click here to download the .zip file. It includes the players.csv and seasons.csv files.

players.csv
                 id country   birthDate   deathDate nameFirst   nameLast  weight  height bats throws
0      00020a493f3b    P.R.  1993-02-10         NaN     Jorge      Lopez   195.0    75.0    R      R
1      000492168bd5     USA  1945-10-12  1970-12-14    Herman       Hill   190.0    74.0    L      R
2      0007b3925736     USA  1890-12-24  1956-09-12       Tod      Sloan   175.0    72.0    L      R
3      000f9b5832e6     USA  1886-03-06  1948-05-26      Bill    Sweeney   175.0    71.0    R      R
4      00148e917757     USA  1959-09-10         NaN     Bruce    Robbins   190.0    73.0    L      L
...             ...     ...         ...         ...       ...        ...     ...     ...  ...    ...
18995  fff2e8e0ccff    P.R.  1953-04-02         NaN    Hector       Cruz   170.0    71.0    R      R
18996  fff3d8297c46     USA  1917-05-19  1993-06-07    Skippy    Roberge   185.0    71.0    R      R
18997  fff913eb4437  Panama  1976-06-20         NaN    Carlos        Lee   270.0    74.0    R      R
18998  fffa11996763    Cuba  1965-10-11         NaN   Orlando  Hernandez   210.0    74.0    R      R
18999  fffa80049d40    P.R.  1990-02-18         NaN       Joe      Colon   180.0    72.0    R      R
seasons.csv
          players_id  year team league   G  AB  R  H  HR  RBI   SB   CS  BB    SO
0       00020a493f3b  2015  MIL     NL   2   2  0  0   0  0.0  0.0  0.0   0   2.0
1       00020a493f3b  2017  MIL     NL   1   0  0  0   0  0.0  0.0  0.0   0   0.0
2       00020a493f3b  2018  MIL     NL  10   2  1  1   0  2.0  0.0  0.0   0   1.0
3       00020a493f3b  2018  KCA     AL   7   0  0  0   0  0.0  0.0  0.0   0   0.0
4       000492168bd5  1969  MIN     AL  16   2  4  0   0  0.0  1.0  2.0   0   1.0
...              ...   ...  ...    ...  ..  .. .. ..  ..  ...  ...  ...  ..   ...
103573  fffa11996763  2005  CHA     AL  24   3  0  1   0  0.0  0.0  0.0   0   1.0
103574  fffa11996763  2006  ARI     NL   9  11  0  3   0  0.0  0.0  0.0   1   0.0
103575  fffa11996763  2006  NYN     NL  20  35  4  5   0  2.0  1.0  0.0   0  10.0
103576  fffa11996763  2007  NYN     NL  28  48  1  8   0  3.0  2.0  0.0   0  18.0
103577  fffa80049d40  2016  CLE     AL  11   0  0  0   0  0.0  0.0  0.0   0   0.0

CDNOW dataset

This dataset contains a CRM table and the entire purchase history up to the end of June 1998 of 23,570 customers who made their first-ever purchase at CDNOW in the first quarter of 1997.

Click here to download the the .csv.gz file.

CDNOW_CRM_table.csv.gz
      first_name last_name       state gender   birthdate
0          Bobby  Thompson      Oregon      M  1972-07-19
1           John      Wood  New Jersey      M  1962-02-08
2        Michael  Griffith   Minnesota      M  1981-03-22
3           Eric    Walker    Michigan      M  1942-10-07
4         Austin    Levine  New Jersey      M  1952-05-23
...          ...       ...         ...    ...         ...
23565       Luis     Braun     Florida      M  1954-05-09
23566   Nicholas   Aguilar     Indiana      M  1950-10-01
23567     Alison    Larson  New Jersey      F  1954-06-08
23568     Joseph      Cook        Utah      M  1935-06-11
23569     Debbie    Zamora    Illinois      F  1977-06-02

Click here to download the .zip file. It includes the customers.csv and purchases.csv tables.

customers.csv
          id      zone       state gender age_category  age
0          1   Pacific      Oregon      M        young   26
1          2   Eastern  New Jersey      M       medium   36
2          3   Central   Minnesota      M        young   17
3          4   Eastern    Michigan      M       medium   56
4          5   Eastern  New Jersey      M       medium   46
...      ...       ...         ...    ...          ...  ...
23565  23566   Eastern     Florida      M       medium   44
23566  23567   Eastern     Indiana      M       medium   48
23567  23568   Eastern  New Jersey      F       medium   44
23568  23569  Mountain        Utah      M          old   63
23569  23570   Central    Illinois      F        young   21
purchases.csv
       users_id        date  cds    amt
0             1  1997-01-01    1  11.77
1             2  1997-01-12    1  12.00
2             2  1997-01-12    5  77.00
3             3  1997-01-02    2  20.76
4             3  1997-03-30    2  20.76
...         ...         ...  ...    ...
69654     23568  1997-04-05    4  83.74
69655     23568  1997-04-22    1  14.99
69656     23569  1997-03-25    2  25.74
69657     23570  1997-03-25    3  51.12
69658     23570  1997-03-26    2  42.96

Netflix Prize dataset

This sequence dataset is an excerpt from the original Netflix Prize dataset. It contains 500,000+ ratings from 10,000 users.

Click here to download the .zip file.

users.csv
           id
0         495
1         840
2        1374
3        1522
4        1619
...       ...
9995  2648416
9996  2648568
9997  2648678
9998  2648907
9999  2649207
ratings.csv
        users_id        date                                 movie  rating
0            495  2003-10-08                         A Mighty Wind       4
1            495  2003-10-24                          On the Beach       4
2            495  2003-11-17                         Seven Samurai       5
3            495  2003-11-26                       Midnight Cowboy       4
4            495  2003-12-04                               Yojimbo       5
...          ...         ...                                   ...     ...
501283   2649207  2005-02-08                     Napoleon Dynamite       5
501284   2649207  2005-02-08       The Importance of Being Earnest       4
501285   2649207  2005-06-08                   Friday Night Lights       2
501286   2649207  2005-06-16  The Hitchhiker's Guide to the Galaxy       1
501287   2649207  2005-08-14                                   Ray       3