In case you don’t have your own CSV file ready, you can still get started right away with one of the provided datasets below:

US Census Income dataset

This dataset is taken from the Adult Data Set from UC Irvine’s Machine Learning Repository.

It’s an extraction from the 1994 US Census database and contains 48.842 records and 13 columns of data, with a mix of data types.

Click here to download the .CSV file.

us-census-income.csv
       age        workclass fnlwgt education     marital-status        occupation   relationship               race    sex hours-per-week native-country capital income
    1:  39        State-gov  77516 Bachelors      Never-married      Adm-clerical  Not-in-family              White   Male             40  United-States    2174  <=50K
    2:  50 Self-emp-not-inc  83311 Bachelors Married-civ-spouse   Exec-managerial        Husband              White   Male             13  United-States       0  <=50K
    3:  38          Private 215646   HS-grad           Divorced Handlers-cleaners  Not-in-family              White   Male             40  United-States       0  <=50K
    4:  53          Private 234721      11th Married-civ-spouse Handlers-cleaners        Husband              Black   Male             40  United-States       0  <=50K
    5:  28          Private 338409 Bachelors Married-civ-spouse    Prof-specialty           Wife              Black Female             40           Cuba       0  <=50K
   ...
48838:  39          Private 215419 Bachelors           Divorced    Prof-specialty  Not-in-family              White Female             36  United-States       0  <=50K
48839:  64                ? 321403   HS-grad            Widowed                 ? Other-relative              Black   Male             40  United-States       0  <=50K
48840:  38          Private 374983 Bachelors Married-civ-spouse    Prof-specialty        Husband              White   Male             50  United-States       0  <=50K
48841:  44          Private  83891 Bachelors           Divorced      Adm-clerical      Own-child Asian-Pac-Islander   Male             40  United-States    5455  <=50K
48842:  35     Self-emp-inc 182148 Bachelors Married-civ-spouse   Exec-managerial        Husband              White   Male             60  United-States       0   >50K

Baseball dataset

This dataset is taken from the Sean Lahman Baseball Database.

It consists of two data tables: 17.000 MLB baseball players and up to 15 seasons of their batting statistics.

Click here to download the .ZIP file.

players.csv
    1: 00020a493f3b    P.R. 1993-02-10       <NA>     Jorge    Lopez    195     75    R      R
    2: 000492168bd5     USA 1945-10-12 1970-12-14    Herman     Hill    190     74    L      R
    3: 0007b3925736     USA 1890-12-24 1956-09-12       Tod    Sloan    175     72    L      R
    4: 000b415221f6     USA 1979-04-23       <NA>     Henry    Owens    230     75    R      R
    5: 000f9b5832e6     USA 1886-03-06 1948-05-26      Bill  Sweeney    175     71    R      R
   ...
16996: ffe6f538955f     USA 1867-10-07 1915-09-23 Brickyard  Kennedy    160     71    R      R
16997: ffefc03893ec     USA 1992-02-01       <NA>      Sean   Manaea    245     77    R      L
16998: fff23e39b183     USA 1869-10-11 1906-02-14      Yale   Murphy    125     63    L      R
16999: fff3d8297c46     USA 1917-05-19 1993-06-07    Skippy  Roberge    185     71    R      R
17000: fffa80049d40    P.R. 1990-02-18       <NA>       Joe    Colon    180     72    R      R
seasons.csv
          players_id year team league  G AB R H HR RBI SB CS BB SO
     1: 00020a493f3b 2015  MIL     NL  2  2 0 0  0   0  0  0  0  2
     2: 00020a493f3b 2017  MIL     NL  1  0 0 0  0   0  0  0  0  0
     3: 00020a493f3b 2018  MIL     NL 10  2 1 1  0   2  0  0  0  1
     4: 00020a493f3b 2018  KCA     AL  7  0 0 0  0   0  0  0  0  0
     5: 000492168bd5 1969  MIN     AL 16  2 4 0  0   0  1  2  0  1
    ---
105857: fffa11996763 2005  CHA     AL 24  3 0 1  0   0  0  0  0  1
105858: fffa11996763 2006  ARI     NL  9 11 0 3  0   0  0  0  1  0
105859: fffa11996763 2006  NYN     NL 20 35 4 5  0   2  1  0  0 10
105860: fffa11996763 2007  NYN     NL 28 48 1 8  0   3  2  0  0 18
105861: fffa80049d40 2016  CLE     AL 11  0 0 0  0   0  0  0  0  0

Netflix Prize dataset

This sequence dataset is an excerpt from the original Netflix Prize dataset. It contains 500.000 ratings from 10.000 users, instead of 100 million ratings from 500.000 users.

Click here to download the .ZIP file.

users.csv
        id
     1: 495
     2: 840
     3: 1374
     4: 1522
     5: 1619
    ---
  9997: 2648568
  9998: 2648678
  9999: 2648907
 10000: 2649207
ratings.csv
        users_id	date        movie                                  rating
     1: 495	      2003-10-08	A Mighty Wind                          4
     2: 495	      2003-10-24	On the Beach                           4
     3: 495	      2003-11-17	Seven Samurai                          5
     4: 495	      2003-11-26	Midnight Cowboy                        4
    ---
501286: 2649207   2005-02-08	The Importance of Being Earnest        4
501287: 2649207   2005-06-08	Friday Night Lights                    2
501288: 2649207   2005-06-16	The Hitchhiker's Guide to the Galaxy   1
501289: 2649207   2005-08-14	Ray                                    3