Acquiring Data Using the baseballr Package

Introduction

Over the years, I have found the baseballr package very helpful in obtaining Statcast data from Baseball Savant. Recently, this package has been updated (version 1.5.0) to download baseball data from a variety of sources including some that I have not used. So I thought it would be helpful to illustrate some of the many useful functions in this package. Since the baseballr package has many capabilities beyond just retrieving data, this might be the first of a number of posts describing the features of this package.

The baseballr package allows convenient scraping of data from five sources: Baseball Reference, FanGraphs, MLB, Retrosheet and Statcast.

Getting Started

The home page of baseballr describes the process of installing the package from CRAN or installing the developmental version of the pacagefrom the authors’ Github site.

Baseball Reference Data

The package baseballr has a number of functions for downloading various data from the popular Baseball Reference site. For convenience, all of these BR functions are prefaced by “bref”. In particular, the bref_daily_batter() function will download batting stats for all players between two dates in a season. For example, the following code will download all stats for games played between May 10 and June 20 of 2021.

brdata <- bref_daily_batter(t1="2021-05-10", t2="2021-06-20")

A similar function bref_daily_pitcher() will download pitching stats for all pitcher who pitched between two dates.

brdata2 <- bref_daily_pitcher(t1="2021-05-10", t2="2021-06-20")

These functions retrieve the traditional batting and pitching stats for all players

FanGraphs Data

The package also has a number of functions, prefaced by “fg” that download various data from the popular FanGraphs site.

Suppose, for example, that we’re interested in obtaining the game-by-game batting stats for Mike Trout for each game of the 2022 season. First we have to figure out the player id used by FanGraphs. The function playerid_lookup() will obtain all of the player ids from the Chadwick Bureau’s public register, I extract the FanGraphs id, and then I will use this id as an input to the function fg_batter_game_logs() that downloads the game-to-game stats for Trout.

id <- playerid_lookup("Trout", "Mike")
fg_id <- id$fangraphs_id
fg_data_trout <-
  fg_batter_game_logs(playerid = fg_id, year = 2022)

By the way, this returns a remarkable 239 variables collected about Trout for each of the 119 games he played this season.

Suppose I’m interested in stats for each of Aaron Nola’s starts during the 2022 season. I use similar functions to retrieve Nola’s FanGraphs id and then the function fg_pitcher_game_logs() will retrieve the game-to-game pitching stats.

id <- playerid_lookup("Nola", "Aaron")
fg_id <- id$fangraphs_id
fg_data_nola <-
  fg_pitcher_game_logs(playerid = fg_id, year = 2022)

We have all of the interesting FanGraphs measures collected for each of Nola’s games. As an example, here is a histogram of the zone percentages for the 32 Nola starts. In most of Nola starts, he placed 40-50% of the pitchers within the zone.

hist(fg_data_nola$`Zone%`)

Another useful function is fg_batter_leaderboards() that will collect stats for all hitters across several seasons. Here I am collecting the FanGraphs batting measures for all “leaders” for the 2021 and 2022 seasons. The output data frame has 262 player-seasons and 289 batting measures.

b_leaders <- fg_batter_leaders(2021, 2022)

MLB Data

One noteworthy feature of the baseballr package is the ability to download data from the MLB feed. A variety of functions are available, all starting with “MLB”. One function that was of interest to me was mlb_pbp() which will retrieve pitch by pitch data for a minor or major league game of interest. To use this, one needs to know the game id game_pk. The function mlb_game_pks() will give the game_id values for all MLB games played on a particular day.

I figured out the game_pk value for the spring training game on March 5 recently between the Mets and the Cardinals was 719263 and the mlb_pbp() function will collect pitch-by-pitch data for this game.

d2 <- mlb_pbp(game_pk = 719263)

I am not completely familiar with all of the 148 variables in this data frame, but the startTime and endTime variables give the starting and ending time for each pitch, so it would be straightforward to explore the effect of the new MLB pitch clock rule. Here are two rows of these two variables corresponding to two pitches.

d2[1:2, c("startTime", "endTime")]
── MLB Play-by-Play data from MLB.com ────── baseballr 1.5.0 ──
ℹ Data updated: 2023-03-06 20:04:51 EST
# A tibble: 2 × 2
  startTime                endTime                 
  <chr>                    <chr>                   
1 2023-03-05T20:31:34.676Z 2023-03-05T20:31:39.667Z
2 2023-03-05T20:31:15.772Z 2023-03-05T20:31:19.462Z

Retrosheet Play-by-Play data

I’ve written about downloading Retrosheet play-by-play data by using the Chadwick files and a special R function. The process is even easier using the function retrosheet_data() in the baseballr package. Suppose I’m interested in obtaining all the Retrosheet data for the seasons 2020 through 2022. Just type

d <- retrosheet_data(years_to_acquire = 2020:2022)

The output d is list with three elements corresponding to the three seasons of interest. d[[1]] (the first element of the list) is also a list with two components: events and rosters. The events component is our familiar Retrosheet data with an extra variable year indicating the value for season.

It is important to note that this function assumes that the user has already installed the Chadwick tools. I have a Mac Intel laptop and the Chadwick tools were installed, so this function worked.

Statcast Data

As in earlier versions of baseballr, the package allows easy access to Statcast data, but the function syntax has changed. Suppose you wish to acquire the data for the spring training games from March 5 to March 6 this season. Type

sc_data <- statcast_search(start_date = "2023-03-05",
                           end_date = "2023-03-06",
                           player_type = 'batter')

Also the functions statcast_search_batter() and statcast_search_pitcher() allows accessing data for an individual batter or pitcher, respectively. The function statcast_leaderboards() gives access to leaderboards published on Baseball Savant.

Try it Out!

  • The baseballr package has been of interest to me for a number of years, but I believe the recent additions to the package look very exciting. The easy availability of baseball data from different sources makes it convenient for the reader to address any baseball question of interest. For example, now that I know that I can collect the time at the beginning and end of each pitch, I am interested in doing a study exploring the changes in lengths of baseball games due to the pitch clock rule.
  • This package contains much more that these data acquisition functions and I likely will discuss other uses with these package functions in future posts.
  • On my Github Gist site, you can find a R script including all of the examples in this post.
  • For more information, I encourage the interested reader to visit the baseballr package home page.

2 responses

  1. I agree Jim! The baseballr package is very good right now.

    I’ve use it in the past trying to create functions to visualize the stats that we can scrape with it.
    Nevertheless, I confess I need to improve my R skills in order to polish my pull requests I’ve submitted to the package.
    I’ve created some isolated functions to plot some baseball stats (you can see some examples at http://www.twitter.com/gmbeisbol).

    In the case you find my vizes interesting and if you think a supporting hand could help you to include this kind of vizes in this blog, just drop me a message.
    That would be great for me to improve my R skills and contribute to this important blog for baseball fans and data scientists in general.

    Regards.

    Daniel.

    1. Daniel: Sure, I’d be happy to have you contribute to the blog. The idea of showing a graphic to illustrate or answer some baseball question together with advice how to do the construction would be great. You can send me materials at albert@bgsu.edu. Jim

Leave a comment