Predicting Baseball Pitching Outcome
EECS 349 Machine Learning, Northwestern University
1 Introduction
2 Data Preparation
3 Training and Feature Selection
4 Evaluation
5 Analysis, Discussion and Future Work
6 Work Allocation
7 Feature List
8 Decision Tree Model for small

Predicting Baseball Pitching Outcome
EECS 349 Machine Learning, Northwestern University

Daniel Feltey,
Spencer Florence,
and Shu-Hung You

A PDF version without the appendices is also available.

1 Introduction

In baseball, when a player does not swing at a pitch the home plate umpire must determine whether the pitch results in a ball or a called strike. Although there is a fairly well-defined strike zone that is intended to differentiate between balls and strikes, umpires are only human and must make the determination unaided. Whether the umpires are determining the pitch results base on the strike zone without being affected by other factors is an open question.

This work analyzes the pitch data with decision tree models, random forest models and logistic regression to study what factors would affect the umpires’ decision about pitching results. The trained models achieve roughly 88%-91% accuracy and precision. Unsurprisingly, the home plate entrance zone or equivalently the (x,z)-coordinate of the ball entrance point have the largest impact while other features such as pitch type, release position and release velocity have certain but minor impact.

From the perspective of a baseball team, this project gives insight into the factors that affect an umpires decision of whether to call a ball or strike. Furthermore, this work hopefully give a characterization of how borderline pitches are usually decided.



2 Data Preparation

(Data Source)  We have acquired pitch and umpire data for the years 2014-2017 from Retrosheet1Retrosheet: http://www.retrosheet.org/ and Baseball Savant2Baseball Savant, which provides an interface to MLB Statcast (http://m.mlb.com/glossary/statcast): https://baseballsavant.mlb.com/. The Retrosheet data contain the home plate umpire data while the Baseball Savant data include detailed pitch data each game within 2014-2017.

(Preprocessing)  We joined the data from Retrosheet and Baseball Savant together by game ID to extend each pitch data with the home plate umpire ID. We also removed consecutive game records from the data since consecutive games have different home umpires while sharing the same game ID. This resulted in a dataset with roughly 89 attributes and 842644 instances.

(Splitting Data)  We left out pitch data from year 2017 as Test Dataset I and further sampled 10% of the pitch data from year 2014-2016 as Test Dataset II. The remaining 90% of the data are used for training.

3 Training and Feature Selection

To study the impact of different factors in the pitch data, we selected 3 sets of features and trained one decision tree model, one random forest model and one logistic regression model for each feature set using the Orange framework3https://orange.biolab.si/ as in Figure 2. The hyperparameters were turned using 10 fold cross-validation on the training data.

Due to time and computation resource limit, we were not able to explore the importance of each factor separately. Hence we only studied three sets of features summarized in Figure 1. The list of all features is given in Feature List.

 

Feature

 

Type

 

  tiny  

 

  small  

 

  medium  

 

 

Pitch Type

 

Categorical

 

V

 

V

 

V

 

 

IDs

 

Categorical

 

Umpire

 

Umpire

 

Umpire, Pitcher and Batter

 

 

Ball-Plate Intersection Coordinate

 

Numerical

 

V

 

V

 

V

 

 

Ball-Plate Intersection Zone

 

Categorical

 

 

V

 

V

 

 

Strike Zone Range

 

Numerical

 

V

 

V

 

V

 

 

Effective Speed

 

Numerical

 

V

 

V

 

V

 

 

Release Position, Velocity and Spin Rate

 

Numerical

 

 

V

 

V

 

 

Pitcher and Batter Handedness

 

Categorical

 

 

V

 

V

 

 

Initial Acceleration

 

Numerical

 

 

 

V

 

Figure 1: Selected Feature Sets

Figure 2: Training setup

4 Evaluation

(a) imageimageimage

(b) imageimageimage

Figure 3: (a) The learning curve for model accuracy and (b) the learning curve for model precision. The models were trained using the tiny, small and medium feature sets from left to right and were tested on Test Data II.

 

tiny

 

small

 

medium

CA

 

  zeror

 

logistic

 

tree

 

forest  

 

  zeror

 

logistic

 

tree

 

forest  

 

  zeror

 

logistic

 

tree

 

forest  

Test Data I

 

0.669

 

0.669

 

0.91

 

0.906

 

0.669

 

0.885

 

0.911

 

0.908

 

0.669

 

0.893

 

0.895

 

0.906

Test Data II

 

0.661

 

0.661

 

0.908

 

0.911

 

0.661

 

0.874

 

0.908

 

0.911

 

0.661

 

0.879

 

0.893

 

0.903


 

tiny

 

small

 

medium

Precision

 

  zeror

 

logistic

 

tree

 

forest  

 

  zeror

 

logistic

 

tree

 

forest  

 

  zeror

 

logistic

 

tree

 

forest  

Test Data I

 

0.448

 

0.58

 

0.911

 

0.906

 

0.448

 

0.884

 

0.911

 

0.907

 

0.448

 

0.892

 

0.894

 

0.905

Test Data II

 

0.436

 

0.436

 

0.908

 

0.911

 

0.437

 

0.873

 

0.908

 

0.911

 

0.437

 

0.878

 

0.892

 

0.902

Figure 4: Prediction accuracy and precision on Test Data I and Test Data II.

For each of the tiny, small and medium feature set and each kind of model, Figure 3 depicts the learning curves and Figure 4 presents the accuracy and the precision result. The precision is calculated by taking weighted average over both outcomes.

From the evaluation result, we see that the learning curves for both accuracy and precision are mostly flat except for the logistic model on the tiny feature set. This shows that most models can predict the outcome fairly well and do not overfit training data.

Overall, the models trained with the small feature set perform the best. Including batter ID and pitcher ID in the medium feature set decreases performance. Excluding the zone attribute to form the tiny feature set stopped the logistic regression model from working, indicating that the logistic model is unable to determine the outcome using plate_x and plate_z alone.

5 Analysis, Discussion and Future Work

In this section, we investigate two models trained with the small feature set and study the importance of each feature. In the following models, the categorical attribute zone 4Baseball Savant Statcast Search: https://baseballsavant.mlb.com/statcast_search represents the number of the strike zone grid and its surrounding area:

Figure 6 shows the Nomogram of the logistic regression model. From the Nomogram, we can see that zone has the largest impact on the resulting probability. The second most impactful features include plate_z and plate_x, which can actually be seen as where zone is derived from. After that, pitch_type and release_speed have certain impact on the resulting probability.

Figure 5 shows the first 3 layers of the decision tree model. In (a), the decision tree first splits on zone. Following in (b), (c) and (d), the model further splits on plate_z or plate_x to determine the borderline cases. In (b) and (c), since the regions 1, 2, 8 and 9 are at the top and the bottom of the strike zone respectively, the decision tree tests plate_z to determine whether the z coordinate is out of border. In (d), the decision tree is splitting on both plate_z and plate_x since regions 11 and 12 both contain a corner.

To sum up, our models show that the features zone, plate_z and plate_x can largely characterize the pitching results. The logistic model further predicts that pitch_type and release_speed also affect pitching outcome.

Unfortunately, we did not have enough time to experiment with more feature sets to obtain a finer understanding on how each feature affects the pitch. Moreover, we were unable to prune the decision tree model enough to understand how the remaining features took place in the process. We would like to close this loophole in the future.

(a) 

  

(b)

(c)

  

(d)

Figure 5: Decision tree model.

Figure 6: Nomogram for the logistic regression model.

2Baseball Savant, which provides an interface to MLB Statcast (http://m.mlb.com/glossary/statcast): https://baseballsavant.mlb.com/

4Baseball Savant Statcast Search: https://baseballsavant.mlb.com/statcast_search

6 Work Allocation

For the work allocation, please see the end of the PDF.

7 Feature List

Here are the full lists of features we used in our model. The name of the features follow the convention from MLB StatCast. A detailed explanation can be found at https://fastballs.wordpress.com/2007/08/02/glossary-of-the-gameday-pitch-fields/. The D# prefixes in some feature names indicate that the corresponding features are categorical. In all feature sets, description is our target variable.

8 Decision Tree Model for small

The decision tree trained on the small feature set, expanded up to depth 4, is:

and more.

The decision tree visualized up to depth 5 is also available here. A more comprehensible decision tree trained without using the Umpire ID attribute is:

and more.