Project Overview: Movie Revenue Prediction

17 July 2025

During my Data Analytics masters program, I had the opportunity to study and practice a variety of statistical techniques. While I plan to eventually complete projects for each of the techniques, I decided to start with a straightforward one— linear regression. For this project, I explored whether I could predict a movie’s box-office revenue using data from the IMDB database.

Challenge #1: Cleaning the Data

The first challenge I faced was ensuring the data was clean and usable. There were several inconsistencies in the data that I had to address:

  1. The revenue data did not use a consistent currency. Because the dataset included data from many countries, the revenue data currency was inconsistent. To address this issue, I limited my analysis to films released in the United States.
  2. Missing Score Information. Some movies were missing score data. Because there were relatively few movies with missing score information, I chose to exclude them from the analysis.
  3. Missing Revenue Information. For movies with missing revenue values, I searched online to fill the gaps. If I couldn’t find the missing data, I removed the movie from the analysis..

After cleaning the data, I reviewed the summary statistics for the numeric columns. The results are shown below.

Challenge #2: Linear Regression Assumptions

Linear Regression is based on several key assumptions. To ensure my model could accurately predict future revenue, I tested the data against these assumptions. One key assumption is the absence of multicollinearity among the variables. I used the variance inflation factor (VIF) to identify any multicollinearity among the independent variables.

There were 2 variables that had a high VIF score (>5): Score and Year. Their correlation is illustrated in the RegPlot below.

To mitigate this issue, I removed the “Score” value from the analysis.

Challenge #3: Choosing the Explanatory Variables

Next, I needed to select which explanatory variables to include in the initial analysis. To choose the variables, I calculated the correlation coefficient between each explanatory variable and the revenue. Since the dataset included both numerical and dichotomous variables, I used two different methods:

  1. Pearson Correlation – Used for numeric variables such as budget.
  2. Point Biserial – Used for dichotomous variables such as the binary genre columns

Any variables with a p-value below 0.1 were included in my initial model (see the chart below).

Running the model with these initial variables resulted in the results shown below.

Challenge #4: Reducing the Model

The initial model included six variables. I wanted to see if I could simplify it without significantly reducing its accuracy. Using an iterative approach, I evaluated how the adjusted R-squared changed as variables were added. As shown in the plot below, model performance plateaued after the first two variables, suggesting that Budget and Year were sufficient for maintaining predictive accuracy.

Using these two variables I created a final regression model with the results shown below.

Analysis Results

The final R-squared value of 0.797 indicates that 79.7% of the variability in revenue was explained by the model. This suggests that while revenue generally increases over time, and higher budgets tend to generate more revenue, other important factors influencing box-office performance were not captured in this analysis.

Learnings

Real-world data cleaning is tough. I faced multiple challenges, particularly with inconsistent currency formats and differing representations of monetary values. Some datasets listed full revenue figures, while others abbreviated them by millions or billions, even within the same dataset.

Relevant Links

Kaggle Dataset: https://www.kaggle.com/datasets/ashpalsingh1525/imdb-movies-dataset

Kaggle Notebook: https://www.kaggle.com/code/chelsieminor/movie-revenue-prediction