About the Project (ID 13)

Data Journey

We chose a data source of movie data from over 1,500 movies spanning almost 100 years and containing multiple pieces of information on each movie (lead actor/actress, director, genre, etc). We found this data set valuable for its sheer size—giving us a large sample size with which to conduct our research, as well as for its multiple different variables—allowing us to manipulate and analyze the popularity of a movie based on many different factors, which we believe will lead to a more detailed and specific conclusion as to which characteristics actually lead to more popular movies.

Our data was originally formatted as a CSV file, which we first turned into one large, baseline DataFrame. Then, we further manipulated the data in our large Movie Data DataFrame, allowing us to extract even more variables from the original dataset, such as a given actor’s average popularity or the number of movies in the dataset directed by a given director. We first put this further-manipulated data into dictionaries, and then lists, and finally into new DataFrames, ultimately giving us multiple different DataFrames and an expanded number of variable information from which to create our visualizations.

Caveats

While we believe our dataset provides lots of valuable information, it also has some limitations. For example, it is unclear how the popularity of a given movie is determined in the dataset. While it seems to be measured on a scale from 0-100 (likely a percentage), the method of data collection is unknown, so we can only guess as to where it came from. Additionally, the year of the collection of the dataset is unknown and, as the most recent movie data is from 1997 (26 years ago), it is likely that this popularity data is a bit outdated, with preferences perhaps shifting slightly in the last couple of decades. However, because of the large size of the dataset and its relatively recent collection, we are confident that our insights and conclusions will generally hold up today.

Data Sources

Click here to view our original data source.