A while ago I made the decision to log every film I watched so that I could one day analyse this data in some way. With a year’s worth of films I thought it could be interesting to also enrich this with more information about the films using all of the data available from IMDB. Was it worth it? Let’s find out…

Choosing What to Analyse

It’s always good to start with some initial questions you want to answer before any data analysis task. Once you start working with the data and begin to answer these questions you’ll naturally ask more questions, leading to more answers, more insights, hoping to eventually reach a point of satisfaction. My list of questions range from the basic to the complex:

  1. How much time have I spent watching films?
  2. Which day of the week do I prefer to watch films?
  3. What genre of film do I prefer to watch?
  4. Which streaming provider do I watch most of my films?
  5. What is the average IMDB rating of the films I watch?

I also wanted to rank all of the films I watch in a year, so that I could then answer:

  1. What were my top ten favourite films of the year?
  2. Do my opinions on films align to IMDB ratings?

With each film I also logged the date and platform I watched it on, which alone could answer a some of those questions, but I knew the IMDB API was going to be a big part of this data analysis which I get into more detail below.

How to Rank Every Film

The first new data point I wanted to add was a score of how much I personally enjoyed the film, which was going to be a challenge as I chose not to rate the films as part of the journal. I could go and retrospectively add a score to each film now, but giving a fair score on a film I might have forgotten about didn’t seem fair. What I thought was fairer was choosing between a much much shorter list of films and ranking those, like some kind of tournament perhaps, which is the underlying concept of a sorting algorithm!

I’m no stranger to sorting algorithms having studied Computer Science at university, but I did have to familiarise myself with many of them again in order to know which one was more suitable for sorting unlabelled records. Choosing the right algorithm isn’t simply a matter of which is more efficient in terms of their Big O time complexity, but how easily it can be carried out by a human (me specifically).

Bubble Sort

This is the first sorting algorithm we were taught at uni, and were promptly told how slow and inefficient it is, so I’ll admit it was an odd choice to start with for this task. Nevertheless I wanted to see just how long it took to sort even a short list (I forgot to mention but I have 209 films to rank in total).

Once I started coding I soon realised I needed a fast way to input my selections, so decided with using the left and right arrows to decide which film was better. It’s this step which is going to be the slowest part of this process, the “comparisons” step, which will also help me with deciding the sorting algorithms to use. Here’s how I implemented the bubble sort algorithm:

from pynput import keyboard

starting_list = [
    'Indiana Jones The Last Crusade',
    'Prisoners',
    'Semi Pro',
    'Indiana Jones Temple of Doom',
    'L.A. Confidential',
    'Kick-Ass',
    'Bad Boys',
    'Princess Mononoke',
    'Once Upon a Time in Hollywood',
    'You Don\'t Mess With The Zohan',
    'Uncut Gems',
    'Syndoche New York',
]

latest_key = None
question_number = 1

def get_arrow_input():
    def on_release(key):
        if key in [key.left, key.right]:
            global latest_key
            latest_key = f"{key}"
            # Stop listener
            return False

    # Collect events until released
    with keyboard.Listener(on_release=on_release) as listener:
        listener.join()

def bubble_sort(array):
    question_number = 1
        
    # loop through each element of array
    for i in range(len(array)):
                
        # keep track of swapping
        swapped = False
        
        # loop to compare array elements
        for j in range(0, len(array) - i - 1):

            print(f"Q{question_number}: [{array[j]}] or [{array[j + 1]}]?")
            get_arrow_input()
            choice = "left" if latest_key == "Key.left" else "right"

            # compare two adjacent elements
            # change > to < to sort in descending order
            if choice == "right":

                # swapping occurs if elements
                # are not in the intended order
                temp = array[j]
                array[j] = array[j+1]
                array[j+1] = temp

                swapped = True

            question_number += 1
                    
        # no swapping means the array is already sorted
        # so no need for further comparison
        if not swapped:
            break

    return array

sorted_list = bubble_sort(starting_list)
print(sorted_list)

It should come as no surprise that this took quite a long time, and it was only ten films. Choosing between two films with arrow keys worked very well though, so at least that part was staying.

Quick Sort

The next natural choice would be the quick sort, which as it’s name should tell you is rather quick. In my code below I’ve reused the same keyboard input logic as above:

def partition(arr, low, high):
    i = (low-1)
    pivot = arr[high]

    global question_number
 
    for j in range(low, high):
 
        # If current element is smaller than or
        # equal to pivot

        print(f"Q{question_number}: [{arr[j]}] or [{pivot}]?")
        question_number += 1
        get_arrow_input()

        if latest_key == "Key.left":
 
            # increment index of smaller element
            i = i+1
            arr[i], arr[j] = arr[j], arr[i]
 
    arr[i+1], arr[high] = arr[high], arr[i+1]
    return (i+1)
 
# The main function that implements quick sort
# arr[] --> Array to be sorted
# low  --> Starting index
# high  --> Ending index
def quick_sort(arr, low, high):
    if len(arr) == 1:
        return arr
    if low < high:
 
        # pi is partitioning index, arr[p] is now
        # at right place
        pi = partition(arr, low, high)
 
        # Separately sort elements before
        # partition and after partition
        quick_sort(arr, low, pi-1)
        quick_sort(arr, pi+1, high)

quick_sort(starting_list, 0, len(starting_list)-1)
print(starting_list)

While this was quicker than the bubble sort and allowed me to test a larger list of films, I soon realised there was a problem with carrying out this sorting algorithm myself. With this algorithm you compare a lot of different films with one “pivot” film, which after a while became very dull to do one film at a time. There was also an issue where an error happened and I lost all of the sorting progress I made. Simply picking up where I left off isn’t something this algorithm can do well to my limited knowledge.

What I needed to have was a more engaging way to compare different films to keep things interesting, and more importantly find a way to save my progress in case of errors or needed breaks due to boredom!

Binary Insertion Sort

While this isn’t the most common sorting algorithm, it is capable of doing exactly what I wanted to sort my films. It builds up a sorted list of films over time allowing me to save this and build to it by going through the list of unsorted films in chunks. This was also more engaging as by default you’ll start with comparing it to the film currently ranked in the middle, and after a few choices you’re comparing it with films of similar quality.

def binary_search(arr, val, start, end):
    # we need to distinguish whether we should insert
    # before or after the left boundary.
    # imagine [0] is the last step of the binary search
    # and we need to decide where to insert -1

    global question_number

    if start == end:

        print(f"Q{question_number}: [{arr[start]}] or [{val}]?")
        question_number += 1
        get_arrow_input()

        if latest_key == "Key.right":
            return start
        else:
            return start+1
  
    # this occurs if we are moving beyond left\'s boundary
    # meaning the left boundary is the least position to
    # find a number greater than val
    if start > end:
        return start
  
    mid = int((start+end)/2)

    print(f"Q{question_number}: [{arr[mid]}] or [{val}]?")
    question_number += 1
    get_arrow_input()

    if latest_key == "Key.left":
        return binary_search(arr, val, mid+1, end)
    elif latest_key == "Key.right":
        return binary_search(arr, val, start, mid-1)
    else:
        return mid


#sorted_list = []
for index, value in enumerate(starting_list):

    try:
        print(f'Film [{index+1}/{len(starting_list)}]')
        index = binary_search(sorted_list, value, 0, len(sorted_list)-1)
        sorted_list = sorted_list[:index] + [value] + sorted_list[index:]
    except:
        print("Encountered an error!! Whoopsie")
        print(sorted_list)

print(sorted_list)

The commented out sorted_list array would contain an already sorted list of films which was extremely useful! This was the algorithm I used to sort all 209 films which did take a while but was interesting enough that I didn’t mind. All of the films in my journal were now fairly ranked, so onto the IMDB API task…

Pulling Film Stats from IMDB

Using the IMDB API was actually quite straightforward thanks to the IMDbPY package, which appears to have been renamed cinemagoer, so big thanks to the developers for that. I learned how to use this package relatively quickly due to this helpful article which included some examples relevant to the information I needed. Here is all of the code I had to write to extract all of the IMDB information I wanted initially:

import imdb
import json
from datetime import datetime
 
# creating instance of IMDb
ia = imdb.IMDb()

# Information to fetch
info_to_fetch = [
    'runtimes',
    'year',
    'genres',
    'rating'
]

# Some info we want to keep as a list, otherwise take first value.
info_as_lists = [
    'genres'
]

# Get film list
with open('2021_sorted.json') as json_file:
    film_list = json.load(json_file)

# Shorten list while testing
film_list = [
    'The Pianist'
]
film_list_len = len(film_list)

# Maintain list of films
film_records = []
for index, film_name in enumerate(film_list):

    print(f'[{index+1}/{film_list_len}]: Fetching IMDB data for [{film_name}]...')

    film_record = {
        'title': film_name
    }

    try:
        # Search for the movie
        search_results = ia.search_movie(film_name)

        if len(search_results) > 0:
            top_result = search_results[0]
            film_id = top_result.movieID

            film = ia.get_movie(film_id)

            # Check that info is available
            available_info = film.infoset2keys['main']
            
            # Fetch available info
            for info in info_to_fetch:
                if info in available_info:
                    fetched_info = film.get(info)

                    if isinstance(fetched_info, list) and info not in info_as_lists:
                        fetched_info = fetched_info[0]

                    print(f'    Found [{info}]: {fetched_info}')
                    film_record[info] = fetched_info
        else:
            print('    Film not found!')

    except Exception as error:
        print(error)
        print('    Ran into error trying to fetch data, skipping film...')

    film_records.append(film_record)

output_filename = f'imdb_data_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json'

print(f'Saving film data to [{output_filename}]')
with open(output_filename, 'w') as output_file:
    json.dump(film_records, output_file, indent=4)

There is a lot of information available once the get_movie function returns the film, so all I needed to do was pick out the information I was interested in which I kept in a info_to_fetch list. Some film titles I had in my list weren’t found however, but this was mostly due to typos or slight differences in the title name like a missing subtitle for example. For these missing records I manually went through and corrected them and before long had all of the information I was after.

This was my first time using the IMDB API so I didn’t want to pull in too much data and not know what to do with it. The ease of pulling in data was impressive though, so I’ll definitely use it again and pull in more information.

Answering the Initial Questions

I now have my entire film journal ranked and enriched with IMDB data, which means it’s finally time to start answering some questions. I used the popular package pandas to put all of the information I had gathered in various CSVs in a single dataframe films_df, which let me answer those questions in the following way…

Total films watched?

print(len(films_df))
209

Total time watched

total_runtime_mins = films_df['runtimes'].sum()

total_runtime_formatted = timedelta(minutes=total_runtime_mins)
print(total_runtime_formatted)
17 days, 7:02:00

Preferred day to watch films

total_days_of_week = films_df['Date'].dt.day_name().value_counts()

print(total_days_of_week)
Friday       40
Saturday     40
Sunday       39
Wednesday    27
Thursday     24
Tuesday      21
Monday       18

Most films watched in a single day

film_date_frequency = films_df['Date'].value_counts()
max_count = film_date_frequency.max()

print(film_date_frequency[film_date_frequency==max_count].sort_index())
2021-01-01    3
2021-02-13    3
2021-04-06    3

Films watched most frequently

film_title_frequency = films_df['Title'].value_counts()
max_count = film_title_frequency.max()

print(film_title_frequency[film_title_frequency==max_count].sort_index())
Druk                     2
Life of Brian (1979)     2
Mean Girls               2
Under the Silver Lake    2

Longest film watched

longest_film = films_df[films_df['runtimes'] == films_df['runtimes'].max()][['Title','runtimes']]

longest_film_title = longest_film['Title'].values[0]
longest_film_runtime = timedelta(minutes=longest_film['runtimes'].values[0])

print(f'Longest film was "{longest_film_title}" with runtime of {longest_film_runtime}')
Longest film was "Once Upon a Time in America" with runtime of 3:49:00

Shortest film watched

shortest_film = films_df[films_df['runtimes'] == films_df['runtimes'].min()][['Title','runtimes']]

shortest_film_title = shortest_film['Title'].values[0]
shortest_film_runtime = timedelta(minutes=shortest_film['runtimes'].values[0])

print(f'Shortest film was "{shortest_film_title}" with runtime of {shortest_film_runtime}')
Shortest film was "Death to 2020" with runtime of 1:10:00

Best rated film watched

best_imdb_film = films_df[films_df['rating'] == films_df['rating'].max()][['Title','rating']]

best_imdb_film_title = best_imdb_film['Title'].values[0]
best_imdb_film_rating = best_imdb_film['rating'].values[0]

print(f'The highest rated film was "{best_imdb_film_title}" with rating of {best_imdb_film_rating}')
The highest rated film was "Schindlers List" with rating of 8.9

Worst rated film watched

worst_imdb_film = films_df[films_df['rating'] == films_df['rating'].min()][['Title','rating']]

worst_imdb_film_title = worst_imdb_film['Title'].values[0]
worst_imdb_film_rating = worst_imdb_film['rating'].values[0]

print(f'The worst rated film was "{worst_imdb_film_title}" with rating of {worst_imdb_film_rating}')
The worst rated film was "Sentinelle" with rating of 4.7

Total films watched grouped by medium

print(films_df['Where'].value_counts())
Netflix            122
Prime               28
Disney+             27
Blu-ray             11
Torrent             11
Cinema               3
Prime Purchase       2
DVD                  2
All 4                1
TV                   1
BBC iPlayer          1

Average film IMDB rating by medium

print(films_df.groupby('Where').mean()['rating'].sort_values(ascending=False))
All 4              8.600000
BBC iPlayer        8.100000
Blu-ray            7.590909
Prime              7.571429
Cinema             7.566667
Torrent            7.445455
Netflix            7.274590
DVD                7.200000
Disney+            7.166667
Prime Purchase     7.100000
TV                 7.000000

My ranking normalised plotted against IMDB ranking normalised

import plotly.express as px

rankings = films_df['2021 Rank'].copy()
normalized_rankings = (rankings-rankings.min())/(rankings.max()-rankings.min())

ratings = films_df['rating'].copy()
normalized_ratings = (ratings-ratings.min())/(ratings.max()-ratings.min())

films_df_with_normalised = films_df.copy()
films_df_with_normalised['normalized_rankings'] = normalized_rankings
films_df_with_normalised['normalized_ratings'] = normalized_ratings

fig = px.scatter(
    films_df_with_normalised,
    x='normalized_ratings',
    y='normalized_rankings',
    color='Where',
    hover_data=['Title'],
    trendline='ols',
    trendline_scope='overall',
    width=800,
    height=400)

fig.update_yaxes(autorange="reversed")
fig.show()

alt text

Conclusion

Quite a few interesting insights! The above scatter plot is a bit of a mess, but that was sort of the point for this one to demonstrate that my film tastes don’t strictly follow the tastes of IMDB reviewers. The trend line suggests there is some correlation, but really it’s a all over the place, which is good!

When it comes to choosing a favourite streaming provider there also isn’t a clear winner. Prime appeared to have provided me with slightly higher quality films, but I used Netflix much more so I must prefer that in general, which I’d say is accurate.

While I enjoyed this project which let me revisit some areas of programming that I learned at university, realising I spent over 17 days watching films does make me wonder what else I could have done in that time. It’s easy to think that time could have spent doing something more productive, but if we leave no time to relax and enjoy ourselves we’d all go crazy. I’m sure I’ll continue to watch many films for years to come, as that’s what I love doing, and there’s nothing wrong with doing what you love.