The Power Of Embedding Clustering: Leveraging Open AI to Analyze Common Themes in Recent High Grossing Films

What can movie text data tell us about the different film industry market clusters, and how can we use it to develop a cluster based recommendation system.

6 min readAug 7, 2023

Image by https://www.freepik.com/free-photo/arrangement-cinema-objects-close-up_7089735.htm#query=movie%20industry&position=2&from_view=search&track=country_rows_v2 at Freepik

The objective of this article is to learn more about the Open AI API machine learning use cases by extracting embedding information from movie industry text data. We’ll undergo a fun and relatable analysis where we’ll identify clusters of existing film industry consumer markets, and build a simple recommendation system based on proximity. Our data consists of a a small sample of 165 movies (top 30 films with the highest reported revenue for the years 2018, 2019, 2020, 2021, 2022, and top 15 for 2023) extracted from the following Kaggle open dataset, which license states its open for commercial use: https://www.kaggle.com/datasets/akshaypawar7/millions-of-movies.

The advantage of working only with 165 movies is that we can take a look at the movies individually in order to assess the health of the underlying computation, and make business sense of the results. It is important to mention that this is not a representative sample.

In the following notebook, you can find the code I used to extract the dataset embeddings.

Compilednotebook Embeddings - Notebook by AnaPrec07 (anaprec07) | Jovian

Collaborate with anaprec07 on compilednotebook-embeddings notebook.

jovian.com

The datasets generated in the prior notebook are then loaded into the following notebook, where the analysis and algorithm training takes place.

Load And Analyze Embeddings - Notebook by AnaPrec07 (anaprec07) | Jovian

Collaborate with anaprec07 on load-and-analyze-embeddings notebook.

jovian.com

Column Content:

This is a list of the original columns available in the dataset (and the transformed columns generated in the previous step). I’ll put a parenthesis over columns with text data I think would be valuable to analyze the content of this movies with the Open AI api.

Original columns: ‘title’, ‘genres’ (text), ‘overview’(text), ‘popularity’, ‘production_companies’, ‘release_date’, ‘budget’, ‘revenue’, ‘runtime’, ‘tagline’ (text), ‘vote_average’, ‘vote_count’, ‘keywords’ (text).
Transformed columns: ‘overview_embeddings’, ‘genres_embeddings’, ‘cluster’, ‘release_date_year_month’, ‘release_date_year’, ‘pca_comp_0’, ‘pca_comp_1’, ‘is_Horror’, ‘is_Comedy’, ‘is_Action’, ‘is_Drama’, ‘is_Animation’, ‘is_Fantasy’, ‘is_Thriller’, ‘is_War’, ‘is_Science Fiction’, ‘is_Family’, ‘is_History’, ‘is_Adventure’, ‘is_Crime’, ‘is_Romance’, ‘is_Music’, ‘is_Mystery’, ‘is_Western’, ‘is_Documentary’

Sample overview throughout the years:

Median metrics per year:

Per year I’m using the top 30 movies reported to have the highest revenue (except for 2023, which I’m using top 15 because of the article publishing date).

As we can see, the figures reflect the expected trend: a sharp decline on in 2020 for budget and revenue. What is interesting, however, is that ever since 2020, the median film popularity more than doubled for the years 2021, 2022 and 2023, and the median budget seems to have recovered. The revenue, however, didn’t go back to pre-pandemic figures.

Yearly percent with genre keyword

It certainly is the case that some genres are more in profitable than others in terms of revenue. Nevertheless, we can spot interesting patterns observing that 40% of the top rated movies in 2021 had a comedy genre, and the drama category had a significant decline after the pandemic.

With this context in mind, we can go ahead and start our analysis.

What are the existing markets?

To see what are the existing markets, I extracted the embedding data for the ‘overview’, ‘genres’, ‘keywords’, and ‘tagline’. After this, I underwent T-SNE dimensionality reduction (2 components), and defined the clusters with a DBSCAN, using hyper parameter tuning. I opted for using only the embeddings for ‘overview’ and ‘genres’ because the embeddings for ‘keywords’ and ‘tagline’ did not contribute to the silhouette score.

To define the parameters of the T-SNE and DBSCAN, I created a for loop that iterates over perplexity (T-SNE), epsilon and min samples (DBSCAN). The best silhouette score I obtain was: 0.446, with a perplexity of 10, an epsilon of 6and a min sample of 4.

Before using T-SNE, I tested wtih PCA and found that the variance explained ratio for the first two components was only 0.19. Dimensionality reduction on embeddings data caused large information loss. I proceeded with only two components for T-SNE due to ease of use, but I invite you to consider different alternatives to avoid information loss when applying the use case.

I used the following code to ask the algorithm to identify the most common genre, most common production company, and top 3 overarching themes for each of the movies in each of the clusters.

These are the results:

Main characteristics per cluster (hand formatted into table image) please refer to notebook annex for more information

Here you can find the detail fo movie examples for each of the cluseters:

7 movies were not allocated to a cluster, but rather appear as outliers or anomalies:

After looking at this, we have a better idea of the content of each of the clusters. Did you notice that some themes repeat in essence? Examples like self-discovery and family relationships appear more than once as important elements of the narrative of this selected highly-grossing movies. This is likely not coincidence. Let’s do a small exploratory analysis on the impact these themes have on a movie’s popularity using cosine similarity to themes and genres as features. We calculate the cosine similarities that each movie have with each of the themes with the following code:

Because we know that the relationships will very likely not be linear, let’s start analyzing the different partitions with a small tree — using a small depth of 2:

As we can see, movies which’s plot has low cosine similarity to “drama” and high cosine similarity to “family” (node 1) have the highest popularity. This is followed by movies with low cosine similarity to “drama”, and low cosine similarity to “family”. Next, from the movies with high cosine similarity to “drama” we can see that movies with high cosine similarity to “Superheroes and the battle against powerful villains” have better popularity.

This is just one example of how obtaining the cosine similarity of one embedding to another embeddings can be used as features inside a machine learning algorithms to improve predictions.

Lastly, lets present you with an example of a recommendation system based on the previously built clusters as advertised.

Euclidean distance recommendation system:

With the creation of an euclidean matrix, we’ll create a recommendations system that leverages euclidean distance among the reduced dimensionality used for the clusters. This data had considerably information loss, but leverages the joining of the embeddings for genre and plot overview.

Here is the code for the creation and here are the results of 5 iterations:

As we can see, multiple times the sequel (or prequel) of a movie is outputted as the recommended next movie to watch if you liked the first one (or second one). Examples of this include Jurassic World, Venom, and Fantastic Beasts. I am personally amused that Free Guy was recommended after Ready Player One, as they share some plot elements related to the video game industry.

Things to consider for further analysis:

For further analysis I would consider changing the sampling to random (as opposted to higher grossing films), and increase its size so that it is representative. I would also suggest to explore the google trends API, IMBD API, and youtube API to find more NLP information that can be combined with the insights generated in this post.

More about me:

https://www.linkedin.com/in/anapreciado/