Unlocking IMDB Data With Metis for Awesome Database Optimization Insights

Metis can analyze queries, suggest improvements, find performance issues, and automatically troubleshoot problems. We are going to use IMDb datasets. We are going to play with some queries, and see how Metis can identify performance characteristics.

‍

You can find the datasets at IMDb datasets.

Information courtesy of

IMDb

(https://www.imdb.com).

Used with permission.

Dataset preparation

Let’s start with preparing the database. We are going to use the PostgreSQL (Postgres) database.

Let’s start with downloading files:

curl --output name.basics.tsv.gz https://datasets.imdbws.com/name.basics.tsv.gz \
	&& curl --output title.akas.tsv.gz https://datasets.imdbws.com/title.akas.tsv.gz \
	&& curl --output title.basics.tsv.gz https://datasets.imdbws.com/title.basics.tsv.gz \
	&& curl --output title.crew.tsv.gz https://datasets.imdbws.com/title.crew.tsv.gz \
	&& curl --output title.episode.tsv.gz https://datasets.imdbws.com/title.episode.tsv.gz \
	&& curl --output title.principals.tsv.gz https://datasets.imdbws.com/title.principals.tsv.gz \
	&& curl --output title.ratings.tsv.gz https://datasets.imdbws.com/title.ratings.tsv.gz

Next, we need to decompress them:

gzip -d name.basics.tsv.gz \
	&& gzip -d title.akas.tsv.gz \
	&& gzip -d title.basics.tsv.gz \
	&& gzip -d title.crew.tsv.gz \
	&& gzip -d title.episode.tsv.gz \
	&& gzip -d title.principals.tsv.gz \
	&& gzip -d title.ratings.tsv.gz

Next, we need to replace \N characters with empty values in order to import them as NULLs:

sed 's/\\N//g' name.basics.tsv > name.basics.tsv2 \
	&& sed 's/\\N//g' title.akas.tsv > title.akas.tsv2 \
	&& sed 's/\\N//g' title.basics.tsv > title.basics.tsv2 \
	&& sed 's/\\N//g' title.crew.tsv > title.crew.tsv2 \
	&& sed 's/\\N//g' title.episode.tsv > title.episode.tsv2 \
	&& sed 's/\\N//g' title.principals.tsv > title.principals.tsv2 \
	&& sed 's/\\N//g' title.ratings.tsv > title.ratings.tsv2 \
	&& rm *.tsv

Now, we need to create a database in our PostgreSQL instance. You can install the server locally or use the Docker image. Next, create the schema:

CREATE SCHEMA imdb;

And now create tables. We are going to use TEXT columns to keep things simple:

CREATE TABLE imdb.name_basics(
	nconst              TEXT    NOT NULL PRIMARY KEY,
	primaryName        TEXT,
	birthYear          INT,
	deathYear          INT,
	primaryProfession  TEXT,
	knownForTitles    TEXT
);
   
CREATE TABLE imdb.title_akas(
	titleId TEXT NOT NULL,
	ordering INT NOT NULL,
	title TEXT,
	region TEXT,
	language TEXT,
	types TEXT,
	attributes TEXT,
	isOriginalTitle BIT,
	constraint pk_title_akas PRIMARY KEY (titleId, ordering)
);


CREATE TABLE imdb.title_basics(
	tconst TEXT NOT NULL PRIMARY KEY,
	titleType TEXT,
	primaryTitle TEXT,
	originalTitle TEXT,
	isAdult BIT,
	startYear INT,
	endYear INT,
	runtimeMinutes INT,
	genres TEXT
);


CREATE TABLE imdb.title_crew(
	tconst TEXT NOT NULL PRIMARY KEY,
	directors TEXT,
	writers TEXT
);


CREATE TABLE imdb.title_episode(
	tconst TEXT NOT NULL PRIMARY KEY,
	partentTconst TEXT,
	seasonNumber INT,
	episodeNumber INT
);


CREATE TABLE imdb.title_principals (
	tconst TEXT NOT NULL,
	ordering INT NOT NULL,
	nconst TEXT,
	category TEXT,
	job TEXT,
	characters TEXT,
	constraint pk_title_principals PRIMARY KEY (tconst, ordering)
);


CREATE TABLE imdb.title_ratings (
	tconst TEXT NOT NULL PRIMARY KEY,
	averageRating NUMERIC(15, 8),
	numVotes BIGINT
);
CODE

Finally, we need to load the IMDb database:

COPY imdb.name_basics FROM '/database/name.basics.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER;
COPY imdb.title_akas FROM '/database/title.akas.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER;
COPY imdb.title_basics FROM '/database/title.basics.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER;
COPY imdb.title_crew FROM '/database/title.crew.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER;
COPY imdb.title_episode FROM '/database/title.episode.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER;
COPY imdb.title_principals FROM '/database/title.principals.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER;
COPY imdb.title_ratings FROM '/database/title.ratings.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER;

The database is now ready and we can remove the input files:

rm *.tsv2
rm *.tsv
rm *.tsv.gz

Registering in Metis

You can now go to Metis and register. You are now ready to query the database. Start with a very simple query:

SELECT *
FROM imdb.name_basics
LIMIT 1

This should give the following output:

nconst	primaryname	birthyear	deathyear	primaryprofession	knownfortitles
nm5590744	Sophie Kenny	(null)	(null)	assistant_director,miscellaneous,director	tt1758810,tt2085059,tt10846464,tt14764788

Now, we can get the execution plan for the query above. Let’s change the query to:

EXPLAIN (VERBOSE, BUFFERS, COSTS, FORMAT JSON)
SELECT *
FROM imdb.name_basics
LIMIT 1

We should get something along the lines:

[
{
	"Plan": {
	"Node Type": "Limit",
	"Parallel Aware": false,
	"Async Capable": false,
	"Startup Cost": 0.00,
	"Total Cost": 0.02,
	"Plan Rows": 1,
	"Plan Width": 66,
	"Output": ["nconst", "primaryname", "birthyear", "deathyear", "primaryprofession", "knownfortitles"],
	"Shared Hit Blocks": 0,
	"Shared Read Blocks": 0,
	"Shared Dirtied Blocks": 0,
	"Shared Written Blocks": 0,
	"Local Hit Blocks": 0,
	"Local Read Blocks": 0,
	"Local Dirtied Blocks": 0,
	"Local Written Blocks": 0,
	"Temp Read Blocks": 0,
	"Temp Written Blocks": 0,
	"Plans": [
	{
		"Node Type": "Seq Scan",
		"Parent Relationship": "Outer",
		"Parallel Aware": false,
		"Async Capable": false,
		"Relation Name": "name_basics",
		"Schema": "imdb",
		"Alias": "name_basics",
		"Startup Cost": 0.00,
		"Total Cost": 253832.48,
		"Plan Rows": 12416948,
		"Plan Width": 66,
		"Output": ["nconst", "primaryname", "birthyear", "deathyear", "primaryprofession", "knownfortitles"],
		"Shared Hit Blocks": 0,
		"Shared Read Blocks": 0,
		"Shared Dirtied Blocks": 0,
		"Shared Written Blocks": 0,
		"Local Hit Blocks": 0,
		"Local Read Blocks": 0,
		"Local Dirtied Blocks": 0,
		"Local Written Blocks": 0,
		"Temp Read Blocks": 0,
		"Temp Written Blocks": 0
	}
	]
	},
	"Query Identifier": -599869187636405422,
	"Planning": {
	"Shared Hit Blocks": 0,
	"Shared Read Blocks": 0,
	"Shared Dirtied Blocks": 0,
	"Shared Written Blocks": 0,
	"Local Hit Blocks": 0,
	"Local Read Blocks": 0,
	"Local Dirtied Blocks": 0,
	"Local Written Blocks": 0,
	"Temp Read Blocks": 0,
	"Temp Written Blocks": 0
	}
}
]

We can see how the SQL engine is going to execute the query. This is only the estimated plan, and the actual execution may differ. To get the actual plan, we can run this:

EXPLAIN (ANALYZE, TIMING, VERBOSE, BUFFERS, COSTS, FORMAT JSON)
SELECT *
FROM imdb.name_basics
LIMIT 1

You can now copy the query and the execution plan to Metis Query Analyzer and get the following:

‍

Let’s now see what environment I used for the experiments.

Platform

I used the following environment to run the tests below.

‍

RDS instance db.m6g.large with PostgreSQL 13.7. It has 2 vCPUs and 8 GB of RAM.

EC2 instance t3.xlarge with AMI amzn2-ami-kernel-5.10-hvm-2.0.20221210.1-x86_64-gp2 with Amazon Linux 2 in version 5.10.157-139.675.amzn2.x86_64. It has 4 vCPUs and 16 GB of RAM memory.

Mac Mini with macOS 13 Ventura running on M1 with Darwin Kernel Version 22.4.0 It has 3.2GHz 8-Core M1 and 16GB of RAM.

Both EC2 and Mac used PostgreSQL 15.2 (Debian 15.2-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit running in a Docker container that you can get at public.ecr.aws/o2c0x5x8/metis-demo-db:latest.

For a given actor, find their latest movies

First query I used shows some latest movies a given actor has been in. This is the query:

SELECT TB.*
FROM name_basics AS NB
LEFT JOIN title_principals AS TP ON TP.nconst = NB.nconst
LEFT JOIN title_basics AS TB ON TB.tconst = TP.tconst
WHERE NB.nconst = 'nm1588970'
ORDER BY TB.startyear DESC
LIMIT 10

And here is the output:

tconst	titletype	primarytitle	originaltitle	isadult	startyear	endyear	runtimeminutes	genres
tt7513040	short	Den gamla goda tiden	Den gamla goda tiden	false	1946	(null)	6	Documentary,Short
tt0000001	short	Carmencita	Carmencita	false	1894	(null)	1	Documentary,Short

Execution time:

EC2	RDS	Mac
37	31	24

Let’s start with the estimated plan:

‍

‍

We can see that the SQL engine predicts using 3 tables, reading ~52M rows, and returning 10 of them. Let’s compare that with the actual:

Metis Data SQL Estimated plan comparison

We can see that the estimations were pretty accurate. The actual number of rows was closer to 56M, and the query returned 2 rows instead of 10.

However, based on the insights above, we can easily tell how to improve the performance. imdb.title_principals doesn’t have an index. If we go to the Query Tale tab, we can see that it scans the table:

How can we improve the query? Metis clearly indicates that a table is scanned and this decreases the performance! Let’s add the index:

CREATE INDEX IF NOT EXISTS title_principals_nconst_idx ON imdb.title_principals(nconst) INCLUDE (tconst);

The index is on the nconst column (which is used in the join operator), and it also includes the tconst column so that we can join it later on with another table. Now, the query finishes in less than a second:

‍

The index clearly improved the performance.

For a given actor, find their ten most highly rated films

Let’s now find ten best movies a given actor has been in. The query is:

SELECT TB.*, TR.*
FROM name_basics AS NB
LEFT JOIN title_principals AS TP ON TP.nconst = NB.nconst
LEFT JOIN title_basics AS TB ON TB.tconst = TP.tconst
LEFT JOIN title_ratings AS TR on TR.tconst = TP.tconst
WHERE NB.nconst = 'nm1588970'
ORDER BY TR.averagerating DESC, TR.numvotes DESC
LIMIT 10

Here is the output:

tconst	titletype	primarytitle	originaltitle	isadult	startyear	endyear	runtimeminutes	genres	tconst	averagerating	numvotes
tt7513040	short	Den gamla goda tiden	Den gamla goda tiden	false	1946	(null)	6	Documentary,Short	(null)	(null)	(null)
tt0000001	short	Carmencita	Carmencita	false	1894	(null)	1	Documentary,Short	tt0000001	5.70000000	1965

This query runs for the following number of seconds:

EC2	RDS	Mac
3	12	30

Here is the estimated execution:

We can see a very similar story here. title_principals table doesn’t have an index, so the engine needs to scan it fully. Let’s compare with the actual execution:

Metis Data SQL comparison with execution

We can see that the performance is low because of this missing index. Once we add it, we get the following performance:

‍

Metis Data SQL performance with missing index

‍

The query executes in 1 millisecond now.

Find the ten top rated films with some number of votes

Let’s now find the best movies that received some number of votes. This is the query:

SELECT TB.*, TR.*
FROM title_basics AS TB
LEFT JOIN title_ratings AS TR on TR.tconst = TB.tconst
WHERE TR.numvotes > 10000
ORDER BY TR.averagerating DESC
LIMIT 10

Result:

tconst	titletype	primarytitle	originaltitle	isadult	startyear	endyear	runtimeminutes	genres	tconst	averagerating	numvotes
tt2301451	tvEpisode	Ozymandias	Ozymandias	false	2013	(null)	48	Crime,Drama,Thriller	tt2301451	10.00000000	193154
tt12187040	tvEpisode	Plan and Execution	Plan and Execution	false	2022	(null)	50	Crime,Drama	tt12187040	9.90000000	48920
tt9906260	tvEpisode	Hero	Hero	false	2019	(null)	24	Action,Adventure,Animation	tt9906260	9.90000000	100806
tt4283094	tvEpisode	The Winds of Winter	The Winds of Winter	false	2016	(null)	68	Action,Adventure,Drama	tt4283094	9.90000000	152869
tt2301455	tvEpisode	Felina	Felina	false	2013	(null)	55	Crime,Drama,Thriller	tt2301455	9.90000000	127058
tt13857684	tvEpisode	Assault	Assault	false	2021	(null)	24	Action,Adventure,Animation	tt13857684	9.90000000	79715
tt4283088	tvEpisode	Battle of the Bastards	Battle of the Bastards	false	2016	(null)	60	Action,Adventure,Drama	tt4283088	9.90000000	215387
tt2178784	tvEpisode	The Rains of Castamere	The Rains of Castamere	false	2013	(null)	51	Action,Adventure,Drama	tt2178784	9.90000000	110274
tt9313966	tvEpisode	The Phantom Apprentice	The Phantom Apprentice	false	2020	(null)	27	Action,Adventure,Animation	tt9313966	9.90000000	14657
tt10023374	tvEpisode	Midnight Sun	Midnight Sun	false	2019	(null)	24	Action,Adventure,Animation	tt10023374	9.90000000	46063

Timings:

EC2	RDS	Mac
2	1	1

Let us now check the plan:

Once again, we lack an index. Metis clearly shows what index we should add:

CREATE INDEX IDX_title_ratings_637d5836 ON title_ratings (numvotes)

This creates the index on the title_ratings table on the numvotes column. Let’s see if the actual execution agrees:

‍

Metis Database Query Select - showing user the optimal query

We can see that the number of rows read is actually higher than the estimate (1.31M vs 1.06M). The index would definitely help. Let’s add it and see the performance:

‍

We can see that it greatly improved the runtime.

Given two people, list what movies they appeared in together

Let’s now write a query that finds movies with two specified actors. Query is:

SELECT TB.*
FROM title_basics AS TB
JOIN title_principals AS TP1 ON TP1.tconst = TB.tconst
JOIN title_principals AS TP2 ON TP2.tconst = TB.tconst
WHERE TP1.nconst = 'nm0302368' AND TP2.nconst = 'nm0001908'

Output:

tconst	titletype	primarytitle	originaltitle	isadult	startyear	endyear	runtimeminutes	genres
tt0000439	short	The Great Train Robbery	The Great Train Robbery	false	1903	(null)	11	Action,Adventure,Crime

Let’s now see the estimated execution plan:

Metis Database Query Select Estimated plan

Once again we see that the title_principals table is slowing the query down. Let’s compare that with the actual execution:

Metis Database Query Select Estimated plan actual execution

‍

Estimations were pretty correct, and the query is slow. Metis shows that the table scan is the root cause of the low performance. Let’s add the index with:

CREATE INDEX IF NOT EXISTS title_principals_nconst_idx ON imdb.title_principals(nconst) INCLUDE (tconst);

Let’s see the performance now:

Metis Database Query Select with index 2

‍

And we can see the performance is much better now.

List all of the cast and crew in a given movie

Let’s now find all the people involved in the movie. Here comes the query:

SELECT DISTINCT NB.*
FROM title_basics AS TB
LEFT JOIN title_principals AS TP ON TP.tconst = TB.tconst
LEFT JOIN title_crew AS TC ON TC.tconst = TB.tconst
LEFT JOIN name_basics AS NB ON
	NB.nconst = TP.nconst
	OR TC.directors = NB.nconst
	OR TC.directors LIKE NB.nconst || ',%'::text
	OR TC.directors LIKE '%,'::text || NB.nconst || ',%'::text
	OR TC.directors LIKE '%,'::text || NB.nconst
	OR TC.writers = NB.nconst
	OR TC.writers LIKE NB.nconst || ',%'::text
	OR TC.writers LIKE '%,'::text || NB.nconst || ',%'::text
	OR TC.writers LIKE '%,'::text || NB.nconst
WHERE TB.tconst = 'tt0000439'

Here is the output:

nconst	primaryname	birthyear	deathyear	primaryprofession	knownfortitles
nm0302368	Donald Gallaher	1895	1961	actor,director,miscellaneous	tt0021457,tt0020221,tt0029541,tt0020275
nm1145809	Scott Marble	1847	1919	writer	tt0438068
nm0001908	Gilbert M. 'Broncho Billy' Anderson	1880	1971	director,actor,producer	tt0183803,tt0001706,tt0003719,tt0176832
nm2313241	John Manus Dougherty Sr.	1885	(null)	actor	tt0000439
nm0807466	Blair Smith	1859	(null)	cinematographer,camera_department	tt0343568,tt0000439,tt0368073,tt0344376
nm0055607	George Barnes	1880	1951	actor	tt0434558,tt0408105,tt0003756,tt0322643
nm0131750	Walter Cameron	1872	1942	actor,cinematographer	tt0010346,tt0004760
nm0055661	Justus D. Barnes	1862	1946	actor	tt0233527,tt0415631,tt0415738,tt0002504
nm0007625	A.C. Abadie	1878	1950	cinematographer,director,actor	tt0485015,tt0477387,tt0167051,tt0918623
nm0692105	Edwin S. Porter	1870	1941	director,cinematographer,writer	tt0000757,tt0004654,tt0006279,tt1932747

And the timings:

EC2	RDS	Mac
94	98	685

Here are the estimates:

Metis Database Query Select Estimated plan 2

We can see three insights. One regarding number of rows (and missing index), one about the total cost of the plan, and one about the sorting of rows. Let’s compare that with the actuals:

‍

Metis Database Query Select Estimated plan actual execution 2

‍

We can see one more insight. That is regarding the I/O operations for the sorting part. When we go to the query tale, Metis shows that indeed scanning the name_basics table took a lot:

‍

The optimizer decided to materialize the result of the scan because it couldn’t optimize the join with multiple filters.

Why do we need that join? This is because in directors and writers we store the value as CSV:

nm0001,nm002,nm003

Not to mention, that this query extracted the title_basics table which we don’t need at all. How can we improve the query? We need to split the CSV into a regular column. First, let’s see how many identifiers we store at most:

SELECT MAX(CHAR_LENGTH(writers) - CHAR_LENGTH(REPLACE(writers, ',', '')))
FROM title_crew

We take the length of the writers column and subtract the length of the writers column with commas removed. The maximum value we get is 1390. This means that we have at most 1391 identifiers serialized in that field. For directors we get 490. Based on that we can safely assume that there are no more than 1500 identifiers. We can use that to split them and build a table with nconst identifiers:

WITH RECURSIVE numbers AS (
	SELECT 1 AS number
	UNION ALL
	SELECT number + 1 AS number FROM numbers WHERE number < 1500
),
split_associations AS (
	SELECT SPLIT_PART(TC.directors, ',', N.number) AS nconst
	FROM title_crew AS TC
	CROSS JOIN numbers AS N
	WHERE tconst = 'tt0000439' AND directors IS NOT NULL AND CHAR_LENGTH(directors) - CHAR_LENGTH(REPLACE(directors, ',', '')) + 1 >= N.number
	UNION
	SELECT SPLIT_PART(TC.writers, ',', N.number) AS nconst
	FROM title_crew AS TC
	CROSS JOIN numbers AS N
	WHERE tconst = 'tt0000439' AND writers IS NOT NULL AND CHAR_LENGTH(writers) - CHAR_LENGTH(REPLACE(writers, ',', '')) + 1 >= N.number
),
all_associations AS (
	SELECT SA.nconst
	FROM split_associations AS SA
	UNION
	SELECT TP.nconst
	FROM title_principals AS TP
	WHERE TP.tconst = 'tt0000439'
)
SELECT *
FROM all_associations

Output:

nconst
nm0807466
nm1145809
nm0007625
nm0302368
nm0692105
nm0055661
nm0055607
nm0131750
nm0001908
nm2313241

We can now join this with name_basics table and get the final query:

WITH RECURSIVE numbers AS (
	SELECT 1 AS number
	UNION ALL
	SELECT number + 1 AS number FROM numbers WHERE number < 1500
),
split_associations AS (
	SELECT SPLIT_PART(TC.directors, ',', N.number) AS nconst
	FROM title_crew AS TC
	CROSS JOIN numbers AS N
	WHERE tconst = 'tt0000439' AND directors IS NOT NULL AND CHAR_LENGTH(directors) - CHAR_LENGTH(REPLACE(directors, ',', '')) + 1 >= N.number
	UNION
	SELECT SPLIT_PART(TC.writers, ',', N.number) AS nconst
	FROM title_crew AS TC
	CROSS JOIN numbers AS N
	WHERE tconst = 'tt0000439' AND writers IS NOT NULL AND CHAR_LENGTH(writers) - CHAR_LENGTH(REPLACE(writers, ',', '')) + 1 >= N.number
),
all_associations AS (
	SELECT SA.nconst
	FROM split_associations AS SA
	UNION
	SELECT TP.nconst
	FROM title_principals AS TP
	WHERE TP.tconst = 'tt0000439'
)
SELECT NB.*
FROM name_basics AS NB
JOIN all_associations AS AA ON AA.nconst = NB.nconst

Execution plan and analysis:

‍

Metis Database Query Select Estimated plan and analysis

We can see that now the query uses indexes and is much faster. We could optimize it even further by calculating data that would let us avoid splitting by comma with each request.

Find the most prolific actor in a given period

Let’s now find the actor that did the most movies in a given period of time. Query:

SELECT NB.nconst, MAX(NB.primaryname), MAX(nb.birthyear), MAX(NB.deathyear), MAX(nb.primaryprofession), COUNT(*) AS number_of_titles
FROM title_basics AS TB
RIGHT JOIN title_principals AS TP ON TP.tconst = TB.tconst
RIGHT JOIN name_basics AS NB ON NB.nconst = TP.nconst
WHERE TB.startyear >= 1900 AND TB.startyear <= 1915
GROUP BY NB.nconst
ORDER BY number_of_titles DESC
LIMIT 1

Output:

nconst	max	max	max	max	number_of_titles
nm0002615	Siegmund Lubin	1851	1923	producer,director,actor	2993

Execution times:

EC2	RDS	Mac
180	51	45

Let’s see the estimated analysis:

Metis Database Query Select Estimated Analysis

Metis shows table scans, a lot. That’s because we filter based on the startyear column which isn’t indexed. Let’s fix that with the following indexes:

CREATE INDEX IF NOT EXISTS title_basics_startyear_idx ON imdb.title_basics(startyear) INCLUDE (tconst);
CREATE INDEX IF NOT EXISTS title_principals_tconst_idx ON imdb.title_principals(tconst) INCLUDE (nconst);

Let’s also notice that we don’t need to join name_basics first. We can start from the titles, and then get the details of the actor once we know who we’re looking for:

WITH best_actor AS (
	SELECT TP.nconst, COUNT(*) AS number_of_titles
	FROM title_basics AS TB
	LEFT JOIN title_principals AS TP ON TP.tconst = TB.tconst
	WHERE TB.startyear >= 1900 AND TB.startyear <= 1915 AND TP.nconst IS NOT NULL
	GROUP BY TP.nconst
	ORDER BY number_of_titles DESC
	LIMIT 1
)
SELECT BA.nconst, BA.number_of_titles, NB.primaryname, nb.birthyear, NB.deathyear, nb.primaryprofession
FROM best_actor AS BA
JOIN name_basics AS NB ON NB.nconst = BA.nconst

This gives the following analysis:

‍

We can see how these changes improved the query significantly.

Find most prolific actors in a given genre

Let’s now find ten actors doing most movies in a given genre. The query goes like this:

SELECT NB.nconst, NB.primaryname, NB.birthyear, COUNT(*) AS movies_count
FROM name_basics AS NB
LEFT JOIN title_principals AS TP ON TP.nconst = NB.nconst
LEFT JOIN title_basics AS TB ON TB.tconst = TP.tconst
WHERE TB.genres = 'Action' OR TB.genres LIKE 'Action,%' OR TB.genres LIKE '%,Action,%' OR TB.genres LIKE '%,Action'
GROUP BY NB.nconst, NB.primaryname, NB.birthyear
ORDER BY movies_count DESC
LIMIT 10

Output:

nconst	primaryname	birthyear	movies_count
nm0411127	Shotaro Ishinomori	1938	3428
nm0256607	Hiroko Emori	1961	2211
nm0496556	John Ledford	(null)	2205
nm0881576	Yoshio Urasawa	(null)	2197
nm0840642	Teiyû Ichiryûsai	1958	2175
nm2029519	Coco Martin	1981	2106
nm0782841	Toshihiko Seki	1962	2090
nm1167622	Tsutomu Shibayama	1941	2082
nm1113319	Soubee Amako	(null)	2065
nm1114802	Akiko Muta	(null)	2064

Timings:

EC2	RDS	Mac
17	78	93

Metis tells the following:

‍

Metis Database Query Select Estimated plan 3

Metis immediately identifies a lack of indexes and operations that can be optimized. There are two things that we can do here. First, let’s add an index to find titles for a given genre faster:

CREATE INDEX title_basics_genres_gin_idx ON title_basics USING gin (genres gin_trgm_ops);

Next, let’s rewrite the query to get the titles first and then join actors:

WITH best_actors AS (
	SELECT TP.nconst, COUNT(*) AS movies_count
	FROM title_basics AS TB
	LEFT JOIN title_principals AS TP ON TP.tconst = TB.tconst
	WHERE TB.genres = 'Action' OR TB.genres LIKE 'Action,%' OR TB.genres LIKE '%,Action,%' OR TB.genres LIKE '%,Action'
	GROUP BY TP.nconst
	ORDER BY movies_count DESC
	LIMIT 10
)
SELECT BA.nconst, NB.primaryname, NB.birthyear, BA.movies_count
FROM best_actors AS BA
JOIN name_basics AS NB ON NB.nconst = BA.nconst
ORDER BY movies_count DESC

This gives the following analysis:

‍

‍

Specifically, we can see that we use indexes for each table and reach much fewer rows:

‍

Finding most common coworkers

Let’s now find five people a given person worked the most with. We start with the following query:

WITH RECURSIVE numbers AS (
	SELECT 1 AS number
	UNION ALL
	SELECT number + 1 AS number FROM numbers WHERE number < 1500
),
titles_for_person AS (
	SELECT TC.tconst
	FROM title_crew AS TC
	WHERE directors = 'nm0000428' OR directors LIKE 'nm0000428,%' OR directors LIKE '%,nm0000428,%' OR directors LIKE '%,nm0000428'
	UNION
	SELECT TC.tconst
	FROM title_crew AS TC
	WHERE writers = 'nm0000428' OR writers LIKE 'nm0000428,%' OR writers LIKE '%,nm0000428,%' OR writers LIKE '%,nm0000428'
	UNION
	SELECT tconst
	FROM title_principals
	WHERE nconst = 'nm0000428'
),
titles_corresponding AS (
	SELECT TC.tconst, TC.directors, TC.writers
	FROM title_crew AS TC
	JOIN titles_for_person AS TFP ON TFP.tconst = TC.tconst
),
split_associations AS (
	SELECT TC.tconst, SPLIT_PART(TC.directors, ',', N.number) AS nconst
	FROM titles_corresponding AS TC
	CROSS JOIN numbers AS N
	WHERE directors IS NOT NULL AND CHAR_LENGTH(directors) - CHAR_LENGTH(REPLACE(directors, ',', '')) + 1 >= N.number
	UNION
	SELECT TC.tconst, SPLIT_PART(TC.writers, ',', N.number) AS nconst
	FROM titles_corresponding AS TC
	CROSS JOIN numbers AS N
	WHERE writers IS NOT NULL AND CHAR_LENGTH(writers) - CHAR_LENGTH(REPLACE(writers, ',', '')) + 1 >= N.number
),
all_associations AS (
	SELECT SA.tconst, SA.nconst
	FROM split_associations AS SA
	UNION
	SELECT TP.tconst, TP.nconst
	FROM title_principals AS TP
	JOIN titles_for_person AS TFP ON TFP.tconst = TP.tconst
),
other_people AS (
	SELECT nconst
	FROM all_associations
	WHERE nconst != 'nm0000428'
),
top_peers AS (
	SELECT OP.nconst, COUNT(*) as common_titles
	FROM other_people AS OP
	GROUP BY nconst
	ORDER BY common_titles DESC
	LIMIT 5
)
SELECT TP.nconst, TP.common_titles, NB.*
FROM top_peers AS TP
JOIN name_basics AS NB ON NB.nconst = TP.nconst
ORDER BY TP.common_titles DESC

Output:

nconst	common_titles	nconst	primaryname	birthyear	deathyear	primaryprofession	knownfortitles
nm0005658	479	nm0005658	G.W. Bitzer	1872	1944	cinematographer,director,camera_department	tt0431889,tt0006864,tt0315105,tt0009968
nm0115524	156	nm0115524	Kate Bruce	1860	1946	actress	tt0014604,tt0000816,tt0000909,tt0006745
nm0555522	134	nm0555522	Arthur Marvin	1859	1911	cinematographer,director,camera_department	tt0300052,tt0291476,tt0233612,tt0000412
nm0038106	130	nm0038106	Linda Arvidson	1884	1949	actress,writer	tt1487900,tt0160818,tt0000770,tt0000628
nm0424530	121	nm0424530	Arthur V. Johnson	1876	1916	actor,director,writer	tt0000628,tt0003675,tt0337827,tt0000697

Timings:

EC2	RDS	Mac
6	29	866

And analysis:

Metis Database Query Select Estimated plan 4

‍

Metis shows table scans and lack of indexes. See the numbers that the query reads millions of rows. Let’s build indexes for title_crew table:

CREATE INDEX title_crew_directors_gist_idx ON title_crew USING gist (directors gist_trgm_ops);
CREATE INDEX title_crew_writers_gist_idx ON title_crew USING gist (writers gist_trgm_ops);

Let’s also add indexes for title_principals:

CREATE INDEX IF NOT EXISTS title_principals_nconst_idx ON imdb.title_principals(nconst) INCLUDE (tconst);
CREATE INDEX IF NOT EXISTS title_principals_tconst_idx ON imdb.title_principals(tconst) INCLUDE (nconst);

This should give us a significant speedup:

‍

‍

We can see that we read thousands of rows now. That’s a great improvement.

Summary

We have seen a couple of good examples where Metis can clearly show database performance improvements. Thanks to that, we don’t go blind anymore. We can see our improvements, verify if indexes are beneficial, and see how the database performs. The crucial part is that we can finally get some clarity around the database internals. That’s the very first step towards building proper database guardrails.

‍

Unlocking IMDB Data With Metis for Awesome Database Optimization Insights

Dataset preparation

Registering in Metis

Platform

For a given actor, find their latest movies

For a given actor, find their ten most highly rated films

Find the ten top rated films with some number of votes

Given two people, list what movies they appeared in together

List all of the cast and crew in a given movie

Find the most prolific actor in a given period

Find most prolific actors in a given genre

Finding most common coworkers

Summary

Keep Your Databases Fast

Harness Your Database Complexity

Prevent Your Data Workflows From Breaking

Never worry about your
database again!

Keep Your Databases Fast

Harness Your Database Complexity

Prevent Your Data Workflows From Breaking

Never worry about your database again!

Contact Us

Never worry about your
database again!