tmdb api simple tutorial

Goal

Obtain movies’ posters listed in Movielens dataset from The Movie DB.

Tools

python3.5, Movielens dataset, The Movie DB, tmdbsimple

Prepare

1. Movielens dataset

MovieLens

This dataset includes movies’ title, info, users’ rating, etc. There are two kinds of size, 100k and 20M. Latest versions was updated in 2016 and 2017.

In this article, we will focus on file links.csv.

2. The Movie Database API

To use the TMDB api, we have to register an account of TMDB and request for an api key on TMDB.

The api is in form of http request, so I found a simple python wrapper package listed in the website: tmdbsimple - A wrapper for The Movie Database API v3

Import the tmdbsimple(tmdb) module into pycharm.

Progress

1. Read data from movielens

Use pandas to read movie id, imdb id and tmdb id from links.csv. Later we will use tmdb id to get posters path.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import csv
import pandas as pd

datapath = 'D:movielens\'
df_id = pd.read_csv(datapath + 'links.csv', sep = ',')


idx_to_movie = {}
#tmdb ids
idx_to_tmid = {}

for row in df_id.itertuples():
idx_to_movie[row[1]-1] = row[2]
idx_to_tmid[row[1]-1] = row[3]

total_movies = 9000
movies = [0] * total_movies
for i in range(len(movies)):
if i in idx_to_tmid.keys() and len(str(idx_to_movie[i]))==6 and not np.isnan(idx_to_tmid[i]):
movies[i] = (idx_to_tmid[i])

movies = list(filter(lambda imdb: imdb != 0, movies))

2. Get poster url and imdb id

1
2
3
4
5
6
7
8
9
10
11
12
13
total_movies = len(movies)
URL = [0]*total_movies
IMDB = [0]*total_movies
URL_IMDB = {"url":[],"imdb":[]}
i = 0
#get url-imdb index mapper
for movie in movies:
print(movie)
(URL[i], IMDB[i]) = PosterCrawler.get_poster(movie)
if URL[i] != 'http://image.tmdb.org/t/p/':
URL_IMDB["url"].append(URL[i])
URL_IMDB["imdb"].append(IMDB[i])
i += 1

3. TMDB API

I wrote a helper function get_poster() to call TMDB api and obtain posters path. It takes movie’s TMDB id as parameter.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import tmdb

tmdb.API_KEY = 'Your API Key'

def (movieid):
base_url = 'http://image.tmdb.org/t/p/'
psize = 'w92'
posturl = base_url
imdbid = 0

movie = tmdb.Movies(movieid)
response = movie.info()

if movie == None or not hasattr(movie, 'poster_path') or not hasattr(movie, 'imdb_id') or movie.poster_path == None:
return posturl, imdbid

posturl = base_url + psize + movie.poster_path
return posturl, movie.imdb_id

Posters also have other size including [w92, w154, w185, w342, w500, original], you can change it by editing psize. If TMDB does not have a poster of a movie, this function will return base url and 0.

4. Download the Posters

Now we’ve got posters’ path and let’s download them:

1
2
3
4
5
6
7
df = pd.DataFrame(data=URL_IMDB)
total_movies = len(df)

#download posters
poster_path = "D:posters\"
for i in range(total_movies):
urllib.request.urlretrieve(df.url[i], poster_path + str(i) + ".jpg")

It may take a few hours since there are 10k posters…

Conclusion

Mention the boundary conditions.

Also there is another fine wrapper PyTMDB3, however it uses python2.x, pity 🙁