Goal
Obtain movies’ posters listed in Movielens dataset from The Movie DB.
Tools
python3.5, Movielens dataset, The Movie DB, tmdbsimple
Prepare
1. Movielens dataset
This dataset includes movies’ title, info, users’ rating, etc. There are two kinds of size, 100k and 20M. Latest versions was updated in 2016 and 2017.
In this article, we will focus on file links.csv
.
2. The Movie Database API
To use the TMDB api, we have to register an account of TMDB and request for an api key on TMDB.
The api is in form of http request, so I found a simple python wrapper package listed in the website: tmdbsimple - A wrapper for The Movie Database API v3
Import the tmdbsimple(tmdb) module into pycharm.
Progress
1. Read data from movielens
Use pandas to read movie id, imdb id and tmdb id from links.csv. Later we will use tmdb id to get posters path.
1 |
import csv |
2. Get poster url and imdb id
1 |
total_movies = len(movies) |
3. TMDB API
I wrote a helper function get_poster() to call TMDB api and obtain posters path. It takes movie’s TMDB id as parameter.
1 |
import tmdb |
Posters also have other size including [w92, w154, w185, w342, w500, original], you can change it by editing psize. If TMDB does not have a poster of a movie, this function will return base url and 0.
4. Download the Posters
Now we’ve got posters’ path and let’s download them:
1 |
df = pd.DataFrame(data=URL_IMDB) |
It may take a few hours since there are 10k posters…
Conclusion
Mention the boundary conditions.
Also there is another fine wrapper PyTMDB3, however it uses python2.x, pity 🙁
近期评论