April 2005 (perspective of a college student)
I have developed an easy-to-use Python script for automatically harvesting JPEG images from a website and a selected number of websites linked from that starting site. It uses the free GNU Wget program to download images, and a number of heuristics to try to grab only images from the most relevant sites. It can be thought of as a more specialized and 'intelligent' Wget.
To use my Image Harvester script, simply download image-harvester.py, and run it on a computer with Wget installed in a directory where you want to download the images:
This script downloads all .jpg images on and linked from <url-to-harvest>, then follows all webpage URL links on that page, downloads images on all those pages, and then follows one more level of webpage URL links from those pages to grab images, except that this time it only follows URLs in the SAME domain to prevent jumping to outside sites. It creates one sub-directory for images downloaded from every page that it crawls to.
Your choice of <url-to-harvest> is important in determining how many images this script can harvest. For optimum results, try to choose a page that contains lots of images that you want and also lots of links to other pages with lots of images. The maximum depth of webpage links that this script follows is 2, but that should be enough for most image harvesting purposes. Additional levels of recursion usually results in undesired crawling to irrelevant sites.
The Image Harvester script cannot distinguish between the images that you want to keep from ones that you don't (e.g., thumbnails, ads, and banners). I have written a Image Filterer BASH shell script that tries to filter out undesired images based on a heuristic of dimensions. If either an image's width or height are below some minimum threshold (350x350 is what I use), then it's probably a thumbnail, ad, or banner that you don't want to keep. This script uses the ImageMagick inspect program to inspect the dimensions of all .jpg images, throw away the ones that don't meet some minimum threshold, and then throw away sub-directories that don't contain any more images.
To filter your images, download keep-images-larger-than.sh and run it in the same directory where you ran image-harvester.py:
This will first create sub-directories named small-images-trash and small-images-trash/no-jpgs-dirs to store the filtered-out files and directories, respectively. Then it will find all .jpg images within all sub-directories and move any file whose width or height is less than <min-width> or <min-height>, respectively, into small-images-trash. As a last step, it will move any directories that contain no .jpg images into small-images-trash/no-jpgs-dirs. These trash directories provide a safety net to protect against accidental deletions. After running the Image Harvester and Image Filterer, your sub-directories should be filled only with full-sized images that you want to keep.
Here is the code. Please give it a shot and email me with feedback if you have trouble getting it to work or want me to add additional features.
Here are the two main problems that I've experienced with automated web crawling and downloading tools, and how this project tries to solve them: