Download in Geek Style: Use Wget (Part 2)

| Comments

Hello there ! !
After our last article on Introduction to wget for Linux newbies, it is time to advance a little further. In this article  we’ll discuss advanced usage of Wget.
Let’s start with Wget’s most wanted command:


Downloading Recursively (-r switch)

Wget can download recursively, following all the links it meet in the way of downloading process. For example, you are reading an online book (ebook of course), which has links to further chapters. Using this command you can easily download all the pages of the ebook with a single command making your own copy of the ebook to be read offline. Even better, doing some Google we can download as many mp3s or other files as we want, all in a single command.
Excited ? (I know you are)
All right, enough talking.





How to download recursively

wget -r -l 7 --no-parent -A pdf,djvu -nH –cut-dirs=4 -P “My download directory” “Link to download page” 
 Time for some explanations:
-r or –recursive   This switch tell wget to start downloading recursively from the link given.

-l or –level=’depth’    When downloading recursively, wget follows a system of levels. This denotes the levels of depth to which wget will follow links.
As in above example when we start a download from page 1 with level stated to be 7, wget download main page first. After this it follows all the links given on the page. This is level one. After downloading everything, wget starts following links given in downloaded pages. This is level 2. Similarly wget will download everything  following every link it meet in the way until it reaches maximum depth.
By default, wget sets depth level to be 5. It can also be set to infinite
wget -r -l inf "download link" or wget -r l 0 "link"

–no-parent or -np   Wget’s recursive download is bidirectional. It means wget follows link in both directions of link hierarchy (err… what is that?).
Let’s see an example. Assume we are downloading free ebooks from a website, say example.com/ebooks/english/list.html . So  what we want is downloading English books only. But by default, wget will follow all links on list.html BUT it will also move upwards and follow links it find there. This is not what we want.
SO here is –no-parent. It is a very useful command which ensures that we download only downwards the hierarchy and don’t go upwards.

Download specific file types (-A ‘filetype or list’ or –accept=”list of filetypes”)    When downloading ebooks from our kind website, we don’t want any HTML,CSS or Javascript files to be downloaded. By default, wget will download everything including images, scripts  and everything. -A or –accept switch allow us to download only desired files. In the example, we want only .pdf and .djvu files to be downloaded, and wget will do that, strictly following our orders. Multiple filetypes can be given separated by commas.

-R “list of filetypes” or –reject=”filetypes”   Similar to -A is -R. While -A accepts some files and rejects others -R, rejects some given filetypes and download everything else.

Handling Directories

-P “path” or –directory-prefix=”path”   As stated in previous article, -P can be used to redirect downloaded file to some specific path.

But when downloading recursively, there is one problem. Wget saves all the files in the same directory hierarchy as they were on the server. As in our example, by default all files will be saved as this,
 Home Folder> example.com > ebooks > english> file.pdf
This behavior can be very irritating for normal users like us. But no worries, wget provides many options to handle this our own way. Here are some most commonly used ones.

–cut-dirs=x    This is useful command for controlling the directory structure of the location where recursively downloaded files will be saved. It cut the “x” directory components from the hierarchy.

–nH or –no-host-directories    This command cuts the name of host from directory structure. In other words, it disables generation of host prefixed directories.

-nd or –no-directories   It suggests wget to not to use any directory structure at all and save all the files in open (by default) or in the folder specified by -P command.

Example:
Assume that we are recursively downloading (pdf) files from example.com/ebooks/english, then this is how they will be saved on our PC with different commands…

No options                    ->                example.com/ebooks/english/file.pdf
-nH                               ->                ebooks/english/file.pdf
-nH –cut-dirs=1            ->               english/file.pdf
–cut-dirs=1                   ->               example.com/english/file.pdf
Got it ?….Good… : )

Making readable Offline Copies of Websites

It is really easy to make offline copies of websites, just start a recursive download and it is done. No it is not.
 Think of the links given on pages. For example, if we download an ebook (HTML files) which has links to next chapters and other such links, all of them point to pages available on server, like  an ebook at “example.com/onlinebook/contents” will have link to chapter 1 like this, “example.com/onlinebook/chapter1”. Even in an offline copy (that we have made using wget), clicking on this link will take us online to the server. But this is not what we want.
Again, no worries, wget has a solution for this.

-k or –convert-links    This is an extremely useful option which converts all the links in downloaded pages to their local copies (if they are downloaded). In case HTML files have link to the content which has not been downloaded, wget will convert those links to their absolute location (internet of course). This ensures that there are no broken links and make local viewing smooth.

-p or –page-requisites     As all other wget commands, this is also a very useful command which help to download all the files which are necessary for proper display of a page. It downloads everything (images, sounds, style sheet references etc) which are necessary for proper display of page even if they are located on different websites.

When She said NO !!

Sometimes web servers don’t allow tools like wget to access  their data and hence we can’t download from such servers. But as I am saying from very beginning, wget has a way for everything (almost). Here are some useful commands which can be used to get access when the server says no and tempt to kick your ass.

-U “agent” or –user-agent=”agent”    When wget access a file on a HTTP server, it identifies itself by sending a user agent string (header field). It is like it says to server, “Hey baby, this is wget. Wassup ?”. But sometimes HTTP servers deny connections to some agents (web browser, wget etc are all agents which allow us access Internet through protocols) or only allow some specific agents to access their data. We can fool the server by changing the user agent string. The command looks like this:
wget -U "Mozilla/5.0" or wget --user-agent="Mozilla/5.0"
Here Mozilla is name of agent and 5.0 is version number. What we are doing is, changing user ID to look as if it was sent by your browser or at least hide the fact that it is sent by wget.
Actual User ID string is  pretty long and carry more information, but this much is fine for fooling most web servers. We can also say wget to not to send any user ID with this command:
wget --user-agent=""   
–referer=url    This command includes the “Referer: url” in the HTTP request. Sometimes servers expect that their data is always accessed by web browsers which are always sent by some page which points to them. This command is not used often, but may be useful in some particular case.

–http-user=user and –http-password=password   In case you have an account on the server and server needs username and password to authenticate the request, these commands are to be used. Similar commands are –ftp-user=user and –ftp-password=password for ftp servers and –user=user and –password=password for both http and ftp servers. Latter two have lower preference than first two command sets.

-w seconds or –wait=seconds     This command wait for given number of seconds between two consecutive downloads thus decreasing the load on the server. Instead of in seconds, time can be given in minutes with “m” suffix or in hours or even in days with “h” and “d” suffices. Large values can be useful in case destination server is down, giving wget enough time to retry and wait till it is up again.

–random-wait   Sometimes, web servers do analyse the traffic coming to them and find out if automatic tools like wget access them. They usually count the time between requests they receive and deny further requests, –random-wait switch allows us to make wget wait for random time between consecutive downloads and fooling the server.
This option causes the time between requests to vary between 0.5 and 1.5 * wait seconds,where wait was specified using the –wait option, in order to mask Wget’s presence from such analysis.

Unleash the power of Google

Google too has a syntax like *nix commands, which can be used for finding desired results from over billions of pages on Internet. We can get just what we want if we use it smartly. Here we want a pure list of downloadable files, which we can download with wget. Just enter this string in Google Search Bar and hit enter:
intitle:”index of/” mp3 “your favorite band” parent directory
This will give links to pages which only have links to mp3 files of your favorite band ready to download. This link can be passed to wget for a recursive download with required recursion depth to successfully get what we want. Tinker with above search string to get other kind of stuff, may be videos, ebooks or whatever.
(This is meant for educational purposes only. Downloading this way is not legal. Use at your own risk… :P)

Wget has much more than this. Refer to wget manual pages for more advanced and insight information.
HAPPY HACKING… :D



Circle Beat Of The Geek on Google Plus
OR Like us on Facebook  OR Follow on Twitter

Comments