Creating static archive of drupal website:

1. Disable interactive elements and dynamic content blocks/pages of the website.

  1. Login block
  2. disable commenting and comment controls - go to comments module and comment out function (calls) that returns 'login for comments' or 'sort comments' kind of interactive sections.
  3. search block - isable search module
  4. make existing comments read only

2. Edit robots.txt file to allow wget to crawl through website. Please refer to ApacheNotes for more information about robots.txt file.

3. Since wget will make thousands of requests to the drupal site, since this full archive of the website. This may take up huge space by logging all the requests made. So one may disable

  • general query logging of mysql
  • access logging of apache
  • change watchdog and access log settings so that corresponding MySQL tables do not grow out of proportion.

In this case logging is directed to /dev/null. This may not give any improvement over performance, as logging action is still performed. We are just saving disk space in this case. This is really important as used disk space can grow by 100% within few hours.

4. Use wget to create an archive using following switches.

  • -q : turn off wgets output
  • -m : Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets in nite recursion depth and keeps ftp directory listings. It is currently equivalent to `-r -N -l inf --no-remove-listing'.
  • -k : After the download is complete, convert the links in the document to make them suitable for local viewing.
  • -K : back up the original html files before the conversion.
  • -P : specify download directory

wget -m --html-extension -k -K --base=./ -P ./ http://my.drupal.site -o wget.log &

5. Verify that offline version of the site works. Check for any skipped nodes using following perl script:

   #!/usr/bin/perl
   $maxnode = 1576                     # change this number to your drupal site's highest node number
                                       # better still, write mysql query select max(nid) from node; here
    
   $file = './missingnodes.txt';       # output missing nodes to this file
   open(INFO, ">>$file");              
   for($i=1; $i<$maxnode; $i++) {
     $op = `find -name $i.html`;
     if(!$op) {
     print INFO $i.".html\n";
     }
   }
   close(INFO);

6. Make sure that hostname specified in wget request matches with the drupal configuration file (found in /includes directory for drupal 4.4 and /sites directory for higher version).

Reference:

  1. http://drupal.org/node/27882
  2. http://www.gnu.org/software/wget/manual/wget.html#HTTP-Time_002dStamping-Internals
  3. http://www.gnu.org/software/wget/manual/wget.html#Very-Advanced-Usage