[PHP] Sitemap generator for directory listing

Sitemaps are lovely, aren’t they? Not so easy to generate (or *gosh* make manually). Being a programmer, I decided to write a little PHP script to generate one on the fly for a massive (almost 3,000 files) downloads site that I host for gamers. There’s a download link attached to this post, but here’s a look at the code from above.

It has one dependency of phpURI which I use to generate the links for urlset:url::loc elements in the XML. It also uses PHP’s XMLWriter to generate the actual XML for the sitemap.

// BEGIN CONFIG OPTIONS
$default_change_freq = 'monthly';
$url_prefix = 'http://downloads.cncfps.com/';
$blacklist = array('cgi-bin');
// END CONFIG OPTIONS

require_once(realpath('../phpuri.php'));

$filter = function($info, $key, $iter)
{
    global $blacklist;
    if (preg_match('/^..*/', $info->getFilename()))
    {
        return false;
    }

    if($iter->hasChildren() && !in_array($info->getFilename(), $blacklist))
    {
        return true;
    }

    return $info->isFile();
};

$dirall = new RecursiveDirectoryIterator('./', RecursiveDirectoryIterator::SKIP_DOTS | RecursiveDirectoryIterator::KEY_AS_PATHNAME);
$dir = new RecursiveCallbackFilterIterator($dirall, $filter);
$files = new RecursiveIteratorIterator($dir);

header('Content-Type: application/xml');

$writer = new XMLWriter();
$writer->openURI('php://output');
$writer->setIndent(true);
$writer->startDocument('1.0', 'UTF-8');
$writer->startElementNS(NULL, 'urlset', 'http://www.sitemaps.org/schemas/sitemap/0.9');

$writer->startElement('url');
$writer->writeElement('loc', phpURI::parse($url_prefix)->join('sitemap.xml'));
$writer->writeElement('lastmod', date(DateTime::W3C));
$writer->writeElement('changefreq', 'always');
$writer->endElement(); // <url></url>

foreach($files as $file => $object)
{   
    $writer->startElement('url');

    $furi = phpURI::parse($url_prefix)->join($file);
    $furi = htmlentities($furi, ENT_COMPAT | ENT_XML1); //$furi = rawurlencode($furi);

    $writer->writeElement('loc', $furi); // <loc></loc>
    $writer->writeElement('lastmod', date(DateTime::W3C, $object->getMTime())); // <lastmod></lastmod>

    // TODO: read filename.ext.txt for metadata
    $writer->writeElement('changefreq', $default_change_freq); // <changefreq></changefreq> 

    $writer->endElement(); // </url>
}

$writer->endElement(); // </urlset>
$writer->endDocument(); // EOF
$writer->flush();

exit;

Right now, it generates the sitemap on the fly, and only lists files found in the current directory (recursively). I do have plans to add support for per-file configuration (perhaps even per-directory configurations) via “filename.extension.txt” to change things like changefreq and priority per-file.

If you want to change this to run as a cron script, simply change $writer->openURI('php://output'); to $writer->openURI('sitemap.xml'); and it will write the sitemap out to a file (note, appropriate server permissions for writing are needed). If you do this, be sure to remove the sitemap.xml entry from the script.

Of course, if you use on the fly generation, you’ll probably want to use URL rewriting to map /sitemap.xml to /sitemap.php instead. The following is the Apache .htaccess config for such.

<IfModule mod_rewrite.c>
    RewriteEngine On

    RewriteBase /
    RewriteRule ^sitemap.xml$ sitemap.php [L]
</IfModule>

Download: sitemapgen.zip

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax