[PHP] Sitemap generator for directory listing

Sitemaps are lovely, aren’t they? Not so easy to generate (or *gosh* make manually). Being a programmer, I decided to write a little PHP script to generate one on the fly for a massive (almost 3,000 files) downloads site that I host for gamers. There’s a download link attached to this post, but here’s a look at the code from above.

It has one dependency of phpURI which I use to generate the links for urlset:url::loc elements in the XML. It also uses PHP’s XMLWriter to generate the actual XML for the sitemap.

$default_change_freq = 'monthly';
$url_prefix = 'http://downloads.cncfps.com/';
$blacklist = array('cgi-bin');


$filter = function($info, $key, $iter)
    global $blacklist;
    if (preg_match('/^..*/', $info->getFilename()))
        return false;

    if($iter->hasChildren() && !in_array($info->getFilename(), $blacklist))
        return true;

    return $info->isFile();

$dirall = new RecursiveDirectoryIterator('./', RecursiveDirectoryIterator::SKIP_DOTS | RecursiveDirectoryIterator::KEY_AS_PATHNAME);
$dir = new RecursiveCallbackFilterIterator($dirall, $filter);
$files = new RecursiveIteratorIterator($dir);

header('Content-Type: application/xml');

$writer = new XMLWriter();
$writer->startDocument('1.0', 'UTF-8');
$writer->startElementNS(NULL, 'urlset', 'http://www.sitemaps.org/schemas/sitemap/0.9');

$writer->writeElement('loc', phpURI::parse($url_prefix)->join('sitemap.xml'));
$writer->writeElement('lastmod', date(DateTime::W3C));
$writer->writeElement('changefreq', 'always');
$writer->endElement(); // <url></url>

foreach($files as $file => $object)

    $furi = phpURI::parse($url_prefix)->join($file);
    $furi = htmlentities($furi, ENT_COMPAT | ENT_XML1); //$furi = rawurlencode($furi);

    $writer->writeElement('loc', $furi); // <loc></loc>
    $writer->writeElement('lastmod', date(DateTime::W3C, $object->getMTime())); // <lastmod></lastmod>

    // TODO: read filename.ext.txt for metadata
    $writer->writeElement('changefreq', $default_change_freq); // <changefreq></changefreq> 

    $writer->endElement(); // </url>

$writer->endElement(); // </urlset>
$writer->endDocument(); // EOF


Right now, it generates the sitemap on the fly, and only lists files found in the current directory (recursively). I do have plans to add support for per-file configuration (perhaps even per-directory configurations) via “filename.extension.txt” to change things like changefreq and priority per-file.

If you want to change this to run as a cron script, simply change $writer->openURI('php://output'); to $writer->openURI('sitemap.xml'); and it will write the sitemap out to a file (note, appropriate server permissions for writing are needed). If you do this, be sure to remove the sitemap.xml entry from the script.

Of course, if you use on the fly generation, you’ll probably want to use URL rewriting to map /sitemap.xml to /sitemap.php instead. The following is the Apache .htaccess config for such.

<IfModule mod_rewrite.c>
    RewriteEngine On

    RewriteBase /
    RewriteRule ^sitemap.xml$ sitemap.php [L]

Download: sitemapgen.zip