PHP Labs :: Link Grazer

Lately, I've had more and more of an interest in data mining. Perhaps that's due to some work that I recently got to do with Google through one of our clients at Active Media Architects, or perhaps it's me floundering around for an area of Computer Science to focus in as I head off to obtain my Master's degree.

Either way, when I was recently approached by an old friend and employeer to redo their sitemap, I thought it would be fun to try and build out a PHP Application that would crawl their site, store the necessary information, and dynamically build out the sitemap. That may be above and beyond what I end up doing, but I thought it would be fun to build out the crawler anyways.

Enter Link Grazer. You feed Link Grazer a URL (either http or https) and it goes out, examines the HTML, and brings back an array of URLs found in anchor tags in the page located at the URL that you requested it to parse.

Where is the value in that you may ask? It's an engine that can be used for gathering links and spidering sites to make things like: sitemaps, web-based link checkers, search engine (for those times when Google just isn't an option) and even something like a web-based, site-wide, spell checker (why not right)?

Still TODO: Instead of returning an array of strings for URLs, I'd like to have it return an array of arrays or objects representing some more information about the links. Especially useful would be whatever was the subject of the >a< tag, not just what was in the href.

Also, I'd like the request to deliver a User Agent and I orginally coded Link Grazer to work that way, it opened a socket connection, made the request, and then parsed through the server response. This, however, was WAY too slow (about 30 seconds per page). So currently it does not deliver a User Agent to the site that it is parsing.

You can test it out by entering in any http or https url and pressing "Submit".

Url: