Web “scraping” the easy way

Quick Intro:  Say you go to the ABC website and want to get the contents of the alert at the top of the page, which is enclosed in a <div> with the class “alertContent”.

Solution : phpQuery :  http://code.google.com/p/phpquery/

3 lines of code : source

require_once ‘phpQuery.php’;
$doc = phpQuery::newDocumentFileHTML(‘http://abc.com’);
echo $doc[‘.alertContent’];

This will include all of the HTML from that div, so you may want to clean this up a little bit.

The echo command becomes : source

if (ereg(“.*<p>(.*)</p>.*”,$doc[‘.alertContent’],$regs)) {
echo $regs[1];
}

Based on their current content – this gives me : The Season 9 cast has been revealed for <i style=”font-size:15px;”>Dancing With The Stars!</i><br><a href=”/shows/dancing-with-the-stars/cast-announcement/”>See all 16 new stars</a>.

Still not 100%, since that URL is relative to the abc website.

I smell another regex coming on :  source

if (ereg(“.*<p>(.*)</p>.*”,$doc[‘.alertContent’],$regs)) {
if (ereg(“(.*<a href=”)(.*)(“.*)”,$regs[1],$regs2)) {
echo $regs2[1].”http://beta.abc.go.com”.$regs2[2];
} else {
echo $regs[1];
}
}

 

{scraper}http://cnn.com|id=”cnn_bnbrgt1|.*<div>(.*)</div>.*|1{/scraper}

Comments are Closed