Php curl screen scraping program needs an if fork

software development

#1

I have some php programs that go to a Bureau of Reclamation site to screen scrape some flow and temperature data. The urls that I use are like http://www.usbr.gov/pn-bin/yak/arc3.pl?station=YRPW&year=1980&month=10&day=1&year=2010&month=3&day=20&pcode=QD where the beginning and ending dates are in the url. I run these scripts with lynx -dump. The php script looks for a couple strings in the output for the start and end of useful data:

$cStartStr = “BEGIN DATA”;
$cEndStr = “END DATA”;
$cPageTail = stristr($contents, $cStartStr);
$nUsefulDataEndPos = strpos($cPageTail, $cEndStr);
$cUsefulData = substr($cPageTail, 0, $nUsefulDataEndPos);

A difficulty has just appeared. If I put the end date too far out in the future, “END DATA” doesn’t appear at the end of the useful data. Instead, “Error: file access opening fab” appears at the end of the data. I suppose I could set the end date far enough in the future that I would consistently see “Error: file access opening fab” and use that for my end string, but it would be better to use either, depending on which string is encountered. How would you handle that? To I have to retrieve the url once to see what the end string is going to be, and again to get the data?

This signature line intentionally blank.


#2

Receive once into a $var

Test the end string and set $cEndStr with result, rather than hardcoding “END DATA”

Maximum Cash Discount on any plan with MAXCASH

How To Install PHP.INI / ionCube on DreamHost


#3

So there’s a way that the script knows it’s at the end to the data?

This signature line intentionally blank.


#4

You’d have to read last line of $contents into your var.

Assuming they retain that “Error: blah blah” result, an IF might be okay…

$cStartStr = "BEGIN DATA"; [color=#0000CC]$cEndStr = (stristr($contents, 'Error')) ? 'Error: file access opening fab' : 'END DATA';[/color] $cPageTail = stristr($contents, $cStartStr); $nUsefulDataEndPos = strpos($cPageTail, $cEndStr); $cUsefulData = substr($cPageTail, 0, $nUsefulDataEndPos);
But it’s a cop-out 'cause we’re testing against hardcoded strings that may change in the future. If an admin with a sense of humor changed the error message to return “Oh Noes! No Moah Dataz!” then the test will be up a certain creek without a paddle. Setting $cEndStr with the last line of $contents would really be the way to go.

Something like this should work on DreamHost…

$cStartStr = "BEGIN DATA"; [color=#0000CC]$var = escapeshellarg($contents); // secure $var $cmd = 'tail -n 1 '.$var; // grab lastline using tail $cEndStr = exec($cmd); // set $cEndStr with lastline[/color] $cPageTail = stristr($contents, $cStartStr); $nUsefulDataEndPos = strpos($cPageTail, $cEndStr); $cUsefulData = substr($cPageTail, 0, $nUsefulDataEndPos);

Maximum Cash Discount on any plan with MAXCASH

How To Install PHP.INI / ionCube on DreamHost


#5

Instead of trying to determine where the end of the data is by an arbitrary string, why not match on the pattern of the data you want? The date and float have a particular format, so why not capture that with a bit of regex like this:

$content = file_get_contents('http://www.usbr.gov/pn-bin/yak/arc3.pl?station=YRPW&year=1980&month=10&day=1&year=2090&month=3&day=20&pcode=QD'); $matches = array(); preg_match_all('!(\d+/\d+/\d+)\s+[-+]?([0-9]*\.[0-9]+|[0-9]+)!', $content, $matches, PREG_SET_ORDER); if (!empty($matches)) { var_dump($matches); } You can then iterate through $matches, where $matches[$i][1] will contain the date and $matches[$i][2] will contain the float. For example:

foreach ($matches as $match) { echo "On {$match[1]} the QD was {$match[2]}\n"; } Andy