Need more debugging ideas for php and mysql script

software development

#1

I have this php script that I’ve have in production since 2007, running from a cron several times a day that just started to cause problems today. I’ve tried various ways of echoing results back to me, and adding more execution time, but it looks like I need more ideas because each time I run the script I get a different number of records.

The script deletes all the records in a mySQL table, then reads in data from a BOR url, chops up the lines with curl, then inserts the records into the mySQL table. Then it repeats the url, curl, insert steps for nine more BOR sites.

The cron job runs the script with lynx like lynx -dump http://sitehome/php/BOR/getFlow6.php, but if I run the lynx line in a ssh shell, I get internal error messages. If I run the script in a browser I get 500 Internal server error messages. But I get a different number of records and sites inserted each time I run it.

When I enable the echo $contents line; in the first site’s block, a browser test shows the file contents dumped until it gives “Content Encoding Error The page you are trying to view cannot be shown because it uses an invalid or unsupported form of compression.”

[undefined=undefined][undefined=undefined]require_once("…/ChartDirector/lib/phpchartdir.php");

set_time_limit(1090);

//retrieve Prosser flow data from BOR database
$theurl=“http://www.usbr.gov/pn-bin/yak/arc3.pl
."?station=YRPW&year=2000&month=10&day=1&year=2015&month=5&day=20&pcode=QD";

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $theurl);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);
curl_close($ch);

// display file
//echo $contents;

// The feed appears to have useful parsing point identifiers.
// I just used them to get everything between them (including flags)
$cStartStr = “BEGIN DATA”;
//$cEndStr = “END DATA”;
$cEndStr = “Error: file access opening fab”;
$cPageTail = stristr($contents, $cStartStr);
$nUsefulDataEndPos = strpos($cPageTail, $cEndStr);
$cUsefulData = substr($cPageTail, 0, $nUsefulDataEndPos);

// explode the content using newlines as delimeters
$aContents = explode(chr(10), $cUsefulData);
//print_r($aContents);
//echo ‘line 34’;
// Two new arrays initialized to empty
//$myDates = array();
//$myTemps = array();
//echo ‘line 38’;
// Set up and test database connection
@ $db = new mysqli(‘bor****’, ‘username’, ‘pw’, ‘table’);
//echo ‘Line 41’;
if (mysqli_connect_errno())
{
echo ‘Error: Could not connect to database. Please try again later.’;
exit;
}
//Echo ‘line 47’;
// Empty the flow table in preparation for filling it up with revised data
$query = “DELETE from Flow”;
$result = $db->query($query);
// skip the leading and trailing junk
// Prolly don’t want to do all these assignments in the loop. They’re just used for readability
for ($i=3; $i<count($aContents)-1; $i++) {

// Dates are formatted as 10 characters
$cDateStr = substr($aContents[$i],0,10);
#convert date string (a date in human language) to a date in programming language
$myDates = strtotime($cDateStr);
//echo $cDateStr, $myDates;

// QD is everything in the trimmed value after the last space
$nQDVal = substr($aContents[$i], strrpos(trim($aContents[$i]), chr(32))+1);
//$query = "insert into Flow values ('".$myDates."', '".$nQDVal."', '".YRPW."')";	
$nQDVal == "MISSING " ? 
	$query = "insert into Flow values ('".$myDates."', null, '".YRPW."')" :
	$query = "insert into Flow values ('".$myDates."', '".$nQDVal."', '".YRPW."')";			
$result = $db->query($query);
}
//peep scene

//echo(’

’);
//print_r($nQDVal);
//print_r($myDates);
//echo(’
’);

//--------------------------------------------------------------------------
//retrieve Roza flow data from BOR database
$theurl=“http://www.usbr.gov/pn-bin/yak/arc3.pl
."?station=RBDW&year=2000&month=9&day=25&year=2015&month=4&day=15&pcode=QD";

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $theurl);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);
curl_close($ch);

// display file
//echo $contents;

// The feed appears to have useful parsing point identifiers.
// I just used them to get everything between them (including flags)
$cStartStr = “BEGIN DATA”;
//$cEndStr = “END DATA”;
$cEndStr = “Error: file access opening fab”;
$cPageTail = stristr($contents, $cStartStr);
$nUsefulDataEndPos = strpos($cPageTail, $cEndStr);
$cUsefulData = substr($cPageTail, 0, $nUsefulDataEndPos);

// explode the content using newlines as delimeters
$aContents = explode(chr(10), $cUsefulData);

// skip the leading and trailing junk0
// Prolly don’t want to do all these assignments in the loop. They’re just used for readability
for ($i=3; $i<count($aContents)-1; $i++) {

// Dates are formatted as 10 characters
$cDateStr = substr($aContents[$i],0,10);
#convert date string (a date in human language) to a date in programming language
$myDates = strtotime($cDateStr);

// QD is everything in the trimmed value after the last space
$nQDVal = substr($aContents[$i], strrpos(trim($aContents[$i]), chr(32))+1);

// if ($nQDVal == “MISSING “)
// {
//$query = “insert into Flow values (’”.$myDates.”’, null, '”.RBDW."’)";
//}
//else
//{
//$query = “insert into Flow values (’”.$myDates."’, ‘".$nQDVal."’, ‘".RBDW."’)";
//}
$nQDVal == “MISSING " ?
$query = “insert into Flow values (’”.$myDates.”’, null, ‘".RBDW."’)" :
$query = “insert into Flow values (’”.$myDates."’, ‘".$nQDVal."’, ‘".RBDW."’)";
$result = $db->query($query);
}

//peep scene
//echo(’

’);
//print_r($nQDVal);
//print_r($myDates);
//echo(’
’);[/undefined]
[/undefined]
and simile for 8 more sites.
[hr]
By uncommenting echo $contents; farther into the file, it appeared the script wasn’t getting far enough to get to the third site CHCW. Added more to the time_limit, then more, then more. Now I have set_time_limit(1090000); which just seems ridiculous, and I only seem to get to the forth site.

#2

Ok, so first of all, what changed? Did you script change? Did you change the version of PHP you are using? Did the data provider change its format? Things don’t just stop working unless something changed. Finding out what changed is key to solving the problem. For example, looking at your data provider, at the end of the file (which loads slowly), it says:

01/17/2013 1173.01 Error: file access opening fab

So clearly something has happened on the data provider’s end on 17 January.

[quote=“MajorGeek, post:1, topic:59092”]The cron job runs the script with lynx like lynx -dump http://sitehome/php/BOR/getFlow6.php, but if I run the lynx line in a ssh shell, I get internal error messages. If I run the script in a browser I get 500 Internal server error messages. But I get a different number of records and sites inserted each time I run it.
[/quote]

So what are the error messages? That’s kind of important to know. You can get the 500 errors from /home/USER/logs/DOMAIN/http/error.log.

So this sounds like the data provider has changed something. Can you verify that it’s outputting exactly the same as it was before things broke?

Also, just glancing at your code, there’s probably an easier way of doing what you are trying to do. Could you give a bit more detail of what you are doing, a sample of the raw data from the data provider, what you want the data to look like before you insert it into the database, and how you are using the data in the end result? I suspect that, for example, you may not even need to use a database at all.

And finally, to help in troubleshooting, it helps to make your code readable. I’ve added syntax highlighting and cut out bug-finding comments.

[php]
require_once("…/ChartDirector/lib/phpchartdir.php");

set_time_limit(1090);

//retrieve Prosser flow data from BOR database
$theurl=“http://www.usbr.gov/pn-bin/yak/arc3.pl
."?station=YRPW&year=2000&month=10&day=1&year=2015&month=5&day=20&pcode=QD";

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $theurl);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);
curl_close($ch);

// The feed appears to have useful parsing point identifiers.
// I just used them to get everything between them (including flags)
$cStartStr = “BEGIN DATA”;
$cEndStr = “Error: file access opening fab”;
$cPageTail = stristr($contents, $cStartStr);
$nUsefulDataEndPos = strpos($cPageTail, $cEndStr);
$cUsefulData = substr($cPageTail, 0, $nUsefulDataEndPos);

// explode the content using newlines as delimeters
$aContents = explode(chr(10), $cUsefulData);
@ $db = new mysqli(‘bor****’, ‘username’, ‘pw’, ‘table’);
if (mysqli_connect_errno())
{
echo ‘Error: Could not connect to database. Please try again later.’;
exit;
}
// Empty the flow table in preparation for filling it up with revised data
$query = “DELETE from Flow”;
$result = $db->query($query);
// skip the leading and trailing junk
// Prolly don’t want to do all these assignments in the loop. They’re just used for readability
for ($i=3; $i<count($aContents)-1; $i++) {

// Dates are formatted as 10 characters
$cDateStr = substr($aContents[$i],0,10);
#convert date string (a date in human language) to a date in programming language
$myDates = strtotime($cDateStr);

// QD is everything in the trimmed value after the last space
$nQDVal = substr($aContents[$i], strrpos(trim($aContents[$i]), chr(32))+1);
$nQDVal == "MISSING " ? 
	$query = "insert into Flow values ('".$myDates."', null, '".YRPW."')" :
	$query = "insert into Flow values ('".$myDates."', '".$nQDVal."', '".YRPW."')";			
$result = $db->query($query);
}

//--------------------------------------------------------------------------
//retrieve Roza flow data from BOR database
$theurl=“http://www.usbr.gov/pn-bin/yak/arc3.pl
."?station=RBDW&year=2000&month=9&day=25&year=2015&month=4&day=15&pcode=QD";

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $theurl);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);
curl_close($ch);

// The feed appears to have useful parsing point identifiers.
// I just used them to get everything between them (including flags)
$cStartStr = “BEGIN DATA”;
$cEndStr = “Error: file access opening fab”;
$cPageTail = stristr($contents, $cStartStr);
$nUsefulDataEndPos = strpos($cPageTail, $cEndStr);
$cUsefulData = substr($cPageTail, 0, $nUsefulDataEndPos);

// explode the content using newlines as delimeters
$aContents = explode(chr(10), $cUsefulData);

// skip the leading and trailing junk0
// Prolly don’t want to do all these assignments in the loop. They’re just used for readability
for ($i=3; $i<count($aContents)-1; $i++) {

// Dates are formatted as 10 characters
$cDateStr = substr($aContents[$i],0,10);
#convert date string (a date in human language) to a date in programming language
$myDates = strtotime($cDateStr);

// QD is everything in the trimmed value after the last space
$nQDVal = substr($aContents[$i], strrpos(trim($aContents[$i]), chr(32))+1);
$nQDVal == "MISSING " ? 
	$query = "insert into Flow values ('".$myDates."', null, '".RBDW."')" :
	$query = "insert into Flow values ('".$myDates."', '".$nQDVal."', '".RBDW."')";			
$result = $db->query($query);
}

[/php]
[hr]
When you use today’s date as the ending date, the load time is about 3.5 seconds and ends with:

01/17/2013      1173.01 
END DATA

But setting an end date far in the future gives a load time of almost 10 seconds and ends with:

01/17/2013      1173.01 
Error: file access opening fab

If I understand what you are trying to do, you are are trying to add the latest daily average data to your database for graphing purposes using ChartDirector. It would be much more efficient to loop through the URLs you want to get the data from (rather than repeating all your code) and adjusting your query parameters to just get the last day’s data then just append that to your database rather than dumping the whole thing and reloading everything just to get one extra day’s data point.

Getting a single day’s data takes about 800 ms:
[font=courier]http://www.usbr.gov/pn-bin/yak/arc3.pl?station=YRPW&year=2013&month=1&day=17&year=2013&month=1&day=18&pcode=QD[/font]

USBR Pacific Northwest Region
Hydromet System Data Access

Although the Bureau of Reclamation makes efforts to maintain the accuracy
of data found in the Hydromet system databases, the data is largely unverified
and should be considered preliminary and subject to change.  Data and services
are provided with the express understanding that the United States Government
makes no warranties, expressed or implied, concerning the accuracy, complete-
ness, usability or suitability for any particular purpose of the information
or data obtained by access to this computer system, and the United States
shall be under no liability whatsoever to any individual or group entity by
reason of any use made thereof. 
BEGIN DATA
YRPW
      DATE           QD 
01/17/2013      1173.01 
END DATA

#3

Just an idea, you could put something like this in a loop of URLs and adjust the dates so that you only download the most recent few days and update the database rather than dump and reload:

$ wget -q -O - "http://www.usbr.gov/pn-bin/yak/arc3.pl?station=YRPW&year=2013&month=1&day=15&year=2013&month=1&day=18&pcode=QD" \
| php -B '
    $data = array();
    $func = function ( $x ) { return preg_split( "/\s+/", trim( $x ) ); };
' -R '
    if ( preg_match( "/^[0-9]/", $argn ) ) { 
        $data[] = current( 
            array_map( 
                $func, 
                preg_grep( "/^[0-9]/", array ( $argn )  ) 
            ) 
        ); 
    }
' -E '
    print_r($data);
'
Array
(
    [0] => Array
        (
            [0] => 01/15/2013
            [1] => 1130.06
        )

    [1] => Array
        (
            [0] => 01/16/2013
            [1] => 1250.90
        )

    [2] => Array
        (
            [0] => 01/17/2013
            [1] => 1173.01
        )

)

Personally, I wouldn’t use PHP for the task, but to each his own. Your profile notes that you are learning Perl. This little project would be a perfect task to do in Perl for learning purposes. It’s quite small and simple, so a good learning experience.


#4

I haven’t changed php versions, didn’t change the php code. So it must be something changed at BOR. Suppose it takes longer to retrieve a page for each site? Is there a way to get php to wait longer for each URL?

“Error: file access opening fab” has always been a feature of the URL retrieve. Early on I figured out if I asked for records out to 2015, I’d hit that string, so I added that to the php code as the end of file.

Perl was many years ago. I’ve forgotten most of what I ever knew about perl. Now I attempt to develop in php and MySQL. However, recently I’ve had to learn more Python to interact with ArcGIS.

I like the wget approach.

You’re correct, it would be more efficient to loop through the same code with different sites each time. But I first got it to work with one BOR site, then people asked for another, then a couple more, then a couple more. Each time I just copied and pasted code from the one working section to the new ones until I got this rediculously long script.

I have been dumping the whole table and repopulating the whole past five years because BOR sometimes revises old records after inspection. However, I haven’t seen them a case where they have corrected data more than a year old.


#5

I see.

Personally, I would start by making the following changes:

[list]
[]Optimise your code so that you have an array of URLs to fetch and just one block of code to fetch / parse. If some sites require unique handling, that can be done as exceptions to your core fetching/parsing code.
[
]Change the dates of the URLs so that you only fetch daily values regularly then run a weekly or monthly version (same script, just receives an extra parameter or runs an extra block of code depending on the day of week/month) to update the entire database in case of revisions.
[]Don’t request dates that cause errors on the BOR site. In my limited testing, the response times were slower.
[
]Stick with PHP if you like it, but consider using some PREG functions (Perl-compatible regular expressions) which will make your parsing simpler and easier to maintain.
[/list]

After doing those things I’d look into increasing wait times if you are still having problems. Personally, I don’t think the relatively slow (3-5s) response time is an issue.

FWIW, I have a similar type of cronjob set up using a PHP script and fetch the results with [font=courier]file_get_contents()[/font] and have never had a problem with it.


#6

I wonder if I’m running into a run time limit set by the system php.ini so my increases to set_time_limit aren’t having any effect.

I seem to get five sites completely retrieved today, but lack three more of lesser importance.

Originally the script used file_get_contents like I learned from PHP and MySQL Web Development, Welling and Thompson, but it stopped working. I learned that from the fine folks on this forum that file_get_contents introduced security issues so it was withdrawn by DH. The forum showed my how to use curl instead. https://discussion.dreamhost.com/post-62841.html
https://discussion.dreamhost.com/thread-62849.html

I guess I started relying on the string “Error: file access opening fab” to indicate end of data because I could reliably hit it just by asking for data out to 2015, and then I wouldn’t need to revise this end date for many years. If I tried to cut it closer to today’s date, sometimes the url call would end with “END DATA”, sometimes with “Error: file access opening fab”, so I just went with asking for the data out to 2015.


#7

Add some statements to help you figure out what’s going on. Since you are running this as a cronjob, make sure your email is set to get the results and add some print statements. Also make sure you add [font=courier]2>&1[/font] to the end of your crontab entry so that you can get any error messages as well. You can add timing by storing the time ([font=courier]$time = microtime(true);[/font]) before a request, then subtracting that from the current time to see how long it’s taking.

That’s why I mentioned it. I don’t think it’s blocked by DH because I’m using it in a cronjob written in PHP.

[font=courier]/^[0-9]/[/font] will get you only the lines that begin with numbers. You don’t need to find anything before and after the text of interest.


#8

I like many of these ideas and will try them. However, I need to get the flow data reliably available soon. One thing I have tested and know will work is to run a short version of my script that dumps the table for just one site and then inserts new rows, running this single site version for each of the 10 sites from a separate line in my cron with the start time separated by a few minutes. But I’m stuck on the syntax on just one string insertion in the url, pcode. I guess I’m getting confused by the position at the end of the url so I can’t get the right combination of quotes:
$site = $_GET[‘site’];
$pcode = $_GET[‘pcode’];

//retrieve Prosser flow data from BOR database
$theurl=“http://www.usbr.gov/pn-bin/yak/arc3.pl
."?station=".$site."&year=2000&month=10&day=1&year=2017&month=5&day=20&pcode="$pcode;
echo $theurl;

Looping through the code 10 times using site and pcode from an array is more efficient programming, but I’m worried that I’ll be hitting the same kind of processing time limit that I seem to be getting now. With my original script that runs through 10 sites in succession I’m getting farther through the script if I run it in the early morning. I get a different number of records and sites back each time I run it.

Now I see that I’m also getting 404 errors from similar php scripts that retrieve water temperature and air temperate from a bunch of sites from this same BOR system.


#9

Personally I prefer something like this:

[code]//retrieve Prosser flow data from BOR database
$theurl = “http://www.usbr.gov/pn-bin/yak/arc3.pl
. “?station=” . $site
. “&year=2000”
. “&month=10”
. “&day=1”
. “&year=2017”
. “&month=5”
. “&day=20”
. “&pcode=” . $pcode;

echo $theurl;[/code]

But this is just as easy:

$theurl = "http://www.usbr.gov/pn-bin/yak/arc3.pl?station=$site&year=2000&month=10&day=1&year=2017&month=5&day=20&pcode=$pcode"; echo $theurl;

As I said, add some echo statements and timing or else you are just shooting in the dark. Cronjobs can run for a long time.

Add echo statements to find out what your variables actually are. You are probably having the same problem as above.

Also, if something changed with the data provider, it’s possible that they have started scraping countermeasures to prevent people from doing exactly what you are doing. Scraping is usually prohibited in most websites’ ToS.


#10

Oh, crap, why didn’t I see this before I got everything rewritten to run separately for each site based on url parameters? After every curl-init() in the original script there is a $timeout = 5 statement. That’s probably where the script crashes.

Update - Well, I increased the timeout value to 15 and the original script still crashes. I’m better off running each site separately.


#11

Which is exactly why you should be doing a loop. You have repeated code which makes it difficult to maintain (and read).

Take the time to write the best code you can and it will save you headaches later. It takes time in the beginning and may even seem to be a waste of time for short scripts, but unless they are use-once only, it will help you in the long run to write DRY code and separate your code into logical, self-contained chunks which are only responsible for one task.


#12

Now that I just have one piece of php code executing once for each of the 10 sites, I notice that something’s changed recently at BOR’s site. Sometimes, when my script sends this url to BOR “http://www.usbr.gov/pn-bin/yak/arc3.pl?station=YRPW&year=2000&month=10&day=1&year=2017&month=5&day=20&pcode=QD” , the site responds with records all the way back to 1981. Ignores the specified start date 2000-10-1. Same with site “RBDW”. Then next day, the BOR site responds correctly. Maybe the times when the BOR site gives me all those years before 2000, my old WETT script hits a php time limit or a curl time out.

I wonder if the BOR routine encounters problems because they reuse the same names &year, &month, and &day twice. Looks like the BOR script is Perl.


#13

That would be the case with PHP (unless you want to parse [font=courier]$_SERVER[‘QUERY_STRING’][/font] yourself):

[php]<?php
print ‘

’;
print_r($_GET);
print ‘
’;
?>[/php]

results in:

Array ( [station] => YRPW [year] => 2017 [month] => 5 [day] => 20 [pcode] => QD )

But this perl:

#! /usr/bin/env perl use strict; use CGI; use Data::Dumper; my $q = CGI->new; print $q->header(); print '<pre>'; for my $param ( $q->param() ) { my @data = $q->param( $param ); print Dumper \@data; } print '</pre>';

results in:

$VAR1 = [ 'YRPW' ]; $VAR1 = [ '2000', '2017' ]; $VAR1 = [ '10', '5' ]; $VAR1 = [ '1', '20' ]; $VAR1 = [ 'QD' ];

It may not be the best way to do it, but there’s always more than one way to do it, and it works.

Once again, as I’ve said multiple times before, adding timings, printing debugging information, and requesting only the dates you need are probably the best approach to take rather than blaming the data provider. Said provider’s script was probably written years ago by someone who doesn’t even work there anymore and likely goes by the motto don’t fix it if it ain’t 110% broke.