Urlencode and special characters

software development

#1

I’m making a small web app using AJAX to retrieve file names from the server and display them on the web page.
The problem I have is it seems urlencode and rawurlencode do not work correctly.

Here is an example:

  • I use urlencode on the filename “02 - Bagatelle in A Minor, WoO 59, Für Elise.mp3”.
  • I use Javascript to retrieve the data through AJAX which then uses decodeURIComponent to get back to the original string.
  • I get an error stating the string is a malformed URI.

I did some digging and found this website: http://www.the-art-of-web.com/javascript/escape/ which I used to test the strings coming from my webpage.

Putting the string “02 - Bagatelle in A Minor, WoO 59, Für Elise.mp3” into the aforementioned webpage and using their encodeurl and rawencodeurl functions I get these strings:

urlencode:
02 - Bagatelle in A Minor, WoO 59, Für Elise.mp3
where ü is encoded as % C3 % BC (without the spaces)

rawurlencode:
02 - Bagatelle in A Minor, WoO 59, Für Elise.mp3
where ü is encoded as % C3 % BC (without the spaces)

Using these functions in my own php script gives me these strings:

urlencode:
02 - Bagatelle in A Minor, WoO 59, Für Elise.mp3
where ü is encoded as % FC (without the space)

rawurlencode:
02 - Bagatelle in A Minor, WoO 59, Für Elise.mp3
where ü is encoded as % FC (without the space)

Does anyone have any ideas if I might be doing something incorrectly or if there is something that I need to add to have these strings encoded in a way that javascript can decode them?


#2

I have been doing a decent amount of AJAX recently so your post is relevant to my interests. As an aside, I would like to mention that I often second guess myself before concluding that an established programming language’s functions don’t work correctly. :wink:

I did a little research and found this: http://www.captain.at/howto-php-urlencode-javascript-decodeURIComponent.php

[quote]The PHP functions “urlencode”/“urldecode” are not compatible with the Javascript functions “escape”/“unescape”, “encodeURI”/“decodeURI”, “encodeURIComponent”/“decodeURIComponent”.

Here we have some PHP functions which behave like the corresponding Javascript functions. Those come in handy when dealing with Web 2.0 Ajax applications. [/quote]
This specific line from the code example tells me that this will probably help resolve your problem


#3

I second guessed myself all the way up until I found the site I mentioned in my first post: http://www.the-art-of-web.com/javascript/escape/

When I came to that website, placed my string in the input box, hit rawurlencode, copied the result string into the input box, then hit decodeURIComponent, and the string came out right… with no errors… I had to pause for a moment. What the heck was going on!?

I dug a little bit into charsets and the like, but everything I attempted still resulted in the javascript error.

I agree that my conclusion might have been rash, but what a pita, especially when I found something that shows these functions do work together.

Anyways, I have found it easier to leave special language characters alone when passing data back and forth (who knows if this will bite me later).
This is just something simple I use so that I am only encoding the special characters that, as far as I know, NEED to be encoded.

function encodeurl($url)
{
$split_url = str_split($url);
for($i = 0; $i < sizeof($split_url); $i++)
{
switch ($split_url[$i])
{
case “/”:
$split_url[$i] = “% 2F”;
break;
case “’”:
$split_url[$i] = “% 27”;
break;
case “(”:
$split_url[$i] = “% 28”;
break;
case “)”:
$split_url[$i] = “% 29”;
break;
case “*”:
$split_url[$i] = “% 2A”;
break;
case “~”:
$split_url[$i] = “% 7E”;
break;
case “!”:
$split_url[$i] = “% 21”;
break;
case " ":
$split_url[$i] = “% 20”;
break;
case “$”:
$split_url[$i] = “% 24”;
break;
case “&”:
$split_url[$i] = “% 26”;
break;
case “+”:
$split_url[$i] = “% 2B”;
break;
case “,”:
$split_url[$i] = “% 2C”;
break;
case “:”:
$split_url[$i] = “% 3A”;
break;
case “;”:
$split_url[$i] = “% 3B”;
break;
case “=”:
$split_url[$i] = “% 3D”;
break;
case “?”:
$split_url[$i] = “% 3F”;
break;
case “@”:
$split_url[$i] = “% 40”;
break;
}
}
return implode($split_url);
}

Anyways thanks for your help, I’ll try to keep an eye out here in case you ask for help in the future.


#4

you have different character sets here. ü in the ISO 8859-1 character set (see table at http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout) gets coded as % FC, but in UTF-8 this character is 00FC in Unicode so gets encoded to two bytes (see table at http://en.wikipedia.org/wiki/UTF-8#Description) % C3% BC. The ISO-8859-1 encoding is illegal in UTF-8 and the UTF-8 encoding gives you ü if treated as ISO-8859-1. you shouldn’t have any problems if you use the same character set everywhere.

the default character set is normally ISO 8859-1, but that limits you to the characters in the first table i linked to, so it’s better to use UTF-8 since you can use any character at all. there are a number of ways to specify the character set a page should use – on my site i use the php header() function to send it with the content-type header:

header(‘Content-Type: text/html; charset=utf-8’);

track7 - my dream-hosted site


#5

I did some digging and thought this was the suspect, but I swear I added the header charset=utf-8 on all my pages (including the PHP file that the AJAX request is calling) and I still got the error.

I’m not sure if it’s possible that dreamhost’s php configuration will use ISO-8859-1 no matter what I have the header set to. I’m also not sure if there is a way to change that setting.

I will try again later and post back if I still have problems. If you see my latest post: http://discussion.dreamhost.com/showthreaded.pl?Cat=&Board=forum_programming&Number=116830&page=0&view=collapsed&sb=5&o=14&vc=1#Post116830 , I have found a work around for now but I would really like to solve the charset problem so that I am not using a ‘hack’ to get this to work.


#6

In php.ini :

default_charset = “utf-8”

Maximum Cash Discount on any plan with MAXCASH

How To Install PHP.INI / ionCube on DreamHost


#7

Check both your internal encoding and header definition of the encoding. I struggled with a similar issue making a playlist. Turns out it had nothing at all to do with the functions. PHP encodes the output according to the output charset defined in the header() function. For echo or print it’s the default internal encoding. If you don’t write a mime header the other end has no way of knowing exactly what you’re sending.

Perhaps something like:
mb_internal_encoding(‘UTF-8’);
header(“Content-Type: text/html; charset=utf-8”)

It’s then interpreted on the receiving end according to the mime type and charset both IN the document and ON the document header.

Could this be your issue?


#8

As far as I can tell, I have installed the script which allows me to edit PHP.INI file (which I have also done, changing the default_charset to utf-8) and I am still getting encoding from rawurlencode in iso-8859-1.

I have tried pointing my browser to the PHP file directly and viewing the XML output and I still see encoding in iso-8859-1. So obivously… it is something I am doing.

Please take a look at this script and let me know if I am doing something wrong,
(btw are there any code formatting tags on this forum?)

<?php header("Content-type: text/xml; charset=utf-8"); header("Cache-Control: no-cache"); $directory_name = urldecode($_POST["directory"]); // Start the XML file echo "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n";

echo “\n”;

//List directories and files in the given directory
$file_list = array();
// Open the directory handle
if ($dir_handle = opendir($directory_name))
{
// Read all the files in the given directory
while (($file_name = readdir($dir_handle)) !== false)
{
// Ignore . and …
if ($file_name != “.” && $file_name != “…”)
{
array_push($file_list, $file_name);
}
}
// Sort the list alphabetically
sort($file_list);
// Loop through the file list and create XML
foreach($file_list as $file_name)
{
// Full name of the item
$full_name = $directory_name."/".$file_name;
// Is the item a directory
if (is_dir($full_name))
{
// Create the XML node
echo “\t\n”;
echo “\t\t”.rawurlencode($full_name)."\n";
echo “\t\ttrue\n”;
echo “\t\n”;
}
else // The item is a file
{
// Is the item a mp3 file
if (preg_match("/.*.mp3$/", $file_name))
{
// Create the XML node
echo “\t\n”;
echo “\t\t”.rawurlencode($full_name)."\n";
echo “\t\tfalse\n”;
echo “\t\n”;
}
}
}
closedir($dir_handle);
}

echo “”;

?>


#9

I stuck all of these lines at the top of my PHP script:

mb_internal_encoding(“utf-8”);
mb_http_output(“utf-8”);
header(“Content-type: text/xml; charset=utf-8”);
header(“Cache-Control: no-cache”);

Then I pointed my browser to that PHP script.

The resulting xml still had iso-8859-1 encoding in it.

Then I also added the encoding type to the xml tag,

echo “<?xml version=\"1.0\" encoding=\"utf-8\"?>\n”;

and I STILL have iso-8859-1 encoding in the output.


#10

rawurlencode(utf8_encode($str));

Maximum Cash Discount on any plan with MAXCASH

How To Install PHP.INI / ionCube on DreamHost


#11

did you try setting the default_charset like sXi suggested? when i use functions like htmlspecialchars i pass in the charset parameter, but it doesn’t look like urlencode has that parameter. or you can try switching the character set information you’re sending up with the xml to match the contents it currently has.

track7 - my dream-hosted site


#12

Well that fixes it.

So I assume the file names were being given to me in iso-8859-1 and rawurlencode encodes the string in whatever charset the input is, regardless of any settings I may have set. I suppose it would have been different if I were encoding literal strings, then the headers would have made a difference.

Thanks for your help.


#13

Yes I ran the script from the website that sXi has in his sig and I have a custom PHP.INI file set up, but that didn’t change anything.

I suppose that with the way sXi showed me to fix the problem I don’t really need the custom INI file set up anymore, but I think I’ll keep it for now.

Thanks misterhaan, siggma, and pangea33 for your input.


#14

I doubt that will work.

Example time:

http://www.trbailey.net/music/xspf.php

See the last entry in the playlist?
That’s not iso-8859
View Source in your browser and note the heading lines.

Normally you would code your output line as the vary last thing:

$string = this value;
$string .= more values;

print header(“Content-type: text/xml”) . $string->asXML();

If you output the header first you’ll likely get the nasty error about modifying buffer after header and your output will be incorrect.

Now see the playlist after it’s decoded correctly:
http://www.trbailey.net/music/

Last entry again…


#15

Hmm… well it seems like it’s working just fine.

I don’t claim to be a guru or anything, but I would think the error you are talking about has to do with how flash handles its get vars code or just how the flash player you are using is programmed. AFAIK with AJAX the AjaxRequest will wait until the page is complete before returning data (in the request state changed callback there is code to check if the page is complete).

I’ve been poking around in the dark for a month or so with Ajax, Javascript and PHP and have never seen or heard of that error, but you’re probably right and I’ll keep a mental note of writing the xml all out at once for when this error arises.

I’m still trying to figure out what your example shows. I just see that all of the file names are urlencoded and then the last file name doesn’t have any characters which would need encoding. I’m not sure how that discerns whether or not the file names are utf-8 or iso-8859-1. I wonder though, did you remove the apostrophe’s from all your file names on purpose? Sorry if I am not seeing it, it looks like a great player none-the-less.

Everything aside, I’m glad you could contribute and help. I find it funny that we are both coincidentally aiming at creating the same thing.