MySQL default collations


#1

Ok, I’ve got a real problem here. I’ve set up a local LAMP and wrote a PHP application which uses MySQL to store a large wordlist. This wordlist includes accented characters, such as ñ and á. I’ve also started storing other languages, and there are no problems. The cool thing is that I can go from input, through PHP, and into the database, and everything is still human readable. I mean, I can look and edit the same text through phpMyAdmin or even the CLI and still see ñ and á.

When I put it on Dreamhost, however, it stops working. PHP reports that the serialised strings are not of the right length and such. So if I store the wordlist from Dreamhost, it still works, sort of. The text I see when I pull words out of the database and put through PHP still end up as ñ and á, but when I look at them or try to edit them through phpMyAdmin or the CLI, I just get mojibake (like ñ and à a).

Now I’ve never really bothered with it before, other than it being a pain that I can’t directly fix any mistakes through phpMyAdmin, but today I realised where the problem might be. You see, I downloaded a backup copy of my database from Dreamhost and loaded it on my local machine (actually, a WAMP this time). I found that I couldn’t read anything out of it. But when I stored some accented characters, I could see them perfectly well in phpMyAdmin, and they came back out ok through PHP. So that means my database, tables, and columns must have the correct collation settings, utf_general_ci, because none of that changed when importing to my local machine. I went through every setting I could find comparing my setup with DH and I found a difference. It seems that only some of the default collations and character sets are set as utf-8 whereas everything on mine is:

DH:

character set client	utf8
(Global value)	latin1
character set connection	utf8
(Global value)	latin1
character set database	latin1
character set filesystem	binary
character set results	utf8
(Global value)	latin1
character set server	latin1
character set system	utf8
character sets dir	/data/mysql/koble/share/mysql/charsets/
collation connection	utf8_unicode_ci
(Global value)	latin1_swedish_ci
collation database	latin1_swedish_ci
collation server	latin1_swedish_ci

and mine:

character set client	utf8
character set connection	utf8
character set database	utf8
character set filesystem	binary
character set results	utf8
character set server	utf8
character set system	utf8
character sets dir	C:\www\UniServer\usr\local\mysql\share\charsets\
collation connection	utf8_general_ci
collation database	utf8_general_ci
collation server	utf8_general_ci

Is it possible that this is causing non-Latin characters to be encoded as mojibake when using phpMyAdmin / CLI on DH but not on my local machine? Is there any way to change the settings to all be utf8? It would make my life SO much easier if I could read my data directly in phpMyAdmin instead of guessing at what the mojibake might be…

Cheers!


#2

Hmmm, not many people using non-latin languages? Or aren’t bothered by it? Or am I the only one having this problem? Maybe I’ve set something incorrectly.


#3

I’m guessing that collation has nothing to do with the mojibake, although there’s no reason to use the standard [font=Courier]latin1_swedish_ci[/font] as I doubt that is relevant for the vast majority of users.

However, it seems that the inconsistency between character sets might be causing the problem. According to the last paragraph of the MySQL online documentation:

[quote]
You can force client programs to use specific character set as follows:

[font=Courier][client]
default-character-set=charset_name[/font]

This is normally unnecessary. However, when [font=Courier]character_set_system[/font] differs from [font=Courier]character_set_server[/font] or [font=Courier]character_set_client[/font], and you input characters manually (as database object identifiers, column values, or both), these may be displayed incorrectly in output from the client or the output itself may be formatted incorrectly. In such cases, starting the mysql client with [font=Courier]–default-character-set=system_character_set[/font]—that is, setting the client character set to match the system character set—should fix the problem.[/quote]

Does anyone know how to do this?

Cheers!


#4

So the answer was PDOs, in case you were wondering. PDOs in PHP don’t explicitly set the connection character set, so you have to. Once i forced it to use UTF-8, then the mojibake went away…