another encoding headhache (Real Studio network user group Mailinglist archive)
Back to the thread list
Previous thread: application language
Next thread: Re: Getting maximum input speed from BinaryStream :-<
|another encoding headhache - Giulio|
|Re: another encoding headhache - Joseph J. Strout|
|another encoding headhache|
|Date: 18.08.05 19:09 (Thu, 18 Aug 2005 20:09:56 +0200)|
I have an application that uses FTPSuite classes, and I have a
strange behaviour when receiving a Dir list containing file names
with accented characters.
I must compare the file names I receive from the server with names
contained in variables and here comes the strange thing:
I parse the name list, and when the name contains accented characters
and I compare it with the same value contained on a variable, they
both are utf8, testing using encoding(variablename).internetname
if I put the two values ( the parsed name and the name on the
variable ) in two different editfields, they look OK and identical
but: if I test the length of the variable containing the parsed
name, it is longer than the other ( an additional character for every
accented character ), and if i cycle msgboxing every character
contained in the variable, the accented characters result as their
non-accented equivalent followed by space ( or a not displayable char).
Don't know if I should post this on FTPSuite list, but the question is:
how can happen that an UTF8 variable in REALbasic is corrupted this
way but correctly displayed and there's a way to fix it with some
kind of conversion?
Unsubscribe or switch delivery mode:
Search the archives of this list here:
|Re: another encoding headhache|
|Date: 18.08.05 20:09 (Thu, 18 Aug 2005 13:09:47 -0600)|
|From: Joseph J. Strout|
At 8:09 PM +0200 8/18/05, Giulio wrote:
>I parse the name list, and when the name contains accented
>characters and I compare it with the same value contained on a
>variable, they doesn't match!
Well, they're not the same value then. But they may display the same
way. Accented characters can be represented in two ways: composed or
decomposed. When composed, an accented letter is a single character.
If decomposed, it's two characters (the base letter and the accent
mark). This is the primary wart on the whole Unicode system, that
there are two ways to represent the same text.
>how can happen that an UTF8 variable in REALbasic is corrupted this
>way but correctly displayed and there's a way to fix it with some
>kind of conversion?
Nothing is corrupted; this is (regrettably) perfectly valid UTF-8
text, in either form. If you know that the text you're dealing with
can be represented in some other encoding, you can convert to that,
and you should find that both strings convert to the same thing. At
a guess, ISO-Latin-1 would be a good assumption for most FTP servers.