Character Encoding

Suppose you use the international address function to search for the Bahnhofstraße.

Supported character encodings

There are two character encodings supported by the HTTP-RPC and XML-RPC interfaces of

  • utf-8, or 8-bit Unicode
  • iso-8859-1, or Latin1

Both of these encodings are in common use, but UTF-8 is gaining popularity.  We recommend using UTF-8, because it allows for many possible characters and is supported or even the standard character encoding in many languages.  One drawback of UTF-8 compared to Latin1 is that is a variable-length encoding.  That is, not all characters take up exactly one byte.

The SOAP interface, by definition, only supports UTF-8.

Character encoding introduction

Suppose you use the international address function to search for the Bahnhofstraße.  To sent this string to the web service, the characters first have to be converted to bytes.  The character encoding is a mapping table from characters to bytes, and this table dictates which characters are converted to which bytes.

If you are using Latin1 encoding, the ß in Bahnhofstraße will be converted to one byte with numerical value 223.  If you are using UTF-8, the ß will be converted to two bytes, 195 and 159.

Now, if the server and the client do not agree on the character encoding, the server can not convert the bytes from the client into characters.  If the client and server do not use the same mapping table, character strings will become malformed.