Quantcast
Viewing all articles
Browse latest Browse all 22

Answer by thomasrutter for What is the difference between UTF-8 and Unicode?

UTF-8 is an encoding scheme for Unicode text. It is becoming the best supported and best known text encoding for Unicode text in many contexts, especially the web, and is the text encoding used by default in JSON and XML.

Unicode is a broad-scoped standard which defines over 149,000 characters and allocates each a numerical code (a code point). It also defines rules for how to sort this text, normalise it, change its case, and more. A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive, though some code points are reserved and cannot be used for characters.

There is more than one way that a string of Unicode code points can be encoded into a binary stream. These are called "encodings". The most straightforward encoding is UTF-32, which simply stores each code point as a 32-bit integer, with each being 4 bytes wide. Since code points only go up to 0x10FFFF (requiring 21 bits), this encoding is somewhat wasteful.

UTF-8 is another encoding, and has gained its popularity over other encodings due to a number of advantages. UTF-8 encodes each code point as a sequence of either 1, 2, 3 or 4 byte values. Code points in the ASCII range are encoded as a single byte value, leaving them fully compatible with ASCII. Code points outside this range use either 2, 3, or 4 bytes each, depending on what range they are in.

UTF-8 has been designed with these properties in mind:

  • Characters also present in the ASCII encoding are encoded in exactly the same way as they are in ASCII, such that any ASCII string is naturally also a valid UTF-8 string representing the same characters.

  • More efficient: Text strings in UTF-8 almost always occupy less space than the same strings in either UTF-32 or UTF-16, with just a few exceptions.

  • Binary sorting: Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order.

  • When a code point uses multiple bytes, none of those bytes (not even the first one) contain values in the ASCII range, ensuring that no part of them could be mistaken for an ASCII character. This is a feature that is important for security, especially when using UTF-8 encoded text in systems originally designed for 8-bit encodings.

  • UTF-8 can be easily validated to verify that it is valid UTF-8. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 by chance due to the very specific structure of UTF-8.

  • Random access: At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to find the start of the next or current character, without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from.


Viewing all articles
Browse latest Browse all 22

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>