My explanation, after reading numerous posts and articles about this topic:
1 - The Unicode Character Table
"Unicode" is a giant table, that is 21bits wide, these 21bits provide room for 1,114,112 code points / values / fields / places to store characters in.
Out of those 1,114,112 code points, 1,111,998 are able to store Unicode characters, because there are 2048 code points reserved as surrogates, and 66 code points reserved as non-characters. So, there are 1,111,998 code points that can store a unique character, symbol, emoji and etc.
However, as of now, only 144,697 out of those 1,114,112 code points, have been used. These 144,697 code points contain characters that cover all of the languages, as well as symbols, emojis and etc.
Each character in the "Unicode" is assigned to a specific code point aka has a specific value / Unicode number. For Example the character "❤", uses exactly one code point out of the 1,114,112 code points. It has the value (aka Unicode number) of "U+2764". This is a hexadecimal code point consisting of two bytes, which in binary is represented as 00100111 01100100. But to represent this code point, UTF-8 encoding uses 3 bytes (24 bits), which is represented in binary as 11100010 10011101 10100100 (without the two empty space characters, each of which is using 1 bit, and I have added them for visual purposes only, in order to make the 24bits more readable, so please ignore them).
Now, how is our computer supposed to know if those 3 bytes "11100010 10011101 10100100" are to be read separate or together? If those 3 bytes are read separate and then converted to characters the result would be "Ô, Ø, ñ", which is quite the difference compared to our heart emoji "❤".
2 - Encoding Standards (UTF-8, ISO-8859, Windows-1251 and etc)
In order to solve this problem people have invented the Encoding Standards.The most popular one being UTF-8, since 2008. UTF-8 accounts for an average of 97.6% of all web pages, that is why we will UTF-8, for the example below.
2.1 - What is Encoding?
Encoding, simply said means to to convert something, from one thing to another. In our case we are converting data, more specifically bytes to the UTF-8 format, I would also like to rephrase that sentence as: "converting bytes to UTF-8 bytes", although it might not be technically correct.
2.2 Some information about the UTF-8 format, and why it's so important
UTF-8 uses a minimum of 1 bytes to store a character and a maximum of 4 bytes. Thanks to the UTF-8 format we can have characters which take more than 1 byte of information.
This is very important, because if it was not for the UTF-8 format, we would not be able to have such a vast diversity of alphabets, since the letters of some alphabets can't fit into 1 byte, We also wouldn't have emojis at all, since each one requires at least 3 bytes. I am pretty sure you got the point by now, so let's continue forward.
2.3 Example of Encoding a Chinese character to UTF-8
Now, lets say we have the Chinese character "汉".
This character takes exactly 16 binary bits "01101100 01001001", thus as we discussed above, we can not read this character, unless we encode it to UTF-8, because the computer will have no way of knowing, if these 2 bytes are to be read separately or together.
Converting this "汉" character's 2 bytes into, as I like to call it UTF-8 bytes, will result in the following:
(Normal Bytes) "01101100 01001001" -> (UTF-8 Encoded Bytes) "11100110 10110001 10001001"
Now, how did we end up with 3 bytes instead of 2? How is that supposed to be UTF-8 Encoding, turning 2 bytes into 3?
In order to explain how the UTF-8 encoding works, I am going to literally copy the reply of @MatthiasBraun, a big shoutout to him for his terrific explanation.
2.4 How does the UTF-8 encoding actually work?
What we have here is the template for Encoding bytes to UTF-8. This is how Encoding happens, pretty exciting if you ask me!
Now, take a good look at the table below and then we are going to go through it together.
Binary format of bytes in sequence: 1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value 0xxxxxxx 7 007F hex (127) 110xxxxx 10xxxxxx (5+6)=11 07FF hex (2047) 1110xxxx 10xxxxxx 10xxxxxx (4+6+6)=16 FFFF hex (65535) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)
The "x" characters in the table above represent the number of "FreeBits", those bits are empty and we can write to them.
The other bits are reserved for the UTF-8 format, they are used asheaders / markers. Thanks to these headers, when the bytes are beingread using the UTF-8 encoding, the computer knows, which bytes to readtogether and which separately.
The byte size of your character, after being encoded using the UTF-8 format,depends on how many bits you need to write.
In our case the "汉" character is exactly 2 bytes or 16bits:
"01101100 01001001"
thus the size of our character after being encoded to UTF-8, will be 3 bytes or 24bits
"11100110 10110001 10001001"
because "3 UTF-8 bytes" have 16 Free Bits, which we can write to
- Solution, step by step below:
2.5 Solution:
Header Place holder Fill in our Binary Result 1110 xxxx 0110 11100110 10 xxxxxx 110001 10110001 10 xxxxxx 001001 10001001
2.6 Summary:
A Chinese character: 汉 its Unicode value: U+6C49 convert 6C49 to binary: 01101100 01001001 encode 6C49 as UTF-8: 11100110 10110001 10001001
3 - The difference between UTF-8, UTF-16 and UTF-32
Original explanation of the difference between the UTF-8, UTF-16 and UTF-32 encodings: https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html
The main difference between UTF-8, UTF-16, and UTF-32 character encodings is how many bytes they require to represent a character in memory:
UTF-8 uses a minimum of 1 byte, but if the character is bigger, then it can use 2, 3 or 4 bytes. UTF-8 is also compatible with the ASCII table.
UTF-16 uses a minimum of 2 bytes. UTF-16 can not take 3 bytes, it can either take 2 or 4 bytes. UTF-16 is not compatible with the ASCII table.
UTF-32 always uses 4 bytes.
Remember: UTF-8 and UTF-16 are variable-length encodings, where UTF-8 can take 1 to 4 bytes, while UTF-16 will can take either 2 or 4 bytes. UTF-32 is a fixed-width encoding, it always takes 32 bits.