Gnutella Forums  

Go Back   Gnutella Forums > Gnutella News and Gnutelliums Forums > General Gnutella Development Discussion
Register FAQ The Twelve Commandments Members List Calendar Arcade Find the Best VPN Today's Posts

General Gnutella Development Discussion For general discussion about Gnutella development.


Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old March 10th, 2002
bmk bmk is offline
Disciple
 
Join Date: March 4th, 2002
Location: Germany
Posts: 13
bmk is flying high
Thumbs up UTF, UNICODE technical (long)

As to why the new protocol should extend to UNICODE, and why implement this using UTF-8:

UNICODE aspires to define all characters of all languages. Right now, an address space of 2byte (about 64,000 characters) has been defined to cover most languages. This is being extended to 4bytes, but let's keep it at 2 bytes for now.

UTF (more correctly UTF-8) as well as UCS are ways to express the 2byte-number (I skip the 4byte UNICODE) for a character. UCS simply is the number in 2bytes, thus it may contain null-bytes. Normally when talking about UNICODE, the UCS-2 (= 2 bytes) method of expressing UNICODE is being refered to.

UTF or more correctly UTF-8 uses 1, 2 or 3 bytes to express the 2byte number for a UNICODE character. Null bytes do not occur. This works as follows:
<table border=1 cols=4><tr><td>UNICODE character number range (in hex)</td><td>UTF byte 1 (in binary)</td><td>UTF byte 2</td><td>UTF byte 3</td><td></tr><tr><td>0000 - 007f</td><td>0xxxxxxx</td><td>(none)</td><td>(none)</td><td></tr>
<tr><td>0080 - 07ff</td><td>110xxxxx</td><td>10xxxxxx</td><td>(none)</td></tr><tr><td>07ff - ffff</td><td>1110xxxx</td><td>10xxxxxx</td><td>10xxxxxx</td><td></tr></table>
<i><font color=red>UTF can also have 4 bytes, and using the same scheme express a character number up to U+10ffff. That won't be relevant right now, but may be in future. Provisions should be taken for upward compatibility with possible 4-byte UTF code sequences.</font></i>

The first byte of a UTF sequence gives its length in the highest value bits up to the first 0-bit, the following 1 or 2 bytes are easily recognizable as belonging to an UTF sequence by their 2 highest value bits, having a value between 80 and BF. The bits here marked as 'x' give the number of the character in the UNICODE table.

Thus, a UTF character of 1 byte length is exactly the same number as the corresponding ASCII character. However, a Latin-1 character will have a number beyond 7f. So its not possible to say if a single byte is a Latin-1 character or the start of an UTF sequence.

In conclusion, extending the encoding sheme of the protocoll from ASCII to UTF would leave current clients still working, as nullbytes do not occur. Old clients of course would treat each byte of an UTF sequence as a separate character, leading to funny names in the search results. But you get that even now, and searches containing e.g. German special characters do not really work right now: These characters will normally just be ignored. Moving to UCS might make some old clients fail, as one character might contain a nullbyte. Compared to UCS, UTF for a single character either takes less space (for ASCII text), exactly the same space (for the special European characters and any characters up to 07ff UNICODE, for example Russian), or 1 byte more (most notable for Asian languages)

As the bulk of the traffic very probably will remain ASCII for a long time from now, the increase in load by using UTF should be tolerable. You gain a worldwide audience, and you stay compatible with the current standard. Keep in mind that Latin-1 right now neither is standard nor does it work well. Lastly, if at some point in future you desire an extension to cover UNICODE characters up to U+10ffff then UTF-8 can still be used.

Please have a look at the <a href=http://www.unicode.org/>UNICODE Consortium</a>. Demonstration pages for UNICODE (always UTF encoded) can be found anywhere on the web. One such is <a href=http://www.geocities.com/Tokyo/Pagoda/1675/unicode-page.html>here</a>.

If you go for Latin-1, then you need a mechanism to identify the message as Latin-1 or as UNCIODE. If you use UCS, then you probably cannot maintain downward compatibility. You will also get new problems when at some point in the future characters up to U+10ffff should be supported.
Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Proposal for development of Gnutella (hashs) Unregistered General Gnutella Development Discussion 61 April 17th, 2002 08:35 AM
My Proposal for XoloX!!! Unregistered User Experience 1 February 6th, 2002 08:11 AM
What does 'Gnutella v0.6 protocoll' mean? Moak LimeWire Beta Archives 0 December 12th, 2001 10:03 PM
---a Radical Proposal--- Unregistered General Gnutella / Gnutella Network Discussion 0 September 21st, 2001 12:08 PM
protocol extension proposal Unregistered General Gnutella Development Discussion 3 September 16th, 2001 02:00 PM


All times are GMT -7. The time now is 03:30 PM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
SEO by vBSEO 3.6.0 ©2011, Crawlability, Inc.

Copyright Đ 2020 Gnutella Forums.
All Rights Reserved.