Tải bản đầy đủ - 0 (trang)
Chapter 6. XML and Advanced Text Processing

Chapter 6. XML and Advanced Text Processing

Tải bản đầy đủ - 0trang

Encoding

Acomputerstorestextasaseriesofnumbers.Foreachletter,

punctuationmark,orspacethereisacorrespondingnumber,

calledacodepoint,whichrepresentsthatletter.Anencodingis

simplyasystemthatisusedtoidentifycharactersusing

numbers.Binarydatacanbeencodedaswell,andlateronin

thischapteryouwilllearnhowtoencodebinarydatausing

REALbasic'sBase64classes.

Encodinghasbeenencounteredpreviouslyinthisbook,but

onlysuperficially;however,itisacrucialpartofdeveloping

programs,especiallyonesthatrunonmultipleplatforms,

becauseyouwillnodoubtbeconfrontedwiththefullrangeof

encodingpossibilities.Asitturnsout,understandingcharacter

encodingisanimportantpartofbeingabletomakeeffective

useofXML.Inthissection,IwilldiscussREALbasic'sencoding

classesandrelatedtoolsandshareasampleutilityapplicationI

haveusedtoexploreencodingonREALbasic.

Therearealotofdifferentencodingsandifyoudoan

appreciableamountofworkingwithtext,youwillinevitablyrun

intoencodingheadaches.OneoftheearliestencodingsisASCII

(AmericanStandardCodeforInformationInterchange,first

standardizedin1963),whichwaslimitedtoa7-bitcharacter

set,or127characters.Thefirst32characters,031,arecontrol

characters,and32to126arethecharactersthatmakeupthe

basicAmerican-Englishalphabet.

FollowingisalistofalltheprintableASCIIcharacters,in

numericorder:

!"#$%&'()*+,-./0123456789:;<=>?@

ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`

abcdefghijklmnopqrstuvwxyz{|}~



ASCIIisfineifyouspeakEnglishandneedonly127different

characters,butnoteveryoneintheworldisEnglishandnot

everyoneintheworldspeaksEnglish,soitquicklybecame

evidentthateitheradditionalencodingswererequired,oranew

approachtoencodingaltogetherwouldbeneeded.Atfirst,a

rangeofencodingsemerged.BothMacintoshandWindows

computersused1-bytecharacterencodings,whichprovided

spacefor256characters.Althoughbothsystemsshareda

commonASCIIheritage,thecodepointsabove127represented

differentcharactersoneachplatform.Macintoshused

MacRoman(plusmanyvariantsfordifferentlanguages)and

WindowsusedLatin-1(ISO8859-1).

Sincethattime,anewstandardhasemerged(orperhapsmore

accurately,isemerging)thatrationalizescharacterencoding

andthatallocatesalargeenoughpoolofcodepointsto

representalllanguages.ThestandardiscalledUnicode,andit

ismanagedbytheUnicodeConsortium.ThefirstUnicode

standardwasfirstpublishedin1991.

Unicodeusesuptofourbytestorepresentacharacter,and

thereisroominthestandardfor1,114,112codepoints.These

codepointsareorganizedinto17planes,eachonerepresenting

65536codepoints(216).Thefirstplane,plane"0",iscalledthe

BasicMultilingualPlane(BMP);thisisthemostcommonlyused

plane,wherecodepointshavebeenassignedtoalargeportion

ofmodernlanguages.

ThereisastandardapproachtorepresentingUnicodecode

points,whichconsistsof"U+",followedbyahexadecimaldigit.

TherangeofcodepointsintheBMPisU+0000toU+FFFF

(UnicodeassignscodepointsbeyondtheBMPuptoU+10FFFF,

andtheoriginalspecificationallowedforrangesupto

U+7FFFFFFF).Thefirst256codepointsintheBMPareidentical

toLatin-1,whichalsomeansthatthefirst127codepointsare



identicaltoASCII.

TheUnicodestandardusesseveralformatstoencodecode

points.TheycomeintwocampsUCS,theUniversalCharacter

Set,andUTF,theUnicodeTransformationFormat.



UCSUniversalCharacterSet(UCS-2,UCS-4)

TheUniversalCharacterSetuseseither2bytesor4bytesto

representcharacters.InUCS-2,everycharacterisrepresented

bya2-bytecodepoint(which,ofnecessity,limitsthenumber

ofcharactersavailableanddoesnotrepresentthecomplete

rangeofUnicode).UCS-4,ontheotherhand,uses4bytes.

TherearetwoproblemswithUCS("problems"isprobablynot

therightwordbutthesearethereasonsthatUCSisnotusedin

practiceasmuchasUTF).Firstoff,earlynon-Unicodecharacter

sets(ASCIIandLATIN-1)used8bytes.UsingUCSmeansthat

legacydocumentswouldhavetobeconvertedtoeither2-byte

or4-byteformatstobeviewed.Second,usingeither2bytesor

4bytesforallcharactersmeansthatyourtextwilltakeupalot

morespacethanitwouldifyouwereabletouse1bytefor

somecharacters,2bytesforothers,andsoon,whichisexactly

whatUTFdoes.



UTFUnicodeTransformationFormat(UTF-8,

UTF-16,UTF-32)

REALbasicsupportsUTF-8andUTF-16asnativeformats.There

arealsotwoadditionalUTFformats,UTF-7andUTF-32.InUTF,

charactersarerepresentedbycodepointsofvaryingsizes,and

thisiswhatdifferentiatesUTFfromUCS.UTF-8,forexample,

identifiessomecharacterswith1byte,otherswith2,allthe

wayupto4bytes.UTF-16startswith2-bytecodepoints,but

representssomecharacterswith4bytesandUTF-32startswith



4-bytecodepoints.

Thefollowingtableshowstherangeofcodepointsandthe

numberofbytesusedbyUTF-16andUTF-8torepresentit.The

firstthreerowsrepresenttheBMP.

Table6.1.UTF-8:NativeStrings

CodeRange

(hex)



UTF-16(Binary) UTF-8(Binary)



00000000007F 000000000------ 0----------



Comments



TheUTF-8values

inthisrangeare

thesameasASCII.

Thefirstbytein

UTF-8isalways0.



0000800007FF 00000------------ 110------10------ UTF-8uses2bytes

--torepresentthis

range.Thefirst

bytebeginswith

110andthe

secondbytebegins

with10.Thisrange

includestheLatin

characters.

00080000FFFF ------------------- 1110-----10----- UTF-8uses3bytes

10-------torepresentthis

range,whichisthe

upperlimittothe

BMP.Thefirstbyte

beginswith1110,

andthesecond

andthirdbegin

with10.

01000010FFFF 110110------------

110111------------



11110----10-------

10--------10-------



Forcodepoints

beyondtheBMP,

bothencodingsuse

4bytes.However,

notethedifference

intheprefixed

values.UTF-16

usesa"surrogate

pair"torepresent

valuesover

U+FFFF.&h10000



issubtractedfrom

theUTF-16prefix,

sothatitcanbe

distinguishedfrom

UTF-8.



ByteOrderMark

Thebyteordermark(BOM)isacharacterthatisplacedatthe

beginningofafilethatcanusedtoidentifythebyteorderofthe

document;byteorderissimplyareferencetoendianness,

whichisoneofthosetopicsthatkeepspoppingupinanykind

ofcross-platformdevelopment.Thecharacterinquestionis

supposedtobeazero-widthnon-breakingspace.UCS-2and

UTF-16arethetwoUnicodeformatsthatusetheBOMforthe

determinationofendianness.WhenusedwithUTF-8and

others,it'susedtoidentifytheencodingofthefileitself.In

otherwords,basedonthevalueofthefirstfourbytes,youcan

determinetheencodingofthestringthatis,iftheBOMhasbeen

hasbeenset.REALbasicusuallymakesthisdeterminationfor

you,butincaseyouwanttocheckdirectly,herearethevalues

andwhattheymean:

0000FEFF



UCS-4,big-endianmachine(1234order)



FFFE0000



UCS-4,little-endianmachine(4321

order)



0000FFFE



UCS-4,unusualoctetorder(2143)



FEFF0000



UCS-4,unusualoctetorder(3412)



FEFF----



UTF-16,big-endian



FFFE----



UTF-16,little-endian



EFBBBF



UTF-8



ConvertingEncodings

NowthatthebasicsofUnicodehavebeenreviewed,it'stimeto

turntoREALbasic'sencodingclassesandlearnhowtousethe

toolsprovidedtoeffectivelymanagecharacterencodinginyour

application.



TextEncodingClass

TheTextEncodingclassrepresentsaparticular

encodingwhetherUnicodeorsomenativeencodinglike

MacRoman.

TextEncoding.BaseasInteger

TextEncoding.CodeasInteger

TextEncoding.VariantasInteger

TextEncoding.FormatasInteger

TextEncoding.InternetNameasString



TheTextEncodingclassoffersanalternativetotheglobalChr

functiondiscussedearlierinthebook.Whenusingtheglobal

function,itisassumedthatyouareusingUTF-8,butthatmay

notbewhatyouwant.If,forwhateverreason,youwanttoget

acharacterusingacodepointforanotherencoding,youcanuse

thismethodonaTextEncodinginstancethatrepresentsthe



encodingyouwanttouse:

TextEncoding.Chr(codepointasInteger)asString



Youcantesttoseeiftheencodingofonestringisequaltothat

ofanotherthisway:

TextEncoding.Equals(otherEncodingasTextEncoding)asBoolean



EncodingsObject

TheEncodingsobjectisalwaysavailable,anditisusedtoget

areferencetoaparticularTextEncodingobject.Youcangeta

referencetotheencodingusingtheencoding'sname:

Encodings.EncodingNameasTextEncoding



Youcanalsogetareferencetoaparticularencodingusingthe

codeforthatencoding.Thecodeisanintegerthatrepresentsa

particularencoding.Thismethodletsyougetareferencetoa

TextEncodingobjectbypassingthecodeasanargument.

Encodings.GetFromCode(aCodeasInteger)asTextEncoding



Youcangetreferencestocharactersthroughencoding,by

callingthefollowingmethod:

Encodings.UTF-8.Chr(aCodePointasInteger)



TousetheEncodingsobject,youneedtoknowthenamesof

theavailableencodingsand,optionally,theircodes.The

followingtableprovidesalistofalltheencodingsrecognizedby

REALbasic'sEncodingsobjectandthevaluesassociatedwith

eachencoding.

Table6.2.EncodingsAvailablefromtheEncodingObject

EncodingsObject



Internet

Name



Base



Variant Format Code



Encodings.SystemDefault(Mac) macintosh



0



2



0



131072



encodings.SystemDefault

(Windows)



windows1252



1280



0



0



1280



Encodings.UTF8



UTF-8



256



0



2



134217984



Encodings.UTF16



UTF-16



256



0



0



256



Encodings.UCS4



UTF-32



256



0



3



201326848



Encodings.ASCII



US-ASCII



1536



0



0



1536



Encodings.WindowsLatin1



windows1252



1280



0



0



1280



Encodings.WindowsLatin2



windows1250



1281



0



0



1281



Encodings.WindowsLatin5



windows1254



1284



0



0



1284



Encodings.WindowsKoreanJohab Johab



1296



0



0



1296



Encodings.WindowsHebrew



1285



0



0



1285



windows1255



Encodings.WindowsGreek



windows1253



1283



0



0



1283



Encodings.WindowsCyrillic



windows1251



1282



0



0



1282



Encodings.WindowsBalticRim



windows1257



1287



0



0



1287



Encodings.WindowsArabic



windows1256



1286



0



0



1286



Encodings.WindowsANSI



windows1252



1280



0



0



1280



Encodings.WindowsVietnamese windows1258



1288



0



0



1288



Encodings.MacRoman



macintosh



0



0



0



0



Encodings.MacVietnamese



X-MAC30

VIETNAMESE



0



0



30



Encodings.MacTurkish



X-MACTURKISH



35



0



0



35



Encodings.MacTibetan



X-MACTIBETAN



26



0



0



26



Encodings.MacThai



TIS-620



21



0



0



21



Encodings.MacTelugu



X-MACTELUGU



15



0



0



15



Encodings.MacTamil



X-MAC-TAMIL 14



0



0



14



Encodings.MacSymbol



AdobeSymbolEncoding



33



0



0



33



Encodings.MacSinhalese



X-MAC-



18



0



0



18



SINHALESE

Encodings.MacRomanLatin1



ISO-8859-1



2564



0



0



2564



Encodings.MacRomanian



X-MACROMANIAN



38



0



0



38



Encodings.MacOriya



X-MAC-ORIYA 12



0



0



12



Encodings.MacMongolian



X-MAC27

MONGOLIAN



0



0



27



Encodings.MacMalayalam



X-MACMALAYALAM



17



0



0



17



Encodings.MacLaotian



X-MACLAOTIAN



22



0



0



22



Encodings.MacKorean



EUC-KR



3



0



0



3



Encodings.MacKhmer



X-MACKHMER



20



0



0



20



Encodings.MacKannada



X-MACKANNADA



16



0



0



16



Encodings.MacJapanese



Shift_JIS



1



0



0



1



Encodings.MacIcelandic



X-MACICELANDIC



37



0



0



37



Encodings.MacHebrew



X-MACHEBREW



5



0



0



5



Encodings.MacGurmukhi



X-MACGURMUKHI



10



0



0



10



Encodings.MacGujarati



X-MACGUJARATI



11



0



0



11



Encodings.MacGree



X-MAC-



6



0



0



6



GREEK

Encodings.MacGeorgian



Encodings.MacGaelic



X-MACGEORGIAN







23



0



0



23



40



0



0



40



Encodings.MacExtArabic



X-MACEXTARABIC



31



0



0



31



Encodings.MacEthiopic



X-MACETHIOPIC



28



0



0



28



Encodings.MacDingbats



X-MACDINGBATS



34



0



0



34



Encodings.MacDevanagari



X-MAC9

DEVANAGARI



0



0



9



Encodings.MacCyrillic



X-MACCYRILLIC



7



0



0



7



Encodings.MacCroatian



X-MACCROATIAN



36



0



0



36



Encodings.MacChineseTrad



Big5



2



0



0



2



Encodings.MacChineseSimp



GB2312



25



0



0



25



29



0



0



29



39



0



0



39



Encodings.MacCentralEurRoman X-MAC-CE

Encodings.MacCeltic







Encodings.MacBurmese



X-MACBURMESE



19



0



0



19



Encodings.MacBengali



X-MACBENGALI



13



0



0



13



Encodings.MacArmenian



X-MACARMENIAN



24



0



0



24



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 6. XML and Advanced Text Processing

Tải bản đầy đủ ngay(0 tr)

×