CATEGORII DOCUMENTE |
Asp | Autocad | C | Dot net | Excel | Fox pro | Html | Java |
Linux | Mathcad | Photoshop | Php | Sql | Visual studio | Windows | Xml |
This is an interesting area; alphabets are important. All the same, this is the one part of this chapter that you can read superficially first time round without missing too much. Read it to make sure that you've seen the contents once, and make a mental note to come back to it later on.
Few computer languages bother to define their alphabet rigorously. There's usually an assumption that the English alphabet augmented by a sprinkling of more or less arbitrary punctuation symbols will be available in every environment that is trying to support the language. The assumption is not always borne out by experience. Older languages suffer less from this sort of problem, but try sending C programs by Telex or restrictive e-mail links and you'll understand the difficulty.
The Standard talks about two different character sets: the one that programs are written in and the one that programs execute with. This is basically to allow for different systems for compiling and execution, which might use different ways of encoding their characters. It doesn't actually matter a lot except when you are using character constants in the preprocessor, where they may not have the same value as they do at execution time. This behaviour is implementation-defined, so it must be documented. Don't worry about it yet.
The Standard requires that an alphabet of 96 symbols is available for C as follows:
|
|
|
|
space, horizontal and vertical tab |
form feed, newline |
Table 2.1. The Alphabet of C
It turns out that most of the commonly used computer alphabets contain all the symbols that are needed for C with a few notorious exceptions. The C alphabetic characters shown below are missing from the International Standards Organization ISO 646 standard 7-bit character set, which is as a subset of all the widely used computer alphabets.
# [ ] ^ ~To cater for systems that can't provide the full 96 characters needed by C, the Standard specifies a method of using the ISO 646 characters to represent the missing few; the technique is the use of trigraphs.
Trigraphs are a sequence of three ISO 646 characters that get treated
as if they were one character in the C alphabet; all of the trigraphs
start with two question marks ??
which
helps to indicate that 'something funny' is going on. Table 2.1 below
shows the trigraphs defined in the Standard.
C character |
Trigraph |
|
|
Table Trigraphs
As an example, let's assume that your terminal doesn't have the
symbol. To write the preprocessor line
isn't possible; you must use trigraph notation instead:
??=define MAX 32767Of course trigraphs will work even if you do have a
symbol; they are there to help in
difficult circumstances more than to be used for routine programming.
The ?
'binds to the right', so in any sequence of repeated
s, only the two at the right could
possibly be part of a trigraph, depending on what comes next-this disposes of
any ambiguity.
It would be a mistake to assume that programs written to be highly portable would use trigraphs 'in case they had to be moved to systems that only support ISO 646'. If your system can handle all 96 characters in the C alphabet, then that is what you should be using. Trigraphs will only be seen in restricted environments, and it is extremely simple to write a character-by-character translator between the two representations. However, all compilers that conform to the Standard will recognize trigraphs when they are seen.
Trigraph substitution is the very first operation that a compiler performs on its input text.
Support for multibyte characters is new in the Standard. Why?
A very large proportion of day-to-day computing involves data that
represents text of one form or another. Until recently, the rather chauvinist
computing idustry has assumed that it is adequate to provide support for about
a hundred or so printable characters (hence the 96 character alphabet
of C), based on the requirements of the English language-not suprising,
since the bulk of the development of commercial computing has been in the
C also has a byte-oriented approach to data storage. The smallest individual item of storage that can be directly used in C is the byte, which is defined to be at least 8 bits in size. Older systems or architectures that are not designed explicitly to support this may incur a performance penalty when running C as a result, although there are not many that find this a big problem.
Perhaps there was a time when the English alphabet was acceptable for data
processing applications worldwide-when computers were used in environments
where the users could be expected to adapt-but those days are gone. Nowadays it
is absolutely essential to provide for the storage and processing of textual
material in the native alphabet of whoever wants to use the system. Most of the
There are two general ways of extending character sets. One is to use a fixed number of bytes (often two) for every character. This is what the wide character support in C is designed to do. The other method is to use a shift-in shift-out coding scheme; this is popular over 8-bit communication links. Imagine a stream of characters that looks like:
a b c <SI> a b g <SO> x ywhere <SI>
and <SO>
mean 'switch
to Greek' and 'switch back to English' respectively. A display device that
agreed to use that method might well then display a, b, c, alpha, beta, gamma,
x and y. This is roughly the scheme used by the shift-JIS Japanese standard,
except that once the shift-in has been seen, pairs of characters
together are used as the code for a single Japanese character. Alternative
schemes exist which use more than one shift-in
character, but they are less common.
The Standard now allows explicitly for the use of extended character sets. Only the 96 characters defined earlier are used for the C part of a program, but in comments, strings, character constants and header names (these are really data, not part of the program as such) extended characters are permitted if your environment supports them. The Standard lays down a number of pretty obvious rules about how you are allowed to use them which we will not repeat here. The most significant one is that a byte whose value is zero is interpreted as a null character irrespective of any shift state. That is important, because C uses a null character to indicate the end of strings and many library functions rely on it. An additional requirement is that multibyte sequences must start and end in the initial shift state.
The char
type is
specified by the Standard as suitable to hold the value of all of the
characters in the 'execution character set', which will be defined in your
system's documentation. This means that (in the example above) it could hold
the value of 'a
' or 'b
' or even the 'switch to
Greek' character itself. Because of the shift-in shift-out mechanism,
there would be no difference between the value stored
in a char that was intended to represent 'a
'
or the Greek 'alpha' character. To do that would mean using a different
representation - probably needing more than 8 bits, which on many systems would
be too big for a char
. That
is why the Standard introduces the wchar_t
type.
To use this, you must include the <stddef.h> header, because wchar_t
is simply defined as an
alternative name for one of C's other types. We discuss it further in Section
char
type. Wide characters, each of which may use
more storage than a regular character. These usually have a different
type from char
.
Politica de confidentialitate | Termeni si conditii de utilizare |
Vizualizari: 831
Importanta:
Termeni si conditii de utilizare | Contact
© SCRIGROUP 2024 . All rights reserved