A character set defines the valid characters that can be used in source programs or interpreted when a program is running. The source character set is the set of characters available for the source text. The execution character set is the set of characters available when executing a program. The source character set does not necessarily match the execution character set; for example, when the execution character set is not available on the devices used to produce the source code.
Different character sets exist; for example, one character set is
based on the American Standard Code for Information Interchange
(ASCII) definition of characters, while another set includes
the Japanese kanji characters. The character set in use makes no
difference to the compiler; each character simply has a unique
value. C treats each character as a different integer value.
The ASCII character set has fewer than 255 characters, and these
characters can be represented in 8 bits or less. However, in
some extended character sets, so many characters exist that
some characters' representation requires more than 8 bits. A
special type was created to accommodate these larger characters,
called the wchar_t
(or wide character) type. Section 1.8.3.1 discusses wide
characters further.
Most ANSI-compatible C compilers accept the following ASCII characters for both the source and execution character sets. Each ASCII character corresponds to a numeric value. Appendix C lists the ASCII characters and their numeric values.
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
! # % ^ & * ( ) - _ = + ~ ' " : ; ? / | \ { } [ ] , . < > $
A warning is issued if the $
character is used when the compiler's strict ANSI mode option is
specified.
Space | ( ) |
Horizontal tab | (\t) |
Form feed | (\f) |
Vertical tab | (\v) |
New-line character | (\n) |
In character constants and string literals, characters from the execution character set can also be represented by character or numeric escape sequences. Section 1.8.3.3 and Section 1.8.3.4 describe these escape sequences.
The ASCII execution character set also includes the following control characters:
\n
in the
source file),
\a
)
\b
)
\r
)
\0
)
The null character is a byte or wide character with all bits set to 0. It is used to mark the end of a character string. Section 1.7 discusses character strings in more detail.
The new-line character splits the source character stream into separate lines for greater legibility and for proper operation of the preprocessor.
Sometimes a line longer than the terminal or window width must
be interpreted by the compiler as one logical line. One logical
line can be typed as two or more lines by appending the backslash
character (\
) to the end of the continued lines. The
backslash must be immediately followed by a new-line character. The
backslash signifies that the current logical line continues on the
next line. For example:
#define ERROR_TEXT "Your entry was outside the range of \ 0 to 100."
The compiler deletes the backslash character and the adjacent new- line character during processing, so that this line becomes one logical line, as follows:
#define ERROR_TEXT "Your entry was outside the range of 0 to 100."
A long string can be continued across multiple lines by using the backslash-newline line continuation feature, but the continuation of the string must start in the first position of the next line. In some cases, this destroys the indentation scheme of the program. The ANSI C standard introduces another string continuation mechanism to avoid this problem. Two string literals, with only white space separating them, are combined to form one logical string literal. For example:
printf ("Your entry was outside the range of " "0 to 100.\n");
The maximum logical line length is 32,767 characters.
To write C programs using character sets that do not contain all of C's punctuation characters, ANSI C allows the use of nine trigraph sequences in the source file. These three- character sequences are replaced by a single character in the first phase of compilation. (See Section 2.15 for an explanation of compilation phases.) Table 1-1 lists the valid trigraph sequences and their character equivalents.
Trigraph Sequence | Character Equivalent |
---|---|
??= | # |
??( | [ |
?? / | \ |
??) | ] |
??' | ^ |
??< | { |
??! | | |
??> | } |
??- | ~ |
No other trigraph sequences are recognized. A question mark (?) that does not begin a trigraph sequence remains unchanged during compilation. For example, consider the following source line:
printf ("Any questions???/n");
After the ??/ sequence is replaced, this line is translated as follows:
printf ("Any questions?\n");
Digraph processing is supported when compiling in ISO C 94 mode ( /STANDARD=ISOC94 on OpenVMS systems).
Digraphs are pairs of characters that translate into a single character, much like trigraphs, except that trigraphs get replaced inside string literals, but digraphs do not. Table 1-2 lists the valid digraph sequences and their character equivalents.
Digraph Sequence | Character Represented |
---|---|
<: | [ |
:> | ] |
<% | { |
%> | } |
%: | # |
%:%: | ## |