CWB
|
#define popc | ( | s, | |
p | |||
) | s[p++] |
Referenced by cl_string_latex2iso().
#define pushc | ( | s, | |
c, | |||
p, | |||
m | |||
) | s[p++] = c; if (p>=m) goto endloop; |
Referenced by cl_string_latex2iso().
void cl_id_tolower | ( | char * | s | ) |
Converts an uppercase corpus name to an equivalent lowercase form.
String is modified in situ. Only the ASCII characters are changed.
Note, this function doesn't check for what is and is not an allowed CWB-corpus-name character.
Referenced by cl_new_corpus(), encode_generate_registry_file(), and main().
void cl_id_toupper | ( | char * | s | ) |
Converts a lowercase corpus name to an equivalent uppercase form.
String is modified in situ. Only the ASCII characters are changed.
Note, this function doesn't check for what is and is not an allowed CWB-corpus-name character.
The old version of this code was a line in cwb-encode that used the library toupper to cope with Latin1 characters. But these are no longer allowed in identifiers, which must be ASCII only.
Referenced by encode_generate_registry_file().
int cl_id_validate | ( | char * | s | ) |
Checks a string to see if it is a valid CWB identifier.
The rules for these are as follows (see also the CQP lexer):
* all characters must be ASCII, ie less than 0x80; * must be at least 1 character long (of course) * first character must be an uppercase or lowercase letter or underscore * second and subsequent characters may also be digits, hyphen or fullstop. * mixed case is allowed (just-upper and just-lower is imposed elsewhere, where necessary).
TODO: should the CL registry lexer be amended to reflect these restricitons? (ID there is rather laxer than this)
s | The string to check. |
Referenced by cl_new_corpus(), and encode_generate_registry_file().
void cl_path_adjust_independent | ( | char * | path | ) |
Standardises subdirectory-dividers in a string that represents a path into Unix-like form (ie with forward-slash), regardless of what OS we are in.
Or, to put it another way, changes backslashes into forward slashes under Windows.
This may be useful because of the need to move corpora between systems
Note that the path is modified in place.
path | The path to modify (must be Ascii-compatible) |
References SUBDIR_SEPARATOR.
void cl_path_adjust_os | ( | char * | path | ) |
Standardises subdirectory-dividers in a string that represents a path, in an OS-sensitive way.
If the CL was compiled for Unix, backslash is changed to forwardslash. If the CL was compiled for Windows, forwardslash is changed to backslash.
Note that the path is modified in place.
path | The path to modify (must be Ascii-compatible) |
References SUBDIR_SEPARATOR.
char* cl_path_get_component | ( | char * | s | ) |
Tokenises a string into components split by ':' (or ';' under Win32).
s | The string to tokenise; or, NULL if tokenisation has already been initialised. |
References last, and PATH_SEPARATOR.
char* cl_path_registry_quote | ( | char * | path | ) |
Add quotes and escape slashes to a file path if necessary.
This is for the HOME and INFO fields of the registry file.
If either field contains any characters that can't be treated as an "ID" token by the registry parser, then we make sure it is treated as a string (quoted) instead, and make all appropriate substitutions
For consistency, this function always returns a newly allocated string, regardless of whether changes have been made.
Note that the way the registry parser works, it is quite happy with either "C:\dir\subdir" or "C:\\dir\\subdir" as a path for HOME or INFO.
path | String containing the path to quotify. |
References cl_malloc(), and cl_strdup().
Referenced by encode_generate_registry_file().
char* cl_strcpy | ( | char * | buf, |
const char * | src | ||
) |
Replacement for strcpy that won't copy more than CL_MAX_LINE_LENGTH characters.
This is intended to make it easier to evade buffer overflows. But it doesn't protect against the opposite danger of losing important data from the end of a truncated string.
Note, buffer overflow is still possible if buf is a pointer to the middle of a buffer.
So this function is not a panacea, it's just a bit of a help.
It's also implemented in a way that is safe for down-strcpying, that is, if we are erasing a section from the start/middle of the string - cl_strcpy(string, string+3); for instance). The POSIX standard states that the normal strcpy has undefined behaviour if the objects overlap. That's not the case here.
buf | A string buffer to copy to. |
src | The string pointer to copy from. |
References buf, and CL_MAX_LINE_LENGTH.
Referenced by encode_get_input_line(), ParsePrintOptions(), and range_declare().
void cl_string_canonical | ( | char * | s, |
CorpusCharset | charset, | ||
int | flags | ||
) |
Converts a string to canonical form.
The "canonical form" of a string is for use in comparisons where case-insensitivity and/or diacritic insensitivity is desired.
Note that the string s is modified in place. This means it must have enough memory to cope with any expansions made in Unicode case folding. Ideally, allocate double the length of the string (since case-folding doesn't include any one -> more-than-two mappings so far as I know).
Note also that the arguments of this string were changed in v3.2.1. Now, a CorpusCharset is needed. This is because string canonicalising works differently in UTF8. In UTF8, the "composed" status of ALL strings is standardised (this is not dependent on flags; so this function should always be called on all strings that are going to be inserted into or searched for within, an indexed corpus; then we know we are always dealing with maximally-precomposed strings). Then case folding / accent folding is done by calling Unicode-aware functions. This is in contrast to the process for Latin1, which just uses a straightforward mapping table for both sorts of folding.
s | The string (currently: must be Ascii, Latin-1, or UTF8, but this is not checked for you!) |
charset | The character set to use in standardising. If this is utf8, complex accent and/or case folding will be done, as per the unicode standard. If it is anything else, the Latin1 mapping tables will be used (currently no other ISO mapping tables are built in and activated in the CL). |
flags | The flags that specify which conversions are required. Can be IGNORE_CASE and/or IGNORE_DIAC. |
References cl_free, cl_string_maptable(), IGNORE_CASE, IGNORE_DIAC, and utf8.
Referenced by cl_new_regex(), cl_regex_match(), cl_string_qsort_compare(), encode_get_input_line(), print_tabulation(), SortExternally(), and SortSubcorpus().
char* cl_string_latex2iso | ( | char * | str, |
char * | result, | ||
int | target_len | ||
) |
Converts ASCII strings with latex-style blackslash escapes for accented characters to ISO-8859-1 (Latin-1).
Syntax:
"[AaOoUus..] --> corresponding ISO 8859-1 character
octal} --> ISO 8859-1 character
Note that if cl_allow_latex2iso is FALSE, this function will simply copy the input to the output. So it is always safe to call this function.
str | The string to convert. |
result | The location to put the altered string (which should be shorter, or at least no longer than, the input string). If this parameter is NULL, space is automatically allocated for the output. result is allowed to be the same as str. |
target_len | The maximum length of the target string. If result is NULL, then this is deduced automatically. |
References cl_allow_latex2iso, cl_malloc(), cl_strdup(), popc, and pushc.
Referenced by cl_new_regex(), do_flagged_string(), do_SetVariableValue(), and do_XMLTag().
unsigned char* cl_string_maptable | ( | CorpusCharset | charset, |
int | flags | ||
) |
Gets a specified character mapping table for use in regular expressions.
Returns pointer to static mapping table for given flags (IGNORE_CASE and IGNORE_DIAC) and character set.
Removed from the public API for 3.2.0 because there's no way for it to work if the CorpusCharset is UTF8. Prototype moved to special-chars.h
Tables exist for all character sets, but for all except Latin1 and ASCII, they are currently identical to the ASCII tables (i.e. the awareness of case/accent relationships in the upper half of each character set have not yet been inserted).
charset | The character set of this corpus. Currently ignored. |
flags | The flags that specify which table is required. Can be IGNORE_CASE and/or IGNORE_DIAC. |
References ascii, charset, identity_tab, identity_tab_init, IGNORE_CASE, IGNORE_DIAC, maptable_init_both(), maptable_init_identity(), nocase_nodiac_tab, nocase_nodiac_tab_init, nocase_tab, nodiac_tab, and utf8.
Referenced by cl_string_canonical().
int cl_string_qsort_compare | ( | const char * | s1, |
const char * | s2, | ||
CorpusCharset | charset, | ||
int | flags, | ||
int | reverse | ||
) |
Compares two strings in a qsort-stylie!
This function is designed to be suitable for use as a callback with qsort(). As such, its return values are negative if s1 is "less than" s2; zero if the two strings are the same; and positive if s2 is "greater than" s2. But of course you can also use it on its own.
You cannot use it directly with qsort as its parameters are wrong. It needs to be wrapped in another function that (at least) provides the charset, flags and reverse arguments (e.g. from global variables or by calling other functions).
The two strings must be in the same character set. Both will be made canonical in accordance with the flags argument if it is set. Also, the comparison can be done on reverse-order strings.
Note that if either flags or reverse is non-zero, then memory allocation will be necessary. If you are calling this function in a loop, that could quickly get costly. To avoid this, a pair of one-time-allocated buffers are used - but this doesn't dispense with all need for allocation. [Another option would be to allow a buffer to be optionally supplied....]
s1 | First string to compare. |
s2 | Second string to compare. |
charset | Character set of the two strings. |
flags | IGNORE_CASE, IGNORE_DIAC, both, or neither. |
reverse | Boolean: if true, strings are compared from end to beginning, rather than beginning to end. |
References cl_free, cl_malloc(), CL_MAX_LINE_LENGTH, cl_string_canonical(), cl_string_reverse(), MIN, s1, s2, and utf8.
Referenced by i2compare().
char* cl_string_reverse | ( | const char * | s, |
CorpusCharset | charset | ||
) |
Creates a "backwards" version of the specified string.
The memory for the reversed string is newly allocated. (This is potentially wasteful, but it occurs in the depths of GLib, so short of reinventing the wheel we have to live with it.)
s | String to reverse. |
charset | The character set of the string. |
References cl_strdup(), and utf8.
Referenced by cl_string_qsort_compare(), SortExternally(), and SortSubcorpus().
int cl_string_validate_encoding | ( | char * | s, |
CorpusCharset | charset, | ||
int | repair | ||
) |
Checks the encoding of a string.
This function looks for bad bytes (or byte sequences in the case of UTF8); if any are present, it judges the string invalid. For ISO8859-* encodings, the string can optionally be "repaired" in-place by replacing bad bytes with '?' characters. If the "repair" is successful, the function returns True.
What counts as "bad" is of course relative to the character set that the string is encoded in - so this must be specified.
s | Null-terminated string to check. |
charset | CorpusCharset of the string's encoding. |
repair | if True, replace invalid 8-bit characters by '?' |
References arabic, ascii, cyrillic, greek, hebrew, latin1, latin2, latin3, latin4, latin5, latin6, latin7, latin8, latin9, and utf8.
Referenced by encode_get_input_line(), and prepare_Query().
int cl_string_zap_controls | ( | char * | s, |
CorpusCharset | charset, | ||
char | replace, | ||
int | zap_tabs, | ||
int | zap_newlines | ||
) |
Replaces any invalid control characters in a string.
"Invalid" control characters are any below 0x20.
The string is modified in situ. A typical "replace" to use would be '?' to match the action of cl_string_validate_encoding.
s | The string to modify. |
charset | The character set of the string. |
replace | The replacement character to use. If this is 0, the character is deleted rather than replaced. |
zap_tabs | Whether or not tabs should be zapped (boolean). |
zap_newlines | Whether or not and should be zapped (boolean). |
Referenced by encode_get_input_line().
char* cl_xml_entity_decode | ( | char * | s | ) |
Decode XML entities in a string.
This function decodes pre-defined XML entities in string s. It overwrites the input string s and also returns s for convenience.
(The entities are < > & " ').
TODO -- numeric entities?
If passed NULL, it will not fall over - it will just pass NULL back!
This function is safe for strings in any encoding. The returned string will be at the same memory location and will always be the same length or shorter after the decoding of entities.
s | A string to decode. |
Referenced by encode_add_wattr_line(), and range_open().
void maptable_init_both | ( | unsigned char * | maptable, |
const unsigned char * | nocasetable, | ||
const unsigned char * | nodiactable | ||
) |
Initialise a "fold both case and diacritics" mapping table.
Referenced by cl_string_maptable().
void maptable_init_identity | ( | unsigned char * | maptable | ) |
Initialise an "identity" mapping table.
Referenced by cl_string_maptable().
int cl_allow_latex2iso = 0 |
Boolean switch enabling/disabling latex-style escapes.
By default, it is false; if programs wish to allow these escapes they need to offer some means of changing this variable.
Note that enabling this variable may cause scrambling of the string for LatinX strings where X is not 1; and may cause undefined errors for UTF8 strings. In short, you should only activate it when you are working with a corpus whose charset is Latin1.
Referenced by cl_string_latex2iso().
const unsigned char identity_tab[unknown_charset][256] |
Array of mapping tables used when NEITHER case NOR diacritics are to be stripped.
These are composite tables: they are only generated when needed (the corresponding identity_tab_init value is a boolean indicating whether this has been done yet).
Use a CorpusCharset value as the index into this array.
Referenced by cl_string_maptable().
int identity_tab_init[unknown_charset] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} |
Referenced by cl_string_maptable().
unsigned char nocase_nodiac_tab[unknown_charset][256] |
Array of mapping tables used when BOTH case AND diacritics are to be stripped.
These are composite tables: they are only generated when needed (the corresponding identity_tab_init value is a boolean indicating whether this has been done yet).
Use a CorpusCharset value as the index into this array.
Referenced by cl_string_maptable().
int nocase_nodiac_tab_init[unknown_charset] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} |
Referenced by cl_string_maptable().
unsigned char nocase_tab[unknown_charset][256] |
Array of tables mapping a character (the index) to the equivalent character in lowercase (the value).
There are as many tables as there are possible values of CorpusCharset. Moreover, tables must always be in the same order as the values of CorpusCharset are declared in.
This means starting at ascii == 0 and working up through the canonical order that is observable in cl.h
Use a CorpusCharset value as the index into this array.
Referenced by cl_string_maptable().
unsigned char nodiac_tab[unknown_charset][256] |
Array of tables mapping a character (the index) to the equivalent character without any accents (the value).
There are as many tables as there are possible values of CorpusCharset. Moreover, tables must always be in the same order as the values of CorpusCharset are declared in.
This means starting at ascii == 0 and working up through the canonical order that is observable in cl.h
Use a CorpusCharset value as the index into this array.
Referenced by cl_string_maptable().