CWB
|
The CL_Regex object, and the CL Regular Expression Optimiser. More...
The CL_Regex object, and the CL Regular Expression Optimiser.
This is the CL front-end to POSIX regular expressions with CL semantics (most notably: CL regexes always match the entire string and NOT substrings.)
Note that the optimiser is handled automatically by the CL_Regex object.
All variables / functions containing "regopt" are internal to this module and are not exported in the CL API.
Optimisation is done by means of "grains". The grain array in a CL_Regex object is a list of short strings. Any string which will match the regex must contain at least one of these. Thus, the grains provide a quick way of filtering out strings that definitely WON'T match, and avoiding a time-wasting call to the POSIX regex matching function.
While a regex is being optimised, the grains are stored in non-exported global variables in this module. Subsequently they are transferred to members of the CL_regex object with which they are associated. The use of global variables and a fixed-size buffer for grains is partly due to historical reasons, but it does also serve to reduce memory allocation overhead.
void cl_delete_regex | ( | CL_Regex | rx | ) |
Deletes a CL_Regex object.
Note that we use cl_free to deallocate the internal PCRE buffers, not pcre_free, for the simple reason that pcre_free is just a function pointer that will normally contain free, and thus we miss out on the checking that cl_free provides.
rx | The CL_Regex to delete. |
References cl_free, _CL_Regex::extra, _CL_Regex::grain, _CL_Regex::grains, _CL_Regex::haystack_buf, and _CL_Regex::needle.
Referenced by cl_regex2id(), free_booltree(), and free_environment().
CL_Regex cl_new_regex | ( | char * | regex, |
int | flags, | ||
CorpusCharset | charset | ||
) |
Create a new CL_regex object (ie a regular expression buffer).
The regular expression is preprocessed according to the flags, and anchored to the start and end of the string. (That is, ^ is added to the start, $ to the end.)
Then the resulting regex is compiled (using PCRE) and optimised.
regex | String containing the regular expression |
flags | IGNORE_CASE, or IGNORE_DIAC, or both, or 0. |
charset | The character set of the regex. |
References CDA_EBADREGEX, CDA_OK, charset, _CL_Regex::charset, cl_debug, cl_errno, cl_free, cl_malloc(), CL_MAX_LINE_LENGTH, cl_regex_error, cl_regopt_analyse(), cl_string_canonical(), cl_string_latex2iso(), _CL_Regex::extra, _CL_Regex::flags, _CL_Regex::grains, _CL_Regex::haystack_buf, IGNORE_CASE, IGNORE_DIAC, _CL_Regex::needle, regopt_data_copy_to_regex_object(), and utf8.
Referenced by cl_regex2id(), do_flagged_string(), do_XMLTag(), main(), and scancorpus_add_key().
int cl_regex_match | ( | CL_Regex | rx, |
char * | str | ||
) |
Matches a regular expression against a string.
The regular expression contained in the CL_Regex is compared to the string. No settings or flags are passed to this function; rather, the settings that rx was created with are used.
rx | The regular expression to match. |
str | The string to compare the regex to. |
References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::charset, cl_debug, cl_regopt_successes, cl_string_canonical(), _CL_Regex::extra, _CL_Regex::flags, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::haystack_buf, _CL_Regex::jumptable, and _CL_Regex::needle.
Referenced by cl_regex2id(), eval_bool(), eval_constraint(), is_regular(), main(), and matchfirstpattern().
int cl_regex_optimised | ( | CL_Regex | rx | ) |
Finds the level of optimisation of a CL_Regex.
This function returns the approximate level of optimisation, computed from the ratio of grain length to number of grains (0 = no grains, ergo not optimised at all).
rx | The CL_Regex to check. |
References _CL_Regex::grain_len, and _CL_Regex::grains.
Referenced by cl_regex2id().
int cl_regopt_analyse | ( | char * | regex | ) |
Analyses a regular expression and tries to find the best set of grains.
Part of the regex optimiser. For a given regular expression, this function will try to extract a set of grains from regular expression {regex_string}. These grains are then used by the CL regex matcher and cl_regex2id() for faster regular expression search.
If successful, this function returns True and stores the grains in the optiomiser's global variables above (from which they should be copied to a CL_Regex object's corresponding members).
Usage: optimised = cl_regopt_analyse(regex_string);
This is a non-exported function.
regex | String containing the regex to optimise. |
References buf, cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, local_grain_data, make_jump_table(), read_disjunction(), read_grain(), read_kleene(), read_wildcard(), and update_grain_buffer().
Referenced by cl_new_regex().
int cl_regopt_count_get | ( | void | ) |
Get a reading from the "success counter" for optimised regexes.
The counter is incremented by 1 every time the "grain" system is used successfully to avoid calling PCRE. That is, it is incremented every time a string is scrutinised and found to contain none of the grains.
Usage:
for (i = 0, hits = 0; i < n; i++) if (cl_regex_match(rx, haystacks[i])) hits++;
fprintf(stderr, "Found %d matches; avoided regex matching %d times out of %d trials", hits, cl_regopt_count_get(), n );
References cl_regopt_successes.
Referenced by cl_regex2id().
void cl_regopt_count_reset | ( | void | ) |
Reset the "success counter" for optimised regexes.
References cl_regopt_successes.
Referenced by cl_regex2id().
int is_safe_char | ( | unsigned char | c | ) |
Is the given character a 'safe' character which will only match itself in a regex?
What counts as safe: A to Z, a to z, 0 to 9, minus, quote marks, percent, ampersand, slashes, excl mark, colon, semi colon, character, underscore, any value over 0x7f.
What counts as not safe therefore includes: brackets, braces, square brackets; questionmark, plus, and star; circumflex and dollar sign; dot; hash; etc.
(But, in UTF8, Unicode PUNC area equivalents of these characters will be safe.)
c | The character (cast to unsigned for the comparison. |
Referenced by read_grain(), and read_matchall().
void make_jump_table | ( | void | ) |
Computes a jump table for Boyer-Moore searches.
Unlike the textbook version, this jumptable includes the last character of each grain (in order to avoid running the string comparing loops every time).
A non-exported function.
References cl_debug, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, and cl_regopt_jumptable.
Referenced by cl_regopt_analyse().
char* read_disjunction | ( | char * | mark, |
int * | align_start, | ||
int * | align_end | ||
) |
Finds grains in a disjunction group - part of the CL Regex Optimiser.
This function find grains in disjunction group within a regular expression; the grains are then stored in the grain_buffer.
The first argument, mark, must point to the '(' at beginning of the disjunction group.
The booleans align_start and align_end are set to true if the grains from *all* alternatives are anchored at the start or end of the disjunction group, respectively.
This is a non-exported function.
mark | Pointer to the disjunction group (see also function description). |
align_start | See function description. |
align_end | See function description. |
References buf, grain_buffer, grain_buffer_grains, local_grain_data, MAX_GRAINS, read_grain(), and read_wildcard().
Referenced by cl_regopt_analyse().
char* read_grain | ( | char * | mark | ) |
Reads in a grain from a regex - part of the CL Regex Optimiser.
A grain is a string of safe symbols not followed by ?, *, or {..}. This function finds the longest grain it can starting at the point in the regex indicated by mark; backslash-escaped characters are allowed but the backslashes must be stripped by the caller.
mark | Pointer to location in the regex string from which to read. |
References is_safe_char().
Referenced by cl_regopt_analyse(), and read_disjunction().
char* read_kleene | ( | char * | mark | ) |
Reads in a repetition marker - part of the CL Regex Optimiser.
This function reads in a Kleene star (asterisk), ?, +, or the general repetition modifier {n,n}; it returns a pointer to the first character after the repetition modifier it has found.
mark | Pointer to location in the regex string from which to read. |
Referenced by cl_regopt_analyse(), and read_wildcard().
char* read_matchall | ( | char * | mark | ) |
Reads in a matchall (dot wildcard) or safe character - part of the CL Regex Optimiser.
This function reads in matchall, any safe character, or a reasonably safe-looking character class.
mark | Pointer to location in the regex string from which to read. |
References is_safe_char().
Referenced by read_wildcard().
char* read_wildcard | ( | char * | mark | ) |
Reads in a wildcard - part of the CL Regex Optimiser.
This function reads in a wildcard segment matching arbitrary substring (but without a '|' symbol); it returns a pointer to the first character after the wildcard segment.
Note that effectively, wildcard equals matchall plus kleene.
mark | Pointer to location in the regex string from which to read. |
References read_kleene(), and read_matchall().
Referenced by cl_regopt_analyse(), and read_disjunction().
void regopt_data_copy_to_regex_object | ( | CL_Regex | rx | ) |
Internal regopt function: copies optimiser data from internal global variables to the member variables of argument CL_Regex object.
References _CL_Regex::anchor_end, _CL_Regex::anchor_start, cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, cl_regopt_jumptable, cl_strdup(), _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, and _CL_Regex::jumptable.
Referenced by cl_new_regex().
void update_grain_buffer | ( | int | front_aligned, |
int | anchored | ||
) |
Updates the public grain buffer -- part of the CL Regex Optimiser.
This function copies the local grains to the public buffer, if they are better than the set of grains currently there.
A non-exported function.
front_aligned | Boolean: if true, grain strings are aligned on the left when they are reduced to equal lengths. |
anchored | Boolean: if true, the grains are anchored at beginning or end of string, depending on front_aligned. |
References buf, CL_MAX_LINE_LENGTH, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, and public_grain_data.
Referenced by cl_regopt_analyse().
char cl_regex_error[CL_MAX_LINE_LENGTH] |
The error message from (PCRE) regex compilation are placed in this buffer if cl_new_regex() fails.
This global variable is part of the CL_Regex object's API.
Referenced by cl_new_regex(), and cl_regex2id().
Boolean: whether grains are anchored at end of string.
Referenced by cl_regopt_analyse(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
Boolean: whether grains are anchored at beginning of string.
Referenced by cl_regopt_analyse(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
char* cl_regopt_grain[MAX_GRAINS] |
list of 'grains' (any matching string must contain one of these)
Referenced by cl_regopt_analyse(), make_jump_table(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
all the grains have the same length
Referenced by cl_regopt_analyse(), make_jump_table(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
int cl_regopt_grains |
number of grains
Referenced by cl_regopt_analyse(), make_jump_table(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
int cl_regopt_jumptable[256] |
A jump table for Boyer-Moore search algorithm; use _unsigned_ char as index;.
Referenced by make_jump_table(), and regopt_data_copy_to_regex_object().
int cl_regopt_successes = 0 |
A counter of how many times the "grain" system has allwoed us to avoid calling the regex engine.
Referenced by cl_regex_match(), cl_regopt_count_get(), and cl_regopt_count_reset().
char* grain_buffer[MAX_GRAINS] |
Intermediate buffer for grains.
When a regex is parsed, grains for each segment are written to this intermediate buffer; if the new set of grains is better than the current one, it is copied to the cl_regopt_ variables.
Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().
int grain_buffer_grains = 0 |
The number of grains currently in the intermediate buffer.
Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().
char local_grain_data[CL_MAX_LINE_LENGTH] |
A buffer for grain strings.
Referenced by cl_regopt_analyse(), and read_disjunction().
char public_grain_data[CL_MAX_LINE_LENGTH] |