1 #ifndef MSP_STRINGS_REGEX_H_
2 #define MSP_STRINGS_REGEX_H_
10 This class provides regular expression matching. It supports a subset of
11 POSIX.2 extended regex syntax. Character classes, equivalence classes and
12 collating elements are not supported. Refer to the regex(7) manpage for more
15 A description of the internal workings of the class follows. You may skip this
16 if you are not a developer or otherwise interested.
18 The core of this class is a virtual machine running a non-deterministic finite
19 automaton (NFA). The state of the automaton is represented by an iterator into
20 the code. Transitions between states are represented by instructions.
21 Instructions may take multiple bytes, so a valid code iterator may not be a
24 The virtual machine may have any number of execution contexts at any one time.
25 On every cycle, each context is advanced until an input-consuming instruction
26 is encountered, keeping the input positions in sync with each other. Execution
27 continues as long as there's at least one context remaining.
29 The GROUP_BEGIN and GROUP_END instructions record the begin and end offset of a
30 match group, respectively. The GROUP_END instruction also marks the group as
31 successfully matched. If the target group was already matched, these
32 instructions do nothing.
34 The JUMP instruction causes the execution iterator to be modified by an offset.
36 The ND_JUMP instruction causes the execution context to be split in two. One
37 continues directly after the instruction and the other continues at an offset.
39 The NEGATE instruction causes the result of the next match instruction to be
42 Match instructions compare the input against a condition. If the match
43 succeeds, execution continues at the next instruction. If the match fails,
44 execution of that context is terminated.
46 The MATCH_BEGIN and MATCH_END instructions match the beginning and end of
47 the input string, respectively. They do not consume input.
49 The MATCH_CHAR instruction consumes the input character and matches it against
50 a single character. Since regexes often match sequences of printable character,
51 a match for such a character may be encoded as the character itself.
53 The MATCH_RANGE instruction consumes the input character and matches it against
54 an inclusive character range.
56 The MATCH_MASK instruction consumes the input character and matches it against
59 The MATCH_ANY instruction consumes the input character and always succeeds.
65 Constructs a new Regex object.
67 Regex(const std::string &expr);
70 Matches the regex against a string. Refer to RegMatch documentation for
71 more information on the resulting object.
73 RegMatch match(const std::string &str) const;
76 Returns a disassembled representation of the NFA bytecode. For debugging
79 std::string disassemble() const;
81 typedef std::string Code;
82 typedef unsigned short Count;
84 typedef unsigned short Index;
110 Code::const_iterator citer;
111 RegMatch::GroupArray groups;
118 Compiles a regular expression into NFA bytecode. When compiling a whole
119 regex, \a group should be set to 0. When the function returns, \a group will
120 be the index of the last subexpression and \a iter will point to the first
121 unused character in the expression.
123 \param expr Expression to be compiled
124 \param begin Iterator into the expression
125 \param group Group counter, gets incremented for each subregex
126 \param branch Whether we are compiling a branch
128 \return Compiled NFA bytecode
130 Code compile(const std::string &expr, std::string::const_iterator &iter, unsigned &group, bool branch);
131 Code parse_atom(const std::string &, std::string::const_iterator &i, unsigned &);
132 Code parse_brackets(const std::string &, std::string::const_iterator &);
133 bool parse_repeat(std::string::const_iterator &, Count &, Count &);
134 bool run(const std::string &, const std::string::const_iterator &, RegMatch::GroupArray &) const;
135 bool group_compare(const RegMatch::Group &, const RegMatch::Group &) const;
136 std::string disassemble_instruction(Code::const_iterator &) const;