サヴルスのブログ: 2011

Here we try to build a program (all the code is provided "as is" without any warranty including but not limited to mental, psychic and wrong teaching... in short, usual open source license) that matches a string against a regular expression over simplified alphabet: base characters can be English letters only and special characters allowed are '|' (or), '*' (closure) and concatenation requires no character. All ideas given below can be found in Compilers: Principles Techniques and Tools, 2nd edition by A. Aho, M.S. Lam, R. Sethi, J.D. Ullman.

1. Basic Definitions and Notations
We give this for consistency. Reader may just skip this section

Token is a basic block of input. In our case tokens will be small letters of english alphabet and four special tokens: |, *, ( and ).
Alphabet is some set of tokens. Unless stated otherwise we assume all tokens are non-special.
String is a sequence of tokens.
Language over an alphabet is some set of strings, each sting consists of tokens from the alphabet.
Regular expressions language is a language RE over non-special and special tokens with the following properties:
1) empty string is in RE,
2) any string formed by a single non-special token is in RE.
3) if A and B are in RE, then AB, A|B, A* and (A) are in RE too.
Regular expression is an element of language RE.
Regular language L(regexp) determined by a regular expression regexp is defined as follows:
1) if regexp is an empty string, then L(regexp) contains only one string which is empty;
2) if regexp is a single token T, then L(regexp) has only one string T;
3) if regexp is of form AB where A and B are regular expressions, then L(regexp) consists of such and only such strings which can be represented as a concatenation XY of a string X from L(A) and a string Y from L(B);
4) if regexp is of form A|B where A and B are regular expressions, then L(regexp) consists of all strings from L(A) and L(B);
5) if regexp is of form (A) where A is a regular expressions, then L(regexp) is exactly L(A);
6) if regexp is of form A* where A is a regular expressions, then L(regexp) is a union of an empty string language, L(A), L(AA), L(AAA), and so on infinitely. This operation in language algebra is called Kleene closure.
Nondeterministic Finite Automata (NFA) is a triple (S,A,T) where S is a finite set of states, A is an alphabet which can contain a special empty token and T is a transitions function, which given a state and a token returns a new state. There should be chosen one starting state and some positive number of accepting states. NFA is usually depicted as a directed graph where nodes are states, and edges are labeled by tokens and edge labeled by some token X goes from a state Y to state T(Y,X). NFA is said to accept a string R if there is a directed path in the graph representing this NFA from the starting state to one of accepting states such that a string formed by edge labels with empty labels removed is exactly the R. Note, we can omit the requirement of T to be a function by saying there is some special non-accepting state Q such that all undefined transactions should go to Q.

Theorem 1 For each regular expression regexp there is an NFA that accepts strings from and only from L(regexp). Proof is done by induction on construction of regexp: NFA accepting empty string and tokens are obvious, NFA for other cases can be made using NFAs from subexpression. We will give actual constructions in the real code below.
The reverse (for every NFA there is a regular expression defining the same language) is also true. We will not need this fact later so we leave the proof for the reader.

Context-Free Grammar (CFG) is triple (T,N,P) where T is a set of terminals, N is a set of nonterminals, and P is a set of productions. Each production has a head which is always a nonterminal and a body which is a sequence of terminals and nonterminals. One of the nonterminals is designated as a start symbol.
CFG defines a language in the following way. At first we have only start symbol. Then if there are some nonterminals one is picked and substituted by a body of some production having this nonterminal as a head. We stop when there are no nonterminals left. All strings which can be derived in such a way form a language defined by this CFG.
To write down a CFG we will use a notation similar to yacc(1)'s: on each line we write a nonterminal, then ":" and then all productions having this nonterminal as a head separated by "," (we choose comma instead of yacc(1)'s pipe because it is used as RE language token and comma is absolutely vacant), the special word 'epsilon' denotes an empty production body.

Parse Tree for a string s derived using CFG G's productions is a rooted ordered tree with labels at its nodes: root is labeled by start symbol of G; each inner node corresponds to a nonterminal substitution and is labeled by that nonterminal, it's childs are labeled by terminals and nonterminals from corresponding production body in the same order. All leafs are labeled by terminals and being traversed from left to right form the string s.
Parsing is a process of constructing a parse tree from a sting.
Parser is a program capable of performing parsing.

LL(1) parsing algorithm is defined as follows. We read an input string from left to right. To represent a parse state we remember current position in the string and we have a parse stack, initially holding only a start symbol of a CFG. At each step we pop a token from the top of the stack and if it is a terminal we check if it is the same as the current token in the input. If it's not then parse error occurs, otherwise we advance the input. If the token form the stack is nonterminal we choose a production which has this token as it's head and can derive a string starting with the current input terminal. By the definition of LL(1) parser there is exactly one such production, otherwise parse error occurs.
LL(1) context-free grammar is a CFG for which LL(1) parser exists.

2. LL(1) Grammar for Regular Expressions

There is a natural CFG for the language RE obtained from the definition of RE:

S : E
E : E|T , T
T : TF , F
F : F* , R
R : (E), epsilon, L
L : q,w,e,r,t,y,u,i,o,p,a,s,d,f,g,h,j,k,l,z,x,c,v,b,n,m

Unfortunately this grammar is not LL(1) since choice between productions can't be done by looking at the current symbol only. We have to alter this grammar to make another CFG which would be LL(1) one. There are some techniques for this, namely left recursion elimination, left factoring, but we will not give them here. An LL(1) grammar for RE language got by the author is just presented without explanation.

S  :  S'
S' :  TT'
T' :  |S' , epsilon
T  :  PP'
P' :  PP' , epsilon
P  :  RR'
R' :  *R' , epsilon
R  :  (S') , epsilon , L
L  :  q,w,e,r,t,y,u,i,o,p,a,s,d,f,g,h,j,k,l,z,x,c,v,b,n,m

3. Token Class

Now that we have a grammar we can present the Token class. We have tokens for epsilon, grammar symbols and input characters.

typedef enum {
    TOKEN_EPSILON, /* empty token */
    /* nonterminals of regexp LL(1) grammar */
    TOKEN_START,   /* S  -> S' */
    TOKEN_SP,      /* S' -> TT' */
    TOKEN_TP,      /* T' -> |S'  or  epsilon */
    TOKEN_T,       /* T  -> PP' */
    TOKEN_PP,      /* P' -> PP'  or  epsilon */
    TOKEN_P,       /* P  -> RR' */
    TOKEN_RP,      /* R' -> *R'  or  epsilon */
    TOKEN_R,       /* R  -> (S')  or  a,b,c,...,z  or  epsilon */
    /*end nonterminals */
    TOKEN_NORMAL   /* token for the input character */
} TokenType;

class Token {
    TokenType type;  /* type of the token */
    char normal;     /* character for TOKEN_NORMAL token */
public:
    /* create special token of specified type */
    Token(TokenType t) : type(t) {}
    /* create  token for the specified character */
    Token(char n) : type(TOKEN_NORMAL), normal(n) {}

    bool is_epsilon() const { return type == TOKEN_EPSILON; }
    TokenType get_type() const { return type; }
    char get_char() const { return normal; }
    
    /* this is used by std::map data structure. only tokens with
     * TOKEN_NORMAL and TOKEN_EPSILON types held there */
    bool operator <(const Token& t) const { 
        if ((type == t.type) && (type != TOKEN_EPSILON))
            return normal < t.normal;
        return type < t.type;
    }
    bool operator ==(const Token &t) const {
        if ((type == t.type) && (type != TOKEN_EPSILON))
            return normal == t.normal;
        return type == t.type;
    }
};

4. NFA Class

We start with a single NFA state. The class represents the state itself and contains all transitions from this state: endpoints for the same token are grouped in vectors and tokens are used as keys in a std::map data structure. State also has a temporary bit is_on used later by algorithms to mark visited states.

class NFAState;

typedef std::vector<NFAState*> Svector;
typedef std::map<Token, Svector> ASmap;

/* Single NFA state */
class NFAState {
    private:
    ASmap moves;  /* outgoing edges of this state. */

    public:
    bool is_on; /* temporary bit to use by others */
    NFAState() : is_on(false) {}
    /* get vector of transitions for a token */
    const Svector & get_move(Token a) { return moves[a]; }
    /* get all transitions for this state */
    const ASmap & get_all_moves() { return moves; }
    /* add transition */
    void add_move(Token a, NFAState *target) {
        moves[a].push_back(target);
    }
    
};

Next is the whole NFA class. Since we only require one finish state in this program we do not support a general case by this class. However general case can be reduced to ours by simply adding a new finish state and epsilon transitions to it from old finish states. NFA class contains routines to construct a new NFA using existing NFAs. We said earlier that for every regular expression we can build a NFA accepting the same language by recursive on regular expression but we omitted actual constructs, now we give them in the code.

/* Nondeterministic Finite Automata */
class NFA {
    public:
    NFAState *start;   /* start state */
    NFAState *finish;  /* finish state */

    NFA(NFA& nfa) : start(nfa.start), finish(nfa.finish) {}
    NFA() {}

    NFA& operator=(const NFA &nfa) {
        this->start = nfa.start;
        this->finish = nfa.finish;
        return *this;
    }

    /* NFA which accepts a specified token */
    void init_token(Token t) {
        start = new NFAState;
        finish = new NFAState;
        start->add_move(t, finish);
    }

    /* NFA which accepts a string either _s_ or _t_ accepts */
    void init_or(NFA &s, NFA &t) {
        start = new NFAState;
        finish = new NFAState;
        Token epsilon = Token(TOKEN_EPSILON);

        start->add_move(epsilon, s.start);
        start->add_move(epsilon, t.start);
        s.finish->add_move(epsilon, finish);
        t.finish->add_move(epsilon, finish);
    }

    /* NFA which accepts a concatenation of _s_ and _t_ accepted strings */
    void init_concat(NFA &s, NFA &t) {
        start = s.start;
        finish = t.finish;
        Token epsilon = Token(TOKEN_EPSILON);

        s.finish->add_move(epsilon, t.start);
    }

    /* NFA which accepts a closure of _s_ */
    void init_asterisk(NFA &s) {
        start = new NFAState;
        finish = new NFAState;
        Token epsilon = Token(TOKEN_EPSILON);

        start->add_move(epsilon, finish);
        start->add_move(epsilon, s.start);
        s.finish->add_move(epsilon, finish);
        s.finish->add_move(epsilon, s.start);
    }

    /* states are allocated in heap, we need to free them explicitly */
    void delete_states() {
        Svector states, processing;
        processing.push_back(start);

        /* collect all states in the _states_ vector */
        while (!processing.empty()) {
            NFAState *p = processing.back();
            processing.pop_back();
            states.push_back(p);
            ASmap all_moves = p->get_all_moves();
            for (ASmap::iterator it = all_moves.begin(), end = all_moves.end();
                    it != end; ++it) {
                Svector &moves = (*it).second;
                for (Svector::iterator iit = moves.begin(), eend = moves.end();
                        iit < eend; ++iit) {
                    if ((*iit)->is_on) continue;
                    (*iit)->is_on = 1;
                    processing.push_back(*iit);
                }
            }
        }

        for (Svector::iterator it = states.begin(), end = states.end();
                it < end; ++it)
            delete *it;
    }
};

5. NFA Emulation

To run an NFA we first need to introduce a notion of a hyperstate. Since in NFA there can be multiple ways for a transition we need to simulate them all at once. Hyperstate is a collection of NFA states the automata can be in for the current input. Hyperstate can be constructed with the help of epsilon-closure: given a set of states, epsilon-closure is this set plus states were we can get by epsilon transitions from this set. So emulation is done as follows: initial hyperstate is the epsilon-closure of start state, each step the next hyperstate is the epsilon-closure of the set got by current token transactions from the current hyperstate. At the end of the input we check if the finish state is in current hyperstate. If it is, then NFA accepts the input, otherwise, rejects.

class NFAHyperState {
    Svector states;
    private:
    /* add all states from an NFAState with theirs's epsilon-clousure
     * which are not already in hyperstate */
    void add_closure(NFAState &s) {
        if (s.is_on) return;
        std::queue<NFAState*> Q;
        Q.push(&s);
        while (!Q.empty()) {
            NFAState *p = Q.front();
            Q.pop();
            states.push_back(p);
            p->is_on = true;
            Svector moves = p->get_move(Token(TOKEN_EPSILON));
            for (Svector::iterator it = moves.begin(), end = moves.end();
                    it < end; ++it) {
                if ((*it)->is_on) continue;
                Q.push(*it);
            }
        }
    }

    /* clear temporary markings in states. Execute after step is done */
    void clear_marks() {
        for (Svector::iterator it = states.begin(), end = states.end();
                it < end; ++it) {
            (*it)->is_on = false;
        }
    }

    public:
    NFAHyperState(NFAState *s) {
        /* add epsilon-closure of start state */
        add_closure(*s);
        clear_marks();
    }

    /* perform a single transition */
    void move(Token a) {
        Svector old; /* current hyperspace in this routine */
        old.swap(states);
        states.clear();
        /* now current hyperstate is clear, fill it using moves from
         * states in old hyperstate */
        for (Svector::iterator it = old.begin(), end = old.end();
                it < end; ++it) {
            Svector moves = (*it)->get_move(a);
            for (Svector::iterator jt = moves.begin(), jend = moves.end();
                    jt < jend; ++jt) {
                add_closure(**jt);
            }
        }
        clear_marks();
    }

    /* check if a specified NFA state is in this hyperstate */
    bool is_in(NFAState *f) {
        for (Svector::iterator it = states.begin(), end = states.end();
                it < end; ++it) {
            if ((*it) == f) return true;
        }
        return false;
    }
};

/* NFA Emulator. */
class NFAEmulator {
    private:
    NFAHyperState hyperstate;  /* all work is done here */
    NFAState *finish_state;    /* NFA finish state */

    public:
    NFAEmulator(NFA &nfa) : hyperstate(nfa.start), finish_state(nfa.finish) {}
    /* hyperstate transition */
    void move(Token a) { hyperstate.move(a); }
    /* check if NFA accepts the input */
    bool is_accepted() { return hyperstate.is_in(finish_state); }
};

6. Regular Expression Parse Tree

Now we return to our LL(1) grammar and present the parser. Since the grammar appears to be not very friendly to NFA construction operations, we have to build a parse tree first. After parse tree is constructed, build_nfa from its root can be called to construct the NFA for parsed regular expression. Routine build_nfa works by recursively generating NFA for the childs and then for the parent. Sometimes it pass an NFA generated by the one child to it's sibling. The reader is advised to draw a parse tree for some simple expression to see why this happens.

/* grammar productions */
typedef enum {
    RULE_S_SP,     /* S  -> S' */
    RULE_SP_TTP,   /* S' -> TT' */
    RULE_TP_SP,    /* T' -> S' */
    RULE_TP_E,     /* T' -> epsilon */
    RULE_T_PPP,    /* T  -> PP' */
    RULE_PP_PPP,   /* P' -> PP' */
    RULE_PP_E,     /* P' -> epsilon */
    RULE_P_RRP,    /* P  -> RR' */
    RULE_RP_ARP,   /* R' -> *R' */
    RULE_RP_E,     /* R' -> epsilon */
    RULE_R_E,      /* R  -> epsilon */
    RULE_R_LETTER, /* R  -> L */
    RULE_R_BSPB,   /* R  -> (S') */
    RULE_LETTER,   /* L  -> letter */
} GrammarRule;

class ParseTreeNode;

typedef std::vector<ParseTreeNode*> Nvector;

/* Node of a Parse tree */
class ParseTreeNode {
    GrammarRule type;       /* grammar production corresponding to this node */
    Nvector childs;         /* child nodes */
    ParseTreeNode *parent;  /* parend node */
    char letter;            /* character for RULE_LETTER type */
    
    /* nfa corresponding to this node. When constructing a prse tree this is
     * used as a parameter to _inherit_ nfa form parent or siblings */
    NFA nfa;

    public:
    /* node for a new parse tree with the starting production */
    ParseTreeNode() : type(RULE_S_SP) {}
    /* new node for existing parse tree */
    ParseTreeNode(ParseTreeNode &p) : parent(&p) {
        p.childs.push_back(this);
    }
    void set_type(GrammarRule t) { type = t; }
    void set_letter(char c) { letter = c; }
    ~ParseTreeNode() {
        for (Nvector::iterator it=childs.begin(),end=childs.end();
                it<end; ++it)
            delete *it;
        childs.clear();
    }

    /* build NFA by traversing a parse tree
     * should be called from the root of the parse tree */
    NFA & build_nfa() {
        NFA new_nfa; /* temporary NFA */
        switch(type) {
        case RULE_S_SP: /* S -> S' */
            /* NFA for this node is just the NFA for it's child */
            nfa = childs[0]->build_nfa();
            break;
        case RULE_SP_TTP: /* S' -> TT' */
            /* construct NFA for T-child and pass it to the T'-child */
            childs[1]->nfa = childs[0]->build_nfa();
            /* NFA for this node is the same as it's T-child's */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_TP_SP: /* T' -> |S' */
            /* NFA is OR of inherited nfa and S'-child's NFA */
            new_nfa.init_or(nfa, childs[1]->build_nfa());
            nfa = new_nfa;
            break;
        case RULE_TP_E:
            /* NFA is right the inherited nfa, do nothing */
            break;
        case RULE_T_PPP: /* T -> PP' */
            /* construct NFA for P-child, pass it to P'-child */
            childs[1]->nfa = childs[0]->build_nfa();
            /* NFA is P'-child's nfa */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_PP_PPP: /* P' -> PP' */
            /* construct concatenation NFA of inherited nfa and P-child's */
            new_nfa.init_concat(nfa, childs[0]->build_nfa());
            /* pass new_nfa to P'-childs */
            childs[1]->nfa = new_nfa;
            /* NFA is P'-child's NFA */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_PP_E:
            /* NFA is inherited nfa, do nothing */
            break;
        case RULE_P_RRP:  /* P -> RR' */
            /* construct NFA for R-child and pass it to R'-child */
            childs[1]->nfa = childs[0]->build_nfa();
            /* NFA is R'-child's NFA */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_RP_ARP: /* R' -> *R' */
            /* construct asterisk NFA from inherited nfa, pass it to child */
            childs[1]->nfa.init_asterisk(nfa);
            /* NFA is R'-child's nfa */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_RP_E:
            /* NFA is inherited nfa, do nothing */
            break;
        case RULE_R_E:
            /* NFA is accepting only epsilon token*/
            nfa.init_token(Token(TOKEN_EPSILON));
            break;
        case RULE_R_LETTER: /* R -> L */
            /* NFA is child's NFA */
            nfa = childs[0]->build_nfa();
            break;
        case RULE_R_BSPB:
            /* NFA is S'-child's NFA */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_LETTER: /* letter -> a|...|z */
            /* NFA is accepting only letter token */
            nfa.init_token(Token(letter));
            break;
        }
        return nfa;
    }
};

7. Regular Expression Parser

Now we can finally present the parser itself. LL(1) parsing algorithm has been discussed above, the only exception is that parse stack now holds not only tokens but parse tree nodes too.

typedef std::pair<Token, ParseTreeNode*> TNpair;
typedef std::vector<TNpair> TNvector;

/* Regular Expression Parser */
class REParse {
    std::string regexp;         /* regular expression to parse */
    unsigned int position;      /* current position in regexp */
    TNvector parse_stack;       /* LL(1) parse stack */
    ParseTreeNode *parse_tree;  /* parse tree */

    /* push token-node pair on a parse stack */
    void push_parse_state(Token &t, ParseTreeNode *node) {
        parse_stack.push_back(TNpair(t, node));
    }

    /* grammar production for a letter terminal in a body */
    void production(ParseTreeNode &node, GrammarRule rule, char letter) {
        node.set_type(rule);
        node.set_letter(letter);
    }
    /* grammar production with empty body */
    void production(ParseTreeNode &node, GrammarRule rule) {
        node.set_type(rule);
    }
    /* grammar production with one nonterminal in a body */
    void production(ParseTreeNode &node, GrammarRule rule, Token t1) {
        ParseTreeNode *c1 = new ParseTreeNode(node);
        push_parse_state(t1, c1);
        node.set_type(rule);
    }
    /* grammar production with two nonterminals in a body */
    void production(ParseTreeNode &node, GrammarRule rule, Token t1, Token t2) {
        ParseTreeNode *c1 = new ParseTreeNode(node);
        ParseTreeNode *c2 = new ParseTreeNode(node);
        push_parse_state(t2, c2);
        push_parse_state(t1, c1);
        node.set_type(rule);
    }
    /* grammar production with three nonterminals in a body */
    void production(ParseTreeNode &node, GrammarRule rule, Token t1, Token t2, Token t3) {
        ParseTreeNode *c1 = new ParseTreeNode(node);
        ParseTreeNode *c2 = new ParseTreeNode(node);
        ParseTreeNode *c3 = new ParseTreeNode(node);
        push_parse_state(t3, c3);
        push_parse_state(t2, c2);
        push_parse_state(t1, c1);
        node.set_type(rule);
    }

    /* can current character be the beginning of a string derieved from P' */
    bool is_first_PP(char c) const { return (isalpha(c) || (c == '(')); }
    /* can current character be the beginning of a string derieved from T' */
    bool is_first_TP(char c) const { return (is_first_PP(c) || (c == '|')); }

    /* A step of LL(1) parsing: process a token from the top of the stack */
    void process_token(Token &t, ParseTreeNode &node) {
#define T(x) Token(x)
        char lookup = regexp[position];
        switch (t.get_type()) {
        case TOKEN_START:
            production(node, RULE_S_SP, T(TOKEN_SP)); break;
        case TOKEN_SP:
            production(node, RULE_SP_TTP, T(TOKEN_T), T(TOKEN_TP)); break;
        case TOKEN_T:
            production(node, RULE_T_PPP, T(TOKEN_P), T(TOKEN_PP)); break;
        case TOKEN_P:
            production(node, RULE_P_RRP, T(TOKEN_R), T(TOKEN_RP)); break;
        case TOKEN_R:
            if (lookup == '(')
                production(node, RULE_R_BSPB, T('('), T(TOKEN_SP), T(')'));
            else if (isalpha(lookup))
                production(node, RULE_R_LETTER, T(lookup));
            else
                production(node, RULE_R_E);
            break;
        case TOKEN_TP:
            if (lookup == '|')
                production(node, RULE_TP_SP, T('|'), T(TOKEN_SP));
            else
                production(node, RULE_TP_E);
            break;
        case TOKEN_PP:
            if (is_first_PP(lookup))
                production(node, RULE_PP_PPP, T(TOKEN_P), T(TOKEN_PP));
            else
                production(node, RULE_PP_E);
            break;
        case TOKEN_RP:
            if (lookup == '*')
                production(node, RULE_RP_ARP, T('*'), T(TOKEN_RP));
            else
                production(node, RULE_RP_E);
            break;
        case TOKEN_NORMAL:
            /* check if input doesn't finish unexpectedly */
            if (position >= regexp.length()) {
                std::cerr << "Parse error: went past input" << std::endl;
                throw 10; /* ABORT */
            }
            /* this production can be used for passing throwgh
             * special tokens, in this case we have to check */
            if (lookup != t.get_char()) {
                std::cerr << "Parse error: expected '" << t.get_char()
                    << "' got '" << lookup << "'" << std::endl;
                throw 20; /* ABORT */
            }
            production(node, RULE_LETTER, lookup);
            ++position; /* all right, advance input position */
            break;
        default:
            std::cerr << "Parse error" << std::endl;
            throw 30; /* ABORT */
        }
    }

    public:
    REParse(std::string re) : regexp(re), position(0) {
        /* initialize parse tree and stack */
        parse_tree = new ParseTreeNode();
        parse_stack.push_back(TNpair(Token(TOKEN_START), parse_tree));
        /* LL(1) parsing cycle */
        while (!parse_stack.empty()) {
            TNpair& tnpair = parse_stack.back();
            parse_stack.pop_back();
            process_token(tnpair.first, *tnpair.second);
        }
        /* check if we weren't able to process all the input regexp */
        if (position != regexp.length()) {
            std::cerr << "Parse error: finished early" << std::endl;
            throw 40; /* ABORT */
        }

    }

    /* get parse tree */
    ParseTreeNode * get_parse_tree() const { return parse_tree; }

    ~REParse() {
        delete parse_tree;
    }

};

8. Epilogue

We presented all the parts to build a regular expression matcher. First we need to parse a regular expression alongside with building a parse tree, next we build an NFA from a parse tree, then we emulate NFA and finally check if NFA accepts the input string. As a final example here is the main() function of our program. It reads two string: first one is a regular expression, second is a matched string. It outputs "matched" if the input string appears to be in the language defined by the regular expression, otherwise the program outputs "not matched"

int main() {
    std::string re;
    std::cin >> re;  /* read regular expression */
    REParse parser(re);  /* parse regular expression */
    NFA nfa = parser.get_parse_tree()->build_nfa(); /* build NFA */
    NFAEmulator nfaemu(nfa); /* Initialize NFA emulator */

    std::string match;
    std::cin >> match; /* read input string */

    /* process string */
    for (unsigned int i=0; i < match.length(); ++i)
        nfaemu.move(match[i]);
    if (nfaemu.is_accepted()) std::cout << "matched";
    else std::cout << "not matched";
    std::cout << std::endl;

    /* free states allocated in heap */
    nfa.delete_states();
}

9. Further Reading

Aforementioned Compilers: Principles, Techniques and Tools contains every idea presented here. Introduction to Automata Theory, Languages and Computation by Hopcroft, Motwani and Ullman explains the mathematics behind automatas and conext-free grammars. Finally, this is a very good online resource about regular expressions in practice with parsing done by Dijkstra's shunting yard algorithm and reverse polish notation processing which is much more simpler than LL(1) parser presented here.

サヴルスのブログ

Sunday, February 13, 2011

Homemade regular expression matcher

Labels

Blog Archive

Followers

About Me