Friday, January 16, 2015

Sayonara 2014

Being a Ph.D student I've spent a lot of time optimizing Cohen et al. algorithm for the 2-hop cover (Hub Labels). Finally we published our results at Syphosium of Experimental Algorithms:

Daniel Delling, Andrew V. Goldberg, Ruslan Savchenko, and Renato F. Werneck. Hub Labels: Theory and Practice

After the talk I reimplemented the Hub Labeling approximation algorithm from scratch and published it: https://github.com/savrus/hl

Also, in 2014 we solved some theoretical problems related to Hub Labels including NP-hardness of the minimum Hub Labeling problem and approximation guarantee for Hierarchical Hub Labeling greedy algorithms. The preprint is available from arXiv: http://arxiv.org/abs/1501.02492

That's it for 2014, that's it.

Monday, January 6, 2014

Sayonara 2013

Just a few updates.

First, I finally have a scientific publication. Together with Andrew Goldberg and Ilya Razenshteyn we worked on hub labels and created a paper Separating Hierarchical and General Hub Labelings. It was accepted to the MFCS2013 conference and published in the proceedings. You can have a preprint from arxiv or a final paper from the publisher.

Second, I realized there are many textbook algorithms almost everybody implemented and I didn't. I started algory project on GitHub to fill this gap.
I don't promise a cool graph library in the near future, but I hope to add new algorithms there sometimes.


Sunday, February 13, 2011

Homemade regular expression matcher

Here we try to build a program (all the code is provided "as is" without any warranty including but not limited to mental, psychic and wrong teaching... in short, usual open source license) that matches a string against a regular expression over simplified alphabet: base characters can be English letters only and special characters allowed are '|' (or), '*' (closure) and concatenation requires no character. All ideas given below can be found in Compilers: Principles Techniques and Tools, 2nd edition by A. Aho, M.S. Lam, R. Sethi, J.D. Ullman.


1. Basic Definitions and Notations
We give this for consistency. Reader may just skip this section

Token is a basic block of input. In our case tokens will be small letters of english alphabet and four special tokens: |, *, ( and ).
Alphabet is some set of tokens. Unless stated otherwise we assume all tokens are non-special.
String is a sequence of tokens.
Language over an alphabet is some set of strings, each sting consists of tokens from the alphabet.
Regular expressions language is a language RE over non-special and special tokens with the following properties:
1) empty string is in RE,
2) any string formed by a single non-special token is in RE.
3) if A and B are in RE, then AB, A|B, A* and (A) are in RE too.
Regular expression is an element of language RE.
Regular language L(regexp) determined by a regular expression regexp is defined as follows:
1) if regexp is an empty string, then L(regexp) contains only one string which is empty;
2) if regexp is a single token T, then L(regexp) has only one string T;
3) if regexp is of form AB where A and B are regular expressions, then L(regexp) consists of such and only such strings which can be represented as a concatenation XY of a string X from L(A) and a string Y from L(B);
4) if regexp is of form A|B where A and B are regular expressions, then L(regexp) consists of all strings from L(A) and L(B);
5) if regexp is of form (A) where A is a regular expressions, then L(regexp) is exactly L(A);
6) if regexp is of form A* where A is a regular expressions, then L(regexp) is a union of an empty string language, L(A), L(AA), L(AAA), and so on infinitely. This operation in language algebra is called Kleene closure.
Nondeterministic Finite Automata (NFA) is a triple (S,A,T) where S is a finite set of states, A is an alphabet which can contain a special empty token and T is a transitions function, which given a state and a token returns a new state. There should be chosen one starting state and some positive number of accepting states. NFA is usually depicted as a directed graph where nodes are states, and edges are labeled by tokens and edge labeled by some token X goes from a state Y to state T(Y,X). NFA is said to accept a string R if there is a directed path in the graph representing this NFA from the starting state to one of accepting states such that a string formed by edge labels with empty labels removed is exactly the R. Note, we can omit the requirement of T to be a function by saying there is some special non-accepting state Q such that all undefined transactions should go to Q.

Theorem 1 For each regular expression regexp there is an NFA that accepts strings from and only from L(regexp). Proof is done by induction on construction of regexp: NFA accepting empty string and tokens are obvious, NFA for other cases can be made using NFAs from subexpression. We will give actual constructions in the real code below.
The reverse (for every NFA there is a regular expression defining the same language) is also true. We will not need this fact later so we leave the proof for the reader.

Context-Free Grammar (CFG) is triple (T,N,P) where T is a set of terminals, N is a set of nonterminals, and P is a set of productions. Each production has a head which is always a nonterminal and a body which is a sequence of terminals and nonterminals. One of the nonterminals is designated as a start symbol.
CFG defines a language in the following way. At first we have only start symbol. Then if there are some nonterminals one is picked and substituted by a body of some production having this nonterminal as a head. We stop when there are no nonterminals left. All strings which can be derived in such a way form a language defined by this CFG.
To write down a CFG we will use a notation similar to yacc(1)'s: on each line we write a nonterminal, then ":" and then all productions having this nonterminal as a head separated by "," (we choose comma instead of yacc(1)'s pipe because it is used as RE language token and comma is absolutely vacant), the special word 'epsilon' denotes an empty production body.

Parse Tree for a string s derived using CFG G's productions is a rooted ordered tree with labels at its nodes: root is labeled by start symbol of G; each inner node corresponds to a nonterminal substitution and is labeled by that nonterminal, it's childs are labeled by terminals and nonterminals from corresponding production body in the same order. All leafs are labeled by terminals and being traversed from left to right form the string s.
Parsing is a process of constructing a parse tree from a sting.
Parser is a program capable of performing parsing.

LL(1) parsing algorithm is defined as follows. We read an input string from left to right. To represent a parse state we remember current position in the string and we have a parse stack, initially holding only a start symbol of a CFG. At each step we pop a token from the top of the stack and if it is a terminal we check if it is the same as the current token in the input. If it's not then parse error occurs, otherwise we advance the input. If the token form the stack is nonterminal we choose a production which has this token as it's head and can derive a string starting with the current input terminal. By the definition of LL(1) parser there is exactly one such production, otherwise parse error occurs.
LL(1) context-free grammar is a CFG for which LL(1) parser exists.


2. LL(1) Grammar for Regular Expressions

There is a natural CFG for the language RE obtained from the definition of RE:
S : E
E : E|T , T
T : TF , F
F : F* , R
R : (E), epsilon, L
L : q,w,e,r,t,y,u,i,o,p,a,s,d,f,g,h,j,k,l,z,x,c,v,b,n,m

Unfortunately this grammar is not LL(1) since choice between productions can't be done by looking at the current symbol only. We have to alter this grammar to make another CFG which would be LL(1) one. There are some techniques for this, namely left recursion elimination, left factoring, but we will not give them here. An LL(1) grammar for RE language got by the author is just presented without explanation.

S  :  S'
S' :  TT'
T' :  |S' , epsilon
T  :  PP'
P' :  PP' , epsilon
P  :  RR'
R' :  *R' , epsilon
R  :  (S') , epsilon , L
L  :  q,w,e,r,t,y,u,i,o,p,a,s,d,f,g,h,j,k,l,z,x,c,v,b,n,m


3. Token Class

Now that we have a grammar we can present the Token class. We have tokens for epsilon, grammar symbols and input characters.
typedef enum {
    TOKEN_EPSILON, /* empty token */
    /* nonterminals of regexp LL(1) grammar */
    TOKEN_START,   /* S  -> S' */
    TOKEN_SP,      /* S' -> TT' */
    TOKEN_TP,      /* T' -> |S'  or  epsilon */
    TOKEN_T,       /* T  -> PP' */
    TOKEN_PP,      /* P' -> PP'  or  epsilon */
    TOKEN_P,       /* P  -> RR' */
    TOKEN_RP,      /* R' -> *R'  or  epsilon */
    TOKEN_R,       /* R  -> (S')  or  a,b,c,...,z  or  epsilon */
    /*end nonterminals */
    TOKEN_NORMAL   /* token for the input character */
} TokenType;

class Token {
    TokenType type;  /* type of the token */
    char normal;     /* character for TOKEN_NORMAL token */
public:
    /* create special token of specified type */
    Token(TokenType t) : type(t) {}
    /* create  token for the specified character */
    Token(char n) : type(TOKEN_NORMAL), normal(n) {}

    bool is_epsilon() const { return type == TOKEN_EPSILON; }
    TokenType get_type() const { return type; }
    char get_char() const { return normal; }
    
    /* this is used by std::map data structure. only tokens with
     * TOKEN_NORMAL and TOKEN_EPSILON types held there */
    bool operator <(const Token& t) const { 
        if ((type == t.type) && (type != TOKEN_EPSILON))
            return normal < t.normal;
        return type < t.type;
    }
    bool operator ==(const Token &t) const {
        if ((type == t.type) && (type != TOKEN_EPSILON))
            return normal == t.normal;
        return type == t.type;
    }
};


4. NFA Class

We start with a single NFA state. The class represents the state itself and contains all transitions from this state: endpoints for the same token are grouped in vectors and tokens are used as keys in a std::map data structure. State also has a temporary bit is_on used later by algorithms to mark visited states.

class NFAState;

typedef std::vector<NFAState*> Svector;
typedef std::map<Token, Svector> ASmap;

/* Single NFA state */
class NFAState {
    private:
    ASmap moves;  /* outgoing edges of this state. */

    public:
    bool is_on; /* temporary bit to use by others */
    NFAState() : is_on(false) {}
    /* get vector of transitions for a token */
    const Svector & get_move(Token a) { return moves[a]; }
    /* get all transitions for this state */
    const ASmap & get_all_moves() { return moves; }
    /* add transition */
    void add_move(Token a, NFAState *target) {
        moves[a].push_back(target);
    }
    
};

Next is the whole NFA class. Since we only require one finish state in this program we do not support a general case by this class. However general case can be reduced to ours by simply adding a new finish state and epsilon transitions to it from old finish states. NFA class contains routines to construct a new NFA using existing NFAs. We said earlier that for every regular expression we can build a NFA accepting the same language by recursive on regular expression but we omitted actual constructs, now we give them in the code.

/* Nondeterministic Finite Automata */
class NFA {
    public:
    NFAState *start;   /* start state */
    NFAState *finish;  /* finish state */

    NFA(NFA& nfa) : start(nfa.start), finish(nfa.finish) {}
    NFA() {}

    NFA& operator=(const NFA &nfa) {
        this->start = nfa.start;
        this->finish = nfa.finish;
        return *this;
    }

    /* NFA which accepts a specified token */
    void init_token(Token t) {
        start = new NFAState;
        finish = new NFAState;
        start->add_move(t, finish);
    }

    /* NFA which accepts a string either _s_ or _t_ accepts */
    void init_or(NFA &s, NFA &t) {
        start = new NFAState;
        finish = new NFAState;
        Token epsilon = Token(TOKEN_EPSILON);

        start->add_move(epsilon, s.start);
        start->add_move(epsilon, t.start);
        s.finish->add_move(epsilon, finish);
        t.finish->add_move(epsilon, finish);
    }

    /* NFA which accepts a concatenation of _s_ and _t_ accepted strings */
    void init_concat(NFA &s, NFA &t) {
        start = s.start;
        finish = t.finish;
        Token epsilon = Token(TOKEN_EPSILON);

        s.finish->add_move(epsilon, t.start);
    }

    /* NFA which accepts a closure of _s_ */
    void init_asterisk(NFA &s) {
        start = new NFAState;
        finish = new NFAState;
        Token epsilon = Token(TOKEN_EPSILON);

        start->add_move(epsilon, finish);
        start->add_move(epsilon, s.start);
        s.finish->add_move(epsilon, finish);
        s.finish->add_move(epsilon, s.start);
    }

    /* states are allocated in heap, we need to free them explicitly */
    void delete_states() {
        Svector states, processing;
        processing.push_back(start);

        /* collect all states in the _states_ vector */
        while (!processing.empty()) {
            NFAState *p = processing.back();
            processing.pop_back();
            states.push_back(p);
            ASmap all_moves = p->get_all_moves();
            for (ASmap::iterator it = all_moves.begin(), end = all_moves.end();
                    it != end; ++it) {
                Svector &moves = (*it).second;
                for (Svector::iterator iit = moves.begin(), eend = moves.end();
                        iit < eend; ++iit) {
                    if ((*iit)->is_on) continue;
                    (*iit)->is_on = 1;
                    processing.push_back(*iit);
                }
            }
        }

        for (Svector::iterator it = states.begin(), end = states.end();
                it < end; ++it)
            delete *it;
    }
};

5. NFA Emulation

To run an NFA we first need to introduce a notion of a hyperstate. Since in NFA there can be multiple ways for a transition we need to simulate them all at once. Hyperstate is a collection of NFA states the automata can be in for the current input. Hyperstate can be constructed with the help of epsilon-closure: given a set of states, epsilon-closure is this set plus states were we can get by epsilon transitions from this set. So emulation is done as follows: initial hyperstate is the epsilon-closure of start state, each step the next hyperstate is the epsilon-closure of the set got by current token transactions from the current hyperstate. At the end of the input we check if the finish state is in current hyperstate. If it is, then NFA accepts the input, otherwise, rejects.

class NFAHyperState {
    Svector states;
    private:
    /* add all states from an NFAState with theirs's epsilon-clousure
     * which are not already in hyperstate */
    void add_closure(NFAState &s) {
        if (s.is_on) return;
        std::queue<NFAState*> Q;
        Q.push(&s);
        while (!Q.empty()) {
            NFAState *p = Q.front();
            Q.pop();
            states.push_back(p);
            p->is_on = true;
            Svector moves = p->get_move(Token(TOKEN_EPSILON));
            for (Svector::iterator it = moves.begin(), end = moves.end();
                    it < end; ++it) {
                if ((*it)->is_on) continue;
                Q.push(*it);
            }
        }
    }

    /* clear temporary markings in states. Execute after step is done */
    void clear_marks() {
        for (Svector::iterator it = states.begin(), end = states.end();
                it < end; ++it) {
            (*it)->is_on = false;
        }
    }

    public:
    NFAHyperState(NFAState *s) {
        /* add epsilon-closure of start state */
        add_closure(*s);
        clear_marks();
    }

    /* perform a single transition */
    void move(Token a) {
        Svector old; /* current hyperspace in this routine */
        old.swap(states);
        states.clear();
        /* now current hyperstate is clear, fill it using moves from
         * states in old hyperstate */
        for (Svector::iterator it = old.begin(), end = old.end();
                it < end; ++it) {
            Svector moves = (*it)->get_move(a);
            for (Svector::iterator jt = moves.begin(), jend = moves.end();
                    jt < jend; ++jt) {
                add_closure(**jt);
            }
        }
        clear_marks();
    }

    /* check if a specified NFA state is in this hyperstate */
    bool is_in(NFAState *f) {
        for (Svector::iterator it = states.begin(), end = states.end();
                it < end; ++it) {
            if ((*it) == f) return true;
        }
        return false;
    }
};

/* NFA Emulator. */
class NFAEmulator {
    private:
    NFAHyperState hyperstate;  /* all work is done here */
    NFAState *finish_state;    /* NFA finish state */

    public:
    NFAEmulator(NFA &nfa) : hyperstate(nfa.start), finish_state(nfa.finish) {}
    /* hyperstate transition */
    void move(Token a) { hyperstate.move(a); }
    /* check if NFA accepts the input */
    bool is_accepted() { return hyperstate.is_in(finish_state); }
};

6. Regular Expression Parse Tree

Now we return to our LL(1) grammar and present the parser. Since the grammar appears to be not very friendly to NFA construction operations, we have to build a parse tree first. After parse tree is constructed, build_nfa from its root can be called to construct the NFA for parsed regular expression. Routine build_nfa works by recursively generating NFA for the childs and then for the parent. Sometimes it pass an NFA generated by the one child to it's sibling. The reader is advised to draw a parse tree for some simple expression to see why this happens.

/* grammar productions */
typedef enum {
    RULE_S_SP,     /* S  -> S' */
    RULE_SP_TTP,   /* S' -> TT' */
    RULE_TP_SP,    /* T' -> S' */
    RULE_TP_E,     /* T' -> epsilon */
    RULE_T_PPP,    /* T  -> PP' */
    RULE_PP_PPP,   /* P' -> PP' */
    RULE_PP_E,     /* P' -> epsilon */
    RULE_P_RRP,    /* P  -> RR' */
    RULE_RP_ARP,   /* R' -> *R' */
    RULE_RP_E,     /* R' -> epsilon */
    RULE_R_E,      /* R  -> epsilon */
    RULE_R_LETTER, /* R  -> L */
    RULE_R_BSPB,   /* R  -> (S') */
    RULE_LETTER,   /* L  -> letter */
} GrammarRule;

class ParseTreeNode;

typedef std::vector<ParseTreeNode*> Nvector;

/* Node of a Parse tree */
class ParseTreeNode {
    GrammarRule type;       /* grammar production corresponding to this node */
    Nvector childs;         /* child nodes */
    ParseTreeNode *parent;  /* parend node */
    char letter;            /* character for RULE_LETTER type */
    
    /* nfa corresponding to this node. When constructing a prse tree this is
     * used as a parameter to _inherit_ nfa form parent or siblings */
    NFA nfa;

    public:
    /* node for a new parse tree with the starting production */
    ParseTreeNode() : type(RULE_S_SP) {}
    /* new node for existing parse tree */
    ParseTreeNode(ParseTreeNode &p) : parent(&p) {
        p.childs.push_back(this);
    }
    void set_type(GrammarRule t) { type = t; }
    void set_letter(char c) { letter = c; }
    ~ParseTreeNode() {
        for (Nvector::iterator it=childs.begin(),end=childs.end();
                it<end; ++it)
            delete *it;
        childs.clear();
    }

    /* build NFA by traversing a parse tree
     * should be called from the root of the parse tree */
    NFA & build_nfa() {
        NFA new_nfa; /* temporary NFA */
        switch(type) {
        case RULE_S_SP: /* S -> S' */
            /* NFA for this node is just the NFA for it's child */
            nfa = childs[0]->build_nfa();
            break;
        case RULE_SP_TTP: /* S' -> TT' */
            /* construct NFA for T-child and pass it to the T'-child */
            childs[1]->nfa = childs[0]->build_nfa();
            /* NFA for this node is the same as it's T-child's */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_TP_SP: /* T' -> |S' */
            /* NFA is OR of inherited nfa and S'-child's NFA */
            new_nfa.init_or(nfa, childs[1]->build_nfa());
            nfa = new_nfa;
            break;
        case RULE_TP_E:
            /* NFA is right the inherited nfa, do nothing */
            break;
        case RULE_T_PPP: /* T -> PP' */
            /* construct NFA for P-child, pass it to P'-child */
            childs[1]->nfa = childs[0]->build_nfa();
            /* NFA is P'-child's nfa */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_PP_PPP: /* P' -> PP' */
            /* construct concatenation NFA of inherited nfa and P-child's */
            new_nfa.init_concat(nfa, childs[0]->build_nfa());
            /* pass new_nfa to P'-childs */
            childs[1]->nfa = new_nfa;
            /* NFA is P'-child's NFA */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_PP_E:
            /* NFA is inherited nfa, do nothing */
            break;
        case RULE_P_RRP:  /* P -> RR' */
            /* construct NFA for R-child and pass it to R'-child */
            childs[1]->nfa = childs[0]->build_nfa();
            /* NFA is R'-child's NFA */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_RP_ARP: /* R' -> *R' */
            /* construct asterisk NFA from inherited nfa, pass it to child */
            childs[1]->nfa.init_asterisk(nfa);
            /* NFA is R'-child's nfa */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_RP_E:
            /* NFA is inherited nfa, do nothing */
            break;
        case RULE_R_E:
            /* NFA is accepting only epsilon token*/
            nfa.init_token(Token(TOKEN_EPSILON));
            break;
        case RULE_R_LETTER: /* R -> L */
            /* NFA is child's NFA */
            nfa = childs[0]->build_nfa();
            break;
        case RULE_R_BSPB:
            /* NFA is S'-child's NFA */
            nfa = childs[1]->build_nfa();
            break;
        case RULE_LETTER: /* letter -> a|...|z */
            /* NFA is accepting only letter token */
            nfa.init_token(Token(letter));
            break;
        }
        return nfa;
    }
};


7. Regular Expression Parser

Now we can finally present the parser itself. LL(1) parsing algorithm has been discussed above, the only exception is that parse stack now holds not only tokens but parse tree nodes too.

typedef std::pair<Token, ParseTreeNode*> TNpair;
typedef std::vector<TNpair> TNvector;

/* Regular Expression Parser */
class REParse {
    std::string regexp;         /* regular expression to parse */
    unsigned int position;      /* current position in regexp */
    TNvector parse_stack;       /* LL(1) parse stack */
    ParseTreeNode *parse_tree;  /* parse tree */

    /* push token-node pair on a parse stack */
    void push_parse_state(Token &t, ParseTreeNode *node) {
        parse_stack.push_back(TNpair(t, node));
    }

    /* grammar production for a letter terminal in a body */
    void production(ParseTreeNode &node, GrammarRule rule, char letter) {
        node.set_type(rule);
        node.set_letter(letter);
    }
    /* grammar production with empty body */
    void production(ParseTreeNode &node, GrammarRule rule) {
        node.set_type(rule);
    }
    /* grammar production with one nonterminal in a body */
    void production(ParseTreeNode &node, GrammarRule rule, Token t1) {
        ParseTreeNode *c1 = new ParseTreeNode(node);
        push_parse_state(t1, c1);
        node.set_type(rule);
    }
    /* grammar production with two nonterminals in a body */
    void production(ParseTreeNode &node, GrammarRule rule, Token t1, Token t2) {
        ParseTreeNode *c1 = new ParseTreeNode(node);
        ParseTreeNode *c2 = new ParseTreeNode(node);
        push_parse_state(t2, c2);
        push_parse_state(t1, c1);
        node.set_type(rule);
    }
    /* grammar production with three nonterminals in a body */
    void production(ParseTreeNode &node, GrammarRule rule, Token t1, Token t2, Token t3) {
        ParseTreeNode *c1 = new ParseTreeNode(node);
        ParseTreeNode *c2 = new ParseTreeNode(node);
        ParseTreeNode *c3 = new ParseTreeNode(node);
        push_parse_state(t3, c3);
        push_parse_state(t2, c2);
        push_parse_state(t1, c1);
        node.set_type(rule);
    }

    /* can current character be the beginning of a string derieved from P' */
    bool is_first_PP(char c) const { return (isalpha(c) || (c == '(')); }
    /* can current character be the beginning of a string derieved from T' */
    bool is_first_TP(char c) const { return (is_first_PP(c) || (c == '|')); }

    /* A step of LL(1) parsing: process a token from the top of the stack */
    void process_token(Token &t, ParseTreeNode &node) {
#define T(x) Token(x)
        char lookup = regexp[position];
        switch (t.get_type()) {
        case TOKEN_START:
            production(node, RULE_S_SP, T(TOKEN_SP)); break;
        case TOKEN_SP:
            production(node, RULE_SP_TTP, T(TOKEN_T), T(TOKEN_TP)); break;
        case TOKEN_T:
            production(node, RULE_T_PPP, T(TOKEN_P), T(TOKEN_PP)); break;
        case TOKEN_P:
            production(node, RULE_P_RRP, T(TOKEN_R), T(TOKEN_RP)); break;
        case TOKEN_R:
            if (lookup == '(')
                production(node, RULE_R_BSPB, T('('), T(TOKEN_SP), T(')'));
            else if (isalpha(lookup))
                production(node, RULE_R_LETTER, T(lookup));
            else
                production(node, RULE_R_E);
            break;
        case TOKEN_TP:
            if (lookup == '|')
                production(node, RULE_TP_SP, T('|'), T(TOKEN_SP));
            else
                production(node, RULE_TP_E);
            break;
        case TOKEN_PP:
            if (is_first_PP(lookup))
                production(node, RULE_PP_PPP, T(TOKEN_P), T(TOKEN_PP));
            else
                production(node, RULE_PP_E);
            break;
        case TOKEN_RP:
            if (lookup == '*')
                production(node, RULE_RP_ARP, T('*'), T(TOKEN_RP));
            else
                production(node, RULE_RP_E);
            break;
        case TOKEN_NORMAL:
            /* check if input doesn't finish unexpectedly */
            if (position >= regexp.length()) {
                std::cerr << "Parse error: went past input" << std::endl;
                throw 10; /* ABORT */
            }
            /* this production can be used for passing throwgh
             * special tokens, in this case we have to check */
            if (lookup != t.get_char()) {
                std::cerr << "Parse error: expected '" << t.get_char()
                    << "' got '" << lookup << "'" << std::endl;
                throw 20; /* ABORT */
            }
            production(node, RULE_LETTER, lookup);
            ++position; /* all right, advance input position */
            break;
        default:
            std::cerr << "Parse error" << std::endl;
            throw 30; /* ABORT */
        }
    }

    public:
    REParse(std::string re) : regexp(re), position(0) {
        /* initialize parse tree and stack */
        parse_tree = new ParseTreeNode();
        parse_stack.push_back(TNpair(Token(TOKEN_START), parse_tree));
        /* LL(1) parsing cycle */
        while (!parse_stack.empty()) {
            TNpair& tnpair = parse_stack.back();
            parse_stack.pop_back();
            process_token(tnpair.first, *tnpair.second);
        }
        /* check if we weren't able to process all the input regexp */
        if (position != regexp.length()) {
            std::cerr << "Parse error: finished early" << std::endl;
            throw 40; /* ABORT */
        }

    }

    /* get parse tree */
    ParseTreeNode * get_parse_tree() const { return parse_tree; }

    ~REParse() {
        delete parse_tree;
    }

};


8. Epilogue

We presented all the parts to build a regular expression matcher. First we need to parse a regular expression alongside with building a parse tree, next we build an NFA from a parse tree, then we emulate NFA and finally check if NFA accepts the input string. As a final example here is the main() function of our program. It reads two string: first one is a regular expression, second is a matched string. It outputs "matched" if the input string appears to be in the language defined by the regular expression, otherwise the program outputs "not matched"

int main() {
    std::string re;
    std::cin >> re;  /* read regular expression */
    REParse parser(re);  /* parse regular expression */
    NFA nfa = parser.get_parse_tree()->build_nfa(); /* build NFA */
    NFAEmulator nfaemu(nfa); /* Initialize NFA emulator */

    std::string match;
    std::cin >> match; /* read input string */

    /* process string */
    for (unsigned int i=0; i < match.length(); ++i)
        nfaemu.move(match[i]);
    if (nfaemu.is_accepted()) std::cout << "matched";
    else std::cout << "not matched";
    std::cout << std::endl;

    /* free states allocated in heap */
    nfa.delete_states();
}

9. Further Reading

Aforementioned Compilers: Principles, Techniques and Tools contains every idea presented here. Introduction to Automata Theory, Languages and Computation by Hopcroft, Motwani and Ullman explains the mathematics behind automatas and conext-free grammars. Finally, this is a very good online resource about regular expressions in practice with parsing done by Dijkstra's shunting yard algorithm and reverse polish notation processing which is much more simpler than LL(1) parser presented here.

Saturday, December 25, 2010

updatable SQL table with external ordering

Nearly half a year a crucial bug was found in uguu. It was fixed that time and it haven't bothered us since then. I've always wanted to write about it here.

We store all files in a single SQL table. Each row has a path_id, and a file_id. The first is same for all files from the same directory. The second represents a (lexicographical) order on files. We assign file ids sequentially without a gap. While updating contents of a share we need to insert/delete some rows in the files table preserving the sequential order.


First solution.

To insert a row with file id N, increment ids for rows with ids >= N
UPDATE files SET file_id = file_id + 1 WHERE file_id >= N AND path_id = some_path
then insert the new row with file_id N.

Deletion is performed analogously:
DELETE FROM files WHERE file_id = N AND path_id = some_path
UPDATE files SET file_id = file_id - 1 WHERE file_id > N AND path_id = some_path

Clearly, for a directory with M files addition of K new files with ids 1,2....,K can result in O(M*K^2) time. And which is more frustrating if all changes are done in a single transaction, the amount of additional space can rise dramatically for Postgresql 8.4. The same is true for deletion.


Linear solution.

Work is done in two passes. In the first pass removed files are deleted from the table and new files are inserted into a new temporary table. In the second pass file ids are being corrected in a single loop with the help of postgresql cursors. We make loop over rows from old table and use cursor to point to current entry in the new table. After the second pass gaps in file ids in the old table are exactly ids of new files (in a temporary table). Finally, we insert everything from temporary table into old one. For an implementation of the second pass see function push_path_files here.

Thursday, November 11, 2010

using template translator in binary translation system

Our bosses at work asked us to give a talk about template translator (rather simple binary translation layer) at the conference. Text and slides are now available (in Russian).

Monday, August 23, 2010

a contribution

It seems a little patch I sent to the btpd project more than a year ago is finally accepted: commit, original message

Thursday, April 22, 2010

Evolution of uguu scanning process

Scanning is very essential part of a search engine. It is about how database is filled and updated. You can't search if nothing was scanned before. In this post I'd like to cover how scanning in uguu is implemented and how it has evolved over time.

From the very start it was decided that scanning should be split between two programs. The first one connects to a share, somehow constructs a list of contents of the share and prints it to the standard output. The second one is protocol-independent script which reads contents from the standard input and executes a sequence of SQL commands to insert contents into the database. This sounded very reasonable that time since I had already written smbscan in C and I didn't want to deal with SQL in C. Python appeared to be easier for this task. Also, simple text output helps a lot with testing and debugging.

smbscan itself also consisted of protocol dependent part (connecting to a share, go up, go to subdir, readdir) and independent one (constructing file tree). When Radist (the other one uguu developer) wanted to port his ftp scanner to uguu I asked him to use tree constructor from smbscan so we could change scanner output format more easily if necessary. Later this common part evolved into libuguu.

Now it's time to turn back to the scanning process. At first we thought that when time to scan a share comes all contents of the share should be dropped from the database and the share itself should be rescanned by protocol-dependent scanner having it's output read by the database updating script. If some error occurs SQL transaction rollback is executed and scanning process is marked as failed. If the scanner report success then changes are committed to the database. All described process is done inside a single SQL transaction so there is no need to care about what our users see in the middle of scanning process - they see old version until new one is committed.

When we first launched uguu it appeared scanning was very slow. Some optimizations were required. First idea came to my mind was not to perform database updating if contents of a share were unchanged. I thought to execute diff on scanner outputs saved to files, but Radist suggested comparing only security hash digests. We started to keep SHA1 digest of the scanner output in the database and checked against it before each update. Since many shares don't change their contents often uguu gained a big speedup here.

By that time I thought "If it were possible to diff old and new scanner outputs and commit only actual changes...". But scanning speed wasn't the biggest problem anymore - we needed to improve search time next. At the beginning of April database schema change was committed and scanning became major issue again.

Next long-awaited feature was automatic recursion detection in the tree constructor. The source of the problem is that filesystem links are not always detectable by the client. Suppose you have a directory A and you create a symbolic link B in it pointed to A. Then SMB protocol client would see that B is just a plain directory and it has a subdirectory B and so on infinitely. If A has a huge subtree then it is copied each time SMB client enters B resulting very huge scanner output and of course big database workload. Solution was to calculate MD5 digest for each directory and compare against it's ancestors. If a match occurs then directory parent's MD5 is compared to that ancestor's MD5. Recursion is said to be detected if along the path to the root we find a chain of directories which MD5s are the same as of such chain moved to the end of the path (current directory) and overall number of items in that chain's directories is bigger than a certain threshold. The latter condition provides a scale for a chain size: from 1 if linked directory is large to the threshold if there is a directory with only one child - the link itself.

Recursion detection is very useful feature if there are shares with recursive links. But there are very few such shares in our environment.

The next idea was quite simple to implement: if two shares have the same SHA1 digest then keep only one copy of contents in the database. This was supposed to reduce amount of files in the database for hosts with same contents via both SMB and FTP. But in reality this wasn't very helpful since some changes can occur between SMB and FTP scanning and which is more frustrating when a filename is not good enough it is shown differently via SMB and FTP. The latter argument kicks the whole idea to the ground.

So the only last hope remained - to construct a diff against old contents. Compared to previous work this was a big task with a huge amount of new code in libuguu but it is finally done and currently is being tested on our installation.

Tree diffing is very straightforward: first the whole old tree is reconstructed from a file, then while constructing a new tree using protocol-dependent walker, tree constructor compares new contents against old ones. If a new directory is added (respectively an old one is deleted) tree constructor prints the whole subtree with prefix '+' (respectively '-'). Same goes for files. If only size or number of items in a directory (or just size in case of a file) has been changed then only the directory (file) is printed with '*' modifier. Strictly speaking, output is a bit more complex to help further processing by the database updating script.

Dullness of the algorithm can be viewed in a situation when just some directory's name has been changed and that directory has a huge subtree. Then two trees will be presented in diff (one with '-' prefix and the other with '+'). However in uguu we keep tsvector of entire path for each file to allow searching in full paths. So we would have to update the entire subtree anyway.

Now uguu has scanners for FTP, SMB and WebDAV (the latter was written last weekend and is still quite unreliable). All of them use the tree constructor from libuguu and thus have all the features described above.

PS. Thanks Radist for pointing out some errors.