Thursday, April 22, 2010

Evolution of uguu scanning process

Scanning is very essential part of a search engine. It is about how database is filled and updated. You can't search if nothing was scanned before. In this post I'd like to cover how scanning in uguu is implemented and how it has evolved over time.

From the very start it was decided that scanning should be split between two programs. The first one connects to a share, somehow constructs a list of contents of the share and prints it to the standard output. The second one is protocol-independent script which reads contents from the standard input and executes a sequence of SQL commands to insert contents into the database. This sounded very reasonable that time since I had already written smbscan in C and I didn't want to deal with SQL in C. Python appeared to be easier for this task. Also, simple text output helps a lot with testing and debugging.

smbscan itself also consisted of protocol dependent part (connecting to a share, go up, go to subdir, readdir) and independent one (constructing file tree). When Radist (the other one uguu developer) wanted to port his ftp scanner to uguu I asked him to use tree constructor from smbscan so we could change scanner output format more easily if necessary. Later this common part evolved into libuguu.

Now it's time to turn back to the scanning process. At first we thought that when time to scan a share comes all contents of the share should be dropped from the database and the share itself should be rescanned by protocol-dependent scanner having it's output read by the database updating script. If some error occurs SQL transaction rollback is executed and scanning process is marked as failed. If the scanner report success then changes are committed to the database. All described process is done inside a single SQL transaction so there is no need to care about what our users see in the middle of scanning process - they see old version until new one is committed.

When we first launched uguu it appeared scanning was very slow. Some optimizations were required. First idea came to my mind was not to perform database updating if contents of a share were unchanged. I thought to execute diff on scanner outputs saved to files, but Radist suggested comparing only security hash digests. We started to keep SHA1 digest of the scanner output in the database and checked against it before each update. Since many shares don't change their contents often uguu gained a big speedup here.

By that time I thought "If it were possible to diff old and new scanner outputs and commit only actual changes...". But scanning speed wasn't the biggest problem anymore - we needed to improve search time next. At the beginning of April database schema change was committed and scanning became major issue again.

Next long-awaited feature was automatic recursion detection in the tree constructor. The source of the problem is that filesystem links are not always detectable by the client. Suppose you have a directory A and you create a symbolic link B in it pointed to A. Then SMB protocol client would see that B is just a plain directory and it has a subdirectory B and so on infinitely. If A has a huge subtree then it is copied each time SMB client enters B resulting very huge scanner output and of course big database workload. Solution was to calculate MD5 digest for each directory and compare against it's ancestors. If a match occurs then directory parent's MD5 is compared to that ancestor's MD5. Recursion is said to be detected if along the path to the root we find a chain of directories which MD5s are the same as of such chain moved to the end of the path (current directory) and overall number of items in that chain's directories is bigger than a certain threshold. The latter condition provides a scale for a chain size: from 1 if linked directory is large to the threshold if there is a directory with only one child - the link itself.

Recursion detection is very useful feature if there are shares with recursive links. But there are very few such shares in our environment.

The next idea was quite simple to implement: if two shares have the same SHA1 digest then keep only one copy of contents in the database. This was supposed to reduce amount of files in the database for hosts with same contents via both SMB and FTP. But in reality this wasn't very helpful since some changes can occur between SMB and FTP scanning and which is more frustrating when a filename is not good enough it is shown differently via SMB and FTP. The latter argument kicks the whole idea to the ground.

So the only last hope remained - to construct a diff against old contents. Compared to previous work this was a big task with a huge amount of new code in libuguu but it is finally done and currently is being tested on our installation.

Tree diffing is very straightforward: first the whole old tree is reconstructed from a file, then while constructing a new tree using protocol-dependent walker, tree constructor compares new contents against old ones. If a new directory is added (respectively an old one is deleted) tree constructor prints the whole subtree with prefix '+' (respectively '-'). Same goes for files. If only size or number of items in a directory (or just size in case of a file) has been changed then only the directory (file) is printed with '*' modifier. Strictly speaking, output is a bit more complex to help further processing by the database updating script.

Dullness of the algorithm can be viewed in a situation when just some directory's name has been changed and that directory has a huge subtree. Then two trees will be presented in diff (one with '-' prefix and the other with '+'). However in uguu we keep tsvector of entire path for each file to allow searching in full paths. So we would have to update the entire subtree anyway.

Now uguu has scanners for FTP, SMB and WebDAV (the latter was written last weekend and is still quite unreliable). All of them use the tree constructor from libuguu and thus have all the features described above.

PS. Thanks Radist for pointing out some errors.