Categories
Uncategorized

NoteRanger Beta – a basic text based search

I started taking notes in markdown format following David Branner’s suggestion. Here are the two main reasons.

  1. They look much better on a browser than on a text editor on a terminal.
  2. I can access them anywhere.

All I needed was now a search mechanism like the one David has. I forked his notes repo but on a particular Thursday I started to build one of my own for reasons that do not fall under the scope of this blog.

The NoteRanger has two components – Indexer and the Search. I wrote my own indexer in C++ and I am using David’s Javascript code with some addition and modification for the search.

Here’s the search and here’s the code.

Indexing

To use NoteRanger, you have to first index the documents. The indexer then generates two databases which the Search uses to show you the search results.

Usage

Collect list of the documents and put them in a file. Here’s my file containing the list of files that I need to perform search on.

snip20160914_4

I created a softlink to my notes directory in the directory where my executable is. Hence I use the following command to generate the index in the js directory.


$./noteranger notes_filelist js/

Implementation

The indexing part is written in C++. I used some of the features of C++11. It can be built using the Makefile. You will need a G++/Clang++ that supports C++11 and the Boost libraries.

It has a parser which parses (duh!) the documents, discards invalid tokens (for instance, I wouldn’t store special characters like “/”, “#”, “*” etc.). Then it builds two databases. One is called TermDB – the database storing the terms. The TermDB is in the format of an inverted index or a posting list. Posting list is just a fancy name of list of all the documents in which the term occurs.

The TermDB stores the terms and the list of documents in which the term occurs in a map. The list is actually a list of document IDs. The document IDs are generated by hashing the string containing the path to the document.

The other database is called DocumentDB. The DocumentDB is a map of document IDs and the path to the document. It also contains the header of the document.

After indexing the files, the two databases are written as two Javascript files.

  1. IndexEntries.js – the TermDB
  2. TupleStorage.js – the DocumentDB

Since they are in Javascript, you can guess that they are going to be used in the search directly. Hence, I have to check them in to my github repo everytime I make any change in the notes. I am yet to make a cronjob for it, but that’s the plan.

Search

The search is written in Javascript. It’s mostly David’s code. But I added the support to perform AND query (which I shall explain in a bit) and made it a bit more readable with help from Jake (a fellow Recurser).

Usage

If you want to test it locally, open the file index.html in browser and it should work and it should look like this.

Implementation

The search code is written in test.js. If you open the index.html, you’ll see that it includes two more javascript files that the Indexer has generated along with test.js. When you search a term, for instance “variables”, it looks up the map defined in IndexEntries.js and gets the document IDs of the documents in which the term occurs. And using the document IDs, it looks up the TupleStorage.js to get the document name and the header. And then it shows a list of links to the documents.

If you enter multiple terms, for instance “automatic variables”, currently I do the following –

1. Get the posting list for the term “automatic”.
2. Get the posting list for the term “variables”.
3. Perform an intersection of these two lists. Since I store the docuement IDs in the list in a sorted order, my intersection is at most the size of the smaller list.

So that was my AND query. Now this is not the most correct way of doing things. Because it will return a lot of false positives in case there are documents which contain both “automatic” and “variables” but not together. And I probably wanted to know the documents in which “automatic” and “variables” are together. But that’s a feature that I’ll start working on after some time.

Plans

A couple of them.
1. Adding the support of phrase queries will be interesting
2. Adding rank will also be interesting.
3. Know more about Information Retrieval.

Please feel free to fork it, use it, review it, report bugs in it and share your ideas for the “Plans” section.