I started taking notes in markdown format following David Branner’s suggestion. Here are the two main reasons.
- They look much better on a browser than on a text editor on a terminal.
- I can access them anywhere.
All I needed was now a search mechanism like the one David has. I forked his notes repo but on a particular Thursday I started to build one of my own for reasons that do not fall under the scope of this blog.
To use NoteRanger, you have to first index the documents. The indexer then generates two databases which the Search uses to show you the search results.
Collect list of the documents and put them in a file. Here’s my file containing the list of files that I need to perform search on.
I created a softlink to my notes directory in the directory where my executable is. Hence I use the following command to generate the index in the js directory.
$./noteranger notes_filelist js/
The indexing part is written in C++. I used some of the features of C++11. It can be built using the Makefile. You will need a G++/Clang++ that supports C++11 and the Boost libraries.
It has a parser which parses (duh!) the documents, discards invalid tokens (for instance, I wouldn’t store special characters like “/”, “#”, “*” etc.). Then it builds two databases. One is called TermDB – the database storing the terms. The TermDB is in the format of an inverted index or a posting list. Posting list is just a fancy name of list of all the documents in which the term occurs.
The TermDB stores the terms and the list of documents in which the term occurs in a map. The list is actually a list of document IDs. The document IDs are generated by hashing the string containing the path to the document.
The other database is called DocumentDB. The DocumentDB is a map of document IDs and the path to the document. It also contains the header of the document.
- IndexEntries.js – the TermDB
- TupleStorage.js – the DocumentDB
If you want to test it locally, open the file index.html in browser and it should work and it should look like this.
If you enter multiple terms, for instance “automatic variables”, currently I do the following –
1. Get the posting list for the term “automatic”.
2. Get the posting list for the term “variables”.
3. Perform an intersection of these two lists. Since I store the docuement IDs in the list in a sorted order, my intersection is at most the size of the smaller list.
So that was my AND query. Now this is not the most correct way of doing things. Because it will return a lot of false positives in case there are documents which contain both “automatic” and “variables” but not together. And I probably wanted to know the documents in which “automatic” and “variables” are together. But that’s a feature that I’ll start working on after some time.
A couple of them.
1. Adding the support of phrase queries will be interesting
2. Adding rank will also be interesting.
3. Know more about Information Retrieval.
Please feel free to fork it, use it, review it, report bugs in it and share your ideas for the “Plans” section.