Cool bitwise stuff

I just learned this cool bit manipulation staff yesterday and I am dying to write it down. Here it is –

Let’s write a program to decide if a number is a power of 2.

snip20161004_1

The above code executes at least 31 times on a 32 bit or 64 bit Linux machine.

Can it be made better? Can a number be checked whether it’s power of 2 in less number of instructions?

Let’s park that question for a bit and let’s talk about number representations. The computers represent the numbers in 2’s complement representation. What that means is that the negative numbers are represented in the following form –

(2^b - N)

N = the number
b = number of bits

You can get 2’s complement representation of the number -X by doing the following in binary –

~X + 1

X = binary representation of the positive integer

Let’s take an example.

The 8 bit binary representation of 4 is = 0000 0100
The 8 bit binary representation of -4 is = ~(0000 0100) + 1 = 1111 1011 + 1 = 1111 1100

Let’s take another example.

The 8 bit binary representation of 6 is = 0000 0110
The 8 bit binary representation of -6 is = ~(0000 0110) + 1 = 1111 1001 + 1 = 1111 1010

Now if we AND a number with its 2’s complement representation we see something useful –

In case of 4, AND = 0000 0100
In case of 6, AND = 0000 0010

What does the above pattern say. It says that the first set bit in the actual number is also set in the result of the AND. In case of 4, the third bit was the first set bit. And it’s set in the result of the AND too. In case of 6, the second bit was the first set bit. And it’s set in the result of the AND too.

Now if you think of all the powers of two they have exactly one bit set. If we represent integers in 8 bits, then the possible powers of 2 in this range are –

0000 0001
0000 0010
0000 0100
0000 1000
0001 0000
0010 0000
0100 0000
1000 0000

Which means for the powers of two, the result of the AND operation will always give us the same number. Hence we can just check the equality between the input number and the result of the AND to detect if the number is power of 2 and it’s faster than the previous code too.

Here’s the code
snip20161004_2

I profiled both the functions by calling them on 1 billion numbers. The above code using 2’s complement representation takes the following time

real    0m5.846s
user    0m5.830s
sys     0m0.010s

The old iterative version on the same set of numbers takes the following time

real    2m9.093s
user    2m7.756s
sys     0m0.660s

And as you can see the faster version is almost 25 times faster than the iterative version.

Cool bitwise stuff

NoteRanger Beta – a basic text based search

I started taking notes in markdown format following David Branner’s suggestion. Here are the two main reasons.

  1. They look much better on a browser than on a text editor on a terminal.
  2. I can access them anywhere.

All I needed was now a search mechanism like the one David has. I forked his notes repo but on a particular Thursday I started to build one of my own for reasons that do not fall under the scope of this blog.

The NoteRanger has two components – Indexer and the Search. I wrote my own indexer in C++ and I am using David’s Javascript code with some addition and modification for the search.

Here’s the search and here’s the code.

Indexing

To use NoteRanger, you have to first index the documents. The indexer then generates two databases which the Search uses to show you the search results.

Usage

Collect list of the documents and put them in a file. Here’s my file containing the list of files that I need to perform search on.

snip20160914_4

I created a softlink to my notes directory in the directory where my executable is. Hence I use the following command to generate the index in the js directory.


$./noteranger notes_filelist js/

Implementation

The indexing part is written in C++. I used some of the features of C++11. It can be built using the Makefile. You will need a G++/Clang++ that supports C++11 and the Boost libraries.

It has a parser which parses (duh!) the documents, discards invalid tokens (for instance, I wouldn’t store special characters like “/”, “#”, “*” etc.). Then it builds two databases. One is called TermDB – the database storing the terms. The TermDB is in the format of an inverted index or a posting list. Posting list is just a fancy name of list of all the documents in which the term occurs.

The TermDB stores the terms and the list of documents in which the term occurs in a map. The list is actually a list of document IDs. The document IDs are generated by hashing the string containing the path to the document.

The other database is called DocumentDB. The DocumentDB is a map of document IDs and the path to the document. It also contains the header of the document.

After indexing the files, the two databases are written as two Javascript files.

  1. IndexEntries.js – the TermDB
  2. TupleStorage.js – the DocumentDB

Since they are in Javascript, you can guess that they are going to be used in the search directly. Hence, I have to check them in to my github repo everytime I make any change in the notes. I am yet to make a cronjob for it, but that’s the plan.

Search

The search is written in Javascript. It’s mostly David’s code. But I added the support to perform AND query (which I shall explain in a bit) and made it a bit more readable with help from Jake (a fellow Recurser).

Usage

If you want to test it locally, open the file index.html in browser and it should work and it should look like this.

Implementation

The search code is written in test.js. If you open the index.html, you’ll see that it includes two more javascript files that the Indexer has generated along with test.js. When you search a term, for instance “variables”, it looks up the map defined in IndexEntries.js and gets the document IDs of the documents in which the term occurs. And using the document IDs, it looks up the TupleStorage.js to get the document name and the header. And then it shows a list of links to the documents.

If you enter multiple terms, for instance “automatic variables”, currently I do the following –

1. Get the posting list for the term “automatic”.
2. Get the posting list for the term “variables”.
3. Perform an intersection of these two lists. Since I store the docuement IDs in the list in a sorted order, my intersection is at most the size of the smaller list.

So that was my AND query. Now this is not the most correct way of doing things. Because it will return a lot of false positives in case there are documents which contain both “automatic” and “variables” but not together. And I probably wanted to know the documents in which “automatic” and “variables” are together. But that’s a feature that I’ll start working on after some time.

Plans

A couple of them.
1. Adding the support of phrase queries will be interesting
2. Adding rank will also be interesting.
3. Know more about Information Retrieval.

Please feel free to fork it, use it, review it, report bugs in it and share your ideas for the “Plans” section.

NoteRanger Beta – a basic text based search

Last 2 days of week 2 at the Recurse Center

Day 3: Coding-wise, haven’t done much. But otherwise the day was well-spent in other sense. Met David Branner. David took my first interview when I applied for RC. And he is super-helpful. He gave me quite a few tips on time-management and tracking my progress. Here’s how he searches his notes on coding.

Did a code retreat convened by Emil. It was fun!. We worked on to build the Game of Life following TDD. It consisted of four 45-minutes sessions where we paired up with different people and did ping-pong pairing. At the end of each session, we were to throw away the code we have written and start with a fresh mind with a new person. In each session, the rules kept of changing. For instance, the second session had the rule that each function can not be more than 3 lines. The rules of last session were interesting where you can either not talk to the partner or not run the test case for more that twice etc.

Missed the last session to have a one-to-one with James, one of the facilitators at RC. We talked about my goals and the projects that I am going to focus on for the rest of the weeks at RC. That was helpful. There’s another post for what I have decided to do at RC.

Then went on to attend a Goal Setting workshop by Stacey which was helpful too.

Finally, tried to generate doxygen docs for Bitfunnel. But it didn’t turn around to be much useful since the source files do not have doxy formatted comments. But I can see the class diagrams now.

Day 4: Dan shares what he and Mike has thought I should focus on. He gives quite a few options and I like one of them. Will start exploring them from next week.

Spent pretty much the entire day writing Kernighan-Lin in C++ and I am not done yet.

Hang around the web-architecture white-boarding interview prep and listen to how people are approaching a design problem.

Last 2 days of week 2 at the Recurse Center

What I am going to work on at RC (v1.0)

This is almost the end of week 2 at RC. And after much deliberation on my side and discussion with James (one of the facilitators at RC), here are my goals v 1.0 (To have the scope of them evolving even more) –

1. BitFunnel – I am going to keep on contributing to the open source text based search project. It’s in C++. Although contributing to an open source project was quite down below on the list of possible things I thought I would do at RC. But this project has a few things that got me hooked.

First, text-based search is quite an interesting area with the scope of learning new algorithms and tricks. Performance is an important factor in the field. Second, I get to work with two very experienced programmers. Third, since it’s in C++, I only have to learn about the domain without worrying about the programming language I am going to use in this project. And as an icing on the cake, it’s using C++17 which is something I always wanted to do.

2. Writing a graph partitioning library in Haskell. I worked on an FPGA prototyping tool in my last company and we used some very cool graph partitioning algorithms to partition the SoC designs. I always wanted to understand the algorithms and implementing some of well known algorithms like Kernighan-Lin, Fiduccia-Mattheyses looks like a good way to understand them.

I want to first implement them using C++, the language I am currently most comfortable with and then in Haskell, the language I am learning at present. Eventually I want to write METIS in Haskell.

And I want to end this post with the most important lesson I learned today. And that is – don’t worry too much about not having something cool and extra-ordinary built at the end of RC. Don’t worry too much about not making the best use of your precious time here at RC. Enjoy the process of learning and working on something which you are enthusiastic about. Meet and talk to the people in the RC community. There are some really cool and smart people here at RC and there’s so much to learn from them.

Have fun and never graduate!

What I am going to work on at RC (v1.0)

First 3 days of week 2 at the Recurse Center

Here’s my lazy summary of the first three days of week 2 at RC.

Day 1: Continue working on BitFunnel. By working, I mean trying to integrate a piece of old code into the new codebase. My goal is to gain an understanding of how a text-based search engine works.

Submit the first challenge for Haskell at Exercism. Discuss the second and the third problem with Katherine. Have started to get a little bit bored with Haskell too, specially I am not sure what I am going to build with Haskell.

The best part of the day is the talk by Emil Sit, the resident for next two weeks where he gives a very high level overview of the distributed systems and shares his experience.

Feeling: Worried over what I am going to work on at RC.

Day 2: Fix build errors and unit test failure in integrating a small component from the old BitFunnel code to the new code base. Spend some time understanding the data structures used. Also spend some time reading up basic ideas in text-based search. BitFunnel now has started to become interesting!

Spend one hour in reading up Haskell and solve the second challenge on Exercism.

Attend code dojo and pair with Cihan to solve the Euler Project problem 35 in Python. You have to find circular primes and as it turned out most of the people implemented the sieve of Eratosthenes to find out prime numbers in range.

Feeling: Still searching for what I am going to work on at RC.

Day3: Solved two problems in today’s Dropin Warmup. Dropin Warmups are daily one hour sessions where Rose, one of the facilitators, writes down two programming problems. The problems and my solutions to them are here.

Read a some more Haskell. Now I know currying, higher order functions, map, filter, lambda etc.

Started implementing the Kernighan-Lin graph partitioning algorithm in C++ and eventually I want to do it in Haskell.

Went to the New York Haskell Meetup group’s presentation on Concurrent Bloom Filter. Could grasp some of the Haskell code which felt nice.

Did the third challenge of Exercism.

First 3 days of week 2 at the Recurse Center