January 16, 2016
Having finished Coursera’s Algorithms course recently, I looked for more interesting algorithms to implement. I found one data structure that can store large amount of data, use less memory than the competition, it’s implementation is quite short and concise, and gives an opportunity to practice some probability theory. That is the Bloom Filter.
You can use a Bloom Filter to answer a question similar to “did I see this item before?”, and you want the answer quickly—usually to avoid an otherwise costly operation to locate the item.
For example, imagine a server whose data is stored across the network. When getting a request, it’s worth checking whether the item even exists before performing a network call to retrieve the item.
Bloom Filter uses a bit array to know which items it has seen before. Each item is saved by having a few bits turned from 0 to 1. The exact bits are determined by running the item through a few hash function and converting the resulting hash number to a number from zero to n where n is the last bit.
The other data structure you might use for this purpose is a hash table. However, because a hash table needs to store the complete key, it uses a lot more memory. If we have 1 billion keys and each key is 255 bytes we will need at least 255GB. That’s more than most EC2 machines will have. However a comparable bloom filter will only need 12GB to store 1 billion keys.
1 << 128 = 1
In c++ the same operation would have resulted in 0.
My implementation can be found here. It uses murmurhash npm package for hashing the values.
When researching bloom filters, these articles were useful:
December 28, 2015
I recently finished Princeton’s Algorithms I and II on Coursera. The course covers a lot of material, from data structures (List, Queue, Hash Table, Trees, Graphs, Tries) to algorithms (Sorting, BFS, DFS, Shortest Path, RegEx, Compression).
The course requires more time than other courses. Expect to spend 3 hours a week watching lectures, an hour answering the quiz for every chapter, and the weekly exercises can take a few hours to solve.
Learning the history of CS was also interesting. How David Hilbert, Kurt Godel, Alan Turning, Alonzo Church and John von Neumann asked the basic questions about computing. And that Alan Turing did his PhD in Princeton.
Previously I used to think search problems are related to searching arrays. Now I know it means problems where the answer is a yes/no and that an answer can be checked in polynomial time. Contrast that with other types of problems mentioned in the course: optimization and decision problems.
The exercises are fun. They are done in Java and you will get to learn which algorithms Java uses for sorting / searching / string operations. I was especially impressed with solving a baseball elimination using max flow.
Another very useful data structure is the Interval Tree. Interview questions involving intervals are very popular and this structure makes interacting with intervals much more efficient.
The only topic I wish would have been covered more is dynamic programming. Dynamic programming is a technique of breaking down big calculation into small subproblems, solve the subproblems from the easiest once and use those solutions as part of solving the bigger subproblems.
There’s a Dynamic Programming chapter in this book: Introduction to Algorithms, 3rd Edition. And it’s covered in Stanford’s Algorithms: Design and Analysis, Part 2.
Lastly, the course ended with a song “longest path” recorded by Daniel J. Barrett in 1988 while a student at Johns Hopkins. Unlike the shortest path, there is no known polynomial time algorithm for the longest path. And one can spend their entire graduate school trying to find an algorithm for this problem. Listen to the song here
May 16, 2015
There are two Single Variable Calculus courses starting soon:
EDX - Calculus 1A starts in June 2 and split into 3 parts. Getting a verified certificate costs $100 for each part, or $300 in total. This course should be equivalent to MIT 18.01 course.
Coursera - Single Variable starts in May 22. Getting a verified certificate costs $50.
When it comes to books, MIT recommends MIT - Calculus, Second Edition by Gilbert Strang
April 05, 2015
I just finished taking the Machine Learning class from Coursera. Machine Learning started in the 80s, became more popular in the 90s and with the advent of powerful computers, is now a hot topic. The Machine Learning course came out in 2011, the New Yorker wrote about Machine Learning in 2012 and 2013. By now, everyone is talking about it. This morning I saw A Ted talk by Fei Fei Li about How we’re teaching computers to understand pictures
There are so many excellent online courses available these days. Scott Young took a whole CS degree just by taking MIT courses online. This inspires me to get back to learn Linear Algebra and Calculus. I spent some time researching which courses are available, and compiled them into a list. I also added the list of courses i’m planning to take during 2015.
When taking classes online there’s the question of how to take notes. I started taking notes on paper, until I stumbled upon Nathan Typanski’s blog. He talks about using Latex and Markdown which allow notes to be saved and viewed later. I found that writing formulas in Latex can be time consuming so I’m going back and fourth between taking screenshots of slides and using Latex.
If you are in San Francisco and taking online courses, check out the Colearning: Study Group for Online Courses meetup. Join us if you are around.