brainsteam.co.uk/brainsteam/content/posts/2018-08-31-.md

128 KiB
Raw Blame History

title author type date draft url medium_post categories
Unhashing LSH: Addressing the gap between theory and practice James post -001-11-30T00:00:00+00:00 true /?p=308
O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";N;s:2:"id";N;s:21:"follower_notification";N;s:7:"license";N;s:14:"publication_id";N;s:6:"status";N;s:3:"url";N;}
Uncategorized

Ive recently been trying to become intimately familiar with how LSH works in theory and practice in order to solve some prickly comparison problems with O(n²) comparisons.

For the uninitiated, LSH or Locally Sensitive Hashing is a method frequently used for “sketching” large data structures in a way that allows quick comparison and grouping of similar items without having to compare every item with every other item in the dataset. Its often mentioned in the same breath as “the curse of dimensionality”: the problem of dealing with complex data structures like documents and images that must be represented in terms of the words or pixels that they contain which quickly add up and require enormous amounts of memory and compute time for processing.

The literature on LSH is factually accurate and mathematically complete but at the same time its really hard going. At the other end of the spectrum are some incredibly helpful blog posts that tell you how LSH works in practice. This post aims to explain the connections between the two.

Nearest-Neighbour (NN) Problem

The nearest neighbour problem is the issue of finding data points or items “most similar” to a particular starting point or “query”. For example, we are building a music recommendation system and we know that the user likes song **q.   **We want to find artists similar to song q to recommend to the user. We can represent each song as a vector of their attributes lets say for simplicity that were using 3 dimensions on a scale 1-10: Tempo of music (slow to fast), Singer Pitch (deep to high) and Heaviness (Pop Rock to Death Metal). If you plot all of the song  in your catalogue in this way, the ones with a similar sound should end up clustered together.

Steve Tyler, David Lee Roth and Freddie Mercury all have relatively high vices and Aerosmith, Queen and Van Halen are of a similar “heaviness” but but Aerosmiths DWTMAT is a slow ballad compared to upbeat Jump and most of Boh-Rhap is speedy too.

In order to find the nearest neighbours for a given data point, for example **q is “Van Halen Jump” ** we have to loop over all items, find the euclidean distance between the points and then take the point with the smallest distance as the most similar, in this case, its Queens Boh-rhap! In this case there are 6 songs and therefore only 5 comparisons. What if we have a music library of millions of songs? Thats an awful lot of comparisons!

Its also reasonable to assume that wed be interested in comparing more than 3 attributes of a song we cant render more than 3 dimensions on a diagram but as you can imagine if there are 1000 or even 10,000 attributes then working out the euclidean distance becomes much harder. How can we reduce the number of comparisons needed?

Approximate Nearest-Neighbour (NN) Problem

In order to speed up the recommendation process, we need to artificially reduce the number of comparisons that our system has to make. What if we had some prior knowledge about which part of the feature space song q is in and chose only to compare it with other songs from that space?

Drawing on from the example above, lets imagine that we divide our space into two buckets: rock and metal if we already know that Van Halen Jump is a rock song then we can immediately discount Dragonforce, Slipknot and System of a Down as possible nearest neighbours and compare only with Aerosmith and Queen.

You may have already noticed that theres a catch here. Were at risk of missing potential nearest neighbours that sit on the border of our divisions. Metallica Nothing Else Matters is an unusually slow balladic number from the thrash metal heavyweights and many people who dont otherwise like Metallica might enjoy it especially if they like Aerosmiths Dont Wanna Miss a Thing and other pop-rock ballads. The trade-off here is one of speed versus accuracy. By drawing lines of division down through our collection, we reduce the number of comparisons we usually need to make but risk missing near neighbours that are “on the edge” in the process. We can somewhat address this problem that by dividing our collection up into buckets in a few different ways and checking a handful of them. For example Queen Bohemian Rhapsody could belong to “singers with high voices” and “rock ballads”.