--- title: 'Unhashing LSH: Addressing the gap between theory and practice' author: James type: post date: -001-11-30T00:00:00+00:00 draft: true url: /?p=308 medium_post: - 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";N;s:2:"id";N;s:21:"follower_notification";N;s:7:"license";N;s:14:"publication_id";N;s:6:"status";N;s:3:"url";N;}' categories: - Uncategorized ---

I’ve recently been trying to become intimately familiar with how LSH works in theory and practice in order to solve some prickly comparison problems with O(n²) comparisons.

For the uninitiated, LSH or Locally Sensitive Hashing is a method frequently used for “sketching” large data structures in a way that allows quick comparison and grouping of similar items without having to compare every item with every other item in the dataset. It’s often mentioned in the same breath as “the curse of dimensionality”: the problem of dealing with complex data structures like documents and images that must be represented in terms of the words or pixels that they contain which quickly add up and require enormous amounts of memory and compute time for processing.

The literature on LSH is factually accurate and mathematically complete but at the same time it’s really hard going. At the other end of the spectrum are some incredibly helpful blog posts that tell you how LSH works in practice. This post aims to explain the connections between the two.

## Nearest-Neighbour (NN) Problem The nearest neighbour problem is the issue of finding data points or items “most similar” to a particular starting point or “query”. For example, we are building a music recommendation system and we know that the user likes song **q. **We want to find artists similar to song **q** to recommend to the user. We can represent each song as a vector of their attributes – let’s say for simplicity that we’re using 3 dimensions on a scale 1-10: Tempo of music (slow to fast), Singer Pitch (deep to high) and Heaviness (Pop Rock to Death Metal). If you plot all of the song in your catalogue in this way, the ones with a similar sound should end up clustered together.

Steve Tyler, David Lee Roth and Freddie Mercury all have relatively high vices and Aerosmith, Queen and Van Halen are of a similar “heaviness” but but Aerosmith’s DWTMAT is a slow ballad compared to upbeat Jump and most of Boh-Rhap is speedy too.

In order to find the nearest neighbours for a given data point, for example **q is “Van Halen – Jump” **– we have to loop over all items, find the [euclidean distance][1] between the points and then take the point with the smallest distance as the most similar, in this case, its Queen’s Boh-rhap! In this case there are 6 songs and therefore only 5 comparisons. What if we have a music library of millions of songs? That’s an awful lot of comparisons! It’s also reasonable to assume that we’d be interested in comparing more than 3 attributes of a song – we can’t render more than 3 dimensions on a diagram but as you can imagine if there are 1000 or even 10,000 attributes then working out the euclidean distance becomes much harder. How can we reduce the number of comparisons needed? ## Approximate Nearest-Neighbour (NN) Problem In order to speed up the recommendation process, we need to artificially reduce the number of comparisons that our system has to make. What if we had some prior knowledge about which part of the feature space song **q** is in and chose only to compare it with other songs from that space? Drawing on from the example above, let’s imagine that we divide our space into two buckets: ‘rock’ and ‘metal’ – if we already know that Van Halen – Jump is a rock song then we can immediately discount Dragonforce, Slipknot and System of a Down as possible nearest neighbours and compare only with Aerosmith and Queen. You may have already noticed that there’s a catch here. We’re at risk of missing potential nearest neighbours that sit on the border of our divisions. Metallica – Nothing Else Matters is an unusually slow balladic number from the thrash metal heavyweights and many people who don’t otherwise like Metallica might enjoy it – especially if they like Aerosmith’s Don’t Wanna’ Miss a Thing and other pop-rock ballads. The trade-off here is one of speed versus accuracy. By drawing lines of division down through our collection, we reduce the number of comparisons we usually need to make but risk missing near neighbours that are “on the edge” in the process. We can somewhat address this problem that by dividing our collection up into buckets in a few different ways and checking a handful of them. For example Queen – Bohemian Rhapsody could belong to “singers with high voices” and “rock ballads”. [1]: https://en.wikipedia.org/wiki/Euclidean_distance