import numpy as np # library for array and matrix manipulation
import pprint # utilities for console printing
from utils_nb import plot_vectors # helper function to plot vectors
import matplotlib.pyplot as plt # visualization library
= pprint.PrettyPrinter(indent=4) # Instantiate a pretty printer pp

In this lab, we are going to practice the most important concepts related to the hash functions explained in the videos. We will be using these in this week’s assignment.
A key point for the lookup using hash functions is the calculation of the hash key or bucket id that we assign for a given entry. In this notebook, we will cover:
- Basic hash tables
- Multiplanes
- Random planes
Basic Hash tables
Hash tables are data structures that allow indexing data to make lookup tasks more efficient. In this part, we will see the implementation of the simplest hash function.
In the next cell, we will define a straightforward hash function for integer numbers. The function will receive a list of integer numbers and the desired amount of buckets. The function will produce a hash table stored as a dictionary, where keys contain the hash keys, and the values will provide the hashed elements of the input list.
The hash function is just the remainder of the integer division between each element and the desired number of buckets.
def basic_hash_table(value_l, n_buckets):
def hash_function(value, n_buckets):
return int(value) % n_buckets
= {i:[] for i in range(n_buckets)} # Initialize all the buckets in the hash table as empty lists
hash_table
for value in value_l:
= hash_function(value,n_buckets) # Get the hash key for the given value
hash_value # Add the element to the corresponding bucket
hash_table[hash_value].append(value)
return hash_table
Now let’s see the hash table function in action. The pretty print function (pprint()
) will produce a visually appealing output.
= [100, 10, 14, 17, 97] # Set of values to hash
value_l = basic_hash_table(value_l, n_buckets=10)
hash_table_example pp.pprint(hash_table_example)
{ 0: [100, 10],
1: [],
2: [],
3: [],
4: [14],
5: [],
6: [],
7: [17, 97],
8: [],
9: []}
In this case, the bucket key must be the rightmost digit of each number.
Planes
Multiplanes hash functions are other types of hash functions. Multiplanes hash functions are based on the idea of numbering every single region that is formed by the intersection of n planes. In the following code, we show the most basic forms of the multiplanes principle. First, with a single plane:
= np.array([[1, 1]]) # Define a single plane.
P = plt.subplots(figsize=(8, 8)) # Create a plot
fig, ax1
=[2, 2], ax=ax1) # Plot the plane P as a vector
plot_vectors([P], axes
# Plot random points.
for i in range(0, 10):
= np.array(np.random.uniform(-2, 2, 2)) # Get a pair of random numbers between -4 and 4
v1 = np.sign(np.dot(P, v1.T))
side_of_plane
# Color the points depending on the sign of the result of np.dot(P, point.T)
if side_of_plane == 1:
0]], [v1[1]], 'bo') # Plot blue points
ax1.plot([v1[else:
0]], [v1[1]], 'ro') # Plot red points
ax1.plot([v1[
plt.show()
The first thing to note is that the vector that defines the plane does not mark the boundary between the two sides of the plane. It marks the direction in which we find the ‘positive’ side of the plane. Not intuitive at all!
If we want to plot the separation plane, we need to plot a line that is perpendicular to our vector P
. We can get such a line using a 90^o rotation matrix.
Feel free to change the direction of the plane P
.
= np.array([[1, 2]]) # Define a single plane. We may change the direction
P
# Get a new plane perpendicular to P. We use a rotation matrix
= np.dot([[0, 1], [-1, 0]], P.T).T
PT
= plt.subplots(figsize=(8, 8)) # Create a plot with custom size
fig, ax1
=['b'], axes=[2, 2], ax=ax1) # Plot the plane P as a vector
plot_vectors([P], colors
# Plot the plane P as a 2 vectors.
# We scale by 2 just to get the arrows outside the current box
* 4, PT * -4], colors=['k', 'k'], axes=[4, 4], ax=ax1)
plot_vectors([PT
# Plot 20 random points.
for i in range(0, 20):
= np.array(np.random.uniform(-4, 4, 2)) # Get a pair of random numbers between -4 and 4
v1 = np.sign(np.dot(P, v1.T)) # Get the sign of the dot product with P
side_of_plane # Color the points depending on the sign of the result of np.dot(P, point.T)
if side_of_plane == 1:
0]], [v1[1]], 'bo') # Plot a blue point
ax1.plot([v1[else:
0]], [v1[1]], 'ro') # Plot a red point
ax1.plot([v1[
plt.show()
Now, let us see what is inside the code that color the points.
= np.array([[1, 1]]) # Single plane
P = np.array([[1, 2]]) # Sample point 1
v1 = np.array([[-1, 1]]) # Sample point 2
v2 = np.array([[-2, -1]]) # Sample point 3 v3
np.dot(P, v1.T)
array([[3]])
np.dot(P, v2.T)
array([[0]])
np.dot(P, v3.T)
array([[-3]])
The function below checks in which side of the plane P is located the vector v
def side_of_plane(P, v):
= np.dot(P, v.T) # Get the dot product P * v'
dotproduct = np.sign(dotproduct) # The sign of the elements of the dotproduct matrix
sign_of_dot_product = sign_of_dot_product.item() # The value of the first item
sign_of_dot_product_scalar return sign_of_dot_product_scalar
# In which side is [1, 2] side_of_plane(P, v1)
1
# In which side is [-1, 1] side_of_plane(P, v2)
0
# In which side is [-2, -1] side_of_plane(P, v3)
-1
Hash Function with multiple planes
In the following section, we are going to define a hash function with a list of three custom planes in 2D.
= np.array([[1, 1]]) # First plane 2D
P1 = np.array([[-1, 1]]) # Second plane 2D
P2 = np.array([[-1, -1]]) # Third plane 2D
P3 = [P1, P2, P3] # List of arrays. It is the multi plane
P_l
# Vector to search
= np.array([[2, 2]]) v
The next function creates a hash value based on a set of planes. The output value is a combination of the side of the plane where the vector is localized with respect to the collection of planes.
We can think of this list of planes as a set of basic hash functions, each of which can produce only 1 or 0 as output.
def hash_multi_plane(P_l, v):
= 0
hash_value for i, P in enumerate(P_l):
= side_of_plane(P,v)
sign = 1 if sign >=0 else 0
hash_i += 2**i * hash_i
hash_value return hash_value
# Find the number of the plane that containes this value hash_multi_plane(P_l, v)
3
Random Planes
In the cell below, we create a set of three random planes
0)
np.random.seed(= 2 # is 300 in assignment
num_dimensions = 3 # is 10 in assignment
num_planes = np.random.normal(
random_planes_matrix =(num_planes,
size
num_dimensions))print(random_planes_matrix)
[[ 1.76405235 0.40015721]
[ 0.97873798 2.2408932 ]
[ 1.86755799 -0.97727788]]
= np.array([[2, 2]]) v
The next function is similar to the side_of_plane()
function, but it evaluates more than a plane each time. The result is an array with the side of the plane of v
, for the set of planes P
# Side of the plane function. The result is a matrix
def side_of_plane_matrix(P, v):
= np.dot(P, v.T)
dotproduct = np.sign(dotproduct) # Get a boolean value telling if the value in the cell is positive or negative
sign_of_dot_product return sign_of_dot_product
Get the side of the plane of the vector [2, 2]
for the set of random planes.
= side_of_plane_matrix(
sides_l
random_planes_matrix, v) sides_l
array([[1.],
[1.],
[1.]])
Now, let us use the former function to define our multiplane hash function
def hash_multi_plane_matrix(P, v, num_planes):
= side_of_plane_matrix(P, v) # Get the side of planes for P and v
sides_matrix = 0
hash_value for i in range(num_planes):
= sides_matrix[i].item() # Get the value inside the matrix cell
sign = 1 if sign >=0 else 0
hash_i += 2**i * hash_i # sum 2^i * hash_i
hash_value
return hash_value
Print the bucket hash for the vector v = [2, 2]
.
hash_multi_plane_matrix(random_planes_matrix, v, num_planes)
7
Note
This showed we how to make one set of random planes. We will make multiple sets of random planes in order to make the approximate nearest neighbors more accurate.
Document vectors
Before we finish this lab, remember that we can represent a document as a vector by adding up the word vectors for the words inside the document. In this example, our embedding contains only three words, each represented by a 3D array.
= {"I": np.array([1,0,1]),
word_embedding "love": np.array([-1,0,1]),
"learning": np.array([1,0,1])
}= ['I', 'love', 'learning', 'not_a_word']
words_in_document = np.array([0,0,0])
document_embedding for word in words_in_document:
+= word_embedding.get(word,0)
document_embedding
print(document_embedding)
[1 0 3]
Congratulations! You’ve now completed this lab on hash functions and multiplanes!
Citation
@online{bochman2020,
author = {Bochman, Oren},
title = {Hash Functions and Multiplanes},
date = {2020-10-13},
url = {https://orenbochman.github.io/notes-nlp/notes/c1w4/lab02.html},
langid = {en}
}