Our schema consists of a students
table and three other tables to store
scores for each subject.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

1 2 3 4 5 6 7 8 9 10 11 12 

Let’s start by selecting the names and scores of the students:
1 2 3 4 5 6 7 8 9 

Giving us the following results:
1 2 3 4 5 6 7 8 

First, we order the results of our initial query by descending math score and select
the results into a derived table data_ordered_by_math_score
. Next, we initialize our userdefined
variable, @math_rank
, by selecting it into a derived table, math_rank_derived_table
, and setting
its value to zero. Finally, we select every record in data_ordered_by_math_score
and increment
the value of @math_rank
for each row returned by the statement.
1 2 3 4 5 6 7 8 9 10 11 12 13 

Here are the results with math rank:
1 2 3 4 5 6 7 8 

The next step is to order the results of the above query by
english_score
and compute @english_rank
just like we did with @math_rank
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Results with math and english rank:
1 2 3 4 5 6 7 8 

Following the same pattern we add biology rank to the query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

Results with math, english and biology ranks:
1 2 3 4 5 6 7 8 

And that’s it! We can follow the same pattern to grab the overall rank, (just order by the total score) and add a new userdefined variable.
You may have noticed that the query above doesn’t account for ties. For example, Steve and Beck both have a math score of 90, yet Beck has the higher math rank. We can solve this by modifying the part of the query that updates the userdefined variable from this:
1


to this:
1 2 3 

Basically we create two more userdefined variables, one to track the current math score and another to track whether we need to increment the math rank (i.e the current math score is higher then the previous one). As you can see, it gets ugly fast. The final query, only accounting for ties in the math score, looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

The only difference in the results is that now the math ranks are:
1


instead of:
1


I’ve put an SQL Fiddle of this code here. If there are better alternatives to this approach, please leave me a note, I would love to hear them!
]]>The kmeans algorithm is what we will use to cluster our data. It is an unsupervised machine learning algorithm that groups data points into k clusters, minimizing the distance from each data point to the centroid of the cluster it belongs to.
Let’s start by implementing classes to store the businesses, centroids and clusters.
The DataPoint
class will represent the businesses and centroids. A
DataPoint
should be able to do the following:
Since we are dealing with geographical coordinates, for this
implementation I’ve elected to use the geographic distance between
points instead of the euclidean distance, using utilities from the Geocoder library. I would like to pass instances of the DataPoint
class to the Geocoder library for distance calculations, so we will also need to implement an instance method, to_coordinates
, that returns an array of latitude and longitude.
To store the data points, we will use a Set
because it will make cluster comparison easier (for step 4 of the algorithm). The Ruby Set
class uses the eql?
and hash
methods for equality comparisons, so our DataPoint
class will also need to implement them.
Now that we have the basic gist of the requirements for the DataPoint
class, let’s write a few test cases to assert them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

The class implementation is pretty simple.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

Up next is the Cluster
class, which we will use to perform the following functions:
With these specifications in mind, we write the tests:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

Followed by the class implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 

Finally, we implement the algorithm in a class called Clusterer
.
This class needs to run through the four steps of the algorithm and
terminate on convergence or after a maximum number of iterations. It
should also export the clustered data in a format that we can plot on a
map. Let’s assert these requirements.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

Here’s an abridged version of the algorithm’s implementation, showcasing the meat of the algorithm, which is pretty straightforward. The full Clusterer
class is
here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Since the kmeans algorithm does not guarantee to find the global optimum, we can’t expect that the data will be clustered in the same way after each run. However, given that we have some knowledge of the data we are clustering, we can provide initial centroids that will result in the global optimum. For the chart below I picked 500 businesses in groupings of states to make the clustering easier. I also chose the initial centroids to ensure that the algorithm found the global optimum. I plotted the chart using a Google Visualization.
You can find the complete code for this blog post in my algorithms repository on github.
]]>Hash#new
method accepts an optional parameter which is
returned when you try to access a key that does not exist in the hash (the default is nil). For example:
1 2 

While the first example may not very useful, Hash#new
also accepts a
block that will be called with the hash object and the key whenever you try to access
a key that does not exist in the hash, for example:
1 2 

You can also modify the hash within the callback. Here’s a more useful example where we create a memoized version of the fibonacci sequence:
1 2 3 4 5 6 7 8 9 10 

Using this hash, whenever we compute the nth fibonacci number all the n – 1 fibonacci numbers will be cached in the hash, significantly reducing the number of recursive calls needed to compute the next fibonacci number.
1 2 

Here are a few test cases to check that our implementation does what it’s supposed to do:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

And then the implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

Given the following array:
1


The first step is to pick a pivot. The choice of a pivot can significantly improve the performance of Quicksort. For this implementation we will use a random element in the array as a pivot. One alternative would be to use the ‘Medianofthree’ as suggested by Robert Sedgewick (pick the first, middle and last elements of the array and use their median as a pivot).
1


Say the result of the above operation is pivot_index = 0
. The next step is
to call the partition function, which will move all elements less than 2 to the left and all elements
greater than 2 to the right of the pivot and return the new position of the
pivot.
We need to track the position of the new pivot index, so we start off by setting it to the leftmost element. Next we move the pivot out of the way by swapping it with the rightmost element, resulting in the following changes:
1


1


Now we loop through all the elements excluding the pivot — left to right — comparing them with the pivot. If an element is less than or equal to the pivot, we swap that element with the element at the position of the new pivot index and increment the new pivot index by 1.
The first two elements, 5 and 3, are less than 2 so no operations are performed. On the third comparison, we swap 1 and 5 and increment the new pivot index by 1 resulting in the following changes:
1


The loop is complete so the next step is to swap the pivot with the element at the new pivot index resulting in the following changes:
1


We return the new pivot index to the calling function as now we have successfully partitioned the array around the pivot, 2.
Next we call sort! recursively on the elements to the left of 2. However since that’s an array with a single element, 1, that recursive fork ends with no additional operations.
For the elements to the right of we follow the same steps as above:
[5, 3]
)1


1


The array is successfully partitioned and we return the new pivot index to the calling function.
Continuing with the recursion, we call sort! recursively on elements to the left and right of the pivot, 3. But since there are no elements to the left of, and a single element to the right of the pivot, the right fork of the main recursive fork is done.
Finally we return the array, which is completely sorted at this point!
You can find the complete code on github
]]>It is a selection algorithm that has a worstcase O(n) complexity for selecting the kth order statistic (kth smallest number) in an unsorted array with length n.
Say you have an unsorted array and you would like to pick the fourth smallest element. One approach would be to just sort the array and pick the fourth element. However, the best case performance for this approach is O(nlogn).
Given an array of length n = 10: [10, 6, 8, 3, 7, 1, 2, 4, 9, 5], we would like to find the kth smallest element, where k = 3.
Since I tagged this post with ‘TDD’ We’ll need to write a few test cases first. My favorite testing framework for ruby is currently rspec with guard.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Then we fix the breaking specs by writing the lib:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

You can find the complete code on github
]]>