Let’s say we have a collection of businesses in three distinct regions:
the East coast, West coast and fly-over country, and we would like to
classify these businesses by region. Clustering algorithms, which aim to
partition a set of data points into a small number of clusters, are some
of the most commonly used solutions for data classification problems.
The k-means algorithm is what we will use to cluster our data. It is an unsupervised machine
learning algorithm that groups data points into k clusters, minimizing the distance from each data point to the centroid of the cluster it belongs to.
The Algorithm
Choose k points to represent the initial centroids for the data. These may be assigned randomly or specified by the user.
Assign each data point to the closest centroid (by euclidean distance or other distance metric), creating k clusters.
Recalculate each centroid’s position such that each data point in
the centroid’s cluster is equidistant to the centroid.
Repeat steps 2 and 3 until convergence (no data points change cluster) or up to a maximum number of iterations.
The Code
Let’s start by implementing classes to store the businesses, centroids and clusters.
The DataPoint class will represent the businesses and centroids. A
DataPoint should be able to do the following:
Store the latitude, longitude and state belonging to a business.
Calculate the distance between itself and another point or centroid.
Since we are dealing with geographical coordinates, for this
implementation I’ve elected to use the geographic distance between
points instead of the euclidean distance, using utilities from the Geocoder library. I would like to pass instances of the DataPoint class to the Geocoder library for distance calculations, so we will also need to implement an instance method, to_coordinates, that returns an array of latitude and longitude.
To store the data points, we will use a Set because it will make cluster comparison easier (for step 4 of the algorithm). The Ruby Set class uses the eql? and hash methods for equality comparisons, so our DataPoint class will also need to implement them.
Now that we have the basic gist of the requirements for the DataPoint class, let’s write a few test cases to assert them.
subject{Algorithms::KMeans::DataPoint}let(:args){{latitude:-23.03,longitude:-77.25}}let(:args2){{latitude:-33.03,longitude:-77.25}}context'#initialize'doit'raises an error if initialized without latitude or longitude'doexpect{subject.new(foo:'bar')}.toraise_error(ArgumentError)endit'does not raise an error if initialized with latitude and longitude'doexpect{subject.new(args)}.not_toraise_errorendendcontext'#geographic_distance'doit'calculates the geographic distance between the same datapoint as zero'doexpect(subject.new(args).geographic_distance(subject.new(args))).toeq(0)endit'calculates the geographic distance between two datapoints'doexpect(subject.new(args).geographic_distance(subject.new(args2))).tobe>500endendcontext'external dependencies'doit'implements geocoder methods'doexpect(subject.new(args)).torespond_to(:to_coordinates)endit'implements set comparison methods'doexpect(subject.new(args)).torespond_to(:eql?,:hash)endend
The class implementation is pretty simple.
12345678910111213141516171819202122232425262728
classDataPointattr_reader:latitude,:longitude,:statedefinitialize(options)raiseArgumentError,"Please initialize with :latitude, :longitude"ifoptions[:latitude].nil?||options[:longitude].nil?@latitude=options[:latitude].to_f@longitude=options[:longitude].to_f@state=options[:state]enddefgeographic_distance(business)Geocoder::Calculations.distance_between(self,business)end# Geocoder dependencydefto_coordinates[latitude,longitude]end# Set equality dependenciesdefeql?(other_object)latitude==other_object.latitude&&longitude==other_object.longitudeenddefhash[latitude,longitude].hashendend
Up next is the Cluster class, which we will use to perform the following functions:
Store a centroid and a set of data points.
Recalculate the position of the centroid to make it equidistant to all points in the cluster within a very small margin of error.
Implement a method to compare itself with another cluster.
With these specifications in mind, we write the tests:
subject{Algorithms::KMeans::Cluster.new(data_points:[data_point1,data_point2])}let(:data_point1){Algorithms::KMeans::DataPoint.new(latitude:-23.03,longitude:-77.25)}let(:data_point2){Algorithms::KMeans::DataPoint.new(latitude:-33.03,longitude:-77.25)}let(:margin_of_error){0.00189394}#10 ftdescribe'#recompute_centroid!'doit'sets the centroid to the geographic center of the cluster (within 10ft)'dosubject.recompute_centroid!new_centroid=subject.centroidexpect(data_point1.geographic_distance(new_centroid)).tobe_within(margin_of_error).of(data_point2.geographic_distance(new_centroid))endenddescribe'#clear!'doit'removes all the data points'dosubject.clear!expect(subject.data_points).tobe_emptyendenddescribe'#add_datapoint'doit'adds a data point'donew_data_point=Algorithms::KMeans::DataPoint.new(latitude:-32.32,longitude:78.2)expect{subject.add_datapoint(new_data_point)}.tochange{subject.data_points.size}.by(1)endenddescribe'#=='doit'returns true if two clusters have the same data points (order is irrelevant)'docluster2=Algorithms::KMeans::Cluster.new(data_points:[data_point2.dup,data_point1.dup])expect(cluster2).tobe==subjectendend
Finally, we implement the algorithm in a class called Clusterer.
This class needs to run through the four steps of the algorithm and
terminate on convergence or after a maximum number of iterations. It
should also export the clustered data in a format that we can plot on a
map. Let’s assert these requirements.
let(:filename){"#{File.dirname(__FILE__)}/../../../fixtures/k_means/data.json"}subject{Algorithms::KMeans::Clusterer.new(filename:filename)}let(:east_states){%w(NY NJ PA MA VA)}let(:west_states){%w(CA NV OR WA)}let(:fly_over_states){%w(IA MO NE OK SD)}let(:clusters)dolat_lngs=[{latitude:37.757717,longitude:-122.410499},# San Francisco, CA{latitude:40.764684,longitude:-73.988990},# Manhattan, NY{latitude:42.137687,longitude:-100.178348},# Goose Creek, NE]lat_lngs.collect{|ll|Algorithms::KMeans::Cluster.new(centroid:Algorithms::KMeans::DataPoint.new(ll))}enddescribe'#run'docontext'with pre-determined clusters'doit'clusters the data into east, west and flyover states'dosubject.clusters=clusterssubject.runstates_by_cluster=subject.clusters.map{|s|s.data_points.classify(&:state).keys.to_set}.to_setexpect(states_by_cluster).toeq(Set.new([east_states.to_set,west_states.to_set,fly_over_states.to_set]))endendcontext'without pre-determined clusters'doit'divides the data into distinct clusters'dosubject.runexpect(subject.clusters.map{|cluster|cluster.data_points.size}.inject(:+)).toeq(subject.data_points.to_set.size)endendenddescribe'#to_chart_data'doit'returns an array of data points grouped by cluster'dosubject.clusters=clusterssubject.runexpect(subject.to_chart_data.group_by(&:last).size).toeq(3)endend
Here’s an abridged version of the algorithm’s implementation, showcasing the meat of the algorithm, which is pretty straightforward. The full Clusterer class is
here.
Since the k-means algorithm does not guarantee to find the global optimum,
we can’t expect that the data will be clustered in the same way after
each run. However, given that we have some knowledge of
the data we are clustering, we can provide initial centroids that will
result in the global optimum. For the chart below I picked 500
businesses in groupings of states to make the clustering easier. I also
chose the initial centroids to ensure that the algorithm found the global optimum. I plotted the chart using a Google Visualization.
You can find the complete code for this blog post in my algorithms repository on
github.