9、非监督学习--聚类算法

0、

非监督学习：Unsupervised Learning

聚类：Clustering，聚类属于非监督学习

0.1 监督学习与非监督学习？

监督学习提供分类标记（class label），知道每一个实例的标签
非监督学习不知道实例的分类标记

1、示例：病人肿瘤数据

9、非监督学习--聚类算法

x：肿瘤直径

y：肿瘤周长

当前不知道肿瘤是否是良性，而是根据肿瘤之间的关系来分类，可以得到红蓝绿三个团簇

2、K-means 算法：

2.1 Clustering 中的经典算法，数据挖掘十大经典算法之一

2.2 算法接受参数 k ；然后将事先输入的n个数据对象划分为 k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小。

2.3 算法思想：

以空间中k个点为中心进行聚类，对最靠近他们的对象归类。通过迭代的方法，逐次更新各聚类中心的值，直至得到最好的聚类结果

2.4 算法描述：

适当选择c个类的初始中心；
在第k次迭代中，对任意一个样本，求其到c各中心的距离，将该样本归到距离最短的中心所在的类；

利用均值等方法更新该类的中心值；
对于所有的c个聚类中心，如果利用（2）（3）的迭代法更新后，值保持不变，则迭代结束，否则继续迭代。

2.5 算法流程：

9、非监督学习--聚类算法

输入：k, data[n];

选择k个初始中心点，例如c[0]=data[0],…c[k-1]=data[k-1];
对于data[0]….data[n], 分别与c[0]…c[k-1]比较，假定与c[i]差值最少，就标记为i;
对于所有标记为i点，重新计算c[i]={ 所有标记为i的data[j]之和}/标记为i的个数；
重复(2)(3),直到所有c[i]值的变化小于给定阈值。

3、示例说明：四种药分为两类

9、非监督学习--聚类算法

假设c_1和c_2为分类的中心点，那么D^0的每一行表示每个点到中心点c的距离；

G^0表示把每个点归为哪一类（离那个距离近就归为哪一类）

下一步如何更新中心点？

第一类中只有一个点，那么第一类的中心点就是该点

第二类中有三个点，那么第二类的中心点为三个点的平均值

然后重复以上步骤。

再次计算中心点，如果中心点不变，那么算法终止；否则重复以上步骤。

9、非监督学习--聚类算法

优点：速度快，简单

缺点：最终结果跟初始点选择相关，容易陷入局部最优，需直到k值Reference:http://croce.ggf.br/dados/K%20mean%20Clustering1.pdf

4、Python示例

import numpy as np

# Function: K Means
# -------------
# K-Means is an algorithm that takes in a dataset and a constant
# k and returns k centroids (which define clusters of data in the
# dataset which are similar to one another).
# X 数据集；k 分为几类；maxIt 最大迭代次数
def kmeans(X, k, maxIt):
	# X的点数（行数），维度（列数）
	numPoints, numDim = X.shape
# 创建一个比X多一维的全零集合
dataSet = np.zeros((numPoints, numDim + 1))
# 除去最后一列，之前传入X的值
dataSet[:, :-1] = X
	
	# Initialize centroids randomly
# 随机选取中心点
	centroids = dataSet[np.random.randint(numPoints, size = k), :]
	# centroids = dataSet[0:2, :]
	#Randomly assign labels to initial centorid
	centroids[:, -1] = range(1, k +1)
	
	# Initialize book keeping vars.
	iterations = 0
	oldCentroids = None
	
	# Run the main k-means algorithm
	while not shouldStop(oldCentroids, centroids, iterations, maxIt):
		print "iteration: \n", iterations
		print "dataSet: \n", dataSet
		print "centroids: \n", centroids
		# Save old centroids for convergence test. Book keeping.
		oldCentroids = np.copy(centroids)
		iterations += 1
		
		# Assign labels to each datapoint based on centroids
		updateLabels(dataSet, centroids)
		
		# Assign centroids based on datapoint labels
		centroids = getCentroids(dataSet, k)
		
	# We can get the labels too by calling getLabels(dataSet, centroids)
	return dataSet
# Function: Should Stop
# -------------
# Returns True or False if k-means is done. K-means terminates either
# because it has run a maximum number of iterations OR the centroids
# stop changing.
def shouldStop(oldCentroids, centroids, iterations, maxIt):
	if iterations > maxIt:
		return True
	return np.array_equal(oldCentroids, centroids)  

# Function: Get Labels
# -------------
# Update a label for each piece of data in the dataset. 
# 根据数据集每一行到中心点的距离，对点分类
def updateLabels(dataSet, centroids):
	# For each element in the dataset, chose the closest centroid. 
	# Make that centroid the element's label.
	numPoints, numDim = dataSet.shape
	for i in range(0, numPoints):
		dataSet[i, -1] = getLabelFromClosestCentroid(dataSet[i, :-1], centroids)

# 计算行的每一列到中心点的距离，哪一个最小，就属于哪一类
# centroids中心点集合的结构和dataSet类似
def getLabelFromClosestCentroid(dataSetRow, centroids):
# 初始化，赋予centroids最后一列的分类的一个值
	label = centroids[0, -1];
	minDist = np.linalg.norm(dataSetRow - centroids[0, :-1])
	for i in range(1 , centroids.shape[0]):
		dist = np.linalg.norm(dataSetRow - centroids[i, :-1])
		if dist < minDist:
			minDist = dist
			label = centroids[i, -1]
	print "minDist:", minDist
	return label

# Function: Get Centroids
# -------------
# Returns k random centroids, each of dimension n.
def getCentroids(dataSet, k):
	# Each centroid is the geometric mean of the points that
	# have that centroid's label. Important: If a centroid is empty (no points have
	# that centroid's label) you should randomly re-initialize it.
	result = np.zeros((k, dataSet.shape[1]))
	for i in range(1, k + 1):
		oneCluster = dataSet[dataSet[:, -1] == i, :-1]
		result[i - 1, :-1] = np.mean(oneCluster, axis = 0)
		result[i - 1, -1] = i
	
	return result
    
    
x1 = np.array([1, 1])
x2 = np.array([2, 1])
x3 = np.array([4, 3])
x4 = np.array([5, 4])
testX = np.vstack((x1, x2, x3, x4))

result = kmeans(testX, 2, 10)
print "final result:"
print result

5、Hierarchical clustering层次聚类

5.1 假设有N个待聚类的样本，对于层次聚类来说，步骤：（有点类似于使用归并排序的思想）

（初始化）把每个样本归为一类，计算每两个类之间的距离，也就是样本与样本之间的相似度；
寻找各个类之间最近的两个类，把他们归为一类（这样类的总数就少了一个）；
重新计算新生成的这个类与各个旧类之间的相似度；
重复2和3直到所有样本点都归为一类，结束 [具体的k类可以通过设定阈值来提前结束]

9、非监督学习--聚类算法

5.2 分析：整个聚类过程其实是建立了一棵树，在建立的过程中，可以通过在第二步上设置一个阈值，当最近的两个类的距离大于这个阈值，则认为迭代可以终止。另外关键的一步就是第三步，如何判断两个类之间的相似度有不少种方法。这里介绍一下三种：

SingleLinkage：又叫做 nearest-neighbor ，就是取两个类中距离最近的两个样本的距离作为这两个集合的距离，也就是说，最近两个样本之间的距离越小，这两个类之间的相似度就越大。容易造成一种叫做 Chaining 的效果，两个 cluster 明明从“大局”上离得比较远，但是由于其中个别的点距离比较近就被合并了，并且这样合并之后 Chaining 效应会进一步扩大，最后会得到比较松散的 cluster 。
CompleteLinkage：这个则完全是 Single Linkage 的反面极端，取两个集合中距离最远的两个点的距离作为两个集合的距离。其效果也是刚好相反的，限制非常大，两个 cluster 即使已经很接近了，但是只要有不配合的点存在，就顽固到底，老死不相合并，也是不太好的办法。这两种相似度的定义方法的共同问题就是指考虑了某个有特点的数据，而没有考虑类内数据的整体特点。
Average-linkage：这种方法就是把两个集合中的点两两的距离全部放在一起求一个平均值，相对也能得到合适一点的结果。

PS：average-linkage的一个变种就是取两两距离的中值，与取均值相比更加能够解除个别偏离样本对结果的干扰。

6、Python示例

from numpy import *

"""
Code for hierarchical clustering, modified from Programming Collective Intelligence by Toby Segaran (O'Reilly Media 2007, page 33). 
"""

class cluster_node:
	def __init__(self,vec,left=None,right=None,distance=0.0,id=None,count=1):
		self.left=left
		self.right=right
		self.vec=vec
		self.id=id
		self.distance=distance
		self.count=count #only used for weighted average 

def L2dist(v1,v2):
	return sqrt(sum((v1-v2)**2))

def L1dist(v1,v2):
	return sum(abs(v1-v2))

# def Chi2dist(v1,v2):
# 	return sqrt(sum((v1-v2)**2))

def hcluster(features,distance=L2dist):
	#cluster the rows of the "features" matrix
	distances={}
	currentclustid=-1

	# clusters are initially just the individual rows
	clust=[cluster_node(array(features[i]),id=i) for i in range(len(features))]

	while len(clust)>1:
		lowestpair=(0,1)
		closest=distance(clust[0].vec,clust[1].vec)
		
		# loop through every pair looking for the smallest distance
		for i in range(len(clust)):
			for j in range(i+1,len(clust)):
				# distances is the cache of distance calculations
				if (clust[i].id,clust[j].id) not in distances: 
					distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
				
				d=distances[(clust[i].id,clust[j].id)]
		
				if d<closest:
					closest=d
					lowestpair=(i,j)
		
		# calculate the average of the two clusters
		mergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 \ for i in range(len(clust[0].vec))]
		
		# create the new cluster
		newcluster=cluster_node(array(mergevec),left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid)
		
		# cluster ids that weren't in the original set are negative
		currentclustid-=1
		del clust[lowestpair[1]]
		del clust[lowestpair[0]]
		clust.append(newcluster)
	
	return clust[0]

def extract_clusters(clust,dist):
	# extract list of sub-tree clusters from hcluster tree with distance<dist
	clusters = {}
	if clust.distance<dist:
		# we have found a cluster subtree
		return [clust] 
	else:
		# check the right and left branches
		cl = []
		cr = []
		if clust.left!=None: 
			cl = extract_clusters(clust.left,dist=dist)
		if clust.right!=None: 
			cr = extract_clusters(clust.right,dist=dist)
		return cl+cr 

def get_cluster_elements(clust):
	# return ids for elements in a cluster sub-tree
	if clust.id>=0:
		# positive id means that this is a leaf
		return [clust.id]
	else:
		# check the right and left branches
		cl = []
		cr = []
		if clust.left!=None: 
			cl = get_cluster_elements(clust.left)
		if clust.right!=None: 
			cr = get_cluster_elements(clust.right)
		return cl+cr

def printclust(clust,labels=None,n=0):
	# indent to make a hierarchy layout
	for i in range(n): print ' ',
		# negative id means that this is branch
		print '-'
	else:
		# positive id means that this is an endpoint
		if labels==None: print clust.id
		else: print labels[clust.id]
	
	# now print the right and left branches
	if clust.left!=None: printclust(clust.left,labels=labels,n=n+1)
	if clust.right!=None: printclust(clust.right,labels=labels,n=n+1)

def getheight(clust):
	# Is this an endpoint? Then the height is just 1
	if clust.left==None and clust.right==None: return 1
	
	# Otherwise the height is the same of the heights of
	# each branch
	return getheight(clust.left)+getheight(clust.right)

def getdepth(clust):
	# The distance of an endpoint is 0.0
	if clust.left==None and clust.right==None: return 0
	
	# The distance of a branch is the greater of its two sides
	# plus its own distance
	return max(getdepth(clust.left),getdepth(clust.right))+clust.distance

7、Python示例：夕阳图测试代码

import os
from PIL import Image, ImageDraw
from Clustering.HierarchicalClustering import hcluster
from Clustering.HierarchicalClustering import getheight
from Clustering.HierarchicalClustering import getdepth
import numpy as np

def drawdendrogram(clust, imlist, jpeg='clusters.jpg'):
    # height width
    h = getheight(clust) * 20
    w = 1200
    depth = getdepth(clust)

    # width is fixed, so scale distances accordingly
    scaling = float(w - 150) / depth

    # Create a new image with a white background
    img = Image.new('RGB', (w, h), (255, 255, 255))
    draw = ImageDraw.Draw(img)
    draw.line((0, h/2, 10, h/2), fill = (255, 0, 0))

    #Draw the first node
    drawnode(draw, clust, 10, int(h/2), scaling, imlist, img)
    img.save(jpeg)

def drawnode(draw, clust, x, y, scaling, imlist, img):
    if clust.id < 0:
        h1 = getheight(clust.left) * 20
        h2 = getheight(clust.right) * 20
        top = y - (h1 + h2) / 2
        bottom = y + (h1 + h2) / 2
        # line length
        ll = clust.distance * scaling
        # Vertical line from this cluster to children
        draw.line((x, top + h1 / 2, x, bottom - h2 / 2), fill = (255, 0, 0))

        # Call the function to draw the left and right nodes
        drawnode(draw, clust.left, x + ll, top + h1 / 2, scaling, imlist, img)
        drawnode(draw, clust.right, x + ll, bottom - h2 / 2, scaling, imlist, img)
    else:
        # If this is an endpoint, draw a thumbnail image
        nodeim = Image.open(imlist[clust.id])
        nodeim.thumbnail((20, 20))
        ns = nodeim.size
        print x, y - ns[1]//2
        print x + ns[0]
        print img.paste(nodeim, (int(x), int(y - ns[1]//2), int(x + ns[0]), int(y + ns[1] - ns[1]//2))

#Create a list of images
imlist = []
folderPath = r'D:\dataset\flickr-sunsets-small'
for filename in os.listdir(folderPath):
    if os.path.splitext(filename)[1] == '.jpg':
        imlist.append(os.path.join(folderPath, filename))
n = len(imlist)
print n

# extract feature vector for each image
features = np.seros((n, 3))
for i in range(n):
    im = np.array(Image.open(imlist[i]))
    R = np.mean(im[:, :, 0].flatten())
    G = np.mean(im[:, :, 1].flatten())
    B = np.mean(im[:, :, 2].flatten())
    features[i] = np.array([R, G, B])

tree = hcluster(features)
frawdendrogram(tree, imlist, jpeg='sunset.jpg')

PS：材料学习自麦子学院，如有雷同，纯属学习