网络爬虫文本云答案

【问题标题】：Web Crawler Text Cloud网络爬虫文本云
【发布时间】：2012-05-31 05:48:08
【问题描述】：

我需要有关我正在开发的文本云程序的帮助。我意识到这是家庭作业，但我自己已经走了很远，只是现在被难住了几个小时。我被困在网络爬虫部分。该程序应该打开一个页面，收集该页面中的所有单词，并按频率对它们进行排序。然后它应该打开该页面上的任何链接并获取该页面上的单词等。深度由全局变量 DEPTH 控制。最后，它应该把所有页面中的所有单词放在一起，形成一个文本云。

我正在尝试使用递归来调用一个函数来保持打开链接，直到达到深度。顶部的 import 语句只是使用了一个名为 getHTML(URL) 的函数，它返回页面上单词列表的元组，以及页面上的任何链接。

到目前为止，这是我的代码。除了 getRecursiveURLs(url, DEPTH) 和 makeWords(i) 之外，每个函数都可以正常工作。我也不是 100% 确定底部的 counter(List) 函数。

from hmc_urllib import getHTML

MAXWORDS = 50
DEPTH = 2

all_links = []

def getURL():
    """Asks the user for a URL"""

    URL = input('Please enter a URL: ')

    #all_links.append(URL)

    return makeListOfWords(URL), getRecursiveURLs(URL, DEPTH)


def getRecursiveURLs(url, DEPTH):
    """Opens up all links and adds them to global all_links list,
    if they're not in all_links already"""

    s = getHTML(url)
    links = s[1]
    if DEPTH > 0:
        for i in links:
            getRecursiveURLs(i, DEPTH - 1)
            if i not in all_links:
                all_links.append(i)
                #print('This is all_links in the IF', all_links)
                makeWords(i)#getRecursiveURLs(i, DEPTH - 1)
            #elif i in all_links:

             #   print('This is all_links in the ELIF', all_links)
              #  makeWords(i) #getRecursiveURLs(i, DEPTH - 1)
    #print('All_links at the end', all_links)
    return all_links





def makeWords(i):
    """Take all_links and create a dictionary for each page.
    Then, create a final dictionary of all the words on all pages."""

    for i in all_links:
        FinalDict = makeListOfWords(i)
        #print(all_links)
        #makeListOfWords(i))
    return FinalDict


def makeListOfWords(URL):
    """Gets the text from a webpage and puts the words into a list"""

    text = getHTML(str(URL))
    L = text[0].split()
    return cleaner(L)


def cleaner(L):

    """Cleans the text of punctuation and removes words if they are in the stop list."""

    stopList = ['', 'a', 'i', 'the', 'and', 'an', 'in', 'with', 'for',
                'it', 'am', 'at', 'on', 'of', 'to', 'is', 'so', 'too',
                'my', 'but', 'are', 'very', 'here', 'even', 'from',
                'them', 'then', 'than', 'this', 'that', 'though']

    x = [dePunc(c) for c in L]

    for c in x:
        if c in stopList:
            x.remove(c)

    a = [stemmer(c) for c in x]

    return counter(a)


def dePunc( rawword ):
    """ de-punctuationifies the input string """

    L = [ c for c in rawword if 'A' <= c <= 'Z' or 'a' <= c <= 'z' ]
    word = ''.join(L)
    return word


def stemmer(word):

    """Stems the words"""

    # List of endings
    endings = ['ed', 'es', 's', 'ly', 'ing', 'er', 'ers']

    # This first case handles 3 letter suffixes WITH a doubled consonant. I.E. spammers -> spam
    if word[len(word)-3:len(word)] in endings and word[-4] == word[-5]:
        return word[0:len(word)-4]

    # This case handles 3 letter suffixes WITHOUT a doubled consonant. I.E. players -> play
    elif word[len(word)-3:len(word)] in endings and word[-4] != word[-5]:
        return word[0:len(word)-3]

    # This case handles 2 letter suffixes WITH a doubled consonant. I.E. spammed -> spam
    elif word[len(word)-2:len(word)] in endings and word[-3] == word[-4]:
        return word[0:len(word)-3]

    # This case handles 2 letter suffixes WITHOUT a doubled consonant. I.E. played -> played
    elif word[len(word)-2:len(word)] in endings and word[-3] != word[-4]:
        return word[0:len(word)-3]

    # If word not inflected, return as-is.
    else:
        return word

def counter(List):
    """Creates dictionary of words and their frequencies, 'sorts' them,
    and prints them from most least frequent"""

    freq = {}
    result = {}
 # Assign frequency to each word
    for item in List:
        freq[item] = freq.get(item,0) + 1

    # 'Sort' the dictionary by frequency
    for i in sorted(freq, key=freq.get, reverse=True):
        if len(result) < MAXWORDS:
            print(i, '(', freq[i], ')', sep='')
            result[i] = freq[i]
    return result

【问题讨论】：

爬取网站的教程有很多，例如：ms4py.org/2010/4/10/python-search-engine-crawler-part-1。
您可以使用的内容有限制吗？我建议使用队列和线程，而不是递归，来爬行。
另外，getRecursiveURLS() 到底有什么问题？
def makeWords(i): ... for i in all_links: 您将 i 作为参数传递，但也在 for 循环中分配它。不一定是问题，但我会考虑将其更改为def makeWords():
@ Joel，现在代码会多次返回页面中的单词，而不仅仅是一次。我认为问题出在 makeWords() 或 getRecursiveURLs() 中，因为其他一切似乎都有效。我也不知道如何将所有这些单词分配给最终的字典。

标签： python

【解决方案1】：

目前尚不完全清楚该作业的确切要求，但据我所知，您希望访问所有页面直到 DEPTH 一次且仅一次。此外，您希望从所有页面中获取所有单词并使用聚合结果。下面的 sn-p 是您正在寻找的，但是它未经测试（我没有 hmc_urllib）。 all_links、makeWords 和 makeListOfWords 已被删除，但其余代码将相同。

visited_links = []

def getURL():
    url = input('Please enter a URL: ')
    word_list = getRecursiveURLs(url, DEPTH)
    return cleaner(word_list) # this prints the word count for all pages

def getRecursiveURLs(url, DEPTH):
    text, links  = getHTML(url)
    visited_links.append(url)
    returned_word_list = text.split()
    #cleaner(text.split()) # this prints the word count for the current page

    if DEPTH > 0:
        for link in links:
            if link not in visited_links:
                returned_word_list += getRecursiveURLs(link, DEPTH - 1)
    return returned_word_list

一旦你有了一个清理过的词和词干的列表，你就可以使用以下函数来分别生成字数统计字典和打印字数统计字典：

def counter(words):
    """
    Example Input: ['spam', 'egg', 'egg', 'egg', 'spam', 'spam', 'egg', 'egg']
    Example Output: {'spam': 3, 'egg', 5}
    """
    return dict((word, x.count(word)) for word in set(words))

def print_count(word_count, word_max):
    """
    Example Input: {'spam': 3, 'egg', 5}
    Prints the word list up to the word_max sorted by frequency
    """
    for word in sorted(word_count, key=word_count.get, reverse=True)[:word_max]:
        print(word,'(', word_count[word], ')', sep= '')

【讨论】：

感谢您的回复！您给我的代码可以正确打印出每个单词及其每页的频率，但创建的字典仅包含每页的整个文本。我需要最终的字典将所有页面中的单个单词作为键，将它们的频率作为值。这会让我按照我需要的方式对它们进行排序。现在，它返回如下内容：spam(8) page(1) love(1)。那是第一页。下一页是：stem(4) page(2) these(2)，依此类推。最终结果需要spam(8) stem(4) page(3) these(2) love(1)
我不知道为什么它不适合你。一旦你到达return cleaner(global_word_list)global_word_list 应该包含所有页面中的所有单词。我已经阅读了您多次发布的代码，并且您指出的行为没有明显的原因。您对清洁剂或计数器进行了任何修改吗？此外，您不应使用 List 作为计数器的参数名称。 List 是 Python 关键字，如果以其他方式使用，可能会导致意外行为。
对所有令人困惑的 cmets 感到抱歉。我发现了问题。首先，行 global_word_list += text.split() 返回错误。它说在赋值之前引用了局部变量。所以我把它改成了 global_word_list.append(text.split())。 split 的问题在于它创建了一个列表。因此，当需要创建字典时，它会看到一个列表列表，这些列表是每个页面的文本。我需要弄清楚如何让它只是一个列表。
我刚刚看到你的最后一条评论。 global_word_list += text.split() 应该可以正常工作。错误听起来像 global_word_list 未定义。确保 global_word_list = [] 是顶级并且在 def getRecursiveURL 之前。如果它仍然给您带来问题，请告诉我，我可以重构 `getRecursiveURL 以使单词列表冒泡，而不是使用全局变量。
我在任何函数之前都有它，它仍然给我一个错误，说它是未定义的。