如何判断 GitHub 上哪些分叉领先？答案

【问题标题】：How to determine which forks on GitHub are ahead?如何判断 GitHub 上哪些分叉领先？
【发布时间】：2019-07-19 00:30:47
【问题描述】：

有时，我正在使用的一个软件的原始 GitHub 存储库，例如 linkchecker，几乎没有开发或没有开发，而已经创建了很多分支（在这种情况下：142，当时写作）。

对于每个分叉，我想知道：

在原始主分支之前提交了哪些分支

对于每个这样的分支：

它比原始提交提前了多少次
它落后了多少次提交

GitHub has a web interface for comparing forks，但我不想为每个分叉手动执行此操作，我只想要一个包含所有分叉结果的 CSV 文件。如何编写脚本？ The GitHub API can list the forks，但我看不出如何将分叉与它进行比较。依次克隆每个分叉并在本地进行比较似乎有点粗略。

【问题讨论】：

++，但请注意，这种方法至少存在一个问题……分叉可能会与原始存储库大相径庭，其方式可能是好的和/或坏的，所以知道哪个分支有更多的提交并不一定表明哪个分支“领先”于原始分支。
我正在寻找一种快速的方法来选择值得更仔细检查的前叉。如果你有更好的主意，我会全力以赴！
相关，实际上可能是重复的：Github, forked repositories ahead of master: active users.
！我不知道那个功能。我不认为这个问题是重复的（我仍然想要我要问的）但它绝对有帮助，谢谢！

标签： github git-fork

【解决方案1】：

点击顶部的“Insights”，然后点击左侧的“Forks”后，以下书签将信息直接打印到网页上，如下所示：

作为书签添加（或粘贴到控制台）的代码：

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const aTags = [...document.querySelectorAll('div.repo a:last-of-type')].slice(1);

  for (const aTag of aTags) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it directly onto the web page */
    await fetch(aTag.href)
      .then(x => x.text())
      .then(html => aTag.outerHTML += `${html.match(/This branch is.*/).pop().replace('This branch is', '').replace(/([0-9]+ commits? ahead)/, '<font color="#0c0">$1</font>').replace(/([0-9]+ commits? behind)/, '<font color="red">$1</font>')}`)
      .catch(console.error);
  }
})();

您也可以将代码粘贴到地址栏中，但请注意，某些浏览器会在粘贴时删除前导 javascript:，因此您必须自己输入 javascript:。

已从this answer修改。

奖金

以下小书签还打印 ZIP 文件的链接：

作为书签添加（或粘贴到控制台）的代码：

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const aTags = [...document.querySelectorAll('div.repo a:last-of-type')].slice(1);

  for (const aTag of aTags) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it directly onto the web page */
    await fetch(aTag.href)
      .then(x => x.text())
      .then(html => aTag.outerHTML += `${html.match(/This branch is.*/).pop().replace('This branch is', '').replace(/([0-9]+ commits? ahead)/, '<font color="#0c0">$1</font>').replace(/([0-9]+ commits? behind)/, '<font color="red">$1</font>')}` + " <a " + `${html.match(/href="[^"]*\.zip">/).pop() + "Download ZIP</a>"}`)
      .catch(console.error);
  }
})();

【讨论】：

用火狐测试过；对我有用，而且看起来也不错（我们可以关注它的进展）。
我必须说：the first page I tried it on 上的结果表明 GiutHub 应该让玩具叉子和陈旧的叉子更容易被发现。
爱它 - 我在偷这个 :)

【解决方案2】：

有完全相同的痒，并编写了一个刮板，它将打印在渲染 HTML 中的信息用于分叉：https://github.com/hbbio/forkizard

绝对不是完美的，只是一个临时的解决方案。

【讨论】：

据我所知，GitHub 似乎仍然没有显示此信息，对吗？例如。 repo github.com/alormil/ipa-rest-api/network/members 或者目前有其他方法可以做到这一点，否则你的倡议听起来很棒，我会使用它！

【解决方案3】：

派对迟到了 - 我想这是我第二次写这篇 SO 帖子了，所以我将分享我的基于 js 的解决方案（我最终通过获取和搜索 html 页面制作了一个书签） . 您可以从中创建一个bookmarklet，或者简单地将整个内容粘贴到控制台中。适用于基于铬和火狐：

编辑：如果页面上有超过 10 个左右的分叉，您可能会因为抓取速度太快而被锁定（网络中的请求太多 429）。改用 async / await：

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  for (const fork of forks) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it to console */
    await fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }
})();

或者你可以批量处理，但是很容易被锁定

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  getfork = (fork) => {
    return fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }

  while (forks.length) {
    await Promise.all(forks.splice(0, 2).map(getfork));
  }
})();

原始（这会立即触发所有请求，如果请求数超过 github 允许的数量，可能会将您锁定）

javascript:(() => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  for (const fork of forks) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it to console */
    fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }
})();

将打印如下内容：

https://github.com/user1/repo: 289 commits behind original:master.
https://github.com/user2/repo: 489 commits behind original:master.
https://github.com/user2/repo: 1 commit ahead, 501 commits behind original:master.
...

到控制台。

编辑：用块 cmets 替换 cmets 以实现粘贴能力

【讨论】：

这应该如何工作？在 Firefox 和 Chrome 中，我只能将其粘贴为一行，当我单击 the forks page for rclone 上的结果书签时，它不会显示任何结果。但是，当我在 Chrome 中重新加载该页面时，该页面显示“访问已受到限制 - 您已触发滥用检测机制。”所以某事一定发生了。
啊，是的，那个有很多叉子。我假设您刚刚受到了速率限制。我将使用节流版本更新描述——我测试过的版本只有大约 20 个分叉
@reinierpost 还请注意 - 作为一个书签，它只是便于点击 - 您仍然需要打开控制台才能查看提取结果。发生的事情是您的速率受到限制，并且由于重新加载只是发出另一个请求，因此也被阻止了。我已经更新了描述，而是将它们 1 逐 1 刮掉
嗯，console.log，我应该检查一下。它现在正在工作！我的 Firefox 控制台现在有一个很长的列表或行，例如 https://github.com/adragomir/rclone: 1 commit ahead, 4251 commits behind rclone:master. ).pop().replace('This branch is ', '')})) .catch(console.error); } })();:1:442`
啊，我明白了。在 Firefox 上，使用 marklet 时，控制台会使用脚本打印行号。使用深色主题，左侧是灰色文本（控制台输出），右侧是蓝色文本（控制台日志所在的行号）。如果您将其粘贴到控制台而不是作为标记，它将显示“调试器评估代码” - 它正在工作，它看起来很有趣

【解决方案4】：

active-forks 并不能完全满足我的要求，但它很接近并且非常易于使用。

【讨论】：

【解决方案5】：

这是一个 Python 脚本，用于列出和克隆前面的分叉。此脚本部分使用 API，因此会触发速率限制（您可以通过在脚本中添加 GitHub API authentication 来扩展速率限制（不是无限），请编辑或发布）。

最初我尝试完全使用 API，但触发速率限制太快，所以现在我使用 is_fork_ahead_HTML 而不是 is_fork_ahead_API。如果 GitHub 网站设计发生变化，这可能需要进行调整。

由于速率限制，我更喜欢我在此处发布的其他答案。

import requests, json, os, re

def obj_from_json_from_url(url):
    # TODO handle internet being off and stuff
    text = requests.get(url).content
    obj = json.loads(text)
    return obj, text

def is_fork_ahead_API(fork, default_branch_of_parent):
    """ Use the GitHub API to check whether `fork` is ahead.
     This triggers the rate limit, so prefer the non-API version below instead.
    """
    # Compare default branch of original repo with default branch of fork.
    comparison, comparison_json = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo+'/compare/'+default_branch_of_parent+'...'+fork['owner']['login']+':'+fork['default_branch'])
    if comparison['ahead_by']>0:
        return comparison_json
    else:
        return False

def is_fork_ahead_HTML(fork):
    """ Use the GitHub website to check whether `fork` is ahead.
    """
    htm = requests.get(fork['html_url']).content
    match = re.search('<div class="d-flex flex-auto">[^<]*?([0-9]+ commits? ahead(, [0-9]+ commits? behind)?)', htm)
    # TODO if website design changes, fallback onto checking whether 'ahead'/'behind'/'even with' appear only once on the entire page - in that case they are not part of the username etc.
    if match:
        return match.group(1) # for example '1 commit ahead, 114 commits behind'
    else:
        return False

def clone_ahead_forks(user,repo):
    obj, _ = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo)
    default_branch_of_parent = obj["default_branch"]
    
    page = 0
    forks = None
    while forks != [{}]:
        page += 1
        forks, _ = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo+'/forks?per_page=100&page='+str(page))

        for fork in forks:
            aheadness = is_fork_ahead_HTML(fork)
            if aheadness:
                #dir = fork['owner']['login']+' ('+str(comparison['ahead_by'])+' commits ahead, '+str(comparison['behind_by'])+'commits behind)'
                dir = fork['owner']['login']+' ('+aheadness+')'
                print dir
                os.mkdir(dir)
                os.chdir(dir)
                os.system('git clone '+fork['clone_url'])
                print
                
                # recurse into forks of forks
                if fork['forks_count']>0:
                    clone_ahead_forks(fork['owner']['login'], fork['name'])
                    
                os.chdir('..')

user = 'cifkao'
repo = 'tonnetz-viz'

clone_ahead_forks(user,repo)

【讨论】：

【解决方案6】：

这是一个 Python 脚本，用于列出和克隆所有前面的分支。

它不使用 API。所以它不会受到速率限制，也不需要身份验证。但如果 GitHub 网站设计发生变化，可能需要调整。

与其他答案中显示 ZIP 文件链接的小书签不同，此脚本还保存有关提交的信息，因为它使用 git clone 并创建一个带有概述的 commits.htm 文件。

import requests, re, os, sys, time

def content_from_url(url):
    # TODO handle internet being off and stuff
    text = requests.get(url).content
    return text

def clone_ahead_forks(forklist_url):
    forklist_htm = content_from_url(forklist_url)
    with open("forklist.htm", "w") as text_file:
        text_file.write(forklist_htm)
        
    is_root = True
    # not working if there are no forks: '<a class="(Link--secondary)?" href="(/([^/"]*)/[^/"]*)">'
    for match in re.finditer('<a (class=""|data-pjax="#js-repo-pjax-container") href="(/([^/"]*)/[^/"]*)">', forklist_htm):
        fork_url = 'https://github.com'+match.group(2)
        fork_owner_login = match.group(3)
        fork_htm = content_from_url(fork_url)
        
        match2 = re.search('<div class="d-flex flex-auto">[^<]*?([0-9]+ commits? ahead(, [0-9]+ commits? behind)?)', fork_htm)
        # TODO if website design changes, fallback onto checking whether 'ahead'/'behind'/'even with' appear only once on the entire page - in that case they are not part of the username etc.
        
        sys.stdout.write('.')
        if match2 or is_root:
            if match2:
                aheadness = match2.group(1) # for example '1 commit ahead, 2 commits behind'
            else:
                aheadness = 'root repo'
                is_root = False # for subsequent iterations
                
            dir = fork_owner_login+' ('+aheadness+')'
            print dir
            
            os.mkdir(dir)
            os.chdir(dir)
            
            # save commits.htm
            commits_htm = content_from_url(fork_url+'/commits')            
            with open("commits.htm", "w") as text_file:
                text_file.write(commits_htm)
            
            # git clone
            os.system('git clone '+fork_url+'.git')
            print
            
            # no need to recurse into forks of forks because they are all listed on the initial page and being traversed already
                
            os.chdir('..')

    


base_path = os.getcwd()
match_disk_letter = re.search(r'^([a-zA-Z]:\\)', base_path)


with open('repo_urls.txt') as url_file:
    for url in url_file:
        url = url.strip()
        match = re.search('github.com/([^/]*)/([^/]*)$', url)
        if match:
            user_name = match.group(1)
            repo_name = match.group(2)
            print repo_name
            dirname_for_forks = repo_name+' ('+user_name+')'
            if not os.path.exists(dirname_for_forks):
                url += "/network/members" # page that lists the forks

                TMP_DIR = 'tmp_'+time.strftime("%Y%m%d-%H%M%S")
                if match_disk_letter: # if Windows, i.e. if path starts with A:\ or so, run git in A:\tmp_... instead of .\tmp_..., in order to prevent "filename too long" errors
                    TMP_DIR = match_disk_letter.group(1)+TMP_DIR
                print TMP_DIR

                os.mkdir(TMP_DIR)
                os.chdir(TMP_DIR)
                clone_ahead_forks(url)
                print
                os.chdir(base_path)
                os.rename(TMP_DIR, dirname_for_forks)
            else:
                print dirname_for_forks+' already exists, skipping.'

如果你将文件repo_urls.txt制作成如下内容（你可以放几个网址，每行一个网址）：

https://github.com/cifkao/tonnetz-viz

然后您将获得以下目录，每个目录都包含相应的克隆存储库：

tonnetz-viz (cifkao)
  bakaiadam (2 commits ahead)
  chumo (2 commits ahead, 4 commits behind)
  cifkao (root repo)
  codedot (76 commits ahead, 27 commits behind)
  k-hatano (41 commits ahead)
  shimafuri (11 commits ahead, 8 commits behind)

如果不起作用，请尝试earlier versions。

【讨论】：

我想我们应该按照here 的描述将--mirror 标志添加到git clone，对吧？

【解决方案7】：

这是一个使用 Github API 的 Python 脚本。我想包括日期和最后一次提交消息。如果您需要达到 5k 个请求/小时，则需要包含个人访问令牌 (PAT)。

用法：python3 list-forks.py https://github.com/itinance/react-native-fs

import requests, re, os, sys, time, json, datetime
from urllib.parse import urlparse

GITHUB_PAT = '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'

def json_from_url(url):
    response = requests.get(url, headers={ 'Authorization': 'token {}'.format(GITHUB_PAT) })
    return response.json()

def iso8601_date_to_text(date):
    return datetime.datetime.strptime(date, '%Y-%m-%dT%H:%M:%SZ').strftime('%Y-%m-%d')

def process_repo(repo_author, repo_name, fork_of_fork):
    page = 1

    while 1:
        forks_url = 'https://api.github.com/repos/{}/{}/forks?per_page=100&page={}'.format(repo_author, repo_name, page)
        forks_json = json_from_url(forks_url)

        if not forks_json:
            break

        for fork_info in forks_json:
            fork_author = fork_info['owner']['login']
            fork_name = fork_info['name']
            forks_count = fork_info['forks_count']
            fork_url = 'https://github.com/{}/{}'.format(fork_author, fork_name)

            compare_url = 'https://api.github.com/repos/{}/{}/compare/master...{}:master'.format(repo_author, fork_name, fork_author)
            compare_json = json_from_url(compare_url)

            if 'status' in compare_json:
                items = []

                status = compare_json['status']
                ahead_by = compare_json['ahead_by']
                behind_by = compare_json['behind_by']
                total_commits = compare_json['total_commits']
                commits = compare_json['commits']

                if fork_of_fork:
                    items.append('   ')

                items.append(fork_url)
                items.append(status)

                if ahead_by != 0:
                    items.append('+{}'.format(ahead_by))

                if behind_by != 0:
                    items.append('-{}'.format(behind_by))

                if total_commits > 0:
                    last_commit = commits[total_commits-1];
                    commit = last_commit['commit']
                    author = commit['author']
                    items.append(iso8601_date_to_text(author['date']))
                    items.append('"{}"'.format(commit['message'].replace('\n', ' ')))

                if ahead_by > 0:
                    print(' '.join(items))

            if forks_count > 0:
                process_repo(fork_author, fork_name, True)

        page += 1

url_parsed = urlparse(sys.argv[1].strip())
path_array = url_parsed.path.split('/')
root_author = path_array[1]
root_name = path_array[2]

root_url = 'https://github.com/{}/{}'.format(root_author, root_name)
commits_url = 'https://api.github.com/repos/{}/{}/commits/master'.format(root_author, root_name)
commits_json = json_from_url(commits_url)
commit = commits_json['commit']
author = commit['author']
print('{} root {} "{}"'.format(root_url, iso8601_date_to_text(author['date']), commit['message'].replace('\n', ' ')));

process_repo(root_author, root_name, False)

【讨论】：