热门程序语言趋势爬取与分析

type

Post

status

Published

date

May 24, 2022

slug

crawly_end

summary

爬虫课大作业，分析github与Stack Overflow的热门语言趋势，通过统计数日github和Stack Overflow的排行榜热门语言标签，得到折线图，饼图等，对当下热门语言进行直观查看

环境

环境配置

运行时间

2022年5月

爬取网站

Github

Stack Overflow

项目主要结构


E:.
│  work.ipynb
│  网络爬虫结课论文.docx
│
├─github_today
│      github_today_05_06.html
│      github_today_05_07.html
│      github_today_05_08.html
│      github_today_05_09.html
│      github_today_05_10.html
│
│
└─stackoverflow
        stackoverflow_05_06.html
        stackoverflow_05_07.html
        stackoverflow_05_08.html
        stackoverflow_05_09.html
        stackoverflow_05_10.html

总函数调用分布图

依赖库

name	version
python	3.6.13
requests	2.27.1
lxml	4.8.0
pyecharts	1.9.1
jieba	0.42

项目源代码

GitHub - txuw/crawly_end

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or window. Reload to refresh your session. Reload to refresh your session.

https://github.com/txuw/crawly_end

(需要代理)

crawly_end.zip

2519.4KB

Demo 展示

通过饼状图可以得到各个语言的当天占比

stackoverflow饼状图.html

通过折线图可以得到语言的变化趋势

通过聚合柱状图，可以得到语言的分布情况

通过词云，可以得到热门关键词

爬取目标

提取出当下热门编程社区网站，github和Stack Overflow的热点排行榜

进行当下热门程序语言的统计

爬取行为均遵守各大网站robots协议

https://github.com/trending

Github是当下最热门的程序员开源网站之一[基本没有之一了]，上面有着诸多公司的开源项目，通过git协议，进行维护发布，同时提供文档给程序员阅读使用，可以在上面了解到最前沿的热门项目，github的排行榜往往就是正式程序员们热门语言的风向标

Stack Overflow最热门的程序员问题社区之一，通常用于进行问题的提问，在上面提出的问题可以得到较为有效的解答，可以很好的反馈出当下最多人学习的语言，和github的差距在于这是一个学习社区，而github是成熟的产品社区，Stack Overflow能反映出语言的学习趋势，而github能反应出产品的使用趋势

https://stackoverflow.com/?tab=hot

界面爬取

通过request库对Stack Overflow的热门问题排行榜以及github的日榜进行爬取

界面爬取函数调用图

调用get_today_data()获取当前日期，用于获取当前时间

便于之后读取网页代码时按时间分类，并将得到的日期作为postifix 后缀参数传入get_html

返回数据格式


'5_7'

调用get_html(url,Folder='default',postifix='',cookie='')对爬取页面进行爬取

并将得到的html代码，存在对应网站名对应的文件夹内，并加入日期后缀（具体形式参考项目主要结构的文件名）

参数对应意义

url 对应爬取链接

Folder 对应存放的文件夹

postifix 对应后缀，用于时间后缀

cookie 爬取链接用的cookie

无返回格式，网页代码存储于本地文件夹内

代码详解

页面爬取模块 Get_today_data

用于将html爬取至本地，存储于传入的Folder为名称的文件夹中

使用了代理进行国外网站访问


#网页爬取模块
def get_html(html,Folder='default',postifix='',cookie=''):
    try:
        headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36',
        'Cookie':cookie
       }
        proxies={'http':'http://127.0.0.1:7890','https':'http://127.0.0.1:7890'}#使用国外代理
        html = requests.get(html,headers=headers,proxies=proxies)#get请求获取界面
        html.encoding = html.apparent_encoding#转码为utf-8
        path = './'+Folder+'/'+Folder+postifix+".html"#推导存放路径
        with open(path,"wb") as f:#写入文件
            f.write(html.content)
            f.close()
    except Exception as e :
        print(e)
    return "ok"

时间获取模块 get_html

获取当天的年和月


#时间获取模块
def get_today_data():
    cur_time = datetime.now()
    time = datetime.strftime(cur_time,'%m_%d')#格式化时间字符串
    return time

操作模块

在这里对代码进行调用，把代码解耦合，分步执行

传入参数分别对应：爬取网页，存储在哪个文件夹，文件后缀（也就是当天日期），cookies(此处为空也可以爬)


today = get_today_data()
stackoverflow_cookie =''
github_cookie =''
get_html('https://stackoverflow.com/?tab=hot','stackoverflow','_'+today,stackoverflow_cookie)
get_html('https://github.com/trending?since=daily','github_today','_'+today,github_cookie)
get_html('https://github.com/trending?since=weekly','github_week','_'+today,github_cookie)
get_html('https://github.com/trending?since=monthly','github_monthly','_'+today,github_cookie)

解析数据

解析出网页源代码中需要的数据

解析数据函数调用图

get_path(Folder_name) 返回一个文件夹内的文件列表

用于获得存github的文件夹和Stack Overflow的文件夹下的文件名

参数对应的意义

Folder_name 文件夹名称

返回数据格式


['github_today_05_06.html', 'github_today_05_07.html', 'github_today_05_08.html'...]

analyze_html(name) 判断需要分析的网站

将对应要分析网站传给不同的分析函数，并接受他们分析完的数据

参数对应意义

name 需要分析的网站

返回数据格式


{'time': '05_06',
  'title': 'metaseq',
  'text': 'Repo for external large-scale work',
  'tags': ['Python']}

analyze_github(html,time)使用xpath库解析Github

得到每个github对应项目的语言标签，标题，摘要，时间

参数对应意义

html 需要分析的页面树形结构，从analyze_html传入，用于xpath分析

time 时间标签

返回的数据和analyze_html 相同

analyze_stackoverflow(html,time)

使用re库解析Stack Overflow得到编程语言和标题

参数对应意义

html 传入html源代码，用于re库匹配

time 时间标签

返回的数据和analyze_html 相同

代码详解

文件名提取模块 get_path

由于不同分类被存在不同的文件夹下

需要通过名称将对应文件名提取出来

并返回对应的路径


#本地文件夹文件名提取模块
def get_path(Folder_name):
    path = os.path.abspath('.')
    path_list=os.listdir(str(path)+"\\"+str(Folder_name))
    return path_list

总模块 analyze_html

传入需要调用的网站名称

通过文件名提取模块，得到路径

将其中的时间后缀拆出来，用于分析爬取的时间

判断名称传入对应网页的模块分支


#分析网页-总模块
def analyze_html(name):
    datas = []
    #读取网站名称，获取他目录下的页面名称列表
    paths_lists = get_path(name)
    for path in paths_lists:
        # 通过字符串处理得到爬取的时间
        time = path.replace(name+"_","").replace(".html","")
        # 传入github分支
        if name.find("github") != -1:
            # 进行etree结构化，使其能被xpath分析
            html = etree.parse(name+'/'+path,etree.HTMLParser())
            data = analyze_github(html,time)
        # 传入Stack Overflow分支
        if name == 'stackoverflow':
            #读取html，re库直接源码匹配即可
            html = open(name+'/'+path,"r",encoding="utf-8").read()
            data = analyze_stackoverflow(html,time)
        for iter in data:
            # 将传出的数据进行汇总到datas中
            datas.append(iter)
    return datas

Github网页数据分析模块 analyze_github

通过xpath方法进行数据提取

xpath路径通过chrome工具直接获得，之后通过下标索引查找即可

传入github网页代码以及爬取的时间


def analyze_github(html,time):
    data = []
    #每一页有25个项目，用xpath索引查找
    for i in range(1,25):
        #定义一个字典存放 
        # time时间,title标题
        # text摘要,tages使用的编程语言
        dict={}
        # time就是传入的时间
        dict["time"]=time
        #xpath获取标题
        titles = html.xpath('//*[@id="js-pjax-container"]/div[3]/div/div[2]/article['+str(i)+']/h1/a/text()')
        #去除前后的换行和空格
        title = titles[2].strip('\n').strip(' ').strip('\n')
        dict["title"]=title
        #xpath获取摘要
        text = html.xpath('//*[@id="js-pjax-container"]/div[3]/div/div[2]/article['+str(i)+']/p//text()')
        #摘要可能出现空或者非空，特殊判断下，并去除前后的换行和空格
        if len(text) >=1:
            dict["text"]=text[0].strip('\n').strip(' ').strip('\n')
        else :
            dict["text"]=''
        tag = html.xpath('//*[@id="js-pjax-container"]/div[3]/div/div[2]/article['+str(i)+']/div[2]/span[1]/span[2]//text()')
        #语言标签可能出现空或者非空，特殊判断下，并去除前后的换行和空格
        if len(tag) >=1 :
            dict["tags"]=[]
            dict["tags"].append(tag[0])
        else :
            dict["tags"]=[]
        data.append(dict)
    return data

Stack Overflow网页数据分析模块 analyze_stackoverflow

用re.findall 匹配所有问题

再逐个匹配对应需要的数据


#stackoverflow网页数据过滤-分模块
def analyze_stackoverflow(html,time):
        data = []
        # re匹配全部热榜问题
        text_all = re.findall('\"([0-9]*)\" data-post-type-id=\"1\">([\s\S]*?)</time>',html,re.S)
        # 将每个问题拆分出来，对问题的id和时间和语言标签进行提取
        for text in text_all:
                #定义一个字典存放 
                # time时间,title标题
                # id 问题id,tage 关键词标签
                dict = {}
                dict["time"]=time
                dict["id"]=eval(text[0])
                dict["tags"]=[]
                # 匹配标题
                titles = re.findall('class=\"s-link\">(.*?)</a>',text[1],re.S)
                dict["title"]=titles[0]
                # 匹配标签
                tags = re.findall('rel=\"tag\">([a-z-+0-9#.]*)',text[1],re.S)
                for tag in tags:
                        dict["tags"].append(tag)
                data.append(dict)
        return data

数据分析调用区

Stack Overflow的数据存入stackoverflow_datas中

Github的数据存入Github_datas中


stackoverflow_datas = analyze_html('stackoverflow')
github_today_datas = analyze_html('github_today')

解析数据传出格式

传出的stackoverflow_datas和Github_datas数据结构如下，为json格式


{'time': '05_06',
  'title': 'metaseq',
  'text': 'Repo for external large-scale work',
  'tags': ['Python']}

(一开始想要放mongodb去分析，才生成出这样的格式，但是后面又有点懒)

数据归类

由于数据解析得到的数据较为分散，不利于后续的图表生成，需要进行转化归类

转化目标

将数据以time为基本项，统计编程语言tags出现的次数，生成x轴为时间，y轴为出现次数的折线图

将title拆分出单词，进行词频统计，生成词云图

数据归类函数调用图

categorical_time_data(datas) 将数据解析得到的数据分离出time和tag

接收无论是Stack Overflow或者github的数据，提取出他们的time和tag，返回给merge_all_time_data

参数对应意思

datas 接收到的Stack Overflow或者github的数据

返回数据格式

(time,tag) time是字符串，tag是列表

merge_all_time_data(all_datas) 将分散的数据传入并分离后得到一个[time,tag]的数据格式

接受无论是Stack Overflow或者github的数据，用categorical_time_data 分离出time和tag标签并整合成新的数据格式

参数对应意思

datas 接收到的Stack Overflow或者github的数据

返回数据格式


{
'tot':['Python', 'TypeScript', 'SWIG', 'Python', 'Go', 'Jupyter Notebook'...],
'05_06': ['Python', 'TypeScript', 'SWIG', 'Python', 'Go', 'Jupyter Notebook'...],
'05_07': ['Python', 'TypeScript', 'SWIG', 'Python', 'Go', 'Jupyter Notebook'...],
'05_08': ['Python', 'TypeScript', 'SWIG', 'Python', 'Go', 'Jupyter Notebook'...]
}

get_top_7(data) 筛选出前7个最热门的程序语言

通过接受merge_all_time_data 的内容，进行频率统计

参数对应意思

返回数据格式


{'05_06': [('javascript', 22),
  ('python', 18),
  ('java', 9),
  ('html', 7),
  ('c++', 5),
  ('reactjs', 5),
  ('pandas', 2)]...]

categorical_title_data(datas) 将数据解析得到的数据分离出title，并分词出对应的单词

接收无论是Stack Overflow或者github的数据，提取出他们的title，分词完得到单词

返回给merge_all_title_data

参数对应意思

datas 接收到的Stack Overflow或者github的数据

返回数据格式


('jcc')

merge_all_title_data(all_datas) 将分词出的单词，统计频率

从categorical_title_data 得到title中的单词，将其频率进行统计后返回

参数对应意思

datas 接收到的Stack Overflow或者github的数据

返回数据格式


[('closed', 118), ('using', 107), ('function', 102)....]

代码详解

分类time和tag模块 categorical_time

将每一个传入的数据中的time和tag提取出，以元组(time,tag)形式返回


def categorical_time_data(datas):
    # 遍历数据集
    for data in datas:
        # 提取出time和程序语言标签
        time = data["time"]
        tags = data["tags"]
        for tag in tags :
            #如果标签为空，则跳过
            if tag == "":
                continue
            #每循环一次，返回一个元组(time,tag)
            #python语法糖
            yield (time,tag)

time_tag数据融合模块 merge_all_time_data

将上方数据结构传入 提取time和tag模块 并得到对应的time和tag元组

再推入以time为key，tag列表为value的新数据结构中


def merge_all_time_data(all_datas):
    tag_datas={}
    #tot用于统计全部的，不论任何时间的
    tag_datas["tot"] = []
    #遍历数据结构，进行time和tag的提取
    for datas in all_datas:
        # 通过提取time和tag模块，得到time和tag数据
        for time,tag in  categorical_time_data(datas):
            # 对tot推入所有tag
            tag_datas["tot"].append(tag)
            # 将tag按时间分类
            if time not in tag_datas:
                tag_datas[time]=[]
                tag_datas[time].append(tag)
            else :
                tag_datas[time].append(tag)
    return tag_datas

排名前7程序语言归类(此处较麻烦) get_top_7

由于 time_tag数据中转模块得到的数据，只把出现的数据以列表形式存储

并未归纳为出现次数，在此处将归纳出现次数，并只取出现次数前七的程序语言

代码：


# get_top_7 : 得到top7，以及其在每日中变化趋势
def get_top_7(data):
    # 从time_tag数据中转模块得到数据结构
    time_datas = merge_all_time_data(data)

    # 通过python自带的频率分析函数 得到每个语言的出现次数，在tot中得到前7
    dict_tmp = collections.Counter(time_datas['tot'])
    # 将得到的频率排序
    tot_rank=sorted(dict_tmp.items(),key=lambda x:x[1],reverse=True)
    top_list = [] #存放排名前7的语言元组
    time_lab=[] #存放日期标签
    top_lab=[] #存放排名前7的语言标签
    top_list_everyday = {}#存放各个日期出现的次数

    #遍历前7个出现最多的语言个数，放入上面准备好的列表中
    for index in range(7):
        top_list.append(tot_rank[index])
        top_lab.append(tot_rank[index][0])
    
    #按前7个出现最多的语言去统计得到 各个日期的数据结构
    for time,data in time_datas.items():
        if time != "tot":
            time_lab.append(time)
            # 对各个天数的日期进行频率统计
            dict_tmp = collections.Counter(time_datas[time])
            top_list_everyday[time] = []
            for name,value in dict_tmp.items():
                # 如果语言是前7的语言标签，则放入数据结构
                if name in top_lab:
                    top_list_everyday[time].append((name,value))
            # 将得到东西排个序
            top_list_everyday[time] = sorted(top_list_everyday[time],key=lambda x:x[1],reverse=True)
    #把tot也放进去
    top_list_everyday["tot"]=top_list
    #无所谓的特殊处理
    top_lab.reverse()
    #将时间标签，前7个热门语言标签，和对应的key,value返回
    return time_lab,top_lab,top_list_everyday

分类标题模块 categorical_title_data

将标题数据通过jieba库分词后，传出给函数


def categorical_title_data(datas):
    for data in datas:
        # 提取出标题数据
        title = data["title"]
        # 将数据用jieba分词，存入列表
        list = jieba.cut(title)
        #遍历每一个分词出的单词
        for word in list:
            # 传回得到的单词
            yield word

词频分析模块 merge_all_title_data

得到了所有单词后，对每个出现的单词用字典接收和统计累计次数

进行升序排序，就得到了词频

代码:


def merge_all_title_data(all_datas):
    words = {}
    for datas in all_datas:
        #得到词频
        for word in categorical_title_data(datas):
            # 通过re库过滤 一些短的词，防止人称等无用词干扰
            # （虽然有失偏驳，但是就先这样吧）
            if re.search("^.{5,50}",word) == None:
                continue
            # 将单词通过字典存储
            if word not in words:
                words[word]=1
            else :
                words[word]+=1
    words = sorted(words.items(),key = lambda x:x[1],reverse=True)
    return words

数据归类调用区

在此处调用

排名前7程序语言归类函数，和词频分析函数


all_time_lab,all_top_lab,all_top_list = get_top_7([github_today_datas,stackoverflow_datas])
github_time_lab,github_top_lab,github_top_list = get_top_7([github_today_datas])
stackoverflow_time_lab,stackoverflow_top_lab,stackoverflow_top_list = get_top_7([stackoverflow_datas])
all_title_data = merge_all_title_data([github_today_datas,stackoverflow_datas])

数据归类传出数据

排名前七程序语言数据


{'05_06': [('javascript', 22),
  ('python', 18),
  ('java', 9),
  ('html', 7),
  ('c++', 5),
  ('reactjs', 5),
  ('pandas', 2)]...]

词频分析程序数据


[('closed', 118), ('using', 107), ('function', 102)....]

图像绘制

此处调用pycharts库，将上述得到数据，放进去分析，就可以得到好看的分析图

简单易学，强烈安利，舍友丁逸用了都夸好，孩子再也不怕数据分析图表了

pyecharts - A Python Echarts Plotting Library built with love.

Description

https://pyecharts.org/#/zh-cn/quickstart

环境

运行时间

爬取网站

项目主要结构

总函数调用分布图

依赖库

项目源代码

Demo 展示

爬取目标

界面爬取

界面爬取函数调用图

代码详解

解析数据

解析数据函数调用图

代码详解

解析数据传出格式

数据归类

转化目标

数据归类函数调用图

代码详解

数据归类传出数据

图像绘制

总结