网络爬虫的常用技术有哪些？

作者:课课家教育更新于： 2018-12-09 15:53:09

网络爬虫的常用技术有哪些？

　　1、Python的网络请求

　　本节介绍在python实战项目中实现HTTP网络请求常见的3种方式：urllib、urllib3和requests.

　　1)urllib 模块

　　urllib是Python自带模块，该模块中提供了一个urlopen()方法，通过该方法指定URL发送网络请求来获取数据。urllib提供了多个子模块，具体的CRM模块名称与含义如下表所示。

模块名称	描述
urllib.request	该模块定义了打开UPL（主要是HTTP）的方法和类，如身份认证、重定向、cookie等
urllib.error	该模块中主要包含异常类，基本的异常类是URLError
urllib.parse	该模块定义的功能分为两大类：URL解析和URL引用
urllib.robotparser	该模块用于解析robots.txt文件

通过urllib.request模块实现发送请求并读取网页内容的简单示例如下：

import urllib.request #导入模块

#打开指定需要爬取的网页
response = urllib.request.urlopen('http://www.baidu.com')
html = response.read()    #读取网页代码
print(html)               #打印读取内容

　　上面的示例中，是通过Python开发get请求方式获取百度的网页内容。下面通过使用urllib.request模块的post请求实现获取网页信息的内容，示例如下：

import urllib.parse
import urllib.request

#将数据使用urlencode编码处理后，再使用encoding设置为utf-8编码
data = bytes(urllib.patse.urlencode({'word'"'hello}),encoding='utf8')
#打开指定需要爬取的网页
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
html = response.read()      #读取网页代码
print(html)                 #打印读取内容

　　2)urllib3模块

　　urllib3是一个功能强大，条理清晰，用于HTTP客户端的Python库，许多Python自动化开发的原生系统已经开始使用urllib3。urllib3提供了很多Python标准库里所没有的重要特性：

　　线性安全。

　　连接池。

　　客户端SSL/TLS验证。

　　使用前端开发多部分编码上传文件。

　　Helpers用于重试请求并处理HTTP重定向。

　　支持gzip和deflate编码。

　　支持HTTP和SOCKS代理。

　　100 的测试覆盖率。

　　通过urllib3模块实现发送网络请求的WEB开发示例代码如下：

impot urllib3

#创建PoolManager对象，用于处理与线程池的连接以及线程安全的所有细节
http = urllib3.PoolManager()
# 对需要爬取的网页发送请求
response = http.request('GET','https://www.baidu.com/')
print(response.data)    #打印读取内容

　　post请求实现获取网页信息的内容，关键代码如下：

#对需要爬取的网页发送请求
response = http.request('POST',
                        'http://httpbin.org/post'
                         ,fields = {'word':'hello'})

　　3)requests模块

　　requests是Python中实现HTTP请求的一种方式，requests是第三方模块，该软件自动化测试模块在实现HTTP请求时要比urllib模块简化很多，操作更加人性化。在使用requests模块时需要通过执行pip install requesrsts代码进行该模块的安装。requests 功能特性如下：

　　Keep-Alive& 连接池。

　　国际化域名和URL。

　　带持久Cookie的会话。

　　浏览器式的SSL认证。

　　自动内容解码。

　　基本/摘要式的身份认证。

　　优雅的key/value Cookie.

　　自动解压。

　　Unicode响应体。

　　HTTP(S)代理支持。

　　文件分块上传。

　　流下载。

　　连接超时。

　　分块请求。

　　支持 .netrc。

　　以GET请求方式为例，打印selenium3自动化多种请求信息的示例代码如下：

import requests                 #导入模块

response = requests.get('http://www.baidu.com')
print(response.status_code)     #打印状态码
print(response.url)             #打印请求url 
print(response.headers)         #打印头部信息
print(response.cookies)         #打印cookie信息
print(response.text)            #以文本形式打印网页源码
print(response.content)         #以字节流形式打印网页源码

　　以POST请求方式，发送HTTP网络请求的示例代码如下：

import requests

data = {'word':'hello'}        #表单参数
#对需要爬取的网页发送请求
response = requests.post('http://httpbin.org/post',data=data)
print(response.content)        #以字节流形式打印网页源码

　　requests模块不仅提供了以上两种常用的请求方式，还提供以下多种网络请求的方式。代码如下：

requests.put('http://httpbin.org/put',data = {'key':'value'})     #PUT请求
requests.delete('http://httpbin,org/delete')                      #DELETE请求
requests.head('http://httpbin.org/get')                           #HEAD请求
requests.options('http://httpbin.org/get')                        #OPTIONS请求

　　如果发现请求的URL地址中参数是跟在“?”(问号)的后面，例如，httpbin.org/get?key=val。requests模块提供了传递参数的方法，允许用户使用params关键字参数，以一个字符串字典来提供这些参数。例如，用户想传递key1 = value1 和key2 = value2到httpbin.org/get。那么可以使用如下代码：

import requests

payload = {'key1':'value1','key2':'value2'}      #传递的参数
#对需要爬取的网页发送请求
response = requests.get("http://httpbin.org/get",params = payload)
print(response.content)                          #以字节流形式打印网页源码

　　2、HTKL解析之BeautifulSoup

　　BeautifulSoup是一个用于从 HTML和XML文件中提取数据的Python库。BeautifulSoup 提供一些简单的函数用来处理导航、搜索、修改分析树等功能。BeautifulSoup 模块中的查找提取功能非常强大，而且非常便捷，它通常可以节省程序员数小时或数天的工作时间。

　　BeautifulSoup自动将输入文档转换为Unicode编码，软件测试输出文档转换为utf-8 编码。用户不需要考虑编码方式，除非文档没有指定一个编码方式，这时，BeautifulSoup 就不能自动识别编码方式了。然后，用户仅仅需要说明一下原始编码方式就可以了。

　　1). BeautifulSoup的安装

　　BeautifulSoup 3已经停止开发，目前推荐使用的是BeautifulSoup4,不过它已经被移植到bs4当中了，所以在Zabbix导入时需要from bs4,然后再导入BeautifulSoup。安装BeautifulSoup 有以下3种方式:

　　(1)如果您使用的是最新版本的Debian 或Ubuntu Linux，则可以使用系统软件包管理器安装BeautifulSoup.安装命令为: apt-get install python-bs4。

　　(2)BeautifulSoup 4是通过PyPi发布的，可以通过easy_ install 或pip来安装。包名是beautifulsoup 4,它可以兼容Python 2和Python 3。安装命令为: easy_ install beautifulsoup4 或者是pip installbeautifulsoup 4.

　　(3)如果当前的Beautifulsoup不是您想要的版本，可以通过下载源码的方式进行安装，源码的下载地址为hpttps://ww w.crummy .com/software/BeautifulSoup/bs4/download/，然后在控制台中打开源码的指定路径，输入命令python setup.py istall即可，如下图所示。

　　BeautifulSoup 支持Python标准库中包含的HTML解析器，但它也支持许多第三方Python解析器，其中包含lxml解析器，根据不同的操作系统，用户可以使用以下命令之一安装lxml。

　　apt-get install python-lxml。

　　easy_install lxml。

　　pip install lxml。

　　另个解析器是html5lib，它是一个用于解析 HTML的Python库，按照Web浏览器的方式解析 HTML。用户可以使用以下命令之一安装html5lib。

　　apt-get install python-html5lib。

　　easy_install html5lib。

　　pip install html5lib。

　　在下表中总结了每个解析器的优缺点。

　　2). BeautifulSoup的使用

　　BeautifulSoup安装完成以后，下面将介绍如何通过BeautifulSoup库进行HTML的解析工作，具体示例步骤如下:

　　(1) 导入bs4库，然后创建个模拟 HTML代码的字符串，代码如下:

from bs4 import BeautifuSoup      # 导入BeauifulSoup库


#创建模拟HTML代码的字符串
html_doc="""
The Dormouse's story/title></head>
<body>
<p class="tite"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three ltle sisters; and their names were
<a href="http://example.com/elsie "class="sister" id= "link1">Elsie</a>,
<a href="http://example.com/lacie "class="sister" id= "link2">Lacie</a> and
<a href="http://example.com/tillie" chass="sister" id="link3">Tillie</a>;
and they lved at the bottom of a well.</p>

<p class="story">...</p>
"""</pre>
<p>　　(2) 创建BeauifulSoup对象，并指定解析器为lxml,最后通过打印的方式将解析的HTML代码显示在控制台当中，代码如下：</p>
<pre class="prettyprint linenums">#创建一个BeautifulSoup对象，获取页面正文 
soup = BeautifulSoup(html_doc,features="lxml")
print(soup)          #打印解析的HTML代码 </pre>
<p>　　运行结果 如图所示。<img style="display: block; margin-left: auto; margin-right: auto;" src="/Public/images/upload/article/2018-12/5c0bcb77222f2.jpg" alt="" /></p>
<p>      说明： 如果将html_doc字符串中的代码保存在index.html 文件中，可以通过打开HIML文件的方式进行代码的解析，并且可以通过prettify()方法进行代码的格式化处理，代码如下:</p>
<pre class="prettyprint linenums">#创建BeautifulSoup对象打开需要解析的html文件
soup = BeautifulSoup(open('index.html'),'lxml')
print(soup.prettify())        #打印格式化后的代码 </pre><div class="art_tag">标签：
                    <a title="Python" href="https://m.kokojia.com/s91/">Python</a><a title="编程教学" href="https://m.kokojia.com/s6555/">编程教学</a><a title="游戏开发" href="https://m.kokojia.com/s1080/">游戏开发</a><a title="Pygame" href="https://m.kokojia.com/s5299/">Pygame</a></div><div class="updown"><p><span>上一篇：</span><a title="看看AccessTrunk和Hybrid三种交换机端口模式"
                                           href="/article/37676.html">看看AccessTrunk和Hybrid三种交换机端口模式</a></p><p><span>下一篇：</span><a title="看看AccessTrunk和Hybrid三种交换机端口模式"
                                           href="/article/22460.html">详解C语言中整数（short,int,long）</a></p></div></ul></div><div class="course_recom"><div class="cr-title"><span class="b-border">为您推荐</span></div><ul class="course_lists"><li><div class="course_pico"><a href="/course-1902.html" title="AP计算机科学A（APcomputer science A）复习备考攻略视频教程"><img
                            src="https://www.kokojia.com/Public/course_image/big/2014-09/54223e4f33e90.jpeg" alt="AP计算机科学A（APcomputer science A）复习备考攻略视频教程"></a></div><div class="courses_info"><div class="courses_names"><a href="/course-1902.html"
                                                      title="AP计算机科学A（APcomputer science A）复习备考攻略视频教程">AP计算机科学A（APcomputer science A）复习备考攻略视频教程</a></div><p><span class="price_area">￥700.00</span></p></div></li><li><div class="course_pico"><a href="/course-6974.html" title="Beego新闻发布系统后台视频课程"><img
                            src="https://www.kokojia.com/Public/course_image/big/2022-11/638769d911b22.jpeg" alt="Beego新闻发布系统后台视频课程"></a></div><div class="courses_info"><div class="courses_names"><a href="/course-6974.html"
                                                      title="Beego新闻发布系统后台视频课程">Beego新闻发布系统后台视频课程</a></div><p><span class="price_area">￥299.00</span></p></div></li><li><div class="course_pico"><a href="/course-5379.html" title="10亿级订单数据解决方案以及实现（真实工业界案例）视频教程"><img
                            src="https://www.kokojia.com/Public/course_image/big/2019-07/5d2c5ab197b12.jpeg" alt="10亿级订单数据解决方案以及实现（真实工业界案例）视频教程"></a></div><div class="courses_info"><div class="courses_names"><a href="/course-5379.html"
                                                      title="10亿级订单数据解决方案以及实现（真实工业界案例）视频教程">10亿级订单数据解决方案以及实现（真实工业界案例）视频教程</a></div><p><span class="price_area">￥399.00</span></p></div></li><li><div class="course_pico"><a href="/course-6971.html" title="项目管理工具Maven入门+实战教程"><img
                            src="https://www.kokojia.com/Public/course_image/big/2022-11/63874f0ff2f37.png" alt="项目管理工具Maven入门+实战教程"></a></div><div class="courses_info"><div class="courses_names"><a href="/course-6971.html"
                                                      title="项目管理工具Maven入门+实战教程">项目管理工具Maven入门+实战教程</a></div><p><span class="price_area">￥29.00</span></p></div></li><li><div class="course_pico"><a href="/course-5378.html" title="分布式集群任务调度解决方案视频教程"><img
                            src="https://www.kokojia.com/Public/course_image/big/2018-04/5ae5b25a9df1a.jpeg" alt="分布式集群任务调度解决方案视频教程"></a></div><div class="courses_info"><div class="courses_names"><a href="/course-5378.html"
                                                      title="分布式集群任务调度解决方案视频教程">分布式集群任务调度解决方案视频教程</a></div><p><span class="price_area">￥399.00</span></p></div></li><li><div class="course_pico"><a href="/course-5713.html" title="C++基础到进阶视频教程"><img
                            src="https://www.kokojia.com/Public/course_image/big/2020-08/5f47c52af10bd.png" alt="C++基础到进阶视频教程"></a></div><div class="courses_info"><div class="courses_names"><a href="/course-5713.html"
                                                      title="C++基础到进阶视频教程">C++基础到进阶视频教程</a></div><p><span class="price_area">￥498.00</span></p></div></li></ul></div></div><div class="common_bottom-tar"><ul><li><a href="https://m.kokojia.com"><dl><dt><i class="tar-index"></i></dt><dd>
                    首页
                </dd></dl></a></li><li><a href="https://m.kokojia.com/list/"><dl><dt><i class="tar-class"></i></dt><dd>
                    分类
                </dd></dl></a></li><li><a href="https://m.kokojia.com/?m=Wap&a=myInfo"><dl><dt><i class="tar-my"></i></dt><dd>
                    我的
                </dd></dl></a></li><li><a href="http://a.app.qq.com/o/simple.jsp?pkgname=com.nuoxian.kokojia"><dl><dt><i class="tar-app"></i></dt><dd>
                    APP
                </dd></dl></a></li><ul></div><script>
    $(".common_bottom-tar li").click(function (){
        $(this).addClass("on");
    });
    if ($.cookie('close_ad')!= "close"){
        $(".common_bottom-ad").show();
    }
    $(".common_bottom-ad .ad-close").click(function (){
        $(".common_bottom-ad").hide();
        $.cookie('close_ad', 'close', { expires: 1 });
    });
</script><div style="display: none;" class="headerMask"></div><div class="footer"><p class="home_footer mb5" style="display:none"><a class="pc_web" style="margin-right:10px;" href="http://www.kokojia.com/list/2.html"></a><span class="split"></span><a class="mb_web" style="margin-left:10px;" href="http://m.kokojia.com/"></a></p><!--p>咨询电话：020-32643462　(<a href="/?m=Wap&a=telephone">联系方式</a>)</p--><p><a href="http://m.kokojia.com/?m=Wap&a=aboutkoko">广州挪贤计算机科技有限公司</a> 版权所有</p><p>© 2013-2025 课课家 kokojia.com</p></div></div><a style="display:none;" id="returnTop" href="javascript:scroll(0,0)" class="hide"><i class="icon-arrow-t"></i></a><div class="right_menu" style="position: fixed;right: -270px;width:270px;display: block;z-index:9999"><div class="right_menu_con"><h2 class="hjclass-txt">课课家教育</h2><div class="menu-list clearbox"><a href="http://m.kokojia.com"><span class="item-icon item-1"><i class="icon-home"></i></span> 首页 </a><a href="http://m.kokojia.com/list/"><span class="item-icon item-2"><i class="icon-course"></i></span> 课程分类 </a><a href="http://m.kokojia.com/packageList.html"><span class="item-icon item-9"><i class="fa fa-cubes" aria-hidden="true"></i></span> 课程套餐 </a><a href="http://m.kokojia.com/roadmapList.html"><span class="item-icon item-1"><i class="fa fa-map-o" aria-hidden="true"></i></span> 线路图 </a><a href="http://m.kokojia.com/book/"><span class="item-icon item-7"><i class="icon-book"></i></span> 读书 </a><a href="http://m.kokojia.com/article/list/"><span class="item-icon item-4"><i class="icon-quan"></i></span> 资讯 </a><a href="http://m.ebuy.kokojia.com"><span class="item-icon item-6"><i class="icon-item"></i></span> 商城 </a><a href="https://m.blog.kokojia.com"><span class="item-icon item-11"><i class="fa fa-code" aria-hidden="true"></i></span> 博客 </a><a href="http://m.bbs.kokojia.com"><span class="item-icon item-3"><i class="icon-my"></i></span> 论坛 </a><a href="http://m.kokojia.com/?m=Wap&a=myCourse"><span class="item-icon item-8"><i class="icon-xb"></i></span> 我的课程 </a><a href="http://m.kokojia.com/?m=Wap&a=payCenter"><span class="item-icon item-5"><i class="icon-cz"></i></span> 帐户中心 </a><a href="http://m.kokojia.com/?m=Wap&a=courseCart"><span class="item-icon item-10"><i class="icon-cart2"></i></span> 购物车 </a></div></div></div><div class="user_center" id="user_center"><div class="user_center_con unlogin"><div class="myinfo-box clearbox"><div class="userimg"><div class="userimg_con"><img src="/Public/images/wap/anonymous.png"></div></div><p class="unlogin-txt">未登录</p></div><div class="links-login"><a href="/login.html" class="btn btnGreen">登录</a><a href="/register.html" class="btn btnDark mt20">注册</a></div></div><script type="text/javascript" src="/Public/js/Wap/left.js"></script><script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "//hm.baidu.com/hm.js?16501697fab310f5102e2b71a31b453a";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
</script><script type="text/javascript">
    var htmlH = $(document).height();
    $("#container").css("height", 'auto');
    $(".right_menu").css("height", htmlH);
</script></body></html>1