python解析html开发库pyquery使用方法

时间：2020-11-27 11:09:08|栏目：Python代码|点击：次

例如

 
<div id="info">
 
<span><span class='pl'>导演</span>: <a href="/celebrity/1047989/" rel="v:directedBy">汤姆?提克威</a> / <a href="/celebrity/1161012/" rel="v:directedBy">拉娜?沃卓斯基</a> / <a href="/celebrity/1013899/" rel="v:directedBy">安迪?沃卓斯基</a></span><br/>
 
<span><span class='pl'>编剧</span>: <a href="/celebrity/1047989/">汤姆?提克威</a> / <a href="/celebrity/1013899/">安迪?沃卓斯基</a> / <a href="/celebrity/1161012/">拉娜?沃卓斯基</a></span><br/>
 
<span><span class='pl'>主演</span>: <a href="/celebrity/1054450/" rel="v:starring">汤姆?汉克斯</a> / <a href="/celebrity/1054415/" rel="v:starring">哈莉?贝瑞</a> / <a href="/celebrity/1019049/" rel="v:starring">吉姆?布劳德本特</a> / <a href="/celebrity/1040994/" rel="v:starring">雨果?维文</a> / <a href="/celebrity/1053559/" rel="v:starring">吉姆?斯特吉斯</a> / <a href="/celebrity/1057004/" rel="v:starring">裴斗娜</a> / <a href="/celebrity/1025149/" rel="v:starring">本?卫肖</a> / <a href="/celebrity/1049713/" rel="v:starring">詹姆斯?达西</a> / <a href="/celebrity/1027798/" rel="v:starring">周迅</a> / <a href="/celebrity/1019012/" rel="v:starring">凯斯?大卫</a> / <a href="/celebrity/1201851/" rel="v:starring">大卫?吉雅西</a> / <a href="/celebrity/1054392/" rel="v:starring">苏珊?萨兰登</a> / <a href="/celebrity/1003493/" rel="v:starring">休?格兰特</a></span><br/>
 
<span class="pl">类型:</span> <span property="v:genre">剧情</span> / <span property="v:genre">科幻</span> / <span property="v:genre">悬疑</span><br/>
 
<span class="pl">官方网站:</span> <a href="http://cloudatlas.warnerbros.com" rel="nofollow" target="_blank">cloudatlas.warnerbros.com</a><br/>
 
<span class="pl">制片国家/地区:</span> 德国 / 美国 / 香港 / 新加坡<br/>
 
<span class="pl">语言:</span> 英语<br/>
 
<span class="pl">上映日期:</span> <span property="v:initialReleaseDate" content="2013-01-31(中国大陆)">2013-01-31(中国大陆)</span> / <span property="v:initialReleaseDate" content="2012-10-26(美国)">2012-10-26(美国)</span><br/>
 
<span class="pl">片长:</span> <span property="v:runtime" content="134">134分钟(中国大陆)</span> / 172分钟(美国)<br/>
 
<span class="pl">IMDb链接:</span> <a href="http://www.imdb.com/title/tt1371111" target="_blank" rel="nofollow">tt1371111</a><br>
 
<span class="pl">官方小站:</span>
 
<a href="http://site.douban.com/202494/" target="_blank">电影《云图》</a>
 
</div>

复制代码代码如下:

 
from pyquery import PyQuery as pq
 
doc=pq(url='http://movie.douban.com/subject/3530403/')
 
data=doc('.pl')
 
for i in data:
 
    print pq(i).text()

输出

复制代码代码如下:

 
导演
 
编剧
 
主演
 
类型:
 
官方网站:
 
制片国家/地区:
 
语言:
 
上映日期:
 
片长:
 
IMDb链接:
 
官方小站:

用法

用户可以使用PyQuery类从字符串、lxml对象、文件或者url来加载xml文档:

复制代码代码如下:

 
>>> from pyquery import PyQuery as pq
 
>>> from lxml import etree
 
>>> doc=pq("<html></html>")
 
>>> doc=pq(etree.fromstring("<html></html>"))
 
>>> doc=pq(filename=path_to_html_file)
 
>>> doc=pq(url='http://movie.douban.com/subject/3530403/')

可以像jQuery一样选择对象了

复制代码代码如下:

 
>>> doc('.pl')
 
[<span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span#rateword.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <p.pl>]

这样，class为'pl'的对象就全部选择出来了。

不过在使用迭代时需要对文本进行重新封装：

复制代码代码如下:

 
for para in doc('.pl'):
 
    para=pq(para)
 
    print para.text()   
 
导演
 
编剧
 
主演
 
类型:
 
官方网站:
 
制片国家/地区:
 
语言:
 
上映日期:
 
片长:
 
IMDb链接:
 
官方小站:

这里得到的text是unicode码，如果要写入文件需要编码为字符串。
用户可以使用jquery提供的一些伪类（但还不支持css）来进行操作，诸如：

复制代码代码如下:

 
>>> doc('.pl:first')
 
[<span.pl>]
 
>>> print  doc('.pl:first').text()
 
导演

Attributes
获取html元素的属性

复制代码代码如下:

 
>>> p=pq('<p id="hello" class="hello"></p>')('p')
 
>>> p.attr('id')
 
'hello'
 
>>> p.attr.id
 
'hello'
 
>>> p.attr['id']
 
'hello'

赋值

复制代码代码如下:

 
>>> p.attr.id='plop'
 
>>> p.attr.id
 
'plop'
 
>>> p.attr['id']='ola'
 
>>> p.attr.id
 
'ola'
 
>>> p.attr(id='hello',class_='hello2')
 
[<p#hello.hell0>]

Traversing
过滤

复制代码代码如下:

 
>>> d=pq('<p id="hello" class="hello"><a/>hello</p><p id="test"><a/>world</p>')
 
>>> d('p').filter('.hello')
 
[<p#hello.hello>]
 
>>> d('p').filter('#test')
 
[<p#test>]
 
>>> d('p').filter(lambda i:i==1)
 
[<p#test>]
 
>>> d('p').filter(lambda i:i==0)
 
[<p#hello.hello>]
 
>>> d('p').filter(lambda i:pq(this).text()=='hello')
 
[<p#hello.hello>]

按照顺序选择

复制代码代码如下:

 
>>> d('p').eq(0)
 
[<p#hello.hello>]
 
>>> d('p').eq(1)
 
[<p#test>]

选择内嵌元素

复制代码代码如下:

 
>>> d('p').eq(1).find('a')
 
[<a>]

选择父元素

复制代码代码如下:

 
>>> d=pq('<p><span><em>Whoah!</em></span></p><p><em> there</em></p>')
 
>>> d('p').eq(1).find('em')
 
[<em>]
 
>>> d('p').eq(1).find('em').end()
 
[<p>]
 
>>> d('p').eq(1).find('em').end().text()
 
'there'
 
>>> d('p').eq(1).find('em').end().end()
 
[<p>, <p>]

上一篇：使用Python的判断语句模拟三目运算

栏目：Python代码

下一篇：一个检测OpenSSL心脏出血漏洞的Python脚本分享

本文标题：python解析html开发库pyquery使用方法

本文地址：http://www.codeinn.net/misctech/26791.html

更多Python代码

Python代码

python解析html开发库pyquery使用方法

阅读排行

推荐教程