XPath是⼀⻔在 XML ⽂档中查找信息的语⾔, XPath可⽤来在 XML⽂档中对元素和属性进⾏遍历,⽽我们熟知的HTML恰巧属于XML的⼀个⼦集,所以完全可以⽤xpath去查找html中的内容。文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
⾸先, 先了解⼏个概念:文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
在上述html中,文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
1. book, id, name, price....都被称为节点.文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
2. Id, name, price, author被称为book的⼦节点文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
3. book被称为id, name, price, author的⽗节点文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
4. id, name, price,author被称为同胞节点文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
OK~ 有了这些基础知识后, 我们就可以开始了解xpath的基本语法了文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
在python中想要使⽤xpath,需要安装lxml模块。文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
pip install lxml
⽤法:文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
1. 将要解析的html内容构造出etree对象.文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
2. 使⽤etree对象的xpath()⽅法配合xpath表达式来完成对数据的提取文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
from lxml import etree
html = """
1
野花遍地⾹
1.23
臭⾖腐
周⼤强
周芷若
周杰伦
蔡依林
胖胖陈
胖胖不陈
"""
et = etree.XML(html)
# 根据节点进⾏搜索
# result = et.xpath("/book")
# result = et.xpath("/book/id") # /在开头表示⽂档最开始, /在中间表示⼉⼦
# result = et.xpath("/book//nick") # //表示后代
result = et.xpath("/book/*/nick") # *表示通配符
print(result)
xpath如何提取属性信息. 我们上⼀段真实的HTML来给各位讲解⼀下:文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
- 百度
- ⾕ 歌
- 搜狗
- ⻜机
- ⼤炮
- ⽕⻋
from lxml import etree
tree = etree.parse("1.html")
result = tree.xpath("/html/body/ul/li/a/@href")
print(result)
result = tree.xpath("/html/body/ul/li")
for li in result:
print(li.xpath("./a/@href")) # 局部解析
result = tree.xpath("//div[@class='job']/text()")
# [@class='xxx']属性选取 text()获取⽂本
print(result)
实战案例:文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
一、58二手房标题
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
if __name__ == "__main__":
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
url = "https://bj.58.com/ershoufang/"
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
titles = tree.xpath("//h3/text()")
for title in titles:
print(title)
文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
二:彼岸壁纸下载
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
import os
if __name__ == "__main__":
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
url = "https://pic.netbian.com/4kfengjing/"
page_text = requests.get(url,headers = headers).text.encode("ISO-8859-1")
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]//li')
if not os.path.exists("./piclibs"):
os.mkdir('./piclibs')
for li in li_list:
img_src = 'https://pic.netbian.com/'+li.xpath('./a/img/@src')[0]
# print()
img_name = li.xpath('./a/img/@alt')[0]+'.jpg'
# print(img_name,img_src)
img_data = requests.get(url=img_src,headers = headers).content
img_path = "piclibs/"+img_name
with open(img_path,"wb") as fp:
fp.write(img_data)
print(img_name,"下载成功")
文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
三、全国城市名称爬取文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
import os
if __name__ == "__main__":
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
url = "https://www.aqistudy.cn/historydata/"
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
#全部城市
all_city_list = tree.xpath('//div[@class="bottom"]//li')
all_city_names = []
for li in all_city_list:
city_name = li.xpath("./a/text()")[0]
all_city_names.append(city_name)
for city in all_city_names:
print(city)
# print(all_city_names)
#热门城市
# hot_city_names = []
# hot_city_list = tree.xpath('//div[@class="hot"]//li')
# for li in hot_city_list:
# city_name = li.xpath("./a/text()")[0]
#
# hot_city_names.append(city_name)
# print(hot_city_names)
文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html