实战python爬虫开发:XPath解析库

2022-10-1821:22:16云计算与物联网Comments826 views字数 3037阅读模式

XPath是⼀⻔在 XML ⽂档中查找信息的语⾔, XPath可⽤来在 XML⽂档中对元素和属性进⾏遍历,⽽我们熟知的HTML恰巧属于XML的⼀个⼦集,所以完全可以⽤xpath去查找html中的内容。文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

⾸先, 先了解⼏个概念:文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

实战python爬虫开发:XPath解析库文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

在上述html中,文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

1. book, id, name, price....都被称为节点.文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

2. Id, name, price, author被称为book的⼦节点文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

3. book被称为id, name, price, author的⽗节点文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

4. id, name, price,author被称为同胞节点文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

OK~ 有了这些基础知识后, 我们就可以开始了解xpath的基本语法了文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

在python中想要使⽤xpath,需要安装lxml模块。文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

pip install lxml

⽤法:文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

1. 将要解析的html内容构造出etree对象.文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

2. 使⽤etree对象的xpath()⽅法配合xpath表达式来完成对数据的提取文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

from lxml import etree
html = """

    1
    野花遍地⾹
    1.23
    臭⾖腐
    
        周⼤强
        周芷若
        周杰伦
        蔡依林
惹了

    
    
    胖胖陈
    胖胖不陈
    

"""
et = etree.XML(html)
# 根据节点进⾏搜索
# result = et.xpath("/book")
# result = et.xpath("/book/id") # /在开头表示⽂档最开始, /在中间表示⼉⼦
# result = et.xpath("/book//nick") # //表示后代
result = et.xpath("/book/*/nick") # *表示通配符
print(result)

xpath如何提取属性信息. 我们上⼀段真实的HTML来给各位讲解⼀下:文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

  • 百度
  • ⾕ 歌
  • 搜狗
  1. ⻜机
  2. ⼤炮
  3. ⽕⻋
李嘉诚
胡辣汤

    
from lxml import etree
tree = etree.parse("1.html")
result = tree.xpath("/html/body/ul/li/a/@href")
print(result)
result = tree.xpath("/html/body/ul/li")
for li in result:
print(li.xpath("./a/@href")) # 局部解析
result = tree.xpath("//div[@class='job']/text()")
# [@class='xxx']属性选取 text()获取⽂本
print(result)

实战案例:文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

一、58二手房标题

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
if __name__ == "__main__":
    headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
    url = "https://bj.58.com/ershoufang/"
    page_text = requests.get(url,headers = headers).text
    tree = etree.HTML(page_text)
    titles = tree.xpath("//h3/text()")
    for title in titles:
        print(title)

实战python爬虫开发:XPath解析库文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

二:彼岸壁纸下载

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
import os
if __name__ == "__main__":
    headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
    url = "https://pic.netbian.com/4kfengjing/"
    page_text = requests.get(url,headers = headers).text.encode("ISO-8859-1")

    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="slist"]//li')

    if not os.path.exists("./piclibs"):
        os.mkdir('./piclibs')

    for li in li_list:
        img_src = 'https://pic.netbian.com/'+li.xpath('./a/img/@src')[0]
        # print()
        img_name = li.xpath('./a/img/@alt')[0]+'.jpg'
        # print(img_name,img_src)
        img_data = requests.get(url=img_src,headers = headers).content
        img_path = "piclibs/"+img_name
        with open(img_path,"wb") as fp:
            fp.write(img_data)
            print(img_name,"下载成功")
实战python爬虫开发:XPath解析库文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

三、全国城市名称爬取文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
import os
if __name__ == "__main__":
    headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
    url = "https://www.aqistudy.cn/historydata/"
    page_text = requests.get(url,headers = headers).text

    tree = etree.HTML(page_text)
    #全部城市
    all_city_list = tree.xpath('//div[@class="bottom"]//li')
    all_city_names = []
    for li in all_city_list:
        city_name = li.xpath("./a/text()")[0]
        all_city_names.append(city_name)
        for city in all_city_names:
            print(city)
    # print(all_city_names)

    #热门城市
    # hot_city_names = []
    # hot_city_list = tree.xpath('//div[@class="hot"]//li')
    # for li in hot_city_list:
    #     city_name = li.xpath("./a/text()")[0]
    #
    #     hot_city_names.append(city_name)
    # print(hot_city_names)
实战python爬虫开发:XPath解析库
文章源自菜鸟学院-https://www.cainiaoxueyuan.com/yunda/28417.html
  • 本站内容整理自互联网,仅提供信息存储空间服务,以方便学习之用。如对文章、图片、字体等版权有疑问,请在下方留言,管理员看到后,将第一时间进行处理。
  • 转载请务必保留本文链接:https://www.cainiaoxueyuan.com/yunda/28417.html

Comment

匿名网友 填写信息

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定