Python数据科学教程：类库beautifulsoup读取处理HTML页面

2018-9-30

有一个类库叫作beautifulsoup。使用这个库，可以搜索html标签的值，并获取页面标题和页面标题列表等特定数据。

安装Beautifulsoup
使用Anaconda软件包管理器安装所需的软件包及其相关软件包。

conda install Beaustifulsoap

读取HTML文件

在下面的例子中，我们请求一个url被加载到python环境中。然后使用html parser参数来读取整个html文件。接下来，打印html页面的前几行。

import urllib2
from bs4 import BeautifulSoup

# Fetch the html file
import urllib3
from bs4 import BeautifulSoup
# Fetch the html file
http = urllib3.PoolManager()
response = http.request('GET','http://www.yiibai.com/python/features.html')
html_doc = response.data
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
# Format the parsed html file
strhtm = soup.prettify()
# Print the first few characters
print (strhtm[:225])

当执行上面示例代码，得到以下输出结果 -

<!DOCTYPE html>
<!--[if IE 8]><html class="ie ie8"> <![endif]-->
<!--[if IE 9]><html class="ie ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<html>
 <!--<![endif]-->
 <head>
  <!-- Basic -->
  <meta charset="utf-8"/>
  <title>

提取标记值

可以使用以下代码从标签的第一个实例中提取标签值。

import urllib3
from bs4 import BeautifulSoup
# Fetch the html file
http = urllib3.PoolManager()
response = http.request('GET','http://www.yiibai.com/python/features.html')
html_doc = response.data
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')

print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)

执行上面示例代码，得到以下结果 -

<title>易百教程™ - 专注于IT教程和实例</title>
易百教程™ - 专注于IT教程和实例
None
友情链接:

提取所有标签

可以使用以下代码从标签的所有实例中提取标签值。

import urllib3
from bs4 import BeautifulSoup
# Fetch the html file
http = urllib3.PoolManager()
response = http.request('GET','https://www.yiibai.com/python/features.html')
html_doc = response.data
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')


for x in soup.find_all('h1'): 
    print(x.string)

执行上面示例代码，得到以下结果 -

None
Python功能特点

THE END

Python数据科学教程：用Pandas和numpy库对数据执行聚合

<<上一篇

Python数据科学教程：python库的不同内置函数处理非结构数据

下一篇>>