Day 45 - Intermediate+ Web Scraping with Beautiful Soup

sky · 2021年07月14日13:35

本章由 Shadow 分享，筆記連結網址於此，由 Sky 取得同意後，整理如下。

時間：2021年6月27日 20:00~20:30
與會人員：Shadow, Dot, Wayne, 玉米, Yeh, Sky
分享：Shadow

Topic

Scraping the Web with BeautifulSoup

Parsing HTML and Making Soup

相關資源

Beautiful Soup文檔

Beautiful Soup Documentation - Beautiful Soup 4.9.0 documentation
Beautiful Soup 簡介

a Python library for pulling data out of HTML and XML files.

Basis

如何將 html 與 BeautifulSoup 結合

# 將website.html傳入BeautifulSoup的架構中
soup =BeautifulSoup(contents, 'html.parser') 
#當html.parser無法使用時可考慮使用lxml

獲取 title 內容

print(soup.title) #<title>test</title>
print(soup.title.name) #title
print(soup.title.string) #test

將 html 美化

print(soup.prettify()) #將html_doc增加縮排，以利於更好閱讀

findAll 應用

all_anchor_tags=soup.find_all(name='a') #獲取所有相關的標籤以list儲存
#findAll()=soup.find_all()，所以也可以寫成all_anchor_tags=soup.findAll(name='a')

#取出每個名稱
for tag in all_anchor_tags:
    print(tag.text) #可寫成tag.getText()
    print(tag.get("href")) #取出超連結網址

find 應用 (只會輸出第一個匹配的項目)

#以id搜尋
heading=soup.find(id="name")

#以class搜尋
section_heading= soup.find(name="h3",class_="heading")

select_one 應用 (只會輸出第一個匹配的項目)，如需多個可考慮使用 select

#尋找特定的錨點(使用CSS)
url=soup.select_one(selector="p a") #select_one會回傳遞一個符合的
#尋找特定的錨點(以id)
url=soup.select_one(selector="#name") #selector可拿掉
#尋找特定的錨點(以class)
heading=soup.select_one(selector=".heading") #selector可拿掉

Hacker News實作(找出點閱率最高的文章)

Hacker News
https://news.ycombinator.com/

```
import requests
from bs4 import BeautifulSoup

res=requests.get("https://news.ycombinator.com/")

webpage= res.text

soup =BeautifulSoup(webpage, 'html.parser')
urlList=[]
titleList=[]

articles =soup.find_all(class_='storylink')
for urlTitle in articles:
    url=urlTitle.get('href')
    urlList.append(url)
    title=urlTitle.text #urlTitle.getText()
    titleList.append(title)

scoreList=[int(score.text.split()[0]) for score in soup.find_all(class_='score')]

maxScore=max(scoreList)
maxScore_index=scoreList.index(maxScore)
print(titleList[maxScore_index],urlList[maxScore_index],maxScore)
```

較為建議的做法

使用Public API
尊重網頁的擁有者 (可在網頁背後加上/robots.txt或許能看到一些說明)
盡量限制自己的scrape的速度，使網站能正常的服務