爬蟲
Requests套件
BS4
https://github.com/johnny12150/IM711/blob/master/inClass02/0420.ipynb
進階使用
https://github.com/johnny12150/IM711/blob/master/inClass03/0504.ipynb
BeautifulSoup
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
from bs4 import BeautifulSoup
soup = BeautifulSoup(res, 'html.parser')
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# 找id
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# 找出所有a tag的link
for link in soup.find_all('a'):
print(link.get('href'))
# 取得所有文字內容
print(soup.get_text())
find(), findall(), select()
Selenium/ PhantomJS
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.binary_location = os.environ.get('GOOGLE_CHROME_BIN')
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--remote-debugging-port=9222')
web = webdriver.Chrome(executable_path=str(os.environ.get('CHROMEDRIVER_PATH')), options=options)
web.get(link)
範例
爬UberEats
https://github.com/johnny12150/line-bot-flask/blob/master/app.py#L482
爬PTT
https://github.com/johnny12150/line-bot-flask/blob/master/app.py#L99
爬TripAdvisor
抓取住過該飯店的使用清單
https://github.com/johnny12150/19-lab-summer-training/blob/master/0813/craw_user.ipynb
PREVIOUSServer Manual (User)