爬蟲入門

爬蟲

Requests套件

BS4

https://github.com/johnny12150/IM711/blob/master/inClass02/0420.ipynb

進階使用

https://github.com/johnny12150/IM711/blob/master/inClass03/0504.ipynb

BeautifulSoup

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

from bs4 import BeautifulSoup
soup = BeautifulSoup(res, 'html.parser')

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 找id

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

# 找出所有a tag的link
for link in soup.find_all('a'):
    print(link.get('href'))

# 取得所有文字內容
print(soup.get_text())

find(), findall(), select()

Selenium/ PhantomJS

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
    options.binary_location = os.environ.get('GOOGLE_CHROME_BIN')
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument('--no-sandbox')
    options.add_argument('--remote-debugging-port=9222')
    web = webdriver.Chrome(executable_path=str(os.environ.get('CHROMEDRIVER_PATH')), options=options)
    web.get(link)

範例

爬UberEats

https://github.com/johnny12150/line-bot-flask/blob/master/app.py#L482

爬PTT

https://github.com/johnny12150/line-bot-flask/blob/master/app.py#L99

爬TripAdvisor

抓取住過該飯店的使用清單

https://github.com/johnny12150/19-lab-summer-training/blob/master/0813/craw_user.ipynb

PREVIOUSServer Manual (User)

NEXTGCP spark and hadoop