文字化けの確認

例えば O'reillyのサイトだと文字化けが発生する． webサイトのタイトルを出力するPython スクリプトは以下の通り．

import requests
import lxml.html

res = requests.get('https://www.oreilly.co.jp/books/9784873118864/')
root = lxml.html.fromstring(res.content)

print(root.xpath('//title')[0].text)

このスクリプトの実行結果は以下の通り

O'Reilly Japan - ã¬ã¬ã·ã¼ã³ã¼ãããã®è  ´

上記サイトはタイトルに「O'Reilly Japan - レガシーコードからの脱却」が指定されており，この通り出力して欲しい．

このブログでもダメ．

'å\x87ºå\x8a\x9bã\x82\x92å\x85¥å\x8a\x9bã\x81¸'

一方で，Yahooニュースなら正常に動作する．

'Yahoo!ニュース'

原因

もちろんエンコーディングの指定誤りが原因で， Requests で取得したwebサイトの文字コードを正しく識別できていないことが原因．これについては，以下のサイトが詳しい．

orangain.hatenablog.com

実際，HTTPレスポンスを確認すると

$ wget --server-response https://www.oreilly.co.jp https://www.oreilly.co.jp/books/9784873118864/
...
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Sat, 12 Oct 2019 16:42:53 GMT
  Content-Type: text/html
  Content-Length: 23094
  Connection: keep-alive
  Server: Apache
  Last-Modified: Thu, 10 Oct 2019 05:06:47 GMT
  ETag: "5a36-594875e4b2e01"
  Accept-Ranges: bytes
  Vary: Accept-Encoding

...

$ wget --server-response https://thaim.hatenablog.jp/ 
...
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Server: nginx
  Date: Sat, 12 Oct 2019 16:25:25 GMT
  Content-Type: text/html; charset=utf-8
  Transfer-Encoding: chunked
  Connection: keep-alive
  Vary: Accept-Encoding
  Vary: User-Agent, X-Forwarded-Host, X-Device-Type
  Access-Control-Allow-Origin: *
  Content-Security-Policy-Report-Only: block-all-mixed-content; report-uri https://blog.hatena.ne.jp/api/csp_report
  P3P: CP="OTI CUR OUR BUS STA"
  X-Cache-Only-Varnish: 1
  X-Content-Type-Options: nosniff
  X-Dispatch: Hatena::Epic::Web::Blogs::Index#index
  X-Frame-Options: DENY
  X-Page-Cache: hit
  X-Revision: a7694746800267be0e2d318311d7b13e
  X-XSS-Protection: 1
  X-Runtime: 0.042860
  X-Varnish: 34747135
  Age: 0
  Via: 1.1 varnish-v4
  X-Cache: MISS
  Cache-Control: private
  Accept-Ranges: bytes
...

$ wget --server-response https://news.yahoo.co.jp 
...
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Cache-Control: private, no-cache, no-store, must-revalidate
  Content-Type: text/html;charset=UTF-8
  Date: Sat, 12 Oct 2019 16:26:09 GMT
  Set-Cookie: B=1qlmjdleq3vl1&b=3&s=l0; expires=Tue, 12-Oct-2021 16:26:09 GMT; path=/; domain=.yahoo.co.jp
  Vary: Accept-Encoding
  X-Content-Type-Options: nosniff
  X-Download-Options: noopen
  X-Frame-Options: DENY
  X-Vcap-Request-Id: 9826dce8-3f25-4d7b-7749-c0e9fdcd8fba
  X-Xss-Protection: 1; mode=block
  Age: 0
  Server: ATS
  Transfer-Encoding: chunked
  Connection: keep-alive
  Via: http/1.1 edge2502.img.umd.yahoo.co.jp (ApacheTrafficServer [c sSf ])
  Set-Cookie: XB=1qlmjdleq3vl1&b=3&s=l0; expires=Sat, 19-Oct-2019 16:26:09 GMT; path=/; domain=.yahoo.co.jp; secure; samesite=none
...

そう，実ははてなブログではContent-Typeが適切に設定されているのに上手くいかない．文字コードを小文字で指定しているのが原因かとも思ったけれど RFCによるとどちらでもよいみたい．

tools.ietf.org

requestsは chardetで文字コードの推定も行っているので確認してみたけど，ブログもYahooニュースもどちらもUTF-8を認識している．

>>> import requests, lxml.html
>>> res = requests.get('https://thaim.hatenablog.jp/')
>>> res.encoding
'utf-8'
>>> res.apparent_encoding
'utf-8'
>>> res = requests.get('https://news.yahoo.co.jp/')
>>> res.encoding
'UTF-8'
>>> res.apparent_encoding
'utf-8'

これについてはお手上げで，なぜYahooニュースでは上手くいくのにはてなブログでは上手くいかないのかわからなかった．

対策

根本原因がどうであれ，正しく文字コードを指定して処理すればいいだけなので，この問題を解決するだけならlxmlでスクレイピングする前に文字コードを指定してデコードしてあげればよい．

意図した通り動作するスクリプトは以下の通り．

import requests
import lxml.html

res = requests.get('https://www.oreilly.co.jp/books/9784873118864/')
root = lxml.html.fromstring(res.content.decode('utf-8'))

print(root.xpath('//title')[0].text)

ここでは，UTF-8で固定しているけれど， res.apparent_encoding を指定したり，HTML内で文字コードを指定されているのであればそれに従う方法もある．

出力を入力へ

プログラミングに関する自分が考えた事を中心にまとめます

Requestsで取得したwebサイトの文字化けに対処する

文字化けの確認

原因

対策