醉裡挑燈看Code: 使用 Python parsing Google Play 圖書書籍清單

今天為了 parsing Google Play 圖書清單，多學會了一招 Python Regex 的技巧，也就是如何讓後面的 pattern 可以等於前面定義過的 pattern 變數。

至於為什麼需要這個技巧？

1. Google 的 html 沒有換行字元。

2. 因為沒有換行字元，在貪婪模式下，會得到一整個字串。

3. 剛好 title attribute 跟 element 的書名字串一樣，故可解決貪婪模式問題。

4. 純粹因為我懶，不想用 C 的方式處理 (strstr 定位 -> ptr++ -> loop)。

註
目前 TITLE pattern 只列出我所購買書籍的 id 可能組合，假設有不在組合的字元出現，手動增加到 pattern 即可 ([0-9a-zA-Z\-_])。

不多說，來看程式碼，重點是 "(?P=TITLE)"，這樣就可以讓後面的 pattern 等於前面的 pattern 變數。


'''
html example
<a _ngcontent-c9="" class="title" href="/books/reader?id=jrlyDwAAQBAJ" title="DRAGON BALL超 七龍珠超 (1)"> DRAGON BALL超 七龍珠超 (1) </a>
'''

pattern = 'href="/books/reader\?id=[0-9a-zA-Z\-_]{12}" title="(?P<TITLE>.*)"> (?P=TITLE) </a><\!---->'

res = re.findall(pattern, content)

for i, book in enumerate(res):
    print(book)

醉裡挑燈看Code

2019年4月15日星期一

使用 Python parsing Google Play 圖書書籍清單

沒有留言:

張貼留言

2019年4月15日 星期一

使用 Python parsing Google Play 圖書書籍清單

沒有留言:

張貼留言

2019年4月15日星期一