Channel: Grey Lee » appengine

在 Google App Engine 上解析 HTML

April 7, 2011, 7:14 am

≪ Previous: Xbox 360 Upcoming Games Calendar

解析 XML/HTML 的 Library 有很多，但大部份都只支援 Well-formed XML，遇到 Mal-formed XML 就無法解析，Standard Library 裡面的都屬於此類。可以接受 Mal-formed XML 的 Library 則有以下這些：

lxml: 據說是目前 Python 最好的 XML Parser，但不是純 Python 寫的，不能在 Google App Engine 上使用¹。有人開 ticket 請 Google 裝，不過優先度並不高。
html5lib: HTML5 標準函式庫，在查詢元素方面除了介接 Beautiful Soup、ElementTree、lxml 等既有 Library 以外，還額外提供了一個新的 simpletree 格式，其目標是提供一個最基本的實作，並沒有擴充的打算，查詢起來費時又費力。
Beautiful Soup: 相當老牌的 XML Parser，有自己的一套查詢元素的作法。

所以解決方案有這些：

Beautiful Soup: 這是 dependency 最低的作法，Beautiful Soup 自己的查詢方法其實也還不錯用，但由於查詢經常是使用正規表達式描述，當 HTML 格式變動的時候維護成本可能會比較大。
html5lib + ElementTree: 此作法的原理是讓 html5lib 將原始 HTML 處理成 Well-formed 格式²，然後再用 ElementTree 來作查詢，ElementTree 查詢的方法非常類似 XPath³，很人性化。
Beautiful Soup + ElementTree + ElementSoup: 此作法的原理是讓 BeautifulSoup 將原始 HTML 處理成 Well-formed 格式，然後再用 ElementTree 來作查詢。ElementSoup 則是用來幫忙把 Beautiful Soup 格式轉換為 ElementTree 格式。

目前我是用 Beautiful Soup。最後我決定在本地端 parse 完再丟上去… XD

Google App Engine SDK 1.5.5 以後的版本新增了 Python 2.7 的支援，如果指定使用 Python 2.7 的話，就能使用系統提供的 lxml。 ↩
html5lib 的原理是由它來處理原始資料、製成 DOM、ElementTree 的格式，然後再用 DOM、ElementTree 來查詢，所以解析時間並不會比 Beautiful Soup 快。 ↩
ElementTree 1.3 版以前還不支援對屬性查詢，而 Python 是在 2.7 版才將內建的 ElementTree 更新為 1.3，不過我們可以從 Python SVN 取得最新版的 ElementTree，放到 Google App Engine 上使用，缺點是速度沒有 cElementTree 快。 ↩

The post 在 Google App Engine 上解析 HTML appeared first on Grey Lee.

↧

↧

Latest Images

Eco Data 4/26/24

Eco Data 4/26/24

April 25, 2024, 5:00 pm

‘Pay day every day’ may become Shangri-La Group, BPOs’ secret to happy employees

April 25, 2024, 5:51 am

Nonprofit donates custom home in this East Bay city for Marine injured in...

Nonprofit donates custom home in this East Bay city for Marine injured in...

April 23, 2024, 7:00 am

New private rooms on Tokaido Shinkansen change the way we travel from Tokyo...

New private rooms on Tokaido Shinkansen change the way we travel from Tokyo...

April 22, 2024, 6:00 am

Ukraine bans military from online gambling amid addiction concerns

Ukraine bans military from online gambling amid addiction concerns

April 22, 2024, 5:17 am

ಮಂಡ್ಯದಿಂದ ಸುಮಲತಾ ದೂರ; ಹೆಚ್‌ಡಿಕೆ ಪರ ಪ್ರಚಾರಕ್ಕಿಳಿಯದ ಸಂಸದೆ –ಬರ್ತಾರೆ ನೋಡೋಣ ಎಂದ...

ಮಂಡ್ಯದಿಂದ ಸುಮಲತಾ ದೂರ; ಹೆಚ್‌ಡಿಕೆ ಪರ ಪ್ರಚಾರಕ್ಕಿಳಿಯದ ಸಂಸದೆ –ಬರ್ತಾರೆ ನೋಡೋಣ ಎಂದ...

April 20, 2024, 8:08 pm

OCBC Bank Singapore Offers Up to 2.8% p.a. Fixed Deposit Promotion from 21...

April 20, 2024, 12:38 pm

National Poetry Month 2024: Maxine Starr

National Poetry Month 2024: Maxine Starr

April 19, 2024, 9:56 am

Vegan Chicken Pot Pie

Vegan Chicken Pot Pie

April 19, 2024, 9:18 am

Firefox UX: On Purpose: Collectively Defining Our Team’s Mission Statement

Firefox UX: On Purpose: Collectively Defining Our Team’s Mission Statement

April 19, 2024, 7:03 am

Trending Articles

A Wall Street guide to watches

August 5, 2015, 7:32 am

Who Is Jennifer Hines? Bryan Olesen Wife Is Mother Of 3 Kids

March 5, 2024, 2:19 am

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

August 20, 2016, 5:13 pm

Guntur District Police Officers Mobile Numbers

April 17, 2017, 2:10 am

Gangland murders in Dublin (1990-94)

April 17, 2020, 1:54 am

Varzish Sport Tv HD Biss Key Frequency Update

January 15, 2017, 9:03 pm

[THEME] osTicket Awesome ― fully responsive theme

May 29, 2016, 6:25 pm

Pengalaman Rawatan di Klinik Dr. Ko

October 15, 2021, 7:41 am

Read GOS (Generic Object Service) Picture Attachments and Display it into...

February 14, 2014, 1:08 pm

Empirical Labs Arousor v2.1.0-R2R

February 11, 2018, 8:42 pm

Who Is Junior Pope?| Biography| Profile| History Of Nollywood Actor “Pope...

July 26, 2017, 8:45 am

Consuelo Ortiga y Rey: The "Crush ng Bayan" in Rizal's Time

August 4, 2013, 11:32 pm

AUDIO | Diamond Platnumz ft Mugabe - LawaMa | Download

July 25, 2014, 8:00 am

Tuck Mill sells for £1.4 million

April 15, 2013, 5:22 am

Bar Rescue - The Prime Bar (WildeFire Bistro) Update

September 15, 2019, 6:50 am

NAT, NCAE, LAPG, SREYA, ELNA and PHIL-IR Materials and Reviewers

February 27, 2017, 6:16 pm

गर्मी पर स्टेटस – Funny Summer Status in Hindi for Whatsapp

April 4, 2020, 7:09 am

A List of Glasses Wholesale Markets in Guangzhou–World of Spectacles

August 22, 2017, 9:42 am

Kanulanu Thaake Lyrics and translation | Manam (2014)

May 9, 2014, 5:45 am

Dust of Snow Extract Based Questions for Class 10 First Flight

August 14, 2022, 1:53 pm

More Pages to Explore .....

Latest Images

Eco Data 4/26/24

Eco Data 4/26/24

April 25, 2024, 5:00 pm

‘Pay day every day’ may become Shangri-La Group, BPOs’ secret to happy employees

April 25, 2024, 5:51 am

Nonprofit donates custom home in this East Bay city for Marine injured in...

Nonprofit donates custom home in this East Bay city for Marine injured in...

April 23, 2024, 7:00 am

New private rooms on Tokaido Shinkansen change the way we travel from Tokyo...

New private rooms on Tokaido Shinkansen change the way we travel from Tokyo...

April 22, 2024, 6:00 am

Ukraine bans military from online gambling amid addiction concerns

Ukraine bans military from online gambling amid addiction concerns

April 22, 2024, 5:17 am

ಮಂಡ್ಯದಿಂದ ಸುಮಲತಾ ದೂರ; ಹೆಚ್‌ಡಿಕೆ ಪರ ಪ್ರಚಾರಕ್ಕಿಳಿಯದ ಸಂಸದೆ –ಬರ್ತಾರೆ ನೋಡೋಣ ಎಂದ...

ಮಂಡ್ಯದಿಂದ ಸುಮಲತಾ ದೂರ; ಹೆಚ್‌ಡಿಕೆ ಪರ ಪ್ರಚಾರಕ್ಕಿಳಿಯದ ಸಂಸದೆ –ಬರ್ತಾರೆ ನೋಡೋಣ ಎಂದ...

April 20, 2024, 8:08 pm

OCBC Bank Singapore Offers Up to 2.8% p.a. Fixed Deposit Promotion from 21...

April 20, 2024, 12:38 pm

National Poetry Month 2024: Maxine Starr

National Poetry Month 2024: Maxine Starr

April 19, 2024, 9:56 am

Vegan Chicken Pot Pie

Vegan Chicken Pot Pie

April 19, 2024, 9:18 am

Firefox UX: On Purpose: Collectively Defining Our Team’s Mission Statement

Firefox UX: On Purpose: Collectively Defining Our Team’s Mission Statement

April 19, 2024, 7:03 am

© 2024 //www.rssing.com