醉裡挑燈看Code: Parse pure text from EPUB.

2020年2月1日星期六

Parse pure text from EPUB.

最近測試時，常常需要盯著螢幕，否則會錯失一些資訊。這時電腦就不能做其他的事，不過一直盯著 console 也很無聊，也不方便看其他技術文件。

乾脆來寫個小程式，把小說 EPUB 裡面的純文字取出來，就能開一個最小視窗放在 console 旁邊，也不會影響測試。

一開始只想簡單用 C 語言來處理就好，不過 EPUB 都是 UTF8 編碼，故不適合用 C 來處理。

本想趁這個機會熟悉一下 python，但還是不想花太多時間在這上面，畢竟我只是拿來打發測試時的無聊時間，重點還是我的測試結果。

想了一下，還是用 node.js 好了，畢竟現在專案後端都是用它，雖然我跟它也不是很熟XD

花了不到半個小時，寫出類似 C 寫法的程式，用起來還算滿意。

只是從來沒有好好使用正規工具，都是把 HTML 硬當字串解析，感覺人家都上太空了，我還在學走路？

大概找了一下，決定使用 parse5 這個模組，反正重點還是測試，不要浪費太多時間。

底下是測試程式及時間，測試檔案為 10,303 bytes ：

parse5 - 解析 HTML Tree 及走訪 Tree 要 29 毫秒，單純走訪只要 11 毫秒。
Regex + string function - 無法分離 Regex 時間，全部要 14 毫秒。

Parse5 Example


const fs = require('fs');
const process = require('process');
const parse5 = require('parse5');

let s = '';

function parseTree(root) {
    if (Array.isArray(root.childNodes)) {
        for (let i = 0; i < root.childNodes.length; i++) {
            parseTree(root.childNodes[i]);
        }
    } 

    if (root.tagName == 'p') {
        console.log(s);
        s = '';
    } else if (root.value) {
        s += root.value;
    }
}

const content = fs.readFileSync('p-06.xhtml', {encoding: 'utf8'});

let t1 = new Date();

const document = parse5.parse(content);

let t2 = new Date();

parseTree(document);

let t3 = new Date();

console.log('\nUsed ms :', t3 - t1, t3 - t2);

Regex Example


const fs = require('fs');
const process = require('process');

function arrangeStr(str) {
    let newStr = '';
    let begin;
    let end;

    while (true) {
        begin = str.indexOf('<');
        if (begin != -1) {
            newStr += str.substring(0, begin);
            end = str.indexOf('>', begin);
            str = str.substring(end+1);  
        } else {
            newStr += str;
            break;
        }
    }
    return newStr;
}

const content = fs.readFileSync('p-06.xhtml', {encoding: 'utf8'});

let t1 = new Date();

let regex = />(.+)</g;

let info;
while ((info = regex.exec(content)) != null) {
    for (let i = 1; i < info.length; i++) {
        console.log(arrangeStr(info[1]));
    }
    console.log('');
}

console.log('\nUsed ms :', new Date() - t1);

沒有留言:

張貼留言

pretty code

2020年2月1日 星期六

Parse pure text from EPUB.

沒有留言:

2020年2月1日星期六