提取器

提取器用于使用特定的规则匹配提取数据

接口定义

IExtractor

interface IExtractor<T, I, O> {
    extractOne(selector: T, content: I): O;
    extractAll(selector: T, content: I): O[];
}

内置提取器

CSSExtractor

CSSExtractor 使用CSS选择器提取数据,底层使用cheerio库,由于cheeriojQuery选择器的实现几乎是相同的,所以几乎所有的jQuery选择器都可直接使用。

由于CSS选择器是用于匹配元素的,无法用于提取元素的属性,因此CSSExtractor特意扩展了@property以及@outerHTML语法用于提取元素的属性,具体使用方法见示例。

使用示例

import { CSSExtractor } from 'nula-extractor';

const html = `
<h1>这是标题</h1>
<author>LiesAuer</author>
<div id="meta">
    <div id="time" timestamp="1660392672">2022-08-13 20:11:12</div>
</div>
<div id="content">test</div>
`;

const extractor = new CSSExtractor();

const title = extractor.extractOne('h1', html);
const author = extractor.extractOne('author', html);

// 提取outerHTML内容
const timeHtml = extractor.extractOne('#time @outerHTML', html);
const date = extractor.extractOne('#time', timeHtml);
// 提取timestamp属性
const timestamp = extractor.extractOne('#time @timestamp', timeHtml);

const content = extractor.extractOne('div#content', html);

console.log(JSON.stringify({
    title,
    author,
    timeHtml,
    date,
    timestamp,
    content,
}, null, 4));

输出

{
    "title": "这是标题",
    "author": "LiesAuer",
    "timeHtml": "<div id=\"time\" timestamp=\"1660392672\">2022-08-13 20:11:12</div>",
    "date": "2022-08-13 20:11:12",
    "timestamp": "1660392672",
    "content": "test"
}

XPathExtractor

XPathExtractor 使用XPath选择器提取数据,底层使用jsdom以及xpath-ts库。

同CSS选择器,特意扩展了@outerHTML语法用于提取元素的outerHTML,而@property语法XPath已原生支持,具体使用方法见示例。

使用示例

import { XPathExtractor } from 'nula-extractor';

const html = `
<h1>这是标题</h1>
<author>LiesAuer</author>
<div id="meta">
    <div id="time" timestamp="1660392672">2022-08-13 20:11:12</div>
</div>
<div id="content">test</div>
`;

const extractor = new XPathExtractor();

const title = extractor.extractOne('//h1', html);
const author = extractor.extractOne('//author', html);

// 提取outerHTML内容
const timeHtml = extractor.extractOne('//*[@id="time"]/@outerHTML', html) as string;
const date = extractor.extractOne('//*[@id="time"]', timeHtml);
// 提取timestamp属性
const timestamp = extractor.extractOne('//*[@id="time"]/@timestamp', timeHtml);

const content = extractor.extractOne('//div[@id="content"]', html);

console.log(JSON.stringify({
    title,
    author,
    timeHtml,
    date,
    timestamp,
    content,
}, null, 4));

输出

{
    "title": "这是标题",
    "author": "LiesAuer",
    "timeHtml": "<div id=\"time\" timestamp=\"1660392672\">2022-08-13 20:11:12</div>",
    "date": "2022-08-13 20:11:12",
    "timestamp": "1660392672",
    "content": "test"
}

RegexExtractor

RegexExtractor 使用Regex正则表达式提取数据。

使用示例

import { RegexExtractor } from 'nula-extractor';

const html = `
<h1>这是标题</h1>
<author>LiesAuer</author>
<div id="meta">
    <div id="time" timestamp="1660392672">2022-08-13 20:11:12</div>
</div>
<div id="content">test</div>
`;

const extractor = new RegexExtractor();

const title = extractor.extractOne('<h1>(.*?)</h1>', html);
const author = extractor.extractOne('<author>(.*?)</author>', html);

const date = extractor.extractOne('<div id="time" timestamp="\\d+">(.*?)</div>', html);
const timestamp = extractor.extractOne('timestamp="(\\d+)"', html);

const content = extractor.extractOne('<div id="content">(.*?)</div>', html);

console.log(JSON.stringify({
    title,
    author,
    date,
    timestamp,
    content,
}, null, 4));

输出

{
    "title": "这是标题",
    "author": "LiesAuer",
    "date": "2022-08-13 20:11:12",
    "timestamp": "1660392672",
    "content": "test"
}

JSONPathExtractor

JSONPathExtractor 使用JSONPath表达式提取数据。

使用示例

import { JSONPathExtractor } from 'nula-extractor';

const json = JSON.parse(`
{
    "title":"这是标题",
    "author":"LiesAuer",
    "time":"2022-08-13 20:11:12",
    "timestamp":"1660392672",
    "content":"test"
}
`);

const extractor = new JSONPathExtractor();

const title = extractor.extractOne('$.title', json);
const author = extractor.extractOne('$.author', json);

const date = extractor.extractOne('$.time', json);
const timestamp = extractor.extractOne('$.timestamp', json);

const content = extractor.extractOne('$.content', json);

console.log(JSON.stringify({
    title,
    author,
    date,
    timestamp,
    content,
}, null, 4));

输出

{
    "title": "这是标题",
    "author": "LiesAuer",
    "date": "2022-08-13 20:11:12",
    "timestamp": "1660392672",
    "content": "test"
}

JMESPathExtractor

JMESPathExtractor 使用JMESPath表达式提取数据,JMESPath表达式相比JSONPath表达式而言,拥有更多的高级语法,并且支持使用函数。

使用示例

import { JMESPathExtractor } from 'nula-extractor';

const json = JSON.parse(`
{
    "title":"这是标题",
    "author":"LiesAuer",
    "time":"2022-08-13 20:11:12",
    "timestamp":"1660392672",
    "content":"test"
}
`);

const extractor = new JMESPathExtractor();

const title = extractor.extractOne('title', json);
const author = extractor.extractOne('author', json);

const date = extractor.extractOne('time', json);
const timestamp = extractor.extractOne('timestamp', json);

const content = extractor.extractOne('content', json);

console.log(JSON.stringify({
    title,
    author,
    date,
    timestamp,
    content,
}, null, 4));

输出

{
    "title": "这是标题",
    "author": "LiesAuer",
    "date": "2022-08-13 20:11:12",
    "timestamp": "1660392672",
    "content": "test"
}

TextExtractor

TextExtractor 非常巧妙的使用了数学上的区间符号()以及[]来提取数据,完整语法为(左边文本,右边文本)[左边文本,右边文本],其含义为提取符合左边文本以及右边文本中间文本(表示提取结果中不包含左边文本[则包含,)表示提取结果中不包含右边文本]则包含。

使用示例

import { TextExtractor } from 'nula-extractor';

const html = `
<h1>这是标题</h1>
<author>LiesAuer</author>
<div id="meta">
    <div id="time">2022-08-13 20:11:12</div>
</div>
<div id="content">test</div>
`;

const extractor = new TextExtractor();

const title = extractor.extractOne('(<h1>,</h1>)', html);
const author = extractor.extractOne('(<author>,</author>)', html);

const date = extractor.extractOne('(<div id="time">,</div>)', html);

const content = extractor.extractOne('(<div id="content">,</div>)', html);

console.log(JSON.stringify({
    title,
    author,
    date,
    content,
}, null, 4));

输出

{
    "title": "这是标题",
    "author": "LiesAuer",
    "date": "2022-08-13 20:11:12",
    "content": "test"
}

RawExtractor

特殊用处,将要提取的内容原封不动的返回,不做任何提取动作。

自定义提取器

MyExtractor

通过编写自定义提取器可扩展适用于自己的数据提取方式。