使用PHP进行网络爬虫

1 年 ago

科, 颖

4 minutes

$html = '<html>
    <head>
            <meta content="htmlスクレイピング - Qiita" property="og:title">
            <meta content="https://cdn.qiita.com/assets/qiita-fb-2887e7b4aad86fd8c25cea84846f2236.png" property="og:image">
            <meta content="ogのdescription" property="og:description">
        </head>
    <body>
        <p id="first">上</p>
        <p id="second">中</p>
        <p id="third" class="test">下</p>
        <div>sampleA</div>
        <div>
            <p class="test">sampleB</p>
        </div>
        <div id="area">
            <p>
                <span>sampleC</span>
            </p>
            <div>sampleD</div>
            sampleE
        </div>
    </body>
</html>';

$dom_document = new DOMDocument();
@$dom_document->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$xml_object = simplexml_import_dom($dom_document);

使用DOMDocument

foreach ($dom_document->getElementsByTagName('p') as $item)
{
    foreach ($item->childNodes as $node)
    {
        var_dump($node->nodeValue, $node->textContent);
    }
}

使用SimpleXMLElement

基本

(string)$xml_object->body->p[0];
// string(3) "上"

(string)$xml_object->body->p[0]->attributes()->id;
// string(5) "first"

(string)$xml_object->body->p[0]['id'];
// string(5) "first"

xpath -> XPath

要素选择

选择要素的内容

$xml_object->xpath('/html/body/p[.="中"]'); // ルートから指定
$xml_object->xpath('//body/p[.="中"]'); // 途中から指定
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(1) { ["id"]=> string(6) "second" } [0]=> string(3) "中" } }

选择要素的属性

$xml_object->xpath('//p[@id="first"]');
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(1) { ["id"]=> string(5) "first" } [0]=> string(3) "上" } }

$xml_object->xpath('//p[@class="test"]');
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(2) { ["id"]=> string(5) "third" ["class"]=> string(4) "test" } [0]=> string(3) "下" } }

按照要素顺序选择（第一个元素为1）

$xml_object->xpath('//p[2]');
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(1) { ["id"]=> string(6) "second" } [0]=> string(3) "中" } }

选择具有特定子元素的元素。

$xml_object->xpath('//div[p]'); // 直近の子要素にpを持つdiv
// array(2) { [0]=> object(SimpleXMLElement)#3 (1) { ["p"]=> string(7) "sampleB" } [1]=> object(SimpleXMLElement)#4 (2) { ["p"]=> object(SimpleXMLElement)#5 (1) { ["span"]=> string(7) "sampleC" } ["div"]=> string(7) "sampleD" } }

使用层级来选择

$xml_object->xpath('//div/p[1]/../../p[1]');
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(1) { ["id"]=> string(5) "first" } [0]=> string(3) "上" } }

获取值

如果无法获取到 $xml_object->xpath($xpath_string)，则返回空数组；如果发生错误，则返回 false。

如果用 `! empty($xml_object->xpath($xpath_string))` 进行判断，并且假设对象存在。

获取要素的内容

(string)$xml_object->xpath('//p[2]')[0];
// string(3) "中"

获取要素的属性

(string)$xml_object->xpath('//p[2]/@id')[0];
// string(6) "second"

(string)$xml_object->xpath('//p[3]/@class')[0];
// string(4) "test"

(string)$xml_object->xpath('//meta[@property="og:title"]/@content')[0];
// string(33) "htmlスクレイピング - Qiita"

(string)$xml_object->xpath('//meta[@property="og:image"]/@content')[0];
// string(74) "https://cdn.qiita.com/assets/qiita-fb-2887e7b4aad86fd8c25cea84846f2236.png"

(string)$xml_object->xpath('//meta[@property="og:description"]/@content')[0];
// string(16) "ogのdescription"

你也可以将xpath分开写。

$xml_object->xpath('//div')[1]->xpath('p');
// array(1) { [0]=> object(SimpleXMLElement)#6 (2) { ["@attributes"]=> array(1) { ["class"]=> string(4) "test" } [0]=> string(7) "sampleB" } }

如果您想获取指定元素中的所有文本，则可以使用以下方法：

如果在SimpleXMLElement中存在另一个SimpleXMLElement，那么无法通过(string)获取嵌套SimpleXMLElement中的文本。
只能通过(string)获取没有标签的直接子文本。

$xml_object->xpath('//div[@id="area"]');

array(1) {
  [0]=>
  object(SimpleXMLElement)#10948 (3) {
    ["@attributes"]=>
    array(1) {
      ["id"]=>
      string(4) "area"
    }
    ["p"]=>
    object(SimpleXMLElement)#10950 (1) {
      ["span"]=>
      string(7) "sampleC"
    }
    ["div"]=>
    string(7) "sampleD"
  }
}

(string)$xml_object->xpath('//div[@id="area"]')[0];

string(87) "


                    sampleE
                "

如果您想获取下方文本的所有内容，可以按照以下方式删除标签进行提取。

strip_tags($xml_object->xpath('//div[@id="area"]')[0]->asXml());

string(147) "

                        sampleC

                    sampleD
                    sampleE
                "

请注意，带有指定/@的xpath将如下所示（/@id会变为id=”XXXXXX”），请留意。

(string)$xml_object->xpath('//@id')[0];

string(5) "first"

strip_tags($xml_object->xpath('//@id')[0]->asXml());

string(11) " id="first""

当我实际尝试使用爬虫进行抓取时，有些字符串是HTML实体化的字符，所以我会按照下述方式使用。

mb_ereg_replace('\s', '', html_entity_decode($string, ENT_QUOTES));// 全角スペースを除く空白文字は削除