webextractor | PyF, flow-based python programming webextractor

webextractor

package name: pyf.components.producers.webextractor

“webextractor” plugin

class pyf.components.producers.webextractor.WebExtractor(config_node, process_name)

This is a producer that will take urls and will output items based on xpath selectors.

Configuration available :
  • advanced(label: Advanced): Compound key (each sub key is an individual tag)
    • separate_process(label: Separate Process): boolean
  • name(label: Name): Simple key/value (text-based)

    unique name

  • start_urls(label: Start_Urls): Key with repeated start_url content (default: “[‘’]”)
    Key contains repeated items “start_url”:
  • item_selector(label: Individual item XPath): Simple key/value (text-based)

    ex. ‘//ul[1]/li’

  • fields(label: Fields): Key with repeated field content (default: “[{‘xpath’: ‘’, ‘name’: ‘’}]”)
    Key contains repeated items “field”:
    • field(label: Field): Compound key (“xpath” key is the text content of the node)
      • name(label: Attribute): input

        Target attribute

      • xpath(label: Field XPath): input

        Path to search (ex. “p/*/text()”)

  • link_selector(label: Other pages urls xpath): Simple key/value (text-based)

    ex “p[@id=’links’]/a/@href” (optionnal)

  • url_base(label: Base url for links): Simple key/value (text-based)

    ex. “http://wwww.example.com/

  • page_limit(label: Limit to N Pages): Simple key/value (text-based) (default: “10”)

launch(progression_callback=None, message_callback=None, params=None)

Extracts the data from a file using the passed descriptor. If there is a data item in params, just yield it.

Available params in params dict: - data: if provided: iterates over the lines in data and yield them. - descriptor: use this descriptor to read the data - source: use this file-like object as data source - source_filename: use this file as data source.

requires the source_encoding config key.