2nd Floor College House, 17 King Edwards Road, Ruislip, London HA4 7AE, UK
Table of Contents
< All Topics
Print

Query Language

WebRobot web scraping Query Language

This article introduces the basics of our web scraping query language. It’s aimed both at developers who will use our unmanaged web scraping backend and at external collaborators who will enter our supply chain to help us scale the managed web scraping service process.

1. Clauses

1.1. FETCH

Coordinates the web page acquisition process by defining the sequence of actions involved among those available.
It is the starting clause of an interpolating acquisition pipeline with the reference input dataset.

Example:

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS ( url = ‘{expression}’))

THEN

(action = CLICK WITH ARGS (selector = ‘{expression}’))

)

Each action is delimited by round brackets and the sequence is marked by the keyword THEN.
The parameters of each action are defined within the WITH ARGS directive using the keyword AND to define a multiplicity of parameters. For more details on the parameters of a single action see the specific Actions section.

1.2. JOIN

It allows you to manage the navigation of detail pages by defining the segmentation filter, the actions involved with this segment and the extraction fields involved. Together with the CURRENT and PIVOTED clauses we can refer to fields of the parent page and fields of the child page using CSS selectors.

Example:

JOIN (

(PIVOTED(‘div[class=”{class}”] span’))

)

WHERE (

splitter = ‘div[class=”{class}]”’

AND

ACTIONS ARE

(

(action = {action} WITH ARGS (param = ‘{expression}’))

)

)

1.3. SELECT

It coincides with the selection and projection operator in the SQL context and is responsible for defining the fields to be extracted.

Example:

SELECT (

(CURRENT(‘div[class=”{class}”]’)) AS field1

THEN

(CURRENT(div[class=”{class}”])) AS field2

)

1.4. FLATSELECT

Combine the features of FLATTEN and SELECT. The selection and projection involve the specific segment identified by the coherently enhanced splitter attribute.

FLATSELECT (

(PIVOTED(‘div[class=”{class}”’)) AS field1

THEN

(PIVOTED(‘div[class=”{class}”’)) AS field2

)

WHERE (

splitter = ‘div[class=”{class}]”’

)

1.5. FLATTEN

Example:

FLATTEN WHERE (

splitter = ‘div[class=”{class}”]’

)

1.6. EXPLORE

It allows you to manage a recursive crossing of the pages consistently with the defined selection operator, by performing the actions and selecting the fields on the individual detail pages.

Example:

EXPLORE (

(PIVOTED(‘div[class=”{class}”]’)) as field1

THEN

(PIVOTED(‘div[class=”{class}”]’)) as field2

)

WHERE (

splitter = ‘div[class=”{class}]”’

AND

ACTIONS ARE

(

(action = {action} WITH ARGS (param = ‘{expression}’))

)

)

1.7 VISITJOIN

Quick JOIN operator involving the execution of VISIT navigation action without an explicit selection of fields.

VISITJOIN WHERE

(

splitter = ‘div[class={class}]’

)

1.8. WGETJOIN

Quick JOIN operator involving the execution of WGET navigation action (HTTP request that does not involve a headless browser) without an explicit selection of fields.

Example:

WGETJOIN WHERE

(

splitter = ‘div[class={class}]’

)

1.9. VISITEXPLORE

Quick EXPLORE operator involving the execution of VISIT navigation action without an explicit selection of fields.

Example:

VISITEXPLORE WHERE

(

splitter = ‘div[class=[”{class}”]’

)

2. Actions

Actions defined in the FETCH, JOIN and EXPLORE clauses.

2.1. VISIT

Browsing action using a headless browser.

Example:

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS ( url = '{expression}'))

)

The expression in curly brackets can be a dynamic interpolating expression with variables and fields from the input dataset.

2.2. WGET

Navigation action that takes the form of a simple HttpRequest. More performing than the VISIT counterpart but with use limits to static HTML pages that do not involve complex javascript interactions.

FETCH WHERE ACTIONS ARE (

action = WGET WITH ARGS (url = ‘{expression}’))

)

2.3 CLICK

Consistently with the use of a headless browser, it manages the automatic click of the element identified by the respective CSS selector.

FETCH WHERE ACTIONS ARE (

( action = VISIT WITH ARGS ( url = '{expression}'))

THEN

( action = CLICK WITH ARGS (selector = ‘div[class=”{class}]”’))

)

2.4. CLICKNEXT

Consistently with the use of a headless browser, it manages the automatic click of the next element not yet clicked during the user session.

FETCH WHERE ACTIONS ARE (

( action = VISIT WITH ARGS ( url = '{expression}'))

THEN

(

action = CLICKNEXT WITH ARGS ( selector = ‘div[class=”{class}]”’)

)

2.5. TEXTINPUT

Consistently with the use of a headless browser, it manages the enhancement of a textual field.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS ( url = '{expression}'))

THEN

( action = TEXTINPUT WITH ARGS ( selector = ‘div[class=”{class}]”')) AND value = ‘{expression}’)

)

2.6. DELAY

Suspend execution for the time indicated by the duration parameter in milliseconds.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS ( url = '{expression}'))

THEN

(action = DELAY WITH ARGS ( duration = '10'))

)

2.7 RANDOMDELAY

Suspend the execution of a randomly selected time in milliseconds from the minimum value to the maximum value.

FETCH WHERE ACTIONS ARE (

( action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

(action = RANDOMDELAY WITH ARGS ( min = ‘2’ AND max=’10’ ))

)

2.8. SCREENSHOT

Take a screenshot of the current page saved as an image.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

action = SCREENSHOT WITH ARGS ( filter = ‘{MustHaveTitle|NoFilter}’ )

)

2.9. SUBMIT

Executes the submit command associated with a data submission form.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

action = SUBMIT WITH ARGS ( selector = ‘{expression}’ )

)

2.10. DROPDOWNSELECT

Select the value of a DropDown control.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

action = DROPDOWN WITH ARGS ( selector = ‘{expression}’ , value = ‘{expression}’)

)

2.11. EXESCRIPT

Executes injected client-side javascript code.

{idScript} refers to the javascript script code defined in the database or the XML configuration file of the internal tool in managed mode.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

(action = EXESCRIPT WITH ARGS ( selector = ‘{div[class=”{class}”]}’ AND idClient=’{idScript}’ )

)

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

( action = EXESCRIPT WITH ARGS ( selector = ‘{div[class=”{class}”]}’ AND value=’{script}’ )

)

2.12. DRAGSLIDER

Set the scroll bar with a specific percentage.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

action = DRAGSLIDER WITH ARGS ( selector = ‘{div[class=”{class}”]}’ AND percentage=’{percentage}’ )

)

2.13. LOOP

Loop according to a specific condition.

FETCH WHERE ACTIONS ARE (

( action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

(

action = LOOP WHERE SUBACTIONS ARE (

(action = CLICK WITH ARGS (selector = ‘{div[class=”{class}”}]’))

)

WITH ARGS ( limit = ‘{limit}’)

)

2.14. WAITFOR

Wait until the element indicated by the CSS selector has been loaded on the page before continuing with the flow.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

(

action = WAITFOR WITH ARGS ( selector = ‘{cssselector}’)

)

2.15. TRY

Management of a retry policy consistent with the number of attempts defined.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (url = ‘{expression}’))

THEN

( action = TRY

WHERE SUBACTIONS ARE (

( action = CLICK WITH ARGS ( selector = ‘{selector}’))

)

WITH ARGS (selector = ‘{cssselector}’)

)

)

3. Dynamic Expressions

You can associate variables with our selection expressions.

FETCH WHERE ACTIONS ARE (

(action = VISIT WITH ARGS (

url = ((‘http://{url}’ ) + $(key_url))

))

)

key_url is a variable assigned by the AS projection operator or retrieved from the input dataset field that can be associated with the query script.

4. Parameters

We can define action-level parameters with the WITH ARGS directive already encountered in our examples or at clause level with the WITH PARAMETERS ARE directive.

Below are the parameters for each clause involved. Generally, it is not necessary to use them.

4.1. FETCH

numPartitions: number of partitions allocated by Spark when segmenting the dataset.

4.2. JOIN

numPartitions: number of partitions allocated by Spark during segmentation.

flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}

4.3. VISITJOIN

numPartitions: number of partitions allocated by Spark during segmentation.

flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}

4.4 WGETJOIN

numPartitions: number of partitions allocated by Spark during segmentation.

flattenJoinType: {Inner|LeftOuter|Replace|Append|Merge}

4.5. EXPLORE

numPartitions: number of partitions allocated by Spark during segmentation.

maxDepth: maximum level of traversal during the recursive crawling phase.

4.6. VISITEXPLORE

numPartitions: number of partitions allocated by Spark during segmentation.

maxDepth: maximum level of traversal during the recursive crawling phase.

4.7. WGETEXPLORE

numPartitions: number of partitions allocated by Spark during segmentation.

maxDepth: maximum level of traversal during the recursive crawling phase.

4.8. FLATTEN

alias: association of a variable to which each segment refers, possibly projectable on the output dataset.

4.9. FLATSELECT

alias: association of a variable to which each segment refers, possibly projectable on the output dataset.