(19)
(11)EP 3 502 925 A1

(12)EUROPEAN PATENT APPLICATION

(43)Date of publication:
26.06.2019 Bulletin 2019/26

(21)Application number: 17209553.1

(22)Date of filing:  21.12.2017
(51)International Patent Classification (IPC): 
G06F 17/30(2006.01)
(84)Designated Contracting States:
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
Designated Extension States:
BA ME
Designated Validation States:
MA MD TN

(71)Applicant: Urban Software Institute GmbH
09116 Chemnitz (DE)

(72)Inventors:
  • Bittner, Christopher
    10717 Berlin (DE)
  • Müller, Sebastian
    10717 Berlin (DE)

(74)Representative: Bittner, Peter et al
Peter Bittner und Partner Seegarten 24
69190 Walldorf
69190 Walldorf (DE)

  


(54)COMPUTER SYSTEM AND METHOD FOR EXTRACTING DYNAMIC CONTENT FROM WEBSITES


(57) Computer system (100), computer-implemented method and computer program product are provided for extracting dynamic content data (221) from a website (220) in a machine-readable format. The system has an interface (110) to access configuration data (250) reflecting the structure of the website (220). The configuration data includes at least a website specific scraping script and one or more website specific XPath statements. Further, the interface receives a data retrieval request (210) specifying the website (220) and corresponding dynamic content data (221) to be retrieved. A scraper module (120) provides the scraping script (2050) for execution wherein the scraping script is configured to perform one or more parameterized navigation steps on the website (220) to access the dynamic content data (221). A script module (140) triggers execution of the scraping script and receives HTML/XML data associated with the dynamic content data from the website (220) in response to the scraping script execution. An XPath extraction module (150) extracts machine-readable content data (222) from the HTML/XML data wherein the XPath extraction module is pre-configured with the website specific XPath statements in accordance with the structure of the website (220).




Description

Technical Field



[0001] The present invention generally relates to systems for data retrieval, and more particularly, relates to methods, computer program products and systems for extracting dynamic content from websites in machine-readable format.

Background



[0002] Web scraping or web data extraction methods are known in the art. Web scraping is used to access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. Web scraping typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

[0003] Web scraping a web page involves retrieving a predefined Hypertext Markup Language (HTML) page and extracting data from it. Fetching is the downloading of a page which is stored under a static Web address typically specified by a Uniform Resource Locator (URL). Once the page is fetched from where it had been stored, extraction can take place. The content of a page may then be parsed, searched, reformatted, etc. Web scrapers typically extract certain parts of a page to make use of it for another purpose. An example is to find and copy names and phone numbers, or companies and their URLs, to a list (so-called contact scraping).

[0004] Prior art Web scraping tools can retrieve web page content from pages which are stored as predefined HTML data. Such content is referred to as static content herein because it relates to content provided by static web pages. However, current web technology allows to dynamically generate web pages on a web server in response to requests which may be received from a user or a computer system. As a consequence, data shown on websites can continuously change. A web page containing respective data can change its layout and new data fields may be introduced at any time. The content of such dynamic web pages (dynamic content) typically depends on the navigation history through a website. In other words, it depends on where the user currently is and which information and requests have been sent previously. Current Web scraping tools fail to scrape dynamic content data from such dynamically generated web pages and provide respective content data in a machine-readable format so that the content can be further processed by other computer systems provided with the extracted data.

Summary



[0005] Hence, there is a need for providing improved methods and systems to enable web scraping for dynamic content on dynamic web pages.

[0006] This technical problem is solved by a computer system, a computer-implemented method and a computer program product as disclosed in the independent claims. The disclosed embodiments define a screen-scraping framework which addresses the above problem by automatically connecting to a target website and extracting dynamic data from said target website.

[0007] In one embodiment, a computer system is provided for extracting dynamic content data from a website in a machine-readable format. The system includes an interface to receive configuration data reflecting the structure of the website. The configuration data includes at least a website specific scraping script and one or more website specific XPath statements with the scraping script(s) and XPath statements being predefined (e.g., by a user). The computer system can then deploy the received configuration data to respective modules of the computer system. Such modules will be explained in detail in the following description. Further, the interface receives a data retrieval request specifying the website and corresponding dynamic content data to be retrieved from the website. The data retrieval request may be received from a human user or it may be received from another computer system requiring the to-be-retrieved data for further processing. In the latter case it is advantageous to provide the retrieved data in a machine-readable format.

[0008] The computer system further has a scraper module to provide the predefined scraping script to a script module of the system for triggering its execution on the website. The scraping script is configured in such a way that it allows to perform one or more parameterized navigation steps on the website to access the dynamic content. In other words, the scraping script has instructions which allow to automatically perform parameterized navigation steps by emulating a browser accessing the website. Each step description of the scraping script may include placeholders (parameters) which may be either replaced with data received from a human user or another computer system via the data retrieval request, or with values of responses extracted from previous navigation steps. That is, the scraping script may use results from preceding server responses allowing for dynamic navigation through the website and enabling the script to work without relying on predefined URLs. The screen scraper can be seen as a central module controlling at least the script module and the module for data extraction.

[0009] The script module triggers execution of the scraping script. In other words, the script module executes the steps as defined in the scraping script and triggers parameterized requests to the web server. In response to the scraping script execution, the script module receives from the website HTML data or XML data representing the dynamic content data defined by the data retrieval request.

[0010] The computer system further has an XPath extraction module which is pre-configured with the website specific XPath statements in accordance with the structure of the website. The HTML/XML data received by the script module can be directly provided to the XPath extraction module via the screen scraper. The XPath extraction module extracts machine-readable content data from the HTML/XML data. XPath (XML Path Language) is a query language for selecting nodes from an XML document. Further, XPath can be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. Further, XPath can also be used for parsing HTML pages that have been previously transferred into XML documents. XPath uses a compact, non-XML syntax to facilitate use of XPath within XML attribute values and XML nodes. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. As a consequence, XPath statements return structured data which is readable by computers.

[0011] In one embodiment, the computer system further has a request queue to buffer the data retrieval request amongst a plurality of further data retrieval requests. For example, the buffer can be implemented as a data storage structure which supports Piping and Queueing (FIFO-Buffer) or Stacking (LIFO-Buffer). Using a request queue allows to perform job scheduling for a plurality of data retrieval requests.

[0012] In one embodiment, the computer system further has a session management module. The session management module is useful in cases where the website requires login credentials. In this embodiment, the data retrieval request further specifies user authentication data which are necessary for a user to login to the website. The session management module can determine a session ID for the data retrieval request of the user to access the dynamic content data on the website. The interface is communicatively coupled with an account database. The account database can be an integral component of the computer system or it can be stored on a remote computer (e.g. a cloud server) which can be accessed by the computer system through standard network communication interfaces. The session management module can provide the authentication data to the account database. In case of successful authentication of the user, the session management module receives from the account database (via the interface) the login credentials for the authenticated user in response to a credential request. More details of this authentication process are disclosed in the detailed description.

[0013] The session management module can check if an open session is already available for the received data retrieval request. An open session, as used herein, refers to an active session which is already running. If an open session is available, it provides the session ID of the open session to the scraper module. The scraper and the script module can then make use of the existing session to trigger the execution of the predefined scraping script by using the existing session. If no open session is available for the received data retrieval request, the session management module can initiate (via the scraper) the execution of a pre-configured login script by the script module in accordance with the data retrieval request. As a response to the execution of the login script the website provides one or more cookies which are provided as the session ID to the scraper.

[0014] In one embodiment, a computer-implemented method is provided for extracting dynamic content data in a machine-readable format from a website provided by a server. Thereby, dynamic content data relates to content data which is generated by the server in response to a request. The method can be executed by the modules of the disclosed computer system. The method includes the steps: accessing configuration data reflecting the structure of the website, the configuration data including at least a website specific scraping script and one or more website specific XPath statements (such configuration data may be generated by a user or by a machine); receiving, via an interface, a data retrieval request specifying the website and corresponding dynamic content data to be retrieved from the website; executing the scraping script wherein the scraping script is configured to perform one or more parameterized navigation steps on the website to access the dynamic content data; receiving, from the website in response to the scraping script execution, HTML/XML data representing the dynamic content data; providing the HTML/XML data to an XPath extraction module, wherein the XPath extraction module is pre-configured with the website specific XPath statements in accordance with the structure of the website; and receiving, from the XPath extraction module, machine-readable content data extracted from the HTML/XML data. The machine-readable content data include the extracted dynamic content in a format which can be further processed by a machine.

[0015] In one embodiment, the website requires login credentials and the data retrieval request further specifies user authentication data. In this embodiment, the method further includes the steps: providing the authentication data to an account database; and in case of successful authentication of the user, receiving from the account database the login credentials for the authenticated user in response to a credential request.

[0016] In a further embodiment with a login requirement for the website the method includes the further step: determining a session ID for the data retrieval request to access the dynamic content data on the website. Determining a session ID may further include: if an open session is available for the received data retrieval request, providing the session ID of the open session; if no open session is available for the received data retrieval request, executing a pre-configured login script in accordance with the data retrieval request, and receiving, in response to the executed login script, one or more cookies as the session ID.

[0017] In one embodiment, a computer program product is provided that, when loaded into a memory of a computing device and executed by at least one processor of the computing device, executes the steps of the computer-implemented method as disclosed herein.

[0018] Further aspects of the invention will be realized and attained by means of the elements and combinations particularly depicted in the appended claims. It is to be understood that both, the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as described.

Brief Description of the Drawings



[0019] 

FIG. 1 shows a simplified diagram of an embodiment of a computer system for extracting dynamic content data from a website into a machine-readable format;

FIG. 2A is a simplified flowchart of a computer-implemented dynamic content extraction method which can be performed by embodiments of the computer system;

FIG. 2B is a simplified flowchart of sub-steps for determining a session ID;

FIG. 3 is a swim lane diagram illustrating data flows between modules of a particular embodiment for extracting dynamic content data from a website in a machine-readable format;

FIGs. 4A to 4E illustrate coding portions of example implementations for configuration data including website specific scripts and website specific XPath statements according to an embodiment;
and

FIG. 5 is a diagram that shows an example of a generic computer device and a generic mobile computer device, which may be used with the techniques described here.


Detailed Description



[0020] FIG. 1 shows a simplified diagram of an embodiment of a computer system 100 for extracting dynamic content data 221 from a website 220 in a machine-readable format. FIG. 1 is described in the context of FIG. 2 which is a simplified flowchart of a computer-implemented method 1000 which can be performed by embodiments of the computer system 100. Method steps illustrated by dashed boxes are optional steps of the method 1000. The following description of FIG. 1 in the context of FIG. 2 refers to reference numbers of both figures.

[0021] The computer system 100 has an interface 110 to access 1100 configuration data 250 reflecting the structure of the website 220. The configuration data 250 include at least a website specific scraping script and one or more website specific XPath statements. The website specific scraping script includes script statements which are configured to interact with the website, for example, by addressing a certain URL and defining methods to be performed on this URL. A detailed example of a scraping script is discussed in FIG. 4B. The website specific XPath statements are used for extracting data (in a machine-readable format) from HTML/XML data provided by the website in response to the execution of the scraping script.

[0022] Further, the interface receives 1200 a data retrieval request 210 specifying the website 220 and corresponding dynamic content data 221 to be retrieved. In other words, the data retrieval request specifies the parts of the website 220 which correspond to the data of interest of a requesting user or system. The data retrieval request 210 can be phrased by a user who wants to retrieve specific dynamic content data from the website and provide such data for further processing to a computer system. The data retrieval request may also be a machine generated request which is automatically composed by a computer system in accordance with respective generation rules. For example, the data retrieval request may include one or more dynamic parameters which may lead to different content 221 being generated by website 220. The value of the dynamic parameter can be the result value of a particular query. This value is not known beforehand and may be subject to change with each new search query. For example, a dynamic parameter may be a value returned by a search result of a respective search query. This value is not known beforehand and is subject to change with each new search query. Thereby, the value can be used as a parameter for the next navigation step (e.g., another search query with the returned result value as new parameter like for example a URL), for example, to retrieve more details with a corresponding follow-up request.

[0023] In an optional embodiment, the system 100 further has a request queue 115 which may be a memory component configured to buffer 1230 the received data retrieval request amongst a plurality of further data retrieval requests. The request queue 115 can be used for job scheduling of multiple data retrieval requests. Such a scheduling function may be part of the request queue. In other words, once multiple jobs are stored in the request queue, the system can process the data retrieval requests in a controlled order (e.g., FIFO or LIFO), or in parallel (if parallel processing is supported by the hardware of computer system 100).

[0024] The scraper module 120 of the system 100 is preconfigured with the received configuration data in that it manages the received scraping script. This includes, but is not limited to provisioning of the scraping script to the script module and to further process the result(s) of the script execution. The scraper 120 receives the data retrieval request 210 either directly via the interface 110, or- in the optional embodiment using the request queue - via the request queue 115 and provides the scraping script for execution to a script module 140. The scraping script is configured to perform - when executed - one or more parameterized navigation steps on the website 220 to access the dynamic content data 221. The script module 140 executes 1300 the scraping script and receives 1400, in response to the scraping script execution, HTML/XML data representing the dynamic content data 221 from the website 220. In the example embodiment of FIG. 1, the scraper 120 acts as a communication management module which handles the communication between other modules of the system. However, a person skilled in the art can also design the computer system 100 in such a way that the other modules may have communication interfaces which allow them to directly communicate with other modules bypassing the scraper.

[0025] In the example embodiment, the received HTML/XML data is provided 1500 to an XPath extraction module 150. The provisioning 1500 may occur either directly from the script module 140 (not shown), or the HTML/XML data may be routed through the scraper 120 to XPath extractor 150 as illustrated in FIG. 1 of the example embodiment. The XPath extractor 150 is pre-configured with the website specific XPath statements in accordance with the structure of the website 220 to extract machine-readable content data 222 from the HTML/XML data. The retrieved dynamic content data 221 is received 1600 in a machine-readable format (i.e., a format suitable for further machine processing) as machine-readable content 222 from the XPath extractor 150. In the example embodiment of FIG. 1, the scraper 120 receives the result from the XPath extractor and forwards the machine-readable content 222 to the requesting entity via the interface 110. Alternatively, the interface 110 may directly receive the machine-readable content 222 from the XPath extractor.

[0026] As a result, the requested dynamic content 221 of the website 220 is automatically retrieved by the computer system in a flexible and robust manner and provided in a machine-readable format to the requesting entity. The flexibility is improved through the preconfigured scraping script which allows flexible navigation through the website to identify the dynamic content data based on HTML element values or labels rather than based on a rigid HTML structure. The robustness comes primarily through XPath statements using values of the HTML document rather than using rigid HTML structure paths. Therefore, a web page can have a changed layout, added fields or even removed fields, but the scraping script and the XPath statements do not need to be adapted as a consequence of addressing the respective values via XPath statements.

[0027] Some websites require authentication of the requesting user before allowing to access the dynamic content data. For such scenarios, in an optional embodiment, the computer system 100 includes account/credential management module 160 which is communicatively coupled with an account database 230. The account database 230 may be an internal component of the computer system 100 or it may be stored on a remote computer accessible through standard communication technology. Further, a session management module 130 is used.

[0028] In case the website 220 requires login credentials from the requesting entity the data retrieval request 210 further specifies user authentication data. The authentication data is then provided 1210 to the account database 230 via the account/credential management module 160. The account database 230 stores information about the users and respective credentials for accessing the website 220. In response to the authentication data the account database 230 provides a corresponding user to the interface 110. In a subsequent step, the interface launches a credential request for the received user via the account/credential management module 160. In case of successful authentication of the user this credential request is answered by the account database 230 with corresponding login credentials for the authenticated user. The login credentials are received 1220 by the interface 110 and provided to the session management module 130.

[0029] Turning briefly to FIG. 2B, the session management module 130 determines 1240 a session ID for the data retrieval request to access the dynamic content data on the website 220. For this purpose, a check 1241 is performed whether an open session is already available. If an open session is available for the received data retrieval request 210, the session management module 130 provides 1242 the respective session ID of the open session to the scraper 120 to be used for the execution of the scraping script. If no open session is available for the received data retrieval request, the scraper 120 provides a pre-configured login script to the script module 140. In this embodiment, the preconfigured login script is part of the configuration data initially received by computer system 100. The execution of the login script on the website is then triggered 1243 via the script module in accordance with the data retrieval request. In response to the executed login script, the script module receives 1244 one or more cookies as the session ID which is finally provided to the scraper. The one or more cookies can be stored in a respective cache memory.

[0030] A detailed example embodiment is now described with FIG. 3. Code examples of FIGs. 4A to 4E illustrate specific code sections of a simplified JSON code example illustrating an implementation of the inventive approach to retrieve dynamic content data from the Wikipedia website. It is to be noted that this simple example is only used for explaining the concept. Very complex examples can be implemented by the inventive concept. Therefore, the shown example is not to be interpreted to be limiting the scope of protection. Rather, a person skilled in the art can apply the technical teaching of this example to very complex website structures with high benefit resulting from the high flexibility and robustness of the disclosed procedure.

[0031] FIG. 3 is a swim lane diagram 2000 illustrating data flows between modules of a particular embodiment for extracting dynamic content data from a website in a machine-readable format. FIGs. 4A to 4F illustrate coding portions of example implementations for website specific scripts and website specific XPath statements according to an embodiment. FIG. 3 will be described in the context of the JSON code portions of FIG. 4*. Other data description languages (e.g., markup languages or data serialization languages, such as for example, XML, YAML or BSON) may be used by skilled person instead. In FIG. 3 the reference numbers of FIG. 1 are reused for the respective system modules.

[0032] The vertical bars of FIG. 3 represent the following entities: requesting entity R (10), interface I (110), request queue RQ (115), scraper GS (120), session module SeM (130), script module ScM (140), XPath extractor XE (150), account database AD (230), and website WS (220). It is to be noted that for the reason of simplicity the communication between the interface I and the account database AD is illustrated as a direct communication leaving out the account/credential management module which facilitates this communication as already explained in the description of FIG. 1. The vertical dimension of FIG. 3 can be interpreted as a time axis where time progresses top down. The horizontal arrows in FIG. 3 illustrate messages which are exchanged between the respective entities. The direction of each arrow indicates respective sender and recipient of the message.

[0033] The requesting entity R sends a data retrieval request 2010 to the interface I of the computer system. For example, interface I can be implemented as a REST interface. Representational state transfer (REST) or RESTful web services is a way of providing interoperability between computer systems on the Internet. REST-compliant Web services allow requesting systems to access and manipulate textual representations of Web resources using a uniform and predefined set of stateless operations. Other forms of Web services exist, which expose their own arbitrary sets of operations such as WSDL and SOAP. In a RESTful Web service, requests which are made to the unique resource indicator (URI) of a resource will elicit a response that may be in XML, HTML, JSON or some other defined format. The response may confirm that some alteration has been made to the stored resource, and it may provide hypertext links to other related resources or collections of resources. Using HTTP, the kind of operations available include those predefined by the HTTP verbs GET, POST, PUT, DELETE, and so on.

[0034] FIG. 4A illustrates a JSON example 400 which includes some coding sections to instantiate and configure some of the modules of the computer system. For example, section 402 can be used to instantiate the interface I as a REST interface and section 403 can be used to instantiate the session management module SeM.

[0035] In the embodiment of FIG. 3, it is assumed that the optional modules of FIG. 1 are included in the computer system. It is further assumed that the website of interest requires authentication. Therefore, data retrieval request includes authentication data. The authentication data is used in accordance with the description of FIG. 2B, to request 2020 user data from AD. When the requestor R (e.g., a user) issues a request to the interface I, he/she may need to authenticate via HTTP basic authentication. A lookup in the credential database can be used to verify that R is allowed to trigger a scraping process. If the authentication was successful, the credential management module furthermore provides the associated site credentials that will be used for logging in to the site that should be scraped if needed. In other words, if authentication via the account database is successful, AD sends a user ID 2021 to the interface I in accordance with the provided authentication data. In a second step, I sends a credential request 2022 for said user to AD. AD in turn provides login credentials 2023 in response to the request 2011. In the JSON example of FIG. 4, the optional account/credential management module (for managing the communication between the interface I and AD; not shown) is instantiated by section 404. Section 405 shows configuration for the credential management aspects. For example, it specifies username, password, server address and other configuration parameters such as, for example, the configuration for the AD access to the database or the configuration for the database encryption.

[0036] Upon successful triggering, the interface I can put 2030 the data retrieval request in the optional request queue RQ from which it gets consumed and forwarded to the configured scraper GS as soon as a new request to the site is allowed. For example, only a configurable number of maximum parallel requests may be allowed for each site in order to avoid too much load on the website WS from specific accounts. The forwarding of requests buffered in RS to GS is illustrated by the circular arrow.

[0037] When the GS receives a request 2031, it orchestrates the steps necessary for scraping the website WS. In case the WS requires a login, GS checks 2032 if there is already an active session available (open session "OS?") in the optional session management module SeM. If so (YES), the login process is skipped, and GS can directly proceed to the scraping process 2050 by providing the respective stored session cookie 2042-1. If not (NO), GS triggers the login to WS by invoking 2040 the script module ScM to trigger the execution of a respective login script. The login script is provided to GS during the deployment and configuration of the GS module. FIG. 4C shows a JSON example 450 with the login script 451 which is part of the scripts section of the GS instance 450. The login script includes a sequence of parametrized HTTP calls that are issued to the target website WS. For execution of the login script 451, ScM is instantiated via section 401 (cf. FIG. 4A). Upon successful execution 2041 of the login script on WS, WS provides one or more cookies as session ID back to ScM from where the provided cookie is forwarded 2042, 2043 to GS and SeM to be stored by the session management module SeM as a new session ID which can now be used by GS for future scraping. To provide feedback to R (e.g., the user) about the login process and to avoid massive re-login attempts, a built-in counter can capture failed attempts. The credential management module may show an alert associated with credentials that have an increased failure counter. This information can be used to avoid accounts being banned due to too many failed logins, and to inform the administrator(s) of the scraper to solve the login problem. Additionally, accounts can be thereby marked as broken, preventing them from being used in the future.

[0038] In situations where no login is required by WS, the steps 2020 to 2023 and 2040 to 2043 are not required. GS can proceed with the scraping script right upon the receipt of the data retrieval request 2031. FIG. 4B illustrates the predefined scraping script 413 in the scripts section of the scraper. The scraping script 413 is received via interface I as part of the configuration of the scraper and includes a sequence of parametrized HTTP calls that are executed through the ScM by navigating on the target website. In the example of the scraping script 413, the HTTP method "get" is used to call the URL https://de.wikipedia.org/. A successful call of the URL can be verified by a respective "checkSuccessXpath" statement. The required parameter "searchstring" is a potential user input in the data retrieval request specifying the dynamic content to be retrieved. This content can be hidden on the website in a form which may be placed anywhere on the website. By specifying the form via its ID "//form[@id='searchform']/@action" it can be quickly identified on the website no matter how nested the structure of WS may be. The "searchstring" is then sent 2051 to the server and the respective URL is called with the parameter "searchstring". The closing "checkSuccessXpath" statement is optional and checks whether the call was successful. This function provides a debugging opportunity in cases where the execution of the scraping was not successful and changes to the scripts may be necessary. The system can immediately localize where the scraping script has failed by using such "checkSuccessXpath" debugging statements. The scraping script can use results from preceding server responses. Therefore, a dynamic navigation through the website is possible, enabling the script to work without relying on predefined URLs. The result of the query is received 2052 by the screen scraper as HTML/XML data representing the dynamic content data.

[0039] Section 412 specifies the endpoint under which the screen scraper is accessible from outside the computer system. In this example the screen scraper is accessible via HTTP on port 9090. This interface can be used by a human user or equally by a machine.

[0040] To summarize, FIGs. 4A to 4C illustrate by way of example how the various modules of the computer system can be instantiated and configured. In particular, a scraping script example in FIG. 4B illustrates a predefined scraping script which is specific for retrieving dynamic content from the Wikipedia website. In an optional embodiment, the login script in FIC. 4C can be used to access the website based on authentication data provided with the data retrieval request when a login is required from the requesting entity R. Successful execution of the scraping script provides the respective HTML/XML data to the GS.

[0041] As explained earlier, besides such scripts the configuration data further includes one or more website specific XPath statements which are used to retrieve the requested dynamic content data in a machine-readable format (for further data processing) from the received HTML/XML data. FIG. 4D illustrates a JSON example with a set 420 of flexible and robust XPath statements which can be used for said purpose. FIG. 4E illustrates an example set 430 of XPath statement which can provide the same result as the set of FIG. 4D but being less robust. The main difference is that in set 420 navigation occurs through identifiers only, where in set 430 the navigation occurs via specific structure elements of target website. Once the structure is slightly modified (e.g., by moving a structure element or by insertion/deletion of structure elements, the set 430 will fail to retrieve the requested data while the set 420 will still provide the correct result.

[0042] Turning back to FIG. 3, the GS provides 2053 the HTML/XML data to the XPath extraction module XE. There, the HTML/XML data is cleaned (e.g., using the library htmlcleaner available at htmlcleaner.sourceforge.net) and parsed into a DOM tree. The XE is instantiated in section 411 of the example in FIG. 4B. The configuration of the XPath extraction module in this example refers to the file "wikipedia-robust.json" which corresponds to the set 420 of FIG. 4D. That is, the XPath extraction module is pre-configured with the website specific XPath statements in accordance with the logical structure of the website WS, as for example illustrated by the set 420. The XPath statements point to respective locations within the HTML/XML data contents with the respective information. XE applies the website specific XPath statements to the received HTML/XML data (e.g., the respective DOM tree) and extracts from the HTML/XML data the requested dynamic content in a machine-readable format. Such extracted machine-readable content data is then forwarded 2054, 2055, 2056 to the requesting entity R.

[0043] The skilled person will be able to understand the functioning of the set 420 of FIG. 4D. Nevertheless, a short explanation is given with regards to the XPath statement referred to as "pageLinks" (the last statement of set 420). In this statement, each XML/HTML-"h2" node is addressed. It will be filtered to only return the nodes that have a "span" node having the id "Weblink" as a child. From this node, the next sibling of node "div" is selected. Finally, a filter is applied to find nodes of type "a" somewhere in this selected node's subtree. From this "a" node the value of the "href" attribute is extracted as a "string".

[0044] The computer system may further include an error handling module. In case that the data retrieval request cannot be successfully processed, the error handling module can provide an error code which is generated based on a respective checkSuccessXpath statements as explained earlier. Examples for possible errors include but are not limited to: Unable to login; Unknown HTTP request exception while requesting site; Connection to scraped site cannot be established; Site URL could not be resolved; Problem with establishing a SSL connection to scraped site; Unexpected response from scraped site; Login to target site failed; Wrong parameters given in request; Unable to load user credentials for site to be scraped; Extraction failed; Scraper not registered; Missing required parameters; Scraped site didn't respond in time; Internal scraper server error; Internal scraper timeout error; Internal scraper no handler error; Internal scraper recipient failure error.

[0045] In one embodiment, a scraping script can include a JSON array of JSON objects. Each JSON object represents a single step. A subsequent step can take the response returned by the preceding step as input and behave as configured in its JSON object. In such a JSON object several properties from the below table 1 can be given. Each step has at least one of the properties url, urlFromXpath or urlFromHeader indicating where the request is sent to. All other fields are optional.
Table 1: JSON object property examples
KeyTypeVariables AllowedDescription
url String Yes URL to send the request to
urlFromXpath String No Takes the body of the last response, applies the XPath and uses the first result as URL
urlFromHeader String No Looks in the last response's header for the given key and uses the result as URL
method String No HttpMethod which is used for the request (defaults to GET)
urlParams Object Yes The key-value items are transformed into GET params ({a:"1"}-> ...?a=1)
formData Object Yes (keys & values) The key-value items are transformed into HTTP body content form data
formDataFromXp ath String No Takes the body of the last response and looks for the element specified by this XPath. In the first result, it looks for input elements with attributes name and value (<xpathValue>//input[@name][@value]). These values are overwritten by formData, if keys are the same (ignores case).
responseCookieF ilter Array No By default, all cookies are looped through all requests. If this array is given, the response of the current request will only return the cookies with the cookie names which are in the list. If the list is empty, no cookies will be passed to the next request.
variablesTransfor mation Object No String source, String regex and an optional Integer matchGroup. If a variable named like the value of source exists, the regex is applied to this value (Pattern/Match in Java). If a matchGroup is given, the specified group is used, otherwise the whole match result is stored in the new variable (i.e. X). Important: where variables are allowed, %varname% is replaced by the variable varname (case ignored)
isXml Boolean No If set to true, the response is handled as XML instead of HTML. This means it's not cleaned but directly used to apply XPaths on it. It's optional and defaults to false.
xmlNamespaces Object No This key-value store is used to declare namespaces for the DOM extraction. The key is the namespace prefix and value is the namespace uri. This is only necessary for XML files, as namespaces are cleared with HTML files beforehand. If the value of the key is a zero-length string, the URI is set as the default namespace for elements and types. By default, no namespaces are declared.
checkSuccessXp ath String Yes XPath which is applied to the response's HTML to check for success
checkSuccessEx pectedValue String Yes Is used to assert the evaluated result of the XPath. If it's equal, the response was successful (requires checkSuccessXpath)


[0046] To summarize the approach for extracting dynamic content from websites, the system initially receives a data retrieval request which specifies the target website and the corresponding dynamic content data to be retrieved from the website. For example, a user wants to retrieve a train connection from A to B at a given time from a train connection service website TCS. The data retrieval request provides the initial information of the website TCS, start location A, destination location B and departure time t. In a first scraping script step, a corresponding request is submitted to the website TCS. As a response the website TCS may provide a URL where the data can be retrieved. In a next scraping script step, the system may send a request with the URL as a parameter and TCS may provide the respective content data as HMTL/XML data to the scraper.

[0047] A person skilled in the art can apply this approach to other scenarios, such as for example, a request to retrieve the number of inhabitants of a particular city from Wikipedia. In a first scraping script step, the city name is sent to the Wikipedia website and the URL of a page with information about the respective city is provided to the scraper as a response. The received URL is then used as parameter for the next scraping script launching a request to access the page under the received URL. This page includes the dynamic content data of the city and is provided as HTML/XML data including the number of inhabitants. A corresponding preconfigured XPath statement can then extract this information from the HTML/XML data. Dependent on the complexity of the website structure, scraping scripts can include a plurality of navigation steps to finally get access to the requested dynamic content data. FIG. 4D illustrates how the XPath statements 420 can then be used to retrieve all kinds of different content data from the received HTML/XML data.

[0048] FIG. 5 is a diagram that shows an example of a generic computer device 900 and a generic mobile computer device 950, which may be used with the techniques described here. Computing device 900 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Generic computer device 900 may correspond to the computer system 100 of FIG. 1. Computing device 950 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, and other similar computing devices. For example, computing device 950 may be used as a frontend by a user to interact with the computing device 900 (e.g., for example for providing the data retrieval request and for receiving the machine-readable content result. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

[0049] Computing device 900 includes a processor 902, memory 904, a storage device 906, a high-speed interface 908 connecting to memory 904 and high-speed expansion ports 910, and a low speed interface 912 connecting to low speed bus 914 and storage device 906. Each of the components 902, 904, 906, 908, 910, and 912, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 902 can process instructions for execution within the computing device 900, including instructions stored in the memory 904 or on the storage device 906 to display graphical information for a GUI on an external input/output device, such as display 916 coupled to high speed interface 908. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 900 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

[0050] The memory 904 stores information within the computing device 900. In one implementation, the memory 904 is a volatile memory unit or units. In another implementation, the memory 904 is a non-volatile memory unit or units. The memory 904 may also be another form of computer-readable medium, such as a magnetic or optical disk.

[0051] The storage device 906 is capable of providing mass storage for the computing device 900. In one implementation, the storage device 906 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid-state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product can be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 904, the storage device 906, or memory on processor 902.

[0052] The high-speed controller 908 manages bandwidth-intensive operations for the computing device 900, while the low speed controller 912 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In one implementation, the high-speed controller 908 is coupled to memory 904, display 916 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 910, which may accept various expansion cards (not shown). In the implementation, low-speed controller 912 is coupled to storage device 906 and low-speed expansion port 914. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.

[0053] The computing device 900 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 920, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 924. In addition, it may be implemented in a personal computer such as a laptop computer 922. Alternatively, components from computing device 900 may be combined with other components in a mobile device (not shown), such as device 950. Each of such devices may contain one or more of computing device 900, 950, and an entire system may be made up of multiple computing devices 900, 950 communicating with each other.

[0054] Computing device 950 includes a processor 952, memory 964, an input/output device such as a display 954, a communication interface 966, and a transceiver 968, among other components. The device 950 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 950, 952, 964, 954, 966, and 968, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.

[0055] The processor 952 can execute instructions within the computing device 950, including instructions stored in the memory 964. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 950, such as control of user interfaces, applications run by device 950, and wireless communication by device 950.

[0056] Processor 952 may communicate with a user through control interface 958 and display interface 956 coupled to a display 954. The display 954 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 956 may comprise appropriate circuitry for driving the display 954 to present graphical and other information to a user. The control interface 958 may receive commands from a user and convert them for submission to the processor 952. In addition, an external interface 962 may be provide in communication with processor 952, so as to enable near area communication of device 950 with other devices. External interface 962 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.

[0057] The memory 964 stores information within the computing device 950. The memory 964 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. Expansion memory 984 may also be provided and connected to device 950 through expansion interface 982, which may include, for example, a SIMM (Single In Line Memory Module) card interface. Such expansion memory 984 may provide extra storage space for device 950, or may also store applications or other information for device 950. Specifically, expansion memory 984 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 984 may act as a security module for device 950, and may be programmed with instructions that permit secure use of device 950. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing the identifying information on the SIMM card in a non-hackable manner.

[0058] The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 964, expansion memory 984, or memory on processor 952 that may be received, for example, over transceiver 968 or external interface 962.

[0059] Device 950 may communicate wirelessly through communication interface 966, which may include digital signal processing circuitry where necessary. Communication interface 966 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 968. In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 980 may provide additional navigation- and location-related wireless data to device 950, which may be used as appropriate by applications running on device 950.

[0060] Device 950 may also communicate audibly using audio codec 960, which may receive spoken information from a user and convert it to usable digital information. Audio codec 960 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 950. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 950.

[0061] The computing device 950 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 980. It may also be implemented as part of a smart phone 982, personal digital assistant, or another similar mobile device.

[0062] Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

[0063] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

[0064] To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

[0065] The systems and techniques described here can be implemented in a computing device that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), and the Internet.

[0066] The computing device can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0067] A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention.

[0068] In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other embodiments are within the scope of the following claims.


Claims

1. A computer-implemented method (1000) for extracting dynamic content data (221) in a machine-readable format from a website (220) provided by a server wherein dynamic content data (221) relates to content data which is generated by the server in response to a request, the method comprising:

accessing (1100) configuration data reflecting the structure of the website (220), the configuration data including at least a website specific scraping script and one or more website specific XPath statements;

receiving (1200), via an interface (110), a data retrieval request (210) specifying the website (220) and corresponding dynamic content data (221) to be retrieved from the website (220);

triggering (1300) execution of the scraping script wherein the scraping script is configured to perform one or more parameterized navigation steps on the website (220) to access the dynamic content data (221);

receiving (1400), from the website (220) in response to the scraping script execution, HTML/XML data representing the dynamic content data;

providing (1500) the HTML/XML data to an XPath extraction module (150), wherein the XPath extraction module is pre-configured with the website specific XPath statements in accordance with the structure of the website (220); and

receiving (1600), from the XPath extraction module (150), machine-readable content data (222) extracted from the HTML/XML data.


 
2. The method of claim 1, wherein the website requires login credentials and wherein the data retrieval request further specifies user authentication data, the method further comprising:

providing (1210) the authentication data to an account database (230); and

in case of successful authentication of the user, receiving (1220) from the account database (230) the login credentials for the authenticated user in response to a credential request.


 
3. The method of claim 1 or 2, wherein the data retrieval request is buffered in a request queue (115) amongst a plurality of further data retrieval requests.
 
4. The method of any of the previous claims, further comprising:

in case of a login requirement for the website, determining (1240) a session ID for the data retrieval request to access the dynamic content data on the website.


 
5. The method of any of claim 4, wherein determining (1240) a session ID further comprises:

if an open session is available for the received data retrieval request, providing (1242) the session ID of the open session;

if no open session is available for the received data retrieval request, triggering (1243) execution of a pre-configured login script in accordance with the data retrieval request, and receiving (1244), in response to the executed login script, one or more cookies as the session ID.


 
6. The method of any of the previous claims, wherein the one or more parameterized navigation steps are enabled by the scraping script using results from preceding server responses allowing for dynamic navigation through the website and enabling the scraping script to work without relying on predefined URLs.
 
7. The method of any of the previous claims, wherein the data retrieval request includes at least one dynamic parameter leading to different content (221) generated by the website (220) in response to the execution of a particular scraping script step, wherein the dynamic parameter value is not known beforehand.
 
8. A computer program product that when loaded into a memory of a computing device and executed by at least one processor of the computing device executes the steps of the computer implemented method according to any of the previous claims.
 
9. A computer system (100) for extracting dynamic content data (221) from a website (220) in a machine-readable format, the system comprising:

an interface (110) configured to access configuration data (250) reflecting the structure of the website (220), the configuration data including at least a website specific scraping script and one or more website specific XPath statements, and further to receive a data retrieval request (210) specifying the website (220) and corresponding dynamic content data (221) to be retrieved;

a scraper module (120) configured to provide the scraping script (2050) for execution wherein the scraping script is configured to perform one or more parameterized navigation steps on the website (220) to access the dynamic content data (221);

a script module (140) configured to trigger execution of the scraping script and to receive, from the website (220) in response to the scraping script execution, HTML/XML data representing the dynamic content data; and

an XPath extraction module (150), wherein the XPath extraction module is pre-configured with the website specific XPath statements in accordance with the structure of the website (220) to extract machine-readable content data (222) from the HTML/XML data.


 
10. The system (100) of claim 9, further comprising:

a request queue (115) configured to buffer the data retrieval request (210) amongst a plurality of further data retrieval requests.


 
11. The system (100) of claim 9 or 10, wherein the website requires login credentials and wherein the data retrieval request (2010) further specifies user authentication data (2020), further comprising:

a session management module (130) configured to determine a session ID for the data retrieval request to access the dynamic content data on the website;


 
12. The system (100) of claim 11, the interface (110) being communicatively coupled with an account database (230), and being further configured to provide the authentication data (2020) to the account database (230), and in case of successful authentication of the user, to receive from the account database (230) the login credentials (2023) for the authenticated user in response to a credential request (2022).
 
13. The system (100) of claims 11 or 12, wherein the session management module (130) is further configured to:

check if an open session is available for the received data retrieval request;

if an open session is available, provide the session ID of the open session to the scraper module (120);

if no open session is available for the received data retrieval request, trigger the execution of a pre-configured login script (2040) by the script module (140) in accordance with the data retrieval request (2010), and provide one or more cookies (2042), received in response to the executed login script, as the session ID to the scraper module (120).


 
14. The system (100) of any of claims 9 to 13, wherein the computer system (100) is further configured to deploy configuration data to the computer system modules.
 
15. The system (100) of any of claims 9 to 14, wherein the one or more parameterized navigation steps are enabled by the scraping script using results from preceding server responses allowing for dynamic navigation through the website and enabling the scraping script to work without relying on predefined URLs.
 




Drawing


































Search report









Search report