(19)
(11)EP 2 959 384 B1

(12)EUROPEAN PATENT SPECIFICATION

(45)Mention of the grant of the patent:
21.09.2022 Bulletin 2022/38

(21)Application number: 14753812.8

(22)Date of filing:  14.02.2014
(51)International Patent Classification (IPC): 
G06F 16/2452(2019.01)
G06F 16/2453(2019.01)
(52)Cooperative Patent Classification (CPC):
G06F 16/24542; G06F 16/24524; G06F 16/24532
(86)International application number:
PCT/US2014/016596
(87)International publication number:
WO 2014/130371 (28.08.2014 Gazette  2014/35)

(54)

DATA ANALYTICS PLATFORM OVER PARALLEL DATABASES AND DISTRIBUTED FILE SYSTEMS

DATENANALYSEPLATTFORM ÜBER PARALLELE DATENBANKEN UND VERTEILTE DATEISYSTEME

PLATEFORME D'ANALYSE DE DONNÉES DANS DES BASES DE DONNÉES PARALLÈLES ET DES SYSTÈMES DE FICHIERS DISTRIBUÉS


(84)Designated Contracting States:
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

(30)Priority: 25.02.2013 US 201361769043 P
15.03.2013 US 201313840912

(43)Date of publication of application:
30.12.2015 Bulletin 2015/53

(73)Proprietor: EMC Corporation
Hopkinton, MA 01748 (US)

(72)Inventors:
  • WELTON, Caleb, E.
    Hopkinton, MA 01748 (US)
  • YANG, Shengwen
    Hopkinton, MA 01748 (US)

(74)Representative: Gill, David Alan 
WP Thompson 138 Fetter Lane
London EC4A 1BT
London EC4A 1BT (GB)


(56)References cited: : 
WO-A1-2012/050582
US-A1- 2011 246 511
US-A1- 2009 254 916
US-B1- 7 984 043
  
      
    Note: Within nine months from the publication of the mention of the grant of the European patent, any person may give notice to the European Patent Office of opposition to the European patent granted. Notice of opposition shall be filed in a written reasoned statement. It shall not be deemed to have been filed until the opposition fee has been paid. (Art. 99(1) European Patent Convention).


    Description

    BACKGROUND OF THE INVENTION



    [0001] Distributed storage systems enable databases, files, and other objects to be stored in a manner that distributes data across large clusters of commodity hardware. For example, Hadoop® is an open-source software framework to distribute data and associated computing (e.g., execution of application tasks) across large clusters of commodity hardware.

    [0002] EMC Greenplum® provides a massively parallel processing (MPP) architecture for data storage and analysis. Typically, data is stored in segment servers, each of which stores and manages a portion of the overall data set. Advanced MPP database systems such as EMC Greenplum® provide the ability to perform data analytics processing on huge data sets, including by enabling users to use familiar and/or industry standard languages and protocols, such as SQL, to specify data analytics and/or other processing to be performed. Examples of data analytics processing include, without limitation, Logistic Regression, Multinomial Logistic Regression, K-means clustering, Association Rules based market basket analysis, Latent Dirichlet based topic modeling, etc.

    [0003] While distributed storage systems, such as Hadoop®, provide the ability to reliable store huge amounts of data on commodity hardware, such systems have not to date been optimized to support data mining and analytics processing with respect to the data stored in them.

    [0004] US 2009 254916 describes computing resources assigned to sub-plans within a query plan to effect parallel execution of the query plan. For example, computing resources in a grid can be represented by nodes, and a shortest path technique can be applied to allocate machines to the sub-plans. Computing resources can be provisionally allocated as the query plan is divided into query plan segments containing one or more sub-plans. Based on provisional allocations to the segments, the computing resources can then be allocated to the sub-plans within respective segments.

    [0005] US 7984043 B1 describes a system and method for distributed query processing may compile and optimize query plans for incoming query requests independent of hardware configurations and/or physical locations of data partitions in a distributed storage system (e.g., a data grid). The query plan may be divided into segments, and each segment may be instantiated on a remote query processing node of the distributed system by a query coordinator node according to metadata accessed at runtime by remote sub-query operators in the query plan. The metadata may include an indication of the physical locations of data partitions in the system and may be stored on one or more of the query processing nodes. The remote query processing nodes may execute the query plan segments and return results to the requestor. Cached query plans may be re-executed without recompilation, according to current metadata, even in the event of a node failure or data partition move.

    [0006] The present invention is defined in the claims.

    BRIEF DESCRIPTION OF THE DRAWINGS



    [0007] Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.

    Figure 1 is a block diagram illustrating an embodiment of a large scale distributed system.

    Figure 2 is a block diagram illustrating an embodiment of a data analytics architecture of a large scale distributed system.

    Figure 3 is a flow chart illustrating an embodiment of a database query processing process.

    Figure 4 is a block diagram illustrating an embodiment of a segment server.

    Figure 5 is a flow chart illustrating an embodiment of a process to perform data analytics processing.


    DETAILED DESCRIPTION



    [0008] The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term 'processor' refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.

    [0009] A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

    [0010] Providing advanced data analytics capabilities in the context of a large distributed data storage system is disclosed. In various embodiments, a massively parallel processing (MPP) database system is adapted to manage and provide data analytics with respect to data stored in a large distributed storage layer, e.g., an implementation of the Hadoop® distributed storage framework. Examples of data analytics processing include, without limitation, Logistic Regression, Multinomial Logistic Regression, K-means clustering, Association Rules based market basket analysis, Latent Dirichlet based topic modeling, etc. In some embodiments, advanced data analytics functions, such as statistical and other analytics functions, are embedded in each of a plurality of segment servers comprising the MPP database portion of the system. In some embodiments, to perform a data analytics task, such as computing statistics, performing an optimization, etc., a master node selects a subset of segments to perform associated processing, and sends to each segment an indication of the data analytics processing to be performed by that segment, including for example an identification of the embedded data analytics function(s) to be used, and associated metadata required to locate and/or access the subset of data on which that segment is to perform the indicated processing.

    [0011] Figure 1 is a block diagram illustrating an embodiment of a large scale distributed system. In the example shown, the large scale distributed system includes a large cluster of commodity servers. The master hosts include a primary master 102 and a standby master 104. The primary master 102 is responsible for accepting queries; planning queries, e.g., based at least in part on system metadata 106, which in various embodiments includes information indicating where data is stored within the system; dispatching queries to segments for execution; and collecting the results from segments. The standby master 104 is a warm backup of the primary master 102. The network interconnect 108 is used to communicate tuples between execution processes. The compute unit of the database engine is called a "segment". Each of a large number of segment hosts, represented in Figure 1 by hosts 110, 112, and 114, can have multiple segments. The segments on segment hosts 110, 112, 114, for example, are configured to execute tasks assigned by the primary master 102, such as to perform assigned portions of a query plan with respect to data stored in distributed storage layer 116, e.g., a Hadoop® or other storage layer.

    [0012] When the master node 102 accepts a query, it is parsed and planned according to the statistics of the tables in the query, e.g., based on metadata 106. After the planning phase, a query plan is generated. A query plan is sliced into many slices. In the query execution phase, for each slice a group of segments, typically comprising a subset of the segments hosted on segment hosts 1 through s, is selected to execute the slice. In various embodiments, the size of the group may be dynamically determined by using the knowledge of the data distribution and available resources, e.g., workload on respective segments, etc.

    [0013] In various embodiments, a data analytics job or other query may be expressed in whole or in part using SQL and/or any other specified language or syntax. A master node, such as primary master 102, parses the SQL or other input and invokes scripts or other code available on the master to perform top level processing to perform the requested processing. In various embodiments, a query plan generated by the master 102, for example, may identify for each of a plurality of segments a corresponding portion of the global data set to be processed by that segment. Metadata identifying the location of the data to be processed by a particular segment, e.g., with distributed storage layer 116, is sent to the segment by the master 102. In various embodiments, the distributed storage layer 116 comprises data stored in an instance of the Hadoop Distributed File System (HDFS) and the metadata indicates a location within the HDFS of data to be processed by that segment. The master 102 in addition will indicate to the segment the specific processing to be performed. In various embodiments, the indication from the master may indicate, directly or indirectly, one or more analytics functions embedded at each segment which is/are to be used by the segment to perform the required processing.

    [0014] Figure 2 is a block diagram illustrating an embodiment of a data analytics architecture of a large scale distributed system. In various embodiments, the data analytics architecture 200 of Figure 2 is implemented in a large scale distributed system, such as the large scale distribute system of Figure 1. In the example shown, the data analytics architecture 200 includes a user interface 202 that enables data analytics requests to be expressed using SQL, e.g., as indicated by a specification. Various driver functions 204, e.g., python or other scipts with templated SQL in this example, may be invoked to perform, for example, the outer loops of iterative algorithms, optimizer invocations, etc. A high level abstraction layer 206, in this example also comprising python scripts, provides functionality such as an iteration controller, convex optimizers, etc. The upper layers 202, 204, and 206 interact with RDBMS built-in functions 208 and/or with inner loops 210 and/or low-level abstraction layer 212, comprising compiled C++ in this example, to perform lower level tasks required to perform a task received via user interface 202. Data is accessed to perform analytics computations and/or other processing by interacting with an underlying RDBMS query processing layer 214. In various embodiments, one or more of the components shown in Figure 2 may be implemented across nodes comprising the system, such as across the segments or other processing units comprising the MPP database portion of a large scale distributed system such as the one shown in Figure 1. In various embodiments, core data analytics processing is performed at least in part using functions embedded in each of the segments (or other processing units) included in the system. In some embodiments, the functions comprise a "shared object" or library of functions comprising compiled C++ or other compiled code, such as Java or Fortran. As a portion of a broader task is assigned to a segment, the segment uses the embedded function(s) implicated by the assignment to perform at least part the data analytics and/or other processing that has been assigned to the segment.

    [0015] Figure 3 is a flow chart illustrating an embodiment of a database query processing process. In some embodiments, a master node, such as primary master 102 of Figure 1, implements the process of Figure 3. In the example shown, a query is received (302). Examples of a query include, without limitation, an advanced data analytics request expressed in whole or in part as a set of SQL statements. A query plan is generated (304). The plan is divided into a plurality of slices, and for each slice a corresponding set of segments ("gang") is identified to participate in execution of that slice of the query plan (306). For each slice of the query plan, the segments selected to perform processing required by that slice are sent a communication that includes both the applicable portion of the plan to be performed by that segment and metadata that may be required by a receiving segment to perform tasks assigned to that segment (308). In some embodiments, the metadata included in the query plan slice and/or other communication sent to the respective segments selected to participate in execution of that slice of the plan includes metadata from a central metadata store, e.g., metadata 106 of Figure 1, and includes information indicating to the segment the location of data with respect to which that segment is to perform query plan slice related processing. In past approaches, typically a segment would store and manage a corresponding portion of the overall data, and sending metadata to perform query plan related tasks would not typically have been necessary. In some embodiments, metadata and/or other data included in assignments sent to selected segments may indicate data analytics processing to be performed, in whole or in part, by the segment using one or more data analytics functions that have been embedded in each of the segments in the distributed system. Query results are received from the respective segments to which query tasks were dispatched, and processed to generate, e.g., at the master node, a master or overall response to the query (310).

    [0016] Figure 4 is a block diagram illustrating an embodiment of a segment server. In various embodiments, one or more segment servers such as segment server 402 may be deployed in each of a plurality of segment hosts, such as segment hosts 110, 112, and 114 of Figure 1. In the example shown, the segment server 402 includes a communication interface 404 configured to received, e.g., via a network interconnect such as interconnect 108 of Figure 1, a network communication comprising an assignment sent by a master node such as primary master 102 of Figure 1. A query executor 406 performs processing required to complete tasks assigned by the master node, using in this example a storage layer interface 408 to access data stored in a distributed storage layer, such as distributed storage layer 116 of Figure 1. One or more data analytics functions included in a shared data analytics library 410 embedded in each segment server in the distributed system may be called to perform data analytics processing, as required to perform the assigned task. Examples of functions that may be embedded in segment servers in various embodiments include, without limitation: User-Defined Functions (e.g. a UDF which randomly initializes an array with values in a specified range, a UDF which transposes a matrix, a UDF which un-nests a 2-dimensional array into a set of 1-dimensional arrays, etc.), step functions, and final functions of various User-Defined Aggregators.

    [0017] Figure 5 is a flow chart illustrating an embodiment of a process to perform data analytics processing. In various embodiments, the process of Figure 5 is performed by a segment server in response to receiving an assignment, e.g., from a master node, to perform an assigned part of a data analytics query plan. In the example shown, an assigned task is received (502). Metadata embedded in the assigned task is used to access data as needed to perform the assigned task(s) (504). Data analytics functions embedded at the segment server or other processing unit are invoked as needed to perform the assigned task (506). Examples of functions that may be embedded in segment servers in various embodiments include, without limitation: a function which does Gibbs sampling for the inference of Latent Dirichlet Allocation, or a function which generates the association rules. Once processing has been completed, a result is returned, for example to the master node from which the assignment was received (508).

    [0018] Using techniques disclosed herein, a scalable and high-performance data analytics platform can be provided over a high-performance parallel database system built upon a scalable distributed file system. The advantages of parallel databases and distributed file systems are combined to overcome the challenges of big data analytics. Finally, in various embodiments, users are able to use familiar SQL queries to run analytic tasks, and the underlying parallel database engines translate these SQL queries into a set of execution plans, optimized according to data locality and load balances.

    [0019] Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.


    Claims

    1. A method, comprising:

    receiving (302), by a master node (102), a data analysis request;

    creating (304), by the master node (102), a plan to generate a response to the request;

    assigning (306) to each of a plurality of distributed processing segments a corresponding portion of the plan to be performed by that segment, including by invoking as indicated in the assignment one or more data analytical functions embedded in the processing segment; and characterized by

    sending (308), by the master node (102), to each of the plurality of distributed processing segments for which a portion of the plan is assigned, the corresponding portion of the plan to be performed by that segment and metadata (106), wherein the metadata (106) is used to locate or access a subset of data in a distributed storage layer (116) on which the segment is to perform the indicated processing.


     
    2. The method of claim 1, wherein creating (304) a plan to generate a response to the request includes creating a query plan, slicing the query plan into a plurality of slices, and identifying for each slice a group of processing segments to perform tasks comprising that slice of the query plan.
     
    3. The method of claim 1, wherein the data analysis request comprises one or more SQL statements, to compute one or more of the following: Logistic Regression, Multinomial Logistic Regression, K-means clustering, Association Rules based market basket analysis, and Latent Dirichlet based topic modeling.
     
    4. The method of claim 1, wherein the data analysis request is received at a master node (102) of a large scale distributed system.
     
    5. The method of claim 1, further comprising embedding in each of said plurality of distributed processing segments a library or other shared object comprising said one or more data analytical functions.
     
    6. The method of claim 5, wherein said library or other shared object is included in the processing segments as deployed.
     
    7. The method of claim 5, wherein said library or other shared object embodies said one or more data analytical functions in the form of one or more of the following: compiled C++ code, compiled Java, compiled Fortran, or other compiled code.
     
    8. The method of claim 1, wherein the plurality of distributed processing segments comprise a subset of parallel processing segments comprising a massively parallel processing (MPP) database system.
     
    9. The method of claim 1, wherein assigning (306) to each of a plurality of distributed processing segments (110) a corresponding portion of the plan to be performed by that segment includes embedding in an assignment communication to be sent to one or more of said plurality of distributed processing segments the metadata (106) indicating a location, within the distributed data storage layer (116), of data to be processed by that segment.
     
    10. The method of claim 9, wherein each of said distributed processing segments is configured to use the metadata (106) to access said data to be processed by that segment.
     
    11. The method of claim 9, wherein the distributed data storage layer (116) comprises data stored in an instance of the Hadoop Distributed File System (HDFS) and the metadata (106) indicates a location within the HDFS of data to be processed by that segment.
     
    12. The method of claim 1, further comprising: obtaining, by the master node (102), metadata (106) associated with one or more portions of the plan to be performed by one or more corresponding segments, wherein the master node (102) obtains the metadata (106) from a central metadata store, wherein the metadata identifies a location data corresponding to the one or more portions of the plan and at least a part of one or more data analytic processing to be performed in connection with processing the corresponding one or more portions of the plan.
     
    13. A system, comprising:

    a communication interface (404); and

    a processor coupled to the communication interface (404) and configured to:

    receive (302) a data analysis request;

    create (304) a plan to generate a response to the request;

    assign (306) to each of a plurality of distributed processing segments, via a communication sent via the communication interface (404), a corresponding portion of the plan to be performed by that segment, including by invoking as indicated in the assignment one or more data analytical functions embedded in the processing segment; and characterized by

    send (308) to each of the plurality of distributed processing segments for which a portion of the plan is assigned, the corresponding portion of the plan to be performed by that segment and metadata (106), wherein the metadata (106) is used to locate or access a subset of data in a distributed data storage layer (116) on which the segment is to perform the indicated processing.


     
    14. The system of claim 13, wherein the data analysis request comprises one or more SQL statements, to compute one or more of the following: Logistic Regression, Multinomial Logistic Regression, K-means clustering, Association Rules based market basket analysis, and Latent Dirichlet based topic modeling.
     
    15. The system of claim 13, wherein the data analysis request is received at a master node (102) of a large scale distributed system.
     
    16. The system of claim 13, wherein the processor is configured to create the plan to generate a response to the request at least in part by creating a query plan, slicing the query plan into a plurality of slices, and identifying for each slice a group of processing segments to perform tasks comprising that slice of the query plan.
     
    17. The system of claim 13, wherein the processor is configured to assign to each of a plurality of distributed processing segments a corresponding portion of the plan to be performed by that segment at least in part by embedding in the communication to be sent via the communication interface to one or more of said plurality of distributed processing segments the metadata (106) indicating a location, within the distributed data storage layer (116), of data to be processed by that segment.
     
    18. The system of claim 17, wherein each of said distributed processing segments is configured to use the metadata (106) to access said data to be processed by that segment.
     
    19. The system of claim 13, wherein each of said distributed processing segments has embedded therein a library or other shared object comprising said one or more data analytical functions.
     
    20. The system of claim 13, wherein the plurality of distributed processing segments comprise a subset of parallel processing segments comprising a massively parallel processing (MPP) database system.
     
    21. A computer program product embodied in a tangible, non-transitory computer readable storage medium, comprising computer instructions for:

    receiving (302) a data analysis request;

    creating (304) a plan to generate a response to the request;

    assigning (306) to each of a plurality of distributed processing segments (110) a corresponding portion of the plan to be performed by that segment, including by invoking as indicated in the assignment one or more data analytical functions embedded in the processing segment; and characterized by

    sending (308) to each of the plurality of distributed processing segments for which a portion of the plan is assigned, the corresponding portion of the plan to be performed by that segment and metadata (106), wherein the metadata (106) is used to locate or access a subset of data in a distributed data storage layer (116) on which the segment is to perform the indicated processing.


     
    22. The computer program product of claim 21, wherein assigning (306) to each of a plurality of distributed processing segments (110) a corresponding portion of the plan to be performed by that segment includes embedding in an assignment communication to be sent to one or more of said plurality of distributed processing segments a metadata (106) indicating a location, within a distributed data storage layer (116), of data to be processed by that segment.
     


    Ansprüche

    1. Verfahren, das Folgendes beinhaltet:

    Empfangen (302), durch einen Master-Knoten (102), einer Datenanalyseanforderung;

    Erzeugen (304), durch den Master-Knoten (102), eines Plans zum Erzeugen einer Antwort auf die Anforderung;

    Zuweisen (306), zu jedem aus einer Mehrzahl von verteilten Verarbeitungssegmenten, eines entsprechenden Teils des von diesem Segment auszuführenden Plans, einschließlich durch Aufrufen einer oder mehrerer in dem Verarbeitungssegment eingebetteter Datenanalysefunktionen wie in der Zuweisung angegeben; und gekennzeichnet durch

    Senden (308), durch den Master-Knoten (102), zu jedem aus der Mehrzahl von verteilten Verarbeitungssegmenten, für die ein Teil des Plans zugewiesen ist, des entsprechenden Teils des von diesem Segment auszuführenden Plans und von Metadaten (106), wobei die Metadaten (106) verwendet werden, um eine Teilmenge von Daten in einer verteilten Speicherschicht (116) zu lokalisieren oder darauf zuzugreifen, auf der das Segment die angegebene Verarbeitung ausführen soll.


     
    2. Verfahren nach Anspruch 1, wobei das Erzeugen (304) eines Plans zum Erzeugen einer Antwort auf die Anforderung das Erzeugen eines Abfrageplans, das Unterteilen des Abfrageplans in mehrere Slices und das Identifizieren, für jedes Slice, einer Gruppe von Verarbeitungssegmenten umfasst, um Aufgaben durchzuführen, die dieses Slice des Abfrageplans umfassen.
     
    3. Verfahren nach Anspruch 1, wobei die Datenanalyseanforderung eine oder mehrere SQL-Statements umfasst, um eines oder mehrere der folgenden zu berechnen: Logistische Regression, Multinomiale logistische Regression, K-Means-Clustering, auf Assoziationsregeln basierende Warenkorbanalyse und auf Latent Dirichlet basierende Themenmodellierung.
     
    4. Verfahren nach Anspruch 1, wobei die Datenanalyseanforderung an einem Master-Knoten (102) eines groß angelegten verteilten Systems empfangen wird.
     
    5. Verfahren nach Anspruch 1, das ferner das Einbetten einer Bibliothek oder eines anderen gemeinsam genutzten Objekts, die/das die genannten ein oder mehreren Datenanalysefunktionen umfasst, in jedes aus der genannten Mehrzahl von verteilten Verarbeitungssegmenten beinhaltet.
     
    6. Verfahren nach Anspruch 5, wobei die genannte Bibliothek oder das andere gemeinsam genutzte Objekt in den Verarbeitungssegmenten wie eingesetzt enthalten ist.
     
    7. Verfahren nach Anspruch 5, wobei die genannte Bibliothek oder das andere gemeinsam genutzte Objekt die genannten ein oder mehreren Datenanalysefunktionen in Form von einem oder mehreren der folgenden ausgestaltet: kompilierter C++-Code, kompiliertes Java, kompiliertes Fortran oder anderer kompilierter Code.
     
    8. Verfahren nach Anspruch 1, wobei die Mehrzahl von verteilten Verarbeitungssegmenten eine Teilmenge von Parallelverarbeitungssegmenten umfasst, die ein MPP-(Massively Parallel Processing)-Datenbanksystem umfassen.
     
    9. Verfahren nach Anspruch 1, wobei das Zuweisen (306), zu jedem aus einer Mehrzahl von verteilten Verarbeitungssegmenten (110), eines entsprechenden Teils des von diesem Segment auszuführenden Plans das Einbetten der Metadaten (106), die einen Ort von von diesem Segment zu verarbeitenden Daten innerhalb der verteilten Datenspeicherschicht (116) angeben, in eine Zuweisungsmitteilung beinhaltet, die zu einem oder mehreren aus der genannten Mehrzahl von verteilten Verarbeitungssegmenten zu senden ist.
     
    10. Verfahren nach Anspruch 9, wobei jedes der genannten verteilten Verarbeitungssegmente so konfiguriert ist, dass es die Metadaten (106) verwendet, um auf die genannten von diesem Segment zu verarbeitenden Daten zuzugreifen.
     
    11. Verfahren nach Anspruch 9, wobei die verteilte Datenspeicherschicht (116) Daten umfasst, die in einer Instanz des HDFS (Hadoop Distributed File System) gespeichert sind, und die Metadaten (106) einen Ort innerhalb des HDFS von Daten angeben, die von diesem Segment verarbeitet werden sollen.
     
    12. Verfahren nach Anspruch 1, das ferner Folgendes beinhaltet: Erhalten, durch den Master-Knoten (102), von Metadaten (106), die mit einem oder mehreren Teilen des von einem oder mehreren entsprechenden Segmenten auszuführenden Plans assoziiert sind, wobei der Master-Knoten (102) die Metadaten (106) von einem zentralen Metadatenspeicher erhält, wobei die Metadaten Ortsdaten entsprechend den ein oder mehreren Teilen des Plans und mindestens einen Teil von einer oder mehreren Datenanalyseverarbeitungen identifiziert, die in Verbindung mit der Verarbeitung der entsprechenden ein oder mehreren Teile des Plans ausgeführt werden sollen.
     
    13. System, das Folgendes umfasst:

    eine Kommunikationsschnittstelle (404); und

    einen Prozessor, der mit der Kommunikationsschnittstelle (404) gekoppelt und konfiguriert ist zum:

    Empfangen (302) einer Datenanalyseanforderung;

    Erzeugen (304) eines Plans zum Erzeugen einer Antwort auf die Anforderung;

    Zuweisen (306), zu jedem aus einer Mehrzahl von verteilten Verarbeitungssegmenten, über eine über die Kommunikationsschnittstelle (404) gesendete Mitteilung, eines entsprechenden Teils des von diesem Segment auszuführenden Plans, einschließlich durch Aufrufen einer oder mehrerer in dem Verarbeitungssegment eingebetteter Datenanalysefunktionen wie in der Zuweisung angegeben; und gekennzeichnet durch

    Senden (308), zu jedem aus der Mehrzahl von verteilten Verarbeitungssegmenten, für die ein Teil des Plans zugewiesen ist, des entsprechenden Teils des von diesem Segment auszuführenden Plans und von Metadaten (106), wobei die Metadaten (106) verwendet werden, um eine Teilmenge von Daten in einer verteilten Datenspeicherschicht (116) zu lokalisieren oder darauf zuzugreifen, auf der das Segment die angegebene Verarbeitung ausführen soll.


     
    14. System nach Anspruch 13, wobei die Datenanalyseanforderung eine oder mehrere SQL-Statements umfasst, um eine oder mehrere der folgenden zu berechnen: Logistische Regression, Multinomiale logistische Regression, K-means-Clustering, auf Assoziationsregeln basierende Warenkorbanalyse und auf Latent Dirichlet basierende Themenmodellierung.
     
    15. System nach Anspruch 13, wobei die Datenanalyseanforderung an einem Master-Knoten (102) eines groß angelegten verteilten Systems empfangen wird.
     
    16. System nach Anspruch 13, wobei der Prozessor zum Erzeugen des Plans konfiguriert ist, um eine Antwort auf die Anforderung zumindest teilweise durch Erzeugen eines Abfrageplans, Unterteilen des Abfrageplans in eine Mehrzahl von Slices und Identifizieren einer Gruppe von Verarbeitungssegmenten für jedes Slice zur Durchführung von Aufgaben zu erzeugen, die dieses Slice des Abfrageplans umfassen.
     
    17. System nach Anspruch 13, wobei der Prozessor zum Zuweisen, zu jedem aus einer Mehrzahl von verteilten Verarbeitungssegmenten, eines entsprechenden Teils des von diesem Segment auszuführenden Plans zumindest teilweise durch Einbetten der Metadaten (106), die einen Ort von von diesem Segment zu verarbeitenden Daten innerhalb der verteilten Datenspeicherschicht (116) angeben, in die Mitteilung konfiguriert ist, die über die Kommunikationsschnittstelle zu einem oder mehreren aus der genannten Mehrzahl von verteilten Verarbeitungssegmenten zu senden ist.
     
    18. System nach Anspruch 17, wobei jedes der genannten verteilten Verarbeitungssegmente zum Verwenden der Metadaten (106) konfiguriert ist, um auf die genannten von diesem Segment zu verarbeitenden Daten zuzugreifen.
     
    19. System nach Anspruch 13, wobei in jedem der genannten verteilten Verarbeitungssegmente eine Bibliothek oder ein anderes gemeinsam genutztes Objekt eingebettet ist, die/das die genannten ein oder mehreren Datenanalysefunktionen umfasst.
     
    20. System nach Anspruch 13, wobei die Mehrzahl von verteilten Verarbeitungssegmenten eine Teilmenge von Parallelverarbeitungssegmenten umfasst, die ein MPP-(Massively Parallel Processing)-Datenbanksystem umfassen.
     
    21. Computerprogrammprodukt, das in einem greifbaren, nichtflüchtigen, computerlesbaren Speichermedium ausgestaltet ist, das Computerbefehle umfasst zum:

    Empfangen (302) einer Datenanalyseanforderung;

    Erzeugen (304) eines Plans zum Erzeugen einer Antwort auf die Anforderung;

    Zuweisen (306), zu jedem aus einer Mehrzahl von verteilten Verarbeitungssegmenten (110), eines entsprechenden Teils des von diesem Segment auszuführenden Plans, einschließlich durch Aufrufen einer oder mehrerer in dem Verarbeitungssegment eingebetteter Datenanalysefunktionen wie in der Zuweisung angegeben; und gekennzeichnet durch

    Senden (308), zu jedem aus der Mehrzahl von verteilten Verarbeitungssegmenten, für die ein Teil des Plans zugewiesen ist, des entsprechenden Teils des von diesem Segment auszuführenden Plans und von Metadaten (106), wobei die Metadaten (106) verwendet werden, um eine Teilmenge von Daten in einer verteilten Datenspeicherschicht (116) zu lokalisieren oder darauf zuzugreifen, auf der das Segment die angegebene Verarbeitung ausführen soll.


     
    22. Computerprogrammprodukt nach Anspruch 21, wobei das Zuweisen (306), zu jedem aus einer Mehrzahl von verteilten Verarbeitungssegmenten (110), eines entsprechenden Teils des von diesem Segment auszuführenden Plans das Einbetten von Metadaten (106), die einen Ort von von diesem Segment zu verarbeitenden Daten innerhalb einer verteilten Datenspeicherungsschicht (116) angeben, in eine Zuweisungsmitteilung beinhaltet, die zu einem oder mehreren aus der genannten Mehrzahl von verteilten Verarbeitungssegmenten zu senden ist.
     


    Revendications

    1. Un procédé comprenant :

    la réception (302), par un noeud maître (102), d'une demande d'analyse de données,

    la création (304), par le noeud maître (102), d'un plan de génération d'une réponse à la demande,

    l'affectation (306) à chaque segment d'une pluralité de segments de traitement distribués d'une partie correspondante du plan à exécuter par ce segment, y compris par l'invocation comme indiqué dans l'affectation d'une ou de plusieurs fonctions d'analyse de données imbriquées dans le segment de traitement, et caractérisé par

    l'envoi (308), par le noeud maître (102), à chaque segment de la pluralité de segments de traitement distribués auquel une partie du plan est affectée, de la partie correspondante du plan à exécuter par ce segment et de métadonnées (106), où les métadonnées (106) sont utilisées pour localiser ou accéder à un sous-ensemble de données dans une couche d'espace mémoire distribué (116) sur laquelle le segment doit exécuter le traitement indiqué.


     
    2. Le procédé selon la Revendication 1, où la création (304) d'un plan de génération d'une réponse à la demande comprend la création d'un plan d'interrogation, le découpage du plan d'interrogation en une pluralité de tranches et l'identification pour chaque tranche d'un groupe de segments de traitement de façon à exécuter des tâches comprenant cette tranche du plan d'interrogation.
     
    3. Le procédé selon la Revendication 1, où la demande d'analyse de données comprend une ou plusieurs instructions SQL destinées au calcul d'un ou de plusieurs des éléments suivants : régression logistique, régression logistique multinomiale, mise en grappes par la méthode des K-moyennes, analyse de panier de marché basée sur des règles d'association et modélisation de sujet basée sur Dirichlet latente.
     
    4. Le procédé selon la Revendication 1, où la demande d'analyse de données est reçue au niveau d'un noeud maître (102) d'un système distribué de grande taille.
     
    5. Le procédé selon la Revendication 1, comprenant en outre l'imbrication dans chaque segment de ladite pluralité de segments de traitement distribués d'une bibliothèque ou d'un autre objet partagé comprenant lesdites une ou plusieurs fonctions d'analyse de données.
     
    6. Le procédé selon la Revendication 5, où ladite bibliothèque ou ledit autre objet partagé est inclus dans les segments de traitement tels que déployés.
     
    7. Le procédé selon la Revendication 5, où ladite bibliothèque ou ledit autre objet partagé englobe lesdites une ou plusieurs fonctions d'analyse de données sous la forme d'un ou de plusieurs des éléments suivants : code C++ compilé, Java compilé, Fortran compilé ou autre code compilé.
     
    8. Le procédé selon la Revendication 1, où la pluralité de segments de traitement distribués comprennent un sous-ensemble de segments de traitement parallèles comprenant un système de base de données à traitement massivement parallèle (MPP).
     
    9. Le procédé selon la Revendication 1, où l'affectation (306) à chaque segment d'une pluralité de segments de traitement distribués (110) d'une partie correspondante du plan à exécuter par ce segment comprend l'imbrication dans une communication d'affectation à envoyer à un ou plusieurs segments de ladite pluralité de segments de traitement distribués des métadonnées (106) indiquant un emplacement, à l'intérieur de la couche d'espace mémoire de données distribué (116), de données à traiter par ce segment.
     
    10. Le procédé selon la Revendication 9, où chacun desdits segments de traitement distribués est configuré de façon à utiliser les métadonnées (106) pour accéder auxdites données à traiter par ce segment.
     
    11. Le procédé selon la Revendication 9, où la couche d'espace mémoire de données distribué (116) contient des données conservées en mémoire dans une instance du système de fichiers distribués de Hadoop (HDFS) et les métadonnées (106) indiquent un emplacement à l'intérieur du HDFS de données à traiter par ce segment.
     
    12. Le procédé selon la Revendication 1, comprenant en outre : l'obtention, par le noeud maître (102), de métadonnées (106) associées à une ou plusieurs parties du plan à exécuter par un ou plusieurs segments correspondants, où le noeud maître (102) obtient les métadonnées (106) à partir d'un espace mémoire de métadonnées central, où les métadonnées identifient des données d'emplacement correspondant aux une ou plusieurs parties du plan et au moins une partie d'un ou de plusieurs traitements d'analyse de données à exécuter en relation avec le traitement des une ou plusieurs parties correspondantes du plan.
     
    13. Un système comprenant :

    une interface de communication (404), et

    un processeur couplé à l'interface de communication (404) et configuré de façon à :

    recevoir (302) une demande d'analyse de données,

    créer (304) un plan de génération d'une réponse à la demande,

    affecter (306) à chaque segment d'une pluralité de segments de traitement distribués, par l'intermédiaire d'une communication envoyée par l'intermédiaire de l'interface de communication (404), une partie correspondante du plan à exécuter par ce segment, y compris par l'invocation comme indiqué dans l'affectation d'une ou de plusieurs fonctions d'analyse de données imbriquées dans le segment de traitement, et caractérisé par

    l'envoi (308) à chaque segment de la pluralité de segments de traitement distribués auquel une partie du plan est affectée, de la partie correspondante du plan à exécuter par ce segment et de métadonnées (106), où les métadonnées (106) sont utilisées pour localiser ou accéder à un sous-ensemble de données dans une couche d'espace mémoire de données distribué (116) sur laquelle le segment doit exécuter le traitement indiqué.


     
    14. Le système selon la Revendication 13, où la demande d'analyse de données comprend une ou plusieurs instructions SQL, destinées au calcul d'un ou de plusieurs des éléments suivants : régression logistique, régression logistique multinomiale, mise en grappes par la méthode des K-moyennes, analyse de panier de marché basée sur des règles d'association et modélisation de sujet basée sur Dirichlet latente.
     
    15. Le système selon la Revendication 13, où la demande d'analyse de données est reçue au niveau d'un noeud maître (102) d'un système distribué de grande taille.
     
    16. Le système selon la Revendication 13, où le processeur est configuré de façon à créer le plan de génération d'une réponse à la demande au moins en partie par la création d'un plan d'interrogation, le découpage du plan d'interrogation en une pluralité de tranches, et l'identification pour chaque tranche d'un groupe de segments de traitement de façon à exécuter des tâches comprenant cette tranche du plan d'interrogation.
     
    17. Le système selon la Revendication 13, où le processeur est configuré de façon à affecter à chaque segment d'une pluralité de segments de traitement distribués une partie correspondante du plan à exécuter par ce segment au moins en partie par l'imbrication dans la communication à envoyer par l'intermédiaire de l'interface de communication à un ou plusieurs segments de ladite pluralité de segments de traitement distribués des métadonnées (106) indiquant un emplacement, à l'intérieur de la couche d'espace mémoire de données distribué (116), de données à traiter par ce segment.
     
    18. Le système selon la Revendication 17, où chacun desdits segments de traitement distribués est configuré de façon à utiliser les métadonnées (106) de façon à accéder auxdites données à traiter par ce segment.
     
    19. Le système selon la Revendication 13, où chacun desdits segments de traitement distribués possède imbriqué dans celui-ci une bibliothèque ou un autre objet partagé comprenant lesdites une ou plusieurs fonctions d'analyse de données.
     
    20. Le système selon la Revendication 13, où la pluralité de segments de traitement distribués comprennent un sous-ensemble de segments de traitement parallèles comprenant un système de base de données à traitement massivement parallèle (MPP).
     
    21. Un produit de programme informatique incorporé dans un support à mémoire lisible par ordinateur non transitoire tangible, contenant des instructions informatiques destinées à :

    la réception (302) d'une demande d'analyse de données,

    la création (304) d'un plan de génération d'une réponse à la demande,

    l'affectation (306) à chaque segment d'une pluralité de segments de traitement distribués (110) d'une partie correspondante du plan à exécuter par ce segment, y compris par l'invocation comme indiqué dans l'affectation d'une ou de plusieurs fonctions d'analyse de données imbriquées dans le segment de traitement, et caractérisé par

    l'envoi (308) à chaque segment de la pluralité de segments de traitement distribués auquel une partie du plan est affectée, de la partie correspondante du plan à exécuter par ce segment et de métadonnées (106), où les métadonnées (106) sont utilisées pour localiser ou accéder à un sous-ensemble de données dans une couche d'espace mémoire de données distribué (116) sur laquelle le segment doit exécuter le traitement indiqué.


     
    22. Le produit de programme informatique selon la Revendication 21, où l'affectation (306) à chaque segment d'une pluralité de segments de traitement distribués (110) d'une partie correspondante du plan à exécuter par ce segment comprend l'imbrication dans une communication d'affectation à envoyer à un ou plusieurs segments de ladite pluralité de segments de traitement distribués de métadonnées (106) indiquant un emplacement, à l'intérieur d'une couche d'espace mémoire de données distribué (116), de données à traiter par ce segment.
     




    Drawing




















    Cited references

    REFERENCES CITED IN THE DESCRIPTION



    This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

    Patent documents cited in the description