Global Patent Index - EP 4418141 A1

EP 4418141 A1 20240821 - DOCUMENT CLUSTERING USING NATURAL LANGUAGE PROCESSING

Title (en)

DOCUMENT CLUSTERING USING NATURAL LANGUAGE PROCESSING

Title (de)

DOKUMENTGRUPPIERUNG MITTELS VERARBEITUNG NATÜRLICHER SPRACHE

Title (fr)

REGROUPEMENT DE DOCUMENTS UTILISANT UN TRAITEMENT DE LANGAGE NATUREL

Publication

EP 4418141 A1 20240821 (EN)

Application

EP 23157416 A 20230217

Priority

EP 23157416 A 20230217

Abstract (en)

A Cluster System (10) for automatically clustering together a plurality of data files (200) based on semantics of textual content of each of the plurality of data files (200) is described. The Cluster System comprises a Cluster Processor for processing the plurality of data files (200) created externally and independently of the Cluster System. The Cluster Processor comprises: an Event Data Ingester (26) for receiving each of the plurality of data files (200); a Vector Creator (28) configured to apply each of a plurality of different stored vector templates (40) to a data file received by the Event Data Ingester (26) to determine a plurality of numerical vector representations (202a, 202b, 202c, 202d) of the textual content of the data file (200), each vector template (40) defining a specific subset of characteristics of the textual content of the data file (200) and each vector representation (202a, 202b, 202c, 202d) being specific to each vector template (40) and indicating the degree to which the specific subset of characteristics of textual content of the vector template (40) is present in the data file (200); a Vector Aggregator (30) configured to combine the plurality of numerical vector representations (202a, 202b, 202c, 202d) of the textual content of the data file (200) into an aggregated summary vector (204) representing the textual content of the data file (200), the Vector Aggregator(30) being configured to use a set of vector weights (42) to determine the contribution of each of the different vector representations (202a, 202b, 202c, 202d) to the aggregated summary vector (204); and a Cluster Manager (34) configured to use the aggregated summary vector (204) to cluster together the data file (200) with other data files each having their own aggregated summary vector (204).

IPC 8 full level

G06F 16/35 (2019.01)

CPC (source: EP)

G06F 16/355 (2019.01); G06N 3/042 (2023.01); G06N 3/045 (2023.01); G06N 3/09 (2023.01); G06N 5/01 (2023.01); G06N 20/00 (2019.01); G06N 5/022 (2013.01)

Citation (search report)

[I] CN 101035128 A 20070912 - UNIV DALIAN TECH [CN]

Designated contracting state (EPC)

AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR

Designated extension state (EPC)

BA

Designated validation state (EPC)

KH MA MD TN

DOCDB simple family (publication)

EP 4418141 A1 20240821; EP 4418142 A1 20240821

DOCDB simple family (application)

EP 23157416 A 20230217; EP 24158243 A 20240216