EP 4418141 A1 20240821 - DOCUMENT CLUSTERING USING NATURAL LANGUAGE PROCESSING
Title (en)
DOCUMENT CLUSTERING USING NATURAL LANGUAGE PROCESSING
Title (de)
DOKUMENTGRUPPIERUNG MITTELS VERARBEITUNG NATÜRLICHER SPRACHE
Title (fr)
REGROUPEMENT DE DOCUMENTS UTILISANT UN TRAITEMENT DE LANGAGE NATUREL
Publication
Application
Priority
EP 23157416 A 20230217
Abstract (en)
A Cluster System (10) for automatically clustering together a plurality of data files (200) based on semantics of textual content of each of the plurality of data files (200) is described. The Cluster System comprises a Cluster Processor for processing the plurality of data files (200) created externally and independently of the Cluster System. The Cluster Processor comprises: an Event Data Ingester (26) for receiving each of the plurality of data files (200); a Vector Creator (28) configured to apply each of a plurality of different stored vector templates (40) to a data file received by the Event Data Ingester (26) to determine a plurality of numerical vector representations (202a, 202b, 202c, 202d) of the textual content of the data file (200), each vector template (40) defining a specific subset of characteristics of the textual content of the data file (200) and each vector representation (202a, 202b, 202c, 202d) being specific to each vector template (40) and indicating the degree to which the specific subset of characteristics of textual content of the vector template (40) is present in the data file (200); a Vector Aggregator (30) configured to combine the plurality of numerical vector representations (202a, 202b, 202c, 202d) of the textual content of the data file (200) into an aggregated summary vector (204) representing the textual content of the data file (200), the Vector Aggregator(30) being configured to use a set of vector weights (42) to determine the contribution of each of the different vector representations (202a, 202b, 202c, 202d) to the aggregated summary vector (204); and a Cluster Manager (34) configured to use the aggregated summary vector (204) to cluster together the data file (200) with other data files each having their own aggregated summary vector (204).
IPC 8 full level
G06F 16/35 (2019.01)
CPC (source: EP)
G06F 16/355 (2019.01); G06N 3/042 (2023.01); G06N 3/045 (2023.01); G06N 3/09 (2023.01); G06N 5/01 (2023.01); G06N 20/00 (2019.01); G06N 5/022 (2013.01)
Citation (search report)
[I] CN 101035128 A 20070912 - UNIV DALIAN TECH [CN]
Designated contracting state (EPC)
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR
Designated extension state (EPC)
BA
Designated validation state (EPC)
KH MA MD TN
DOCDB simple family (publication)
DOCDB simple family (application)
EP 23157416 A 20230217; EP 24158243 A 20240216