SYSTEM AND METHOD FOR SECURING DATA FILES

(19)

(11)

EP 4 379 583 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	05.06.2024 Bulletin 2024/23

(21)	Application number: 23186088.3

(22)	Date of filing: 18.07.2023

(51)

International Patent Classification (IPC):

G06F 21/56^(2013.01)

G06N 3/02^(2006.01)

(52)	Cooperative Patent Classification (CPC):
	G06F 21/562; G06N 3/02

(84)	Designated Contracting States:
	AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR
	Designated Extension States:
	BA
	Designated Validation States:
	KH MA MD TN

(30)

Priority:

01.12.2022 SG 10202260284

(71)	Applicant: FLEXXON PTE. LTD.
	Singapore 349585 (SG)

(72)	Inventors:
	CHAN, Mei Ling 349585 Singapore (SG) PARAMASIVAM, Rajamohan 349585 Singapore (SG) TAN, Hong Chuan 349585 Singapore (SG)

(74)	Representative: HGF
	HGF Limited 1 City Walk Leeds LS11 9DX Leeds LS11 9DX (GB)

(54)	SYSTEM AND METHOD FOR SECURING DATA FILES

(57) This document discloses a system and method for securing data files selected from a series of data files. The system comprises a transformation module, an artificial neural network (ANN), a clustering module and a backpropagation module whereby these modules are configured to identify data files that contain malware or anomalies. When such data files are detected, the system will then initiate a series of measures to identify other data files that may be similarly afflicted by the detected malware. These data files are then secured to prevent the malware from affecting a host machine and/or any storage/peripheral devices linked to the host machine.

Description

Field of the Invention

[0001] This invention relates to a system and method for securing data files selected from a series of data files. The system comprises a transformation module, an artificial neural network (ANN), a clustering module, and a backpropagation module whereby these modules are configured to identify data files that contain malware or anomalies. When such data files are detected, the system will then initiate a series of measures to identify other data files that may be similarly afflicted by the detected malware. These data files are then secured to prevent the malware from affecting a host machine and/or any storage/peripheral devices linked to the host machine.

Summary of Prior Art

[0002] Malware is a constant threat in the modern-day computing landscape. As more and more computing devices are connected through wireless or wired networks, malware may spread easily through the network, infecting all connected computing devices. Infected computers will reduce the productivity of individuals and organizations, and additionally may cause irreparable harm. For example, infected computers may have damaged operating systems or corrupted data, making them slower, or rendering them completely non-functional. In a worst-case scenario, valuable data and information may be misappropriated, hijacked, and used or ransomed. Additionally, such infected systems may continue to spread the malware to other devices outside of their network.

[0003] Conventional anti-malware software may attempt to identify and detect known malware by detecting the signatures of these known malware or finding patterns within programs that are associated with other known malware. Such anti-malware methods work based on a signature-based detection system, which compares executables and data to patterns of known malware. However, zero-day exploits/ attacks will typically not be a fit with any pattern of known malware. Consequently, existing anti-malware solutions may fail to detect zero-day exploits, which are unpredictably different from known malware.

[0004] As mentioned above, valuable data and information may be misappropriated by malware and such data are often irreplaceable. Hence, precautionary measures are often taken such as backing up the data frequently or encrypting and storing the encrypted data in a secure environment.

[0005] Unfortunately, such measures face many issues as the backed-up data may be hacked or corrupted, there may be hardware failures, accidental damage, ransomware or malware infections or accidental deletions. In the case of ransomware, the attacker will encrypt the user's file and demand a ransom from the victim to restore access to the data. Generally, in such types of attacks, there are number of vectors that had to be compromised. One of the most common being phishing spam. In a phishing spam, a corrupted attachment is downloaded and executed by the unsuspecting user. Once these files are executed, they then proceed to take control of the device's operating system by self-installing relevant scripts and by corrupting data contained therein.

[0006] Some forms of malicious scripts do not need to be executed by the user. Such scripts can inflict damage to the computing device by exploiting security vulnerabilities in the operating system to infect files and data contained therein. Additionally, there are also cyber-attacks known as leak-ware and dox-ware. In such attacks, a malicious actor who has gained access to a computing device will threaten to publicize sensitive data contained in the computing device's storage media.

[0007] Malicious third parties may also employ Trojan attacks to disrupt, damage and steal a user's data. Trojans also give rise to virus, worms, spyware, zombies, botnet, logic bombs and trap doors. In general, all these various types of malicious cyberattacks have the single aim of compromising the security of a user's data. These attacks usually corrupt the operating system and kernel layers of the operating system along with the address pointers in which the data are stored thereby corrupting the data itself.

[0008] For the above reasons, those skilled in the art are constantly striving to come up with a system and method that is capable of securing files in a computing device by identifying and detecting data files that may contain malware.

Summary of the Invention

[0009] The above and other problems are solved and an advance in the art is made by systems and methods provided by embodiments in accordance with the invention.

[0010] A first advantage of embodiments of systems and methods in accordance with the invention is that the system is configured to secure all types of files, regardless of whether the files are executable or non-executable.

[0011] A second advantage of embodiments of systems and methods in accordance with the invention is that the system is configured to store the data files in a format that is difficult to be manipulated and reconstructed by unauthorized users, even though if the host computing device's operating system has been compromised.

[0012] A third advantage of embodiments of systems and methods in accordance with the invention is that once the files have been stored in accordance with embodiments of the invention, the system is able to recover corrupted files easily and efficiently through the use of a mapping function that was used to map the RREF matrices to the stored files.

[0013] A fourth advantage of embodiments of systems and methods in accordance with the invention is that the system is configured to detect malware and zero-day type malware contained within the data files as the files are transformed into a numeric representation before the files are analysed using a trained neural network.

[0014] A fifth advantage of embodiments of systems and methods in accordance with the invention is that the system is configured to store files that are determined to contain malware in a virtual secure area in the memory whereby the stored files may only be retrieved by authorized users.

[0015] The above advantages are provided by embodiments of a system and/or method in accordance with the invention operating in the following manner.

[0016] According to a first aspect of the invention, a system for securing a series of data files is disclosed, the system comprising: a transformation module comprising sets of reduced row echelon form (RREF) matrices that have been transformed from the series of data files, whereby the series of data files have been mapped to the sets of RREF matrices, the transformation module being configured to: retrieve an altered data file from the system and transform the altered data file into an altered set of RREF matrices; map the altered data file to the altered set of RREF matrices, and provide the altered set of RREF matrices to a trained artificial neural network (ANN) module; the trained ANN module configured to: determine if the altered set of RREF matrices comprise malicious activity, whereby when it is determined that the altered set of RREF matrices comprise malicious activity, the altered set of RREF matrices and the altered data file are moved to a virtual secure area in a memory of the system; a clustering module configured to: retrieve, using a backpropagation module, variations of the altered set of RREF matrices from the sets of RREF matrices; cluster the retrieved variations of the altered set of RREF matrices and the altered set of RREF matrices to identify sets of RREF matrices that contain malicious activity; provide the identified sets of RREF matrices that contain malicious activity to the trained ANN module, whereby the trained ANN module is configured to identify a type of malicious activity associated with the identified sets of RREF matrices; and secure data files mapped to the identified sets of RREF matrices that contain malicious activity according to the type of malicious activity associated with the data files as identified by the trained ANN module.

[0017] In accordance with embodiments of the first aspect of the invention, the transformation of the altered data file into the altered set of RREF matrices by the transformation module comprises the transformation module being configured to: convert the altered data file into an intermediate data frame, wherein the intermediate data frame comprises a multimedia data frame or a character-based data frame; transform the intermediate data frame into a set of matrices using a first linear function; and transform the set of matrices into the altered set of RREF matrices.

[0018] In accordance with embodiments of the first aspect of the invention, the clustering module is further configured to provide sets of RREF matrices that do not contain malicious activity to the transformation module, whereby the transformation module is further configured to convert the provided sets of RREF matrices into data files, and map the data files to the provided sets of RREF matrices.

[0019] In accordance with embodiments of the first aspect of the invention, the variations of the altered set of RREF matrices comprise related sets of RREF matrices that have differing timestamps.

[0020] In accordance with embodiments of the first aspect of the invention, the mapping of the series of data files to the sets of RREF matrices, and the mapping of the altered data file to the altered set of RREF matrices is done for each data file or each altered data file by applying a hashing function to an address of the data file and to a contextual metadata layer of the data file, and using the result of this hashing function to link each data file to each sets of RREF matrices.

[0021] In accordance with embodiments of the first aspect of the invention, the trained ANN module is further configured to further train the neural network in the ANN module based on the identified sets of RREF matrices that contain malicious activity.

[0022] In accordance with embodiments of the first aspect of the invention, the transformation of the series of data files into the sets of RREF matrices by the transformation module comprises the transformation module being configured to: convert each of the data files into an intermediate data frame, wherein the intermediate data frame comprises a multimedia data frame or a character-based data frame; transform the intermediate data frame into a set of matrices using a first linear function; and transform the set of matrices into a set of RREF matrices.

[0023] According to a second aspect of the invention, a method for securing a series of data files using a system comprising a transformation module that comprises sets of reduced row echelon form (RREF) matrices that have been transformed from the series of data files, whereby the series of data files have been mapped to the sets of RREF matrices is disclosed, the method comprising the steps of: retrieving, using the transformation module, an altered data file from the system and transforming the altered data file into an altered set of RREF matrices; mapping, using the transformation module, the altered data file to the altered set of RREF matrices, and providing the altered set of RREF matrices to a trained artificial neural network (ANN) module; determining, using the trained ANN module, if the altered set of RREF matrices comprise malicious activity, whereby when it is determined that the altered set of RREF matrices comprise malicious activity, the altered set of RREF matrices and the altered data file are moved to a virtual secure area in a memory of the system; retrieving, using a backpropagation module, variations of the altered set of RREF matrices from the sets of RREF matrices; clustering, using a clustering module, the retrieved variations of the altered set of RREF matrices and the altered set of RREF matrices to identify sets of RREF matrices that contain malicious activity; providing, using the clustering module, the identified sets of RREF matrices that contain malicious activity to the trained ANN module, whereby the trained ANN module is configured to identify a type of malicious activity associated with the identified sets of RREF matrices; and securing, using the clustering module, data files mapped to the identified sets of RREF matrices that contain malicious activity according to the type of malicious activity associated with the data files as identified by the trained ANN module.

[0024] In accordance with embodiments of the second aspect of the invention, the transforming of the altered data file into the altered set of RREF matrices comprises: converting, using the transformation module, the altered data file into an intermediate data frame, wherein the intermediate data frame comprises a multimedia data frame or a character-based data frame; transforming, using the transformation module, the intermediate data frame into a set of matrices using a first linear function; and transforming, using the transformation module, the set of matrices into the altered set of RREF matrices.

[0025] In accordance with embodiments of the second aspect of the invention, the method further comprises the step of: providing, using the clustering module, sets of RREF matrices that do not contain malicious activity to the transformation module, whereby the transformation module is further configured to convert the provided sets of RREF matrices into data files, and map the data files to the provided sets of RREF matrices.

[0026] In accordance with embodiments of the second aspect of the invention, the variations of the altered set of RREF matrices comprise related sets of RREF matrices that have differing timestamps.

[0027] In accordance with embodiments of the second aspect of the invention, the mapping of the series of data files to the sets of RREF matrices, and the mapping of the altered data file to the altered set of RREF matrices is done for each data file or each altered data file by applying a hashing function to an address of the data file and to a contextual metadata layer of the data file, and using the result of this hashing function to link each data file to each sets of RREF matrices.

[0028] In accordance with embodiments of the second aspect of the invention, the method further comprises the step of training the neural network in the ANN module based on the identified sets of RREF matrices that contain malicious activity.

[0029] In accordance with embodiments of the second aspect of the invention, the transforming of the series of data files into the sets of RREF matrices by the transformation module comprises the steps of: converting, using the transformation module, each of the data files into an intermediate data frame, wherein the intermediate data frame comprises a multimedia data frame or a character-based data frame; transforming, using the transformation module, the intermediate data frame into a set of matrices using a first linear function; and transforming, using the transformation module, the set of matrices into a set of RREF matrices.

Brief Description of the Drawings

[0030] The above and other problems are solved by features and advantages of a system and method in accordance with the present invention described in the detailed description and shown in the following drawings.

Figure 1 illustrating a block diagram of a system for securing data files from a host machine in accordance with embodiments of the invention;

Figure 2 illustrating a block diagram representative of components provided within a module or computing device for executing embodiments in accordance with embodiments of the invention;

Figure 3 illustrating a flow diagram for transforming a series of data files into sets of reduced row echelon form (RREF) matrices in accordance with embodiments of the invention;

Figure 4 illustrating a flow diagram for securing data files in accordance with embodiments of the invention;

Figure 5 illustrating a diagram showing the transformation of a data file to an intermediate data frame in accordance with embodiments of the invention; and

Figure 6 illustrating a diagram showing the transformation of the intermediate file to a reduce row echelon matrix in accordance with embodiments of the invention.

Detailed Description

[0031] This invention relates to a system and method for securing data files selected from a series of data files. The system comprises a transformation module, an artificial neural network (ANN), a clustering module and a backpropagation module and these modules are configured to identify data files that contain malware or anomalies. When such data files are detected, the system will then initiate a series of measures to identify other data files that may be similarly afflicted by the detected malware. These data files are then secured to prevent the malware from affecting a host machine and/or any storage/peripheral devices linked to the host machine.

[0032] In particular, a transformation module comprising sets of reduced row echelon form (RREF) matrices that have been transformed from the series of data files, whereby the series of data files have been mapped to the sets of RREF matrices is, configured to: retrieve an altered data file from the system and transform the altered data file into an altered set of RREF matrices. The altered data file is then mapped to the altered set of RREF matrices, and this is then provided to a trained artificial neural network (ANN) module. Upon receiving the altered set of RREF matrices, the trained ANN module is then configured to determine if the altered set of RREF matrices comprise malicious activity, whereby when it is determined that the altered set of RREF matrices comprise malicious activity, the altered set of RREF matrices and the altered data file are moved to a virtual secure area in a memory of the system. A clustering module then retrieves, using a backpropagation module, variations of the altered set of RREF matrices from the sets of RREF matrices and then proceeds to cluster the retrieved variations of the altered set of RREF matrices and the altered set of RREF matrices in the secure area to identify sets of RREF matrices that contain malicious activity. Once identified, the identified sets of RREF matrices that contain malicious activity are then provided to the trained ANN module and data files mapped to the identified sets of RREF matrices that contain malicious activity will be secured.

[0033] The present invention will now be described in detail with reference to several embodiments thereof as illustrated in the accompanying drawings. In the following description, numerous specific features are set forth in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art, that embodiments may be realised without some or all of the specific features. Such embodiments should also fall within the scope of the current invention. Further, certain process steps and/or structures in the following may not been described in detail and the reader will be referred to a corresponding citation so as to not obscure the present invention unnecessarily.

[0034] Further, one skilled in the art will recognize that many functional units in this description have been labelled as modules throughout the specification. The person skilled in the art will also recognize that a module may be implemented as circuits, logic chips or any sort of discrete component. Still further, one skilled in the art will also recognize that a module may be implemented in software which may then be executed by a variety of processor architectures. In embodiments of the invention, a module may also comprise computer instructions, firmware or executable code that may instruct a computer processor to carry out a sequence of events based on instructions received. The choice of the implementation of the modules is left as a design choice to a person skilled in the art and does not limit the scope of this invention in any way.

[0035] An exemplary process or method for securing data files in accordance with embodiments of the invention is set out in the steps below.

Step 1: retrieve, using a transformation module, an altered data file from the system and transform the altered data file into an altered set of RREF matrices; map the altered data file to the altered set of RREF matrices, and provide the altered set of RREF matrices to a trained artificial neural network (ANN) module, whereby the transformation module comprises sets of reduced row echelon form (RREF) matrices that have been transformed from the series of data files, and whereby the series of data files have been mapped to the sets of RREF matrices,

Step 2: determine, using the trained ANN module, if the altered set of RREF matrices comprise malicious activity, whereby when it is determined that the altered set of RREF matrices comprise malicious activity, the altered set of RREF matrices and the altered data file are moved to a virtual secure area in a memory of the system,

Step 3: retrieve, using a backpropagation module, variations of the altered set of RREF matrices from the sets of RREF matrices,

Step 4: cluster, using a clustering module, the retrieved variations of the altered set of RREF matrices and the altered set of RREF matrices to identify sets of RREF matrices that contain malicious activity, and

Step 5: provide, using the clustering module, the identified sets of RREF matrices that contain malicious activity to the trained ANN module, and secure data files mapped to the identified sets of RREF matrices that contain malicious activity.

[0036] In accordance with embodiments of the invention, the steps set out above may be carried out or executed by a system or a hardware module (comprising sub-modules) that is communicatively connected to a host machine. The steps above may also be carried out or executed by a software module provided either at the host machine and/or at a connected peripheral device.

[0037] A block diagram of a system for securing data files in a host machine in accordance with embodiments of the invention is illustrated in Figure 1.

[0038] Figure 1 illustrates host machine 110 which generally may comprise any hardware device that has CPU 102, cache 104, memory 106 and/or storage 108. Some examples of host machines include, but are not limited to, computers, personal electronic devices, thin clients, and multi-functional devices. In particular, almost any kind of computer, including a centralized mainframe, a server or a desktop personal computer (PC) may be configured as a host machine.

[0039] CPU 102 may comprise any device or component that can process such instructions and may include: a microprocessor, microcontroller (MCU), programmable logic device or other computational device. That is, CPU 102 may be provided by any suitable logic circuitry for receiving inputs, processing them in accordance with instructions stored in memory 106 and generating outputs and may comprise a single core or multi-core processor with memory addressable space. Additionally, memory 106 may comprise volatile and non-volatile memory, and storage 108 may comprise (but is not limited to) solid state devices (SSDs), hard disk drives (HDDs), optical drives, or magnetic disc drives.

[0040] Host machine 110 is communicatively connected to system 120 through an I/O hub (not shown) which may comprise, but is not limited to, any type of microchip that may be used to manage data communications between CPU 102 and the various electronic components in host machine 110 and to manage data that is to be exchanged between host machine 110 and system 120.

[0041] System 120 comprises transformation module 122 that is configured to transform data files into reduced row echelon form (RREF) matrices and to reconstruct RREF matrices back to data files, artificial neural network (ANN) module 124 that comprises a trained neural network, memory 130 that comprise secure and unsecure volatile and non-volatile memories, backpropagation module 126 that is configured to identify variations of data files and/or sets of RREF matrices and clustering module 128 that is configured to cluster RREF matrices to generate clusters of RREF matrices to identify clusters of RREF matrices that may contain malicious activity. One skilled in the art will recognize that system 120 may comprise other types of additional modules and/or components without departing from the invention.

[0042] In accordance with embodiments of the invention, during an initial setup stage, system 120 will obtain the entire series, parts of the entire series, or selected portions of data files 111 from host machine 110, and data files 111 may comprise, but are not limited to, any type of files that may be stored within memory 106 or storage 108, or any files that may be internally communicated between or executed by the modules of host machine 110. Checksums will then be extracted from the data files that have checksums and this information is then stored in memory 130 of system 120 for future use.

[0043] Under the assumption that the entire series of data files 111 are obtained from host machine 110, system 120 then proceeds to bind various permission and user settings to each of data files 111. These settings are essential as the metadata associated with each of the data files may contain information relating to the authenticity of the bound files. Further, this metadata may be used as authentication means, e.g., an authentication certificate such as a .PEM file, for internal validation processes.

[0044] Transformation module 122 then proceeds to convert each of the data files (which may comprise a data file bound with permission and user settings or may comprise an unbound data file). Each data file is first converted into an intermediate data frame so that it may subsequently be easily converted into a set of matrices. In embodiments of the invention, the intermediate data frame may comprise a multimedia data frame which comprise frames having an image file format or may comprise a character-based data frame. Given the red-green-blue (RGB) nature of multimedia data frames, converting them into numeric values becomes easier. In other embodiments of the invention, the intermediate data frame may comprise a character-based data frame instead. In further embodiments of the invention, the intermediate data frame may comprise both the multimedia and character-based data frame.

[0045] The intermediate data frame, regardless whether it is a multimedia or character-based data frame is then converted into a set of matrices. This is illustrated in Figure 5 which shows that a data file 502 is converted through a conversion process 506 to a matrix snapshot 504. Figure 6 then illustrates the conversion of the matrix snapshot 604 to a numerical matrix representation 602. In embodiments of the invention, this conversion may be done using a predetermined look-up table whereby each alphanumeric symbol and/or character and/or pixel and/or colour in the intermediate data frame may be represented by a unique number and/or symbol. A predetermined matrix generation algorithm or function may then be used together with the look-up table to convert the intermediate data frame into sets of matrices. In other embodiments of the invention, more complex mathematical functions may be used to convert the intermediate data frames into numerical data frames and these mathematical functions may comprise functions such as shuffle, rotation, changing the order, and etc.

[0046] In a further embodiment of the invention, a partial differential equation function may be used with the numerical look-up table to derivate the data and to encrypt the data using dependent arguments. The dependent arguments may comprise the hash or the encryption keys with which the data is parameterized and segmented.

[0047] Once each of the intermediate data frames have been converted into sets of matrices, transformation module 122 then further transforms the sets of matrices (which may be represented by linear equations) to sets of reduced row echelon form (RREF) matrices and this may be done, but is not limited to, through the use of various mathematical operations. This transformative step of transforming the sets of matrices into sets of RREF matrices adds an additional layer of encryption to the data as the transformation process involves the use of linear equations which are known only to system 120. It can thereby be said that these linear equations act as unique encryption algorithms for system 120.

[0048] Once the series of data files 111 have been converted into their corresponding sets of RREF matrices by transformation module 122, each data file is then mapped to their set of RREF matrices and this may be done using, but is not limited to, a key-value pairing function or a key-value hash function. The function used to generate the pairing and the generated maps along with the set of RREF matrices are then stored safely in memory 130.

[0049] In another embodiment of the invention, a hashing function may be applied to the metadata of a datafile and/or the address of the data file. The result from the hashing function may then be used to map the data file to their corresponding set of RREF matrices. The usage of the contextual metadata layer of the data file in the hashing function adds a unique property to the mapping function as not only is the address of the data file verified, but the referenced metadata is also verified as well.

[0050] In embodiments of the invention, the hashing function may comprise a combination of natural language processing (NLP) algorithms and key-value binding functions. The NLP algorithms are configured to process the contextual metadata and semi-tags the information parametrizing them and storing them in a specific key address. This improves the search algorithm (of the mapping function) and sets the complexity value as O(1) as the task of locating the data frame comprises the system searching for the appropriate key which may be generated together with the validation of the checksum for the data file. As a result, this key may be uniquely defined within the system.

[0051] This step is important because in traditional operating systems, the data files are simply stored in the host machine's hex memory address whereby pointers are used to identify the location of the data files. If these pointers were to be corrupted by malware, this would effectively render the data files useless as they would not be able to be retrieved by the host machine. Once this is done, system 120 then continuously monitors host machine 110 for changes to the series of data files 111.

[0052] In embodiments of the invention, when system 120 detects an altered data file in host machine 110, system 120 proceeds to retrieve the altered data file from host machine 110. In embodiments of the invention, system 120 will first validate the checksums of the altered data file to determine if the altered data file had been altered by an unauthorized user. If the altered data file contains a corrupted checksum or a mismatched checksum, this altered data file is then moved into a secure area to be further processed. However, malicious third parties have since uncovered ways to alter data files while ensuring the checksum remain untouched. Hence, there is the need for the altered data file to be further processed by system 120.

[0053] Using transformation module 122, the retrieved altered data file is then transformed into an altered set of RREF matrices. In embodiments of the invention, this may be done by first converting the altered data file into an intermediate data frame before it is subsequently converted into sets of matrices as previously described. The sets of matrices are then transformed using mathematical operations into their corresponding altered sets of RREF matrices. Transformation module 122 then maps the altered data file to the altered sets of RREF matrices and stores this mapping securely in memory 130. The altered sets of RREF matrices are then provided to trained artificial neural network (ANN) module 124.

[0054] In other embodiments of the invention, it is noted that when a data file has been modified or altered, the roots of the polynomial equations used to transform the altered data file into an altered set of RREF matrices may be used to identify the numbers which are altered and/or changed in the altered set of RREF matrices. If the numbers in the altered set of RREF matrices comprise a multiplier of the root of the polynomial equations, this implies that the data file has not been corrupted. Conversely, or else it has been corrupted, if the numbers in the altered set of RREF matrices does not comprise a multiplier of the root of the polynomial equations, this implies that the data file has been corrupted.

[0055] In embodiments of the invention, the ANN may comprise a neural network having between 66 to 200 hidden layers. The ANN may be initially trained using an unsupervised learning methodology whereby the training dataset comprises sets of RREF matrices associated with known malware, breaches and/or cyberattacks, and/or sets of RREF matrices that are unaffected by malware. The sets of RREF matrices may be labelled according to their respective properties such as, but not limited to, read/write operations, metadata, access level threads, hardware unique identification IDs, memory consumed, I/O bursts, physical addresses or software checksums, along with their respective timeframes. As the ANN is trained using an unsupervised learning approach, the trained ANN will keep learning of new threats as these threats are identified. In further embodiments of the invention, the ANN may be trained to identify parameters of the sets of RREF matrices that are breaking apart due to a particular type of malicious attack. Once these parameters are identified and associated with the type of malicious attack, the trained ANN will then be able to detect such attacks in the future.

[0056] In a further embodiment of the invention, the neural architecture of the ANN was designed using training sample data of 100 significant categories of known virus types. The features affected were identified based on the number of inputs nodes, the number of hidden layers, and the number of nodes in each hidden layer. The learning rate of the ANN is parametrized with a value of 0.01 at the start and then gradually increased with the batch size being 1000. The number of epochs was started at a value of 500 and it was found that the optimum results showed when a batch size of 800 and epochs of 250 were used.

[0057] Trained ANN module 124 then proceeds to use the trained ANN to determine if the altered sets of RREF matrices comprise malicious activity. If the trained ANN determines that the altered sets of RREF matrices do not comprise any malware, ANN module 124 will then cause transformation module 122 to reconstruct the altered sets of RREF matrices back to its original data file and mark it as secure. These secure data files may then be returned to host machine 110 as data files 112 and the mapping will be kept in memory 130.

[0058] In embodiments of the invention, if the trained ANN determines that the altered sets of RREF matrices comprise malicious activity, the altered sets of RREF matrices along with its mapped altered data file are then moved to a virtual secure area in memory 130. In embodiments of the invention, this virtual secure area may be a virtual sandbox in memory 130.

[0059] Clustering module 128 is then configured to use backpropagation module 126 to retrieve variations of altered sets of RREF matrices from memory 130. In embodiments of the invention, these variations may comprise altered sets of RREF matrices over a series of past timeframes or altered sets of RREF matrices that have very similar values and/or numerical arrangements.

[0060] Clustering module 128 then adds these variations of the altered sets of RREF matrices to the virtual secure area in memory 130. The variations of the altered sets of RREF matrices together with the altered sets of RREF matrices are then clustered. In embodiments of the invention, a k-means clustering function is used to cluster these RREF matrices.

[0061] Clustering module 128 then identifies sets of RREF matrices that are associated with the altered sets of RREF matrices that were found to have malicious activity. These previously unidentified sets of RREF matrices are then flagged as sets of RREF matrices that potentially may comprise malicious activity and are subsequently secured by system 120 so that they will not be inadvertently installed or executed by host machine 110. Sets of RREF matrices that are not associated with the altered sets of RREF matrices together with their corresponding mapped data files are then classified as safe files.

[0062] In embodiments of the invention, sets of RREF matrices that were found to have malicious activity are provided by clustering module 128 to trained ANN module 124. In an embodiment of the invention, this information is utilized to further train the trained ANN of ANN module 124 so that it may better understand and identify such similar malware in the future. In another embodiment of the invention, trained ANN module 124 is used to identify the types of malicious activities and/or malware associated with these sets of RREF matrices. This information is then provided to clustering module 128 so that module 128 may tailor its approach to secure the data files that correspond to these sets of RREF matrices. For example, if trained ANN module 124 identifies a previously undiscovered cluster of RREF matrices that are associated with a backdoor computing attack, trained ANN module 124 will provide this information to clustering module 128. Upon receiving this information, clustering module 128 (on its own or together with other modules of system 120) may then disable all incoming data transmissions relating to the data files associated with this cluster of RREF matrices to prevent the backdoor computing attack from ever taking place (i.e. by denying access to these files from external third parties).

[0063] In another embodiment of the invention, ANN module 124 will induce a lock key mechanism to secure the lowest bits of the RREF matrices found to be associated with malicious activity to ensure that these RREF matrices may not be easily altered by malicious third parties thereby disabling the vulnerabilities of their corresponding data files.

[0064] In accordance with embodiments of the invention, a block diagram representative of components of processing system 200 that may be provided within any of the modules for implementing embodiments in accordance with embodiments of the invention is illustrated in Figure 2. One skilled in the art will recognize that the exact configuration of each processing system provided within these modules may be different and the exact configuration of processing system 200 may vary and Figure 2 is provided by way of example only.

[0065] In embodiments of the invention, each of the modules may comprise controller 201 and user interface 202. User interface 202 is arranged to enable manual interactions between a user and each of these modules as required and for this purpose includes the input/output components required for the user to enter instructions to provide updates to each of these modules. A person skilled in the art will recognize that components of user interface 202 may vary from embodiment to embodiment but may typically include one or more of display 240, keyboard 235 and trackpad 236 or conversely may not include any of these components.

[0066] Controller 201 is in data communication with user interface 202 via bus 215 and includes memory 220, processor 205 mounted on a circuit board that processes instructions and data for performing the method of this embodiment, an operating system 206, an input/output (I/O) interface 230 for communicating with user interface 202 and a communications interface, in this embodiment in the form of a network card 250. Network card 250 may, for example, be utilized to send data from these modules via a wired or wireless network to other processing devices or to receive data via the wired or wireless network. Wireless networks that may be utilized by network card 250 include, but are not limited to, Wireless-Fidelity (Wi-Fi), Bluetooth, Near Field Communication (NFC), cellular networks, satellite networks, telecommunication networks, Wide Area Networks (WAN) etc.

[0067] Memory 220 and operating system 206 are in data communication with CPU 205 via bus 210. The memory components include both volatile and non-volatile memory and more than one of each type of memory, including Random Access Memory (RAM) 223, Read Only Memory (ROM) 225 and storage memory (not shown), the last comprising one or more solidstate drives (SSDs). Memory 220 also includes secure storage 246 for securely storing secret keys, or private keys. One skilled in the art will recognize that the memory components described above comprise non-transitory computer-readable media and shall be taken to comprise all computer-readable media except for a transitory, propagating signal. Typically, the instructions are stored as program code in the memory components but can also be hardwired. Memory 220 may include a kernel and/or programming modules such as a software application that may be stored in either volatile or non-volatile memory.

[0068] Herein the term "processor" is used to refer generically to any device or component that can process such instructions and may include: a microprocessor, microcontroller, programmable logic device or other computational device. That is, processor 205 may be provided by any suitable logic circuitry for receiving inputs, processing them in accordance with instructions stored in memory and generating outputs (for example to the memory components or on display 240). In this embodiment, processor 205 may be a single core or multi-core processor with memory addressable space. In one example, processor 205 may be multi-core, comprising-for example-an 8 core CPU. In another example, it could be a cluster of CPU cores operating in parallel to accelerate computations.

[0069] Figure 3 illustrates process 300 for mapping a series of data files obtained from a host machine to their corresponding sets of RREF matrices in accordance with embodiments of the invention whereby process 300 may be implemented in system 120. It is assumed that system 120 is communicatively connected to host machine 110 thereby allowing system 120 to capture all data being transmitted on the bus system of host machine 110 or stored in the memory of host machine 110.

[0070] Process 300 begins at step 302 with process 300 selecting a data file from the series of data files stored within the host machine or being transmitted/received by the host machine. Process 300 then transforms the data file into a set of RREF matrices at step 304. At step 306, process 300 then stores the set of RREF matrices and maps the set of RREF matrices to the data file.

[0071] Process 300 then proceeds to select another data file at step 308. The data file is then transformed into another set of RREF matrices at step 310. At step 312, process 300 then stores the another set of RREF matrices and maps the set of RREF matrices to the another data file. Process 300 then proceeds to step 308. The process 314 of selecting another data file, transforming the data file and mapping the data file repeats itself until all the data files in the series of data files have been processed by process 300. Process 300 then ends.

[0072] Figure 4 illustrates process 400 for securing data files obtained from a host machine in accordance with embodiments of the invention whereby process 400 may be implemented by the modules in system 120. Process 400 begins at step 402 whereby an altered data file, i.e. a first data file, is retrieved from the host machine by process 400. This altered data file may comprise a variation of a data file that was previously stored and processed by process 300 (as described in Figure 3). Process 400 then transforms the first data file into a first set of RREF matrices at step 404. At step 406, process 400 determines whether the first set of RREF matrices comprise malware or malicious activity. If process 400 determines that the first set of RREF matrices does not comprise malware, process 400 proceeds to step 408 whereby the first data file and its corresponding sets of RREF matrices are stored along with its corresponding mapping.

[0073] Conversely, if process 400 determines at step 406 that the first set of RREF matrices is associated with malicious activity, process 400 proceeds to step 410. At step 410, the first set of RREF matrices along with its mapped first data file is moved to a virtual secure area. Process 400 then proceeds to retrieve all variations of the first set of RREF matrices from its memory at step 412. The variations of the first set of RREF matrices along with the first set of RREF matrices are then clustered by process 400. This takes place at step 414. Based on the outcome of the clustering that took place at step 414, at step 416, process 400 identifies variations of the first set of RREF matrices (that were previously considered to be safe) that contain malicious activity and also identifies variations of the first set of RREF matrices that do not contain malicious activity. The first data file is then restored by process 400 at step 418 based on a first set of RREF matrices that does not contain malicious activity. At step 420, variations of the first set of RREF matrices that contain malicious activity are then provided by process 400 to the trained ANN for further processing as described in the previously sections.

[0074] Numerous other changes, substitutions, variations and modifications may be ascertained by the skilled in the art and it is intended that the present invention encompass all such changes, substitutions, variations and modifications as falling within the scope of the appended claims.

Claims

1. A system for securing a series of data files, the system comprising:

a transformation module comprising sets of reduced row echelon form (RREF) matrices that have been transformed from the series of data files, whereby the series of data files have been mapped to the sets of RREF matrices, the transformation module being configured to:

retrieve an altered data file from the system and transform the altered data file into an altered set of RREF matrices;

map the altered data file to the altered set of RREF matrices, and provide the altered set of RREF matrices to a trained artificial neural network (ANN) module;

the trained ANN module configured to:
determine if the altered set of RREF matrices comprise malicious activity, whereby when it is determined that the altered set of RREF matrices comprise malicious activity, the altered set of RREF matrices and the altered data file are moved to a virtual secure area in a memory of the system;

a clustering module configured to:

retrieve, using a backpropagation module, variations of the altered set of RREF matrices from the sets of RREF matrices;

cluster the retrieved variations of the altered set of RREF matrices and the altered set of RREF matrices to identify sets of RREF matrices that contain malicious activity;

provide the identified sets of RREF matrices that contain malicious activity to the trained ANN module, whereby the trained ANN module is configured to identify a type of malicious activity associated with the identified sets of RREF matrices; and

secure data files mapped to the identified sets of RREF matrices that contain malicious activity according to the type of malicious activity associated with the data files as identified by the trained ANN module.

2. The system according to claim 1, wherein the transformation of the altered data file into the altered set of RREF matrices by the transformation module comprises the transformation module being configured to:

convert the altered data file into an intermediate data frame, wherein the intermediate data frame comprises a multimedia data frame or a character-based data frame;

transform the intermediate data frame into a set of matrices using a first linear function; and

transform the set of matrices into the altered set of RREF matrices.

3. The system according to claim 1 whereby the clustering module is further configured to provide sets of RREF matrices that do not contain malicious activity to the transformation module, whereby the transformation module is further configured to convert the provided sets of RREF matrices into data files, and map the data files to the provided sets of RREF matrices.

4. The system according to claim 1 whereby the variations of the altered set of RREF matrices comprise related sets of RREF matrices that have differing timestamps.

5. The system according to claim 1, whereby the mapping of the series of data files to the sets of RREF matrices, and the mapping of the altered data file to the altered set of RREF matrices is done for each data file or each altered data file by applying a hashing function to an address of the data file and to a contextual metadata layer of the data file, and using a result of this hashing function to link each data file to each sets of RREF matrices.

6. The system according to claim 1 whereby the trained ANN module is further configured to further train the neural network in the ANN module based on the identified sets of RREF matrices that contain malicious activity.

7. The system according to claim 1, wherein the transformation of the series of data files into the sets of RREF matrices by the transformation module comprises the transformation module being configured to:

convert each of the data files into an intermediate data frame, wherein the intermediate data frame comprises a multimedia data frame or a character-based data frame;

transform the intermediate data frame into a set of matrices using a first linear function; and

transform the set of matrices into a set of RREF matrices.

8. A method for securing a series of data files using a system comprising a transformation module that comprises sets of reduced row echelon form (RREF) matrices that have been transformed from the series of data files, whereby the series of data files have been mapped to the sets of RREF matrices, the method comprising the steps of:

retrieving, using the transformation module, an altered data file from the system and transforming the altered data file into an altered set of RREF matrices;

mapping, using the transformation module, the altered data file to the altered set of RREF matrices, and providing the altered set of RREF matrices to a trained artificial neural network (ANN) module;

determining, using the trained ANN module, if the altered set of RREF matrices comprise malicious activity, whereby when it is determined that the altered set of RREF matrices comprise malicious activity, the altered set of RREF matrices and the altered data file are moved to a virtual secure area in a memory of the system;

retrieving, using a backpropagation module, variations of the altered set of RREF matrices from the sets of RREF matrices;

clustering, using a clustering module, the retrieved variations of the altered set of RREF matrices and the altered set of RREF matrices to identify sets of RREF matrices that contain malicious activity;

providing, using the clustering module, the identified sets of RREF matrices that contain malicious activity to the trained ANN module, whereby the trained ANN module is configured to identify a type of malicious activity associated with the identified sets of RREF matrices; and

securing, using the clustering module, data files mapped to the identified sets of RREF matrices that contain malicious activity according to the type of malicious activity associated with the data files as identified by the trained ANN module.

9. The method according to claim 8, wherein the transforming of the altered data file into the altered set of RREF matrices comprises:

converting, using the transformation module, the altered data file into an intermediate data frame, wherein the intermediate data frame comprises a multimedia data frame or a character-based data frame;

transforming, using the transformation module, the intermediate data frame into a set of matrices using a first linear function; and

transforming, using the transformation module, the set of matrices into the altered set of RREF matrices.

10. The method according to claim 8 whereby the method further comprises the step of:
providing, using the clustering module, sets of RREF matrices that do not contain malicious activity to the transformation module, whereby the transformation module is further configured to convert the provided sets of RREF matrices into data files, and map the data files to the provided sets of RREF matrices.

11. The method according to claim 8 whereby the variations of the altered set of RREF matrices comprise related sets of RREF matrices that have differing timestamps.

12. The method according to claim 8, whereby the mapping of the series of data files to the sets of RREF matrices, and the mapping of the altered data file to the altered set of RREF matrices is done for each data file or each altered data file by applying a hashing function to an address of the data file and to a contextual metadata layer of the data file, and using a result of this hashing function to link each data file to each sets of RREF matrices.

13. The method according to claim 8 whereby the method further comprises the step of training the neural network in the ANN module based on the identified sets of RREF matrices that contain malicious activity.

14. The method according to claim 8, wherein the transforming of the series of data files into the sets of RREF matrices by the transformation module comprises the steps of:

converting, using the transformation module, each of the data files into an intermediate data frame, wherein the intermediate data frame comprises a multimedia data frame or a character-based data frame;

transforming, using the transformation module, the intermediate data frame into a set of matrices using a first linear function; and

transforming, using the transformation module, the set of matrices into a set of RREF matrices.

Drawing

Search report

Search report