DETAILED ACTION
This office action is in response to the amendments filed on 10/27/2025.
Claims 1, 6 and 14 are amended.
Claims 3-5, 11-13, and 16-18 have been cancelled.
Claims 1-2, 6-10, 14-15 are presented for examination.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 10/27/2025 regarding 35 USC 103 rejections in Remarks pg. 6-8 have been fully considered but they are not persuasive.
Applicant argues in essence:
[a] “The present rejection alleges that it is obvious to modify Aharon in view of Lopinuso to use Lopinuso's process of "tak[ing] each executable object apart in chunks and removing the execution capability from each part by removing portions of the header that are needed to launch execution. The chunks may be of predetermined chunk sizes or random chunk sizes in various alternatives." Lospinuso, 9:13-18.
This process is incompatible with Aharon. Aharon does not acknowledge or discuss the presence of a header in any capacity. The term "header" does not appear at all in Aharon.”
In response to [a], examiner respectfully disagrees. The header portion of this citation is not the main concern for this citation but rather the segmentation of the code. Each part of this process is part of the “defang”ing of the code to facilitate safe handling of the code.
For example, Lopinuso Col. 4 lines 5-23 discloses “Example embodiments may employ a strategy to “defang” or otherwise render malware safe not only for storage and transport, but also for analysis. Thus, example embodiments may make the malware safe to handle, analyze and store in a normal networked environment without requiring a quarantined network segment. By employing example embodiments, the malware is dismembered into component pieces, removing the execution capability from each part.” Explicitly describes the segmentation as removing execution capability.
Col. 6 lines 60-col. 7 line 10 “In use, if a user chooses not to install unpacking and de-archiving tools in system 10, systems 80 and 60 would need policies and procedures, respectively, for dismembering incoming files that can contain a number of files, so that a segment of a dismembered file cannot inadvertently contain an entire executable object packed or archived within it.”
As seen above, segmentation alone is a way that causes safe handling of code, such that a single file cannot contain an entire executable object in it. Therefore although Lopinuso discloses further use of an analysis of a header (which examiner notes is not a packet header, but rather a piece of information used by the executable in order to launch in Lopinuso Col. 4 lines 5-23), the segmentation of the code into random chunk sizes is still applicable to Aharon, as it would provide safer analysis of potentially malicious code, Lopinuso col. 7 lines 11-15, background col.1 lines 22-60.
[b] “Aharon is directed to analyzing data traffic input to a gateway of a data network. Aharon summarizes the process for detecting suspicious code at a gateway in the following paragraph:
[0013] …
As indicated in the bolded language above, disassembly is performed by Aharon for the purpose of separating code into individual instructions and then analyzing each instruction for a threat potential. See also claim 1 ("for each instruction in said disassembled code, (c) assigning respectively a threat weight for each said instruction..."), etc.
In order to assign a threat weight for each instruction, each instruction is disassembled and analyzed separately. See, e.g., [0036], showing disassembly into specific instructions (XOR EAX,EAX NOP); see also [0041] discussing the disassembly process in detail ("The instruction is incremented (step 509) and each instruction is disassembled (step 507) until a branch (or conditional branch) instruction is reached (step 511)." Aharon's disassembly is deliberate and careful, and incompatible with random segmentation.
"If a proposed modification would render the prior art invention being modified unsatisfactory for its intended purpose, there may be no suggestion or motivation to make the proposed modification." MPEP 2143.01(V).
Modifying Aharon's process to separate code into random chunk sizes according to Lospinuso would render Aharon's process unsuitable for its intended purpose. Aharon requires disassembling individual instructions and analyzing each instruction for a threat level. Separating the code into random chunks would thwart this process since the instructions would be cut off arbitrarily and would not be separated individually as required by Aharon. In addition, it would not be possible to determine precisely when a branch is reached.”
In response to [b], Applicant argues that because Aharon analyzes each instruction separately, it cannot function if it were to be modified by random chunk sizes of the entire executable code, however examiner respectfully disagrees.
Aharon Fig. 4 para.0040 “In the example of FIG. 4, it is clear where the code starts, i.e. after a vulnerable return address. However, in case a vulnerable return address is not detected, and without any advance knowledge regarding any other execution mechanism the attacker is attempting to use, then in order to perform a disassembly and analyze the input stream for malicious code, instruction analyzer 405 needs to perform a disassembly within the suspicious data staring from every possible offset.”
Para.0052-0054 “Valid instructions increase the threat weight. Specific instructions (or set of instructions) that are likely to appear in the initialization code of a worm attack will greatly increase the threat weight. For example in Windows®, the attacker does not know in advance at which absolute address the attacking code will be executed. In order to determine the absolute address the attacker can load the address (EIP) in runtime into a register (EBP) using the following sequence (the known “call delta” technique)
1: CALL 2 // Perform a function call to the code that starts at an offset of 2 bytes, which “happens” to be the next instruction, pushing-the current address onto the stack.
2: POP EBP // Pop the stack into the EBP register.”
When the system knows where the malicious code starts Fig. 4 is possible, however Fig. 5 discloses a process with every possible offset in the code. The system analyzes the instructions at every possible offset in order to determine threat weight of the code starting from various offsets. For example, para.0042-para.0049 show when the starting point of the offset for the code simply does not make sense or is invalid. Para.0043 “For instance, invalid instruction (data that is not executable code) will decrease the threat weight by a given amount.” Afterwards, the analysis moves onto a different offset back in step 505 of Fig. 5 and continues analysis.
Therefore, as Aharon considers every single possible offset in the code, as well as having provisions for starting points of code analysis to be invalid, the idea that the random chunks of code as performed in Lopinuso would not thwart this process as argued, but is already accounted for in the process of Aharon during the offset analysis of the code as invalid starting points of the code as caused by the offset simply changes the threat weight and moves onto the next offset of the code until it hits a valid executable instruction. Therefore the process of Aharon is compatible with the random segmentation of code in Lopinuso, and is relied upon in the rejection.
[c] “There must be some rationale for combining references to establish aprimafacie case of obviousness. MPEP 2143. The present rejection offers the rationale of "the expected benefit of safe handling of malicious code". Rejection, page 9.
This rationale does not apply to Aharon. Aharon's process is performed by a gateway that does not execute any code. The gateway only analyzes code, and if any threats are identified, the gateway blocks traffic, preventing malicious code from being passed on to any computers which could execute the code. Aharon, [0027]. Because the gateway does not execute the code, safe handling of malicious code would not motivate a person of skill in the art to substitute Losponuso's process for Aharon's process, especially since doing so would cause Aharon's process to fail as discussed above.
Nevertheless, in the interest of advancing prosecution, the independent claims are amended to indicate that the randomly selected is within a predetermined range. Support for this amendment can be found in at least [0073]. This amendment further distinguishes the cited references. “
In response to [c], examiner respectfully disagrees. Malware can be unpredictable and even Aharon acknowledges that devices may be tricked into inadvertently executing malicious code para.0030 “ A “well designed” return address will cause the application to jump to the attacker-supplied message, thus executing the attacker's code. Therefore the attacking message typically contains executable code and contains a mechanism that forces execution of the code.”.
Lopinuso discloses a similar issue and addresses the dangers that come with malicious code analysis as they may inadvertently execute col. 1 line 47-60 “Malware executable objects (e.g., computer viruses and worms) are dangerous in networked environments due to the risk that they will inadvertently execute and compromise network nodes. This makes it both difficult and costly to support forensic investigation and to develop a comprehensive malware processing and analysis flow in a networked environment, as the nodes that receive malware must be quarantined from the network while analysts typically do most of their work on a separate network”
There is a clear benefit to incorporating Lopinuso’s safety features during malicious code analysis to that of Aharon as malicious code is dangerous and unpredictable, and may have ways to force execution, something that Aharon acknowledges can be true. Therefore, even though Aharon does not necessarily execute the code intentionally, as viruses may be designed to trick devices into forcing execution, it would be obvious to incorporate the safety features of Lopinuso to that of Aharon.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of completely defanging the executable code such that analysis can be performed with no risk of accidentally executing malicious code, as described in Lopinuso col. 7 lines 11-15, background col.1 lines 22-60.
[d] “Claims 8, 9 and 10 were rejected under 35 U.S.C. 103 as being unpatentable over Aharon in view of Lospinuso in view of Tepper further in view of Eijndhovern further in view of Choi et al. (US 2022/0067579). Applicants respectfully traverse the rejection. Choi does not remedy the deficiencies of Aharon and Lopinuso discussed above.”
In response to [d], Choi is not relied upon for the rejection of the independent claim, therefore this argument does not apply.
Claim Interpretation
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are:
“a collector configured to generate an instruction code sequence by converting an input file, for which a malicious code is to be detected, into an assembly code and to generate a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range;
an output unit configured to embed the instruction code sequence by using a prelearned assembly language model for instruction code embedding and to output an embedding result of the instruction code sequence; and a detector configured to detect whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input.” In Claim 14.
“a converter configured to generate an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer,” In Claim 14.
“wherein the collector is further configured to generate each of the plurality of segment instruction code sequences as an individual file” in Claim 14.
“wherein the collector is further configured to: extract an instruction from the assembly code, generate an instruction code by combining an opcode and an operand of the extracted instruction, and generate the instruction code sequence by using the instruction code.” In Claim 14.
“wherein the output unit is further configured to output an embedding result of the indexed instruction code sequence by embedding the indexed instruction code sequence.” In Claim 15.
The above elements correspond to a processor of a device in Fig. 15-16 and pg 26 line 20-pg 29 line 18, “In other words, it may be a hardware/software configuration playing a controlling role for controlling the above-described device 1600. In addition, the processor 1603 may be performed by modularizing the functions of the collector 1510, the converter 1520, the output unit 1530 and the detector 1540 of FIG. 15.”
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-2, 6-7, 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Aharon et al. (hereinafter Aharon, US 2007/0089171 A1) in view of Lospinuso et al. (hereinafter Lospinuso, US 10,291,647 B2) in view of Tepper et al. (hereinafter Tepper, US 2020/0326934 A1) further in view of Eijndhovern et al. (hereinafter Eij, US 9,081,928 B2).
Regarding Claim 1 Aharon discloses A method for detecting a malicious code, the method comprising: generating an instruction code sequence by converting an input file, for which a malicious code is to be detected, into an assembly code (Aharon: para.0016 “According to the present invention there is provided an apparatus for detecting malicious code in a stream of data traffic input to a gateway to data network, the apparatus including (a) a filter apparatus which filters and thereby detects suspicious data in the stream of data traffic; (b) a disassembler attempting to convert binary operation codes of the suspicious data into assembly instructions, thereby attempting to produce disassembled code” an instruction code sequence is generated by converting the suspicious data from the stream of data traffic into assembly instructions) and
detecting whether or not the input file is a malicious code, by using a prelearned malicious code classification model (Aharon: para.0039 “Subsequent to disassembly, an instruction analyzer 405 is used to determine if the code is executable code and malicious.” The instruction analyzer takes the assembly instruction and determines is the code is malicious. The analyzer already is programmed to detect malicious code such as in Fig. 4 and 5 509-513, therefore prelearned.), and
wherein the generating of the instruction code sequence extracts an instruction from the assembly code (Aharon: para.0016 “According to the present invention there is provided an apparatus for detecting malicious code in a stream of data traffic input to a gateway to data network, the apparatus including (a) a filter apparatus which filters and thereby detects suspicious data in the stream of data traffic; (b) a disassembler attempting to convert binary operation codes of the suspicious data into assembly instructions, thereby attempting to produce disassembled code” an instruction code sequence is generated by converting the suspicious data from the stream of data traffic into assembly instructions. The resulting instruction is seen in Fig. 4 XOR EAX, EAX.),
generates an instruction code by combining an opcode and an operand of the extracted instruction (Aharon: Fig. 4, para.0036-0037 “[0036] The data as shown above is disassembled by disassembler 403 to the following instructions: [0037] XOR EAX,EAX” the instruction code is obtained from the assembly code generated by the disassembler, which comprises operands and opcodes, in this case XOR is the opcode, and eax, eax are both operands. ), and
generates the instruction code sequence by using the instruction code (Aharon: para.0041 “The instruction is incremented (step 509) and each instruction is disassembled (step 507) until a branch (or conditional branch) instruction is reached (step 511). For each instruction, between the chosen offset and a branch instruction a threat weight is thus calculated and accumulated (step 515). Another offset is then chosen (step 505), instructions are disassembled (step 507) and incremented (step 509) and a threat weight is accumulated and added to the accumulated value up to the branch point (step 515). The input stream including executable code is analyzed by dividing the executable code into "flows" including all instructions between the chosen offset (or a first branch instruction) and a subsequent branch instruction. Every time a conditional jump instruction is reached (step 511) the conditional branch instruction is disassembled (step 507) the flow is split into two flows in branch options 513a and 513b, each branch continuing in a different execution path. As a result the flows are linked into "spiders" containing a list (or tree) of flows. For each flow, a threat weight is maintained in memory and accumulated (step 515) and as the flow progresses its threat-weight is updated.” The set of instruction of the data stream is the instruction code sequence, shown here in at least the analysis in Fig. 5, each branch can also be considered a sequence, comprising instructions such as that of para.0037.).
However Aharon does not explicitly disclose generating a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range; embedding the instruction code sequence including the plurality of segment instruction code sequences by using a prelearned assembly language model for instruction code embedding and outputting an embedding result of the instruction code sequence; generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer; and detecting whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input, wherein the generating of the instruction code sequence generates each of the plurality of segment instruction code sequences as an individual file.
Lospinuso discloses generating a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range (Lospinuso: col. 7 lines 11-15 “An example embodiment of the invention will now be described with reference to FIG. 2. FIG. 2 shows certain elements of an apparatus for provision of enabling safe handling of executable binaries (or other content that can carry malicious code) according to an example embodiment.” col. 9 lines 10-21 “The division module 180 manager may include tools to facilitate segmentation of executable objects according to the policy provided by the security policy engine 80. Thus, the division module 180 may be configured to take each executable object apart in chunks and remove the execution capability from each part by removing portions of the header that are needed to launch execution. The chunks may be of predetermined chunk sizes or random chunk sizes in various alternatives. In some cases, the header may be inspected to determine which parts are identified in the header, and division may be accomplished according to the parts identified in the header.” Claim 1 “divide the executable malware object into a plurality of malware segments, wherein the plurality of malware segments are divided based on a predetermined size or a random size” Malicious files can be segmented into a randomly selected size for safe handling of the code. The random sizes are within a predetermined range as the chunks cannot be larger than the total size of the code, therefore the range is greater than at least a bit and smaller than the total size of the executable code.);
wherein the generating of the instruction code sequence generates each of the plurality of segment instruction code sequences as an individual file (Lopinuso: “In this regard, a method for rendering malware files safe for handling according to one embodiment of the invention, as shown in FIG. 3, may include receiving an executable object at operation 300, and dividing the executable object into a plurality of segments or pieces at operation 310” the executable object is a malware file, and by dividing the malware file into segments or pieces in operation 310 generates each segment instruction code sequence as an individual file.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Aharon with Losponuso in order to incorporate generating a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range, wherein the generating of the instruction code sequence generates each of the plurality of segment instruction code sequences as an individual file and apply this concept of handling malicious code to potentially malicious code of Aharon.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of safe handling of malicious code (Lospinuso: col. 7 lines 11-15, background col.1 lines 22-32).
However Aharon-Lospinuso does not explicitly disclose embedding the instruction code sequence including the plurality of segment instruction code sequences by using a prelearned assembly language model for instruction code embedding and outputting an embedding result of the instruction code sequence; generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer; and detecting whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input.
Tepper discloses embedding the instruction code sequence by using a prelearned language model for instruction code embedding and outputting an embedding result of the instruction code sequence (Tepper: para.0016 “The system 100 may receive source code 102 and compile the source code into intermediate representation (IR) code 104.” para.0020 “A set of graph embedding vectors (such as embedding vectors 108 in FIG. 1, already discussed) may be computed, one vector for each intermediate representation (IR) code instruction (e.g., each LLVM-IR instruction) of the input program (i.e., IR code 104 in FIG. 1, already discussed)…. Generation of embedding vectors, however, are part of the trainable elements in the neural network (trainable through backpropagation). Since training involves using dependence graphs (e.g., PDG), after training, the values of these embedding vectors will be influenced by the collection of the PDG graphs seen during training. Once trained, the neural network will be tuned to the selected task.” The input file is the source code 102, and converted into an instruction code sequence. In step 108 in Fig. 1 this is then embedded as vectors for each line of the Ir code.); and
detecting whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input (Tepper: Fig. 1, para.0017 “Through learning algorithm(s) 112, the graph attention neural network 110 may be trained to handle tasks such as software analysis (label 114), in which the system automatically extracts information about the software (such as, e.g., classifying as malicious code), or software enhancement (label 116), in which the system modifies the LLVM-IR bitcode to improve the software runtime performance. Examples of software analysis may include a software classification analysis, a thread coarsening analysis, or a heterogeneous scheduling analysis. Examples of software enhancement may include program modifications to improve performance via loop vectoring and/or optimization pass ordering.” The embedded vector graph form of the code is then input into the universal graph network, with in combination with learning algorithm 112 and software analysis 114, the input file is classified as malicious or not.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Aharon with Tepper in order to incorporate embedding the instruction code sequence by using a prelearned language model for instruction code embedding and outputting an embedding result of the instruction code sequence; and detecting whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input, and apply the ideas of Tepper to the assembly language operation in Aharon, using the safe version of instruction code sequence including the plurality of segment instruction code sequences in Lospinuso.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved accuracy that comes with incorporating a machine learning model (Tepper: para.0017).
However Aharon-Lospinuso-Tepper does not explicitly disclose generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer.
Eij discloses generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer (Eij: col.10 lines 25-65 “In addition to the information already present in the assembly instructions 1120, the object code format 1140 includes the following information: each source code instruction has been assigned a sequence number that is unique to the assembly function 1120 that it appears in; … This is usually necessary because, depending on the assembly language used, not all operands in the assembly instructions 1120 carry a datatype but for proper operation of the transform 2000 and build 4000 steps it is desirable that the datatypes of all values in the CDFG are known. The nodes in the CDFG are marked with the numbers of the corresponding instructions in the object code section, Such that the relationship between executed operations in the object code section and the nodes in the CDFG can be established in the analysis step 1200….All functions in the annotated executable 1158 are assigned a sequence number that is unique to the executable 1158. As a result, functions can be referred to by number instead of by name which is an advantage if the source program 996 is written in a language like C where function names are not necessarily unique in a program; the function CDFGs present in the object code 1140 are combined to form the overall program CDFG 1157 that represents the static structure of program 996.” The system links each instruction to a number, and each function can be called via the number. Either the logical mapping the system generates between integer and program function and/or the binary libraries 998 in fig. 9a-b, are instruction code dictionaries used for this process.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Aharon-Lospinuso-Tepper with Eij in order to incorporate generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved compiling and execution of code (Eij: col.10 lines 25-65).
Regarding Claim 2, Aharon-Lospinuso-Tepper-Eij discloses claim 1 as set forth above.
However Aharon-Lospinuso does not explicitly disclose wherein the outputting of the embedding result outputs an embedding result of the indexed instruction code sequence by embedding the indexed instruction code sequence.
Tepper further discloses wherein the outputting of the embedding result outputs an embedding result of the instruction code sequence by embedding the instruction code sequence (Tepper: para.0016 “The system 100 may receive source code 102 and compile the source code into intermediate representation (IR) code 104.” para.0020 “A set of graph embedding vectors (such as embedding vectors 108 in FIG. 1, already discussed) may be computed, one vector for each intermediate representation (IR) code instruction (e.g., each LLVM-IR instruction) of the input program (i.e., IR code 104 in FIG. 1, already discussed)…. Generation of embedding vectors, however, are part of the trainable elements in the neural network (trainable through backpropagation). Since training involves using dependence graphs (e.g., PDG), after training, the values of these embedding vectors will be influenced by the collection of the PDG graphs seen during training. Once trained, the neural network will be tuned to the selected task.” The input file is the source code 102, and converted into an instruction code sequence. In step 108 in Fig. 1 this is then embedded as vectors for each line of the Ir code. The software that performs this step is the output unit.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Aharon-Lospinuso with Tepper in order to incorporate wherein the outputting of the embedding result outputs an embedding result of the instruction code sequence by embedding the instruction code sequence.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved accuracy that comes with incorporating a machine learning model (Tepper: para.0017).
However Aharon-Lospinuso-Tepper does not explicitly disclose wherein the outputting of the embedding result outputs an embedding result of the indexed instruction code sequence by embedding the indexed instruction code sequence.
Eij discloses the indexed instruction code (Eij: col.10 lines 25-65 “In addition to the information already present in the assembly instructions 1120, the object code format 1140 includes the following information: each source code instruction has been assigned a sequence number that is unique to the assembly function 1120 that it appears in; … This is usually necessary because, depending on the assembly language used, not all operands in the assembly instructions 1120 carry a datatype but for proper operation of the transform 2000 and build 4000 steps it is desirable that the datatypes of all values in the CDFG are known. The nodes in the CDFG are marked with the numbers of the corresponding instructions in the object code section, Such that the relationship between executed operations in the object code section and the nodes in the CDFG can be established in the analysis step 1200….All functions in the annotated executable 1158 are assigned a sequence number that is unique to the executable 1158. As a result, functions can be referred to by number instead of by name which is an advantage if the source program 996 is written in a language like C where function names are not necessarily unique in a program; the function CDFGs present in the object code 1140 are combined to form the overall program CDFG 1157 that represents the static structure of program 996.” The system links each instruction to a number, and each function can be called via the number. Either the logical mapping the system generates between integer and program function and/or the binary libraries 998 in fig. 9a-b, are instruction code dictionaries used for this process.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Aharon-Lospinuso-Tepper with Eij in order to incorporate the indexed instruction code, such that the learning process is done by a integer represented code in Tepper, i.e. the indexed instruction code.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved compiling and execution of code (Eij: col.10 lines 25-65).
Regarding Claim 6, Aharon discloses A method for detecting a malicious code, the method comprising: generating an instruction code sequence by converting each of a plurality of execution files into an assembly code (Aharon: para.0016 “According to the present invention there is provided an apparatus for detecting malicious code in a stream of data traffic input to a gateway to data network, the apparatus including (a) a filter apparatus which filters and thereby detects suspicious data in the stream of data traffic; (b) a disassembler attempting to convert binary operation codes of the suspicious data into assembly instructions, thereby attempting to produce disassembled code” an instruction code sequence is generated by converting the suspicious data from the stream of data traffic into assembly instructions) and
a malicious code classification model for detecting a malicious code (Aharon: para.0039 “Subsequent to disassembly, an instruction analyzer 405 is used to determine if the code is executable code and malicious.” The instruction analyzer takes the assembly instruction and determines is the code is malicious. The analyzer already is programmed to detect malicious code such as in Fig. 4 and 5 509-513.).
wherein the generating of the instruction code sequence extracts an instruction from the assembly code (Aharon: para.0016 “According to the present invention there is provided an apparatus for detecting malicious code in a stream of data traffic input to a gateway to data network, the apparatus including (a) a filter apparatus which filters and thereby detects suspicious data in the stream of data traffic; (b) a disassembler attempting to convert binary operation codes of the suspicious data into assembly instructions, thereby attempting to produce disassembled code” an instruction code sequence is generated by converting the suspicious data from the stream of data traffic into assembly instructions. The resulting instruction is seen in Fig. 4 XOR EAX, EAX.),
generates an instruction code by combining an opcode and an operand of the extracted instruction (Aharon: Fig. 4, para.0036-0037 “[0036] The data as shown above is disassembled by disassembler 403 to the following instructions: [0037] XOR EAX,EAX” the instruction code is obtained from the assembly code generated by the disassembler, which comprises operands and opcodes, in this case XOR is the opcode, and eax, eax are both operands. ), and
generates the instruction code sequence by using the instruction code (Aharon: para.0041 “The instruction is incremented (step 509) and each instruction is disassembled (step 507) until a branch (or conditional branch) instruction is reached (step 511). For each instruction, between the chosen offset and a branch instruction a threat weight is thus calculated and accumulated (step 515). Another offset is then chosen (step 505), instructions are disassembled (step 507) and incremented (step 509) and a threat weight is accumulated and added to the accumulated value up to the branch point (step 515). The input stream including executable code is analyzed by dividing the executable code into "flows" including all instructions between the chosen offset (or a first branch instruction) and a subsequent branch instruction. Every time a conditional jump instruction is reached (step 511) the conditional branch instruction is disassembled (step 507) the flow is split into two flows in branch options 513a and 513b, each branch continuing in a different execution path. As a result the flows are linked into "spiders" containing a list (or tree) of flows. For each flow, a threat weight is maintained in memory and accumulated (step 515) and as the flow progresses its threat-weight is updated.” The set of instruction of the data stream is the instruction code sequence, shown here in at least the analysis in Fig. 5, each branch can also be considered a sequence, comprising instructions such as that of para.0037.).
However Aharon does not explicitly disclose generating a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range; learning an assembly language model for instruction code embedding by using the instruction code sequence including the plurality of segment instruction code sequences; generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer; and learning a malicious code classification model for detecting a malicious code based on the learned assembly language model, wherein the generating of the instruction code sequence generates each of the plurality of segment instruction code sequences as an individual file.
Lospinuso discloses generating a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range (Lospinuso: col. 7 lines 11-15 “An example embodiment of the invention will now be described with reference to FIG. 2. FIG. 2 shows certain elements of an apparatus for provision of enabling safe handling of executable binaries (or other content that can carry malicious code) according to an example embodiment.” col. 9 lines 10-21 “The division module 180 manager may include tools to facilitate segmentation of executable objects according to the policy provided by the security policy engine 80. Thus, the division module 180 may be configured to take each executable object apart in chunks and remove the execution capability from each part by removing portions of the header that are needed to launch execution. The chunks may be of predetermined chunk sizes or random chunk sizes in various alternatives. In some cases, the header may be inspected to determine which parts are identified in the header, and division may be accomplished according to the parts identified in the header.” Claim 1 “divide the executable malware object into a plurality of malware segments, wherein the plurality of malware segments are divided based on a predetermined size or a random size” Malicious files can be segmented into a randomly selected size for safe handling of the code. The random sizes are within a predetermined range as the chunks cannot be larger than the total size of the code, therefore the range is greater than at least a bit and smaller than the total size of the executable code.);
wherein the generating of the instruction code sequence generates each of the plurality of segment instruction code sequences as an individual file (Lopinuso: “In this regard, a method for rendering malware files safe for handling according to one embodiment of the invention, as shown in FIG. 3, may include receiving an executable object at operation 300, and dividing the executable object into a plurality of segments or pieces at operation 310” the executable object is a malware file, and by dividing the malware file into segments or pieces in operation 310 generates each segment instruction code sequence as an individual file.)
Therefore it would have been obvious to one of ordinary skill in the art before the effective ifling date of the claimed invention to combine Aharon with Losponuso in order to incorporate generating a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range, wherein the generating of the instruction code sequence generates each of the plurality of segment instruction code sequences as an individual file and apply this concept of handling malicious code to potentially malicious code of Aharon.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of safe handling of malicious code (Lospinuso: col. 7 lines 11-15, background col.1 lines 22-32).
However Aharon-Lospinuso does not explicitly disclose learning an assembly language model for instruction code embedding by using the instruction code sequence including the plurality of segment instruction code sequences; generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer; and learning a malicious code classification model for detecting a malicious code based on the learned assembly language model,
Tepper discloses learning a language model for instruction code embedding by using the instruction code sequence (Tepper: para.0016 “The system 100 may receive source code 102 and compile the source code into intermediate representation (IR) code 104.” para.0020 “A set of graph embedding vectors (such as embedding vectors 108 in FIG. 1, already discussed) may be computed, one vector for each intermediate representation (IR) code instruction (e.g., each LLVM-IR instruction) of the input program (i.e., IR code 104 in FIG. 1, already discussed)…. Generation of embedding vectors, however, are part of the trainable elements in the neural network (trainable through backpropagation). Since training involves using dependence graphs (e.g., PDG), after training, the values of these embedding vectors will be influenced by the collection of the PDG graphs seen during training. Once trained, the neural network will be tuned to the selected task.” The input file is the source code 102, and converted into an instruction code sequence. In step 108 in Fig. 1 this is then embedded as vectors for each line of the Ir code. This is a learned process by neural network.); and
learning a malicious code classification model for detecting a malicious code based on the learned language model (Tepper: Fig. 1, para.0017 “Through learning algorithm(s) 112, the graph attention neural network 110 may be trained to handle tasks such as software analysis (label 114), in which the system automatically extracts information about the software (such as, e.g., classifying as malicious code), or software enhancement (label 116), in which the system modifies the LLVM-IR bitcode to improve the software runtime performance. Examples of software analysis may include a software classification analysis, a thread coarsening analysis, or a heterogeneous scheduling analysis. Examples of software enhancement may include program modifications to improve performance via loop vectoring and/or optimization pass ordering.” The embedded vector graph form of the code is then input into the universal graph network, with in combination with learning algorithm 112 and software analysis 114, the input file is classified as malicious or not. This is also a learned process.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Aharon with Tepper in order to incorporate learning a language model for instruction code embedding by using the instruction code sequence, learning a malicious code classification model for detecting a malicious code based on the learned assembly language model, and apply the ideas of Tepper to the assembly language operation in Aharon, using the safe version of instruction code sequence including the plurality of segment instruction code sequences in Lospinuso.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved accuracy that comes with incorporating a machine learning model (Tepper: para.0017).
However Aharon-Lospinuso-Tepper does not explicitly disclose generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer.
Eij discloses generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer (Eij: col.10 lines 25-65 “In addition to the information already present in the assembly instructions 1120, the object code format 1140 includes the following information: each source code instruction has been assigned a sequence number that is unique to the assembly function 1120 that it appears in; … This is usually necessary because, depending on the assembly language used, not all operands in the assembly instructions 1120 carry a datatype but for proper operation of the transform 2000 and build 4000 steps it is desirable that the datatypes of all values in the CDFG are known. The nodes in the CDFG are marked with the numbers of the corresponding instructions in the object code section, Such that the relationship between executed operations in the object code section and the nodes in the CDFG can be established in the analysis step 1200….All functions in the annotated executable 1158 are assigned a sequence number that is unique to the executable 1158. As a result, functions can be referred to by number instead of by name which is an advantage if the source program 996 is written in a language like C where function names are not necessarily unique in a program; the function CDFGs present in the object code 1140 are combined to form the overall program CDFG 1157 that represents the static structure of program 996.” The system links each instruction to a number, and each function can be called via the number. Either the logical mapping the system generates between integer and program function and/or the binary libraries 998 in fig. 9a-b, are instruction code dictionaries used for this process.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Aharon-Lospinuso-Tepper with Eij in order to incorporate generating an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved compiling and execution of code (Eij: col.10 lines 25-65).
Regarding Claim 7, Aharon-Lospinuso-Tepper-Eij discloses claim 6 as set forth above.
However Aharon-Lospinuso does not explicitly disclose wherein the learning of the assembly language model learns the assembly language model by using the indexed instruction code sequence.
Tepper further discloses wherein the learning of the assembly language model learns the assembly language model by using the instruction code sequence (Tepper: para.0016 “The system 100 may receive source code 102 and compile the source code into intermediate representation (IR) code 104.” para.0020 “A set of graph embedding vectors (such as embedding vectors 108 in FIG. 1, already discussed) may be computed, one vector for each intermediate representation (IR) code instruction (e.g., each LLVM-IR instruction) of the input program (i.e., IR code 104 in FIG. 1, already discussed)…. Generation of embedding vectors, however, are part of the trainable elements in the neural network (trainable through backpropagation). Since training involves using dependence graphs (e.g., PDG), after training, the values of these embedding vectors will be influenced by the collection of the PDG graphs seen during training. Once trained, the neural network will be tuned to the selected task.” The input file is the source code 102, and converted into an instruction code sequence. In step 108 in Fig. 1 this is then embedded as vectors for each line of the Ir code. The software that performs this step is the output unit.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Aharon-Lospinuso with Tepper in order to incorporate wherein the learning of the assembly language model learns the assembly language model by using the instruction code sequence.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved accuracy that comes with incorporating a machine learning model (Tepper: para.0017).
However Aharon-Lospinuso-Tepper does not explicitly disclose wherein the learning of the assembly language model learns the assembly language model by using the indexed instruction code sequence.
Eij discloses the indexed instruction code (Eij: col.10 lines 25-65 “In addition to the information already present in the assembly instructions 1120, the object code format 1140 includes the following information: each source code instruction has been assigned a sequence number that is unique to the assembly function 1120 that it appears in; … This is usually necessary because, depending on the assembly language used, not all operands in the assembly instructions 1120 carry a datatype but for proper operation of the transform 2000 and build 4000 steps it is desirable that the datatypes of all values in the CDFG are known. The nodes in the CDFG are marked with the numbers of the corresponding instructions in the object code section, Such that the relationship between executed operations in the object code section and the nodes in the CDFG can be established in the analysis step 1200….All functions in the annotated executable 1158 are assigned a sequence number that is unique to the executable 1158. As a result, functions can be referred to by number instead of by name which is an advantage if the source program 996 is written in a language like C where function names are not necessarily unique in a program; the function CDFGs present in the object code 1140 are combined to form the overall program CDFG 1157 that represents the static structure of program 996.” The system links each instruction to a number, and each function can be called via the number. Either the logical mapping the system generates between integer and program function and/or the binary libraries 998 in fig. 9a-b, are instruction code dictionaries used for this process.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Aharon-Lospinuso-Tepper with Eij in order to incorporate generating an indexed instruction code sequence such that the learning process is done by a integer represented code in Tepper.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved compiling and execution of code (Eij: col.10 lines 25-65).
Regarding Claim 14, Aharon discloses An apparatus for detecting a malicious code (Aharon: para.0012 “According to the present invention there is provided a method for detecting malicious code in a stream of data traffic input to a gateway of a data network, the method includes monitoring by the gateway for a suspicious portion of data in the stream of data traffic.” The gateway detects malicious code), the apparatus comprising:
a collector configured to generate an instruction code sequence by converting an input file, for which a malicious code is to be detected, into an assembly code (Aharon: para.0016 “According to the present invention there is provided an apparatus for detecting malicious code in a stream of data traffic input to a gateway to data network, the apparatus including (a) a filter apparatus which filters and thereby detects suspicious data in the stream of data traffic; (b) a disassembler attempting to convert binary operation codes of the suspicious data into assembly instructions, thereby attempting to produce disassembled code” an instruction code sequence is generated by converting the suspicious data from the stream of data traffic into assembly instructions by the disassembler 403 in Fig. 4.)and
a detector configured to detect whether or not the input file is a malicious code, by using a prelearned malicious code classification model (Aharon: para.0039 “Subsequent to disassembly, an instruction analyzer 405 is used to determine if the code is executable code and malicious.” The instruction analyzer takes the assembly instruction and determines is the code is malicious. The analyzer already is programmed to detect malicious code such as in Fig. 4 and 5 509-513, therefore prelearned.),
wherein the collector is further configured to: extract an instruction from the assembly code (Aharon: para.0016 “According to the present invention there is provided an apparatus for detecting malicious code in a stream of data traffic input to a gateway to data network, the apparatus including (a) a filter apparatus which filters and thereby detects suspicious data in the stream of data traffic; (b) a disassembler attempting to convert binary operation codes of the suspicious data into assembly instructions, thereby attempting to produce disassembled code” an instruction code sequence is generated by converting the suspicious data from the stream of data traffic into assembly instructions. The resulting instruction is seen in Fig. 4 XOR EAX, EAX.),
generate an instruction code by combining an opcode and an operand of the extracted instruction (Aharon: Fig. 4, para.0036-0037 “[0036] The data as shown above is disassembled by disassembler 403 to the following instructions: [0037] XOR EAX,EAX” the instruction code is obtained from the assembly code generated by the disassembler, which comprises operands and opcodes, in this case XOR is the opcode, and eax, eax are both operands. ), and
generate the instruction code sequence by using the instruction code (Aharon: para.0041 “The instruction is incremented (step 509) and each instruction is disassembled (step 507) until a branch (or conditional branch) instruction is reached (step 511). For each instruction, between the chosen offset and a branch instruction a threat weight is thus calculated and accumulated (step 515). Another offset is then chosen (step 505), instructions are disassembled (step 507) and incremented (step 509) and a threat weight is accumulated and added to the accumulated value up to the branch point (step 515). The input stream including executable code is analyzed by dividing the executable code into "flows" including all instructions between the chosen offset (or a first branch instruction) and a subsequent branch instruction. Every time a conditional jump instruction is reached (step 511) the conditional branch instruction is disassembled (step 507) the flow is split into two flows in branch options 513a and 513b, each branch continuing in a different execution path. As a result the flows are linked into "spiders" containing a list (or tree) of flows. For each flow, a threat weight is maintained in memory and accumulated (step 515) and as the flow progresses its threat-weight is updated.” The set of instruction of the data stream is the instruction code sequence, shown here in at least the analysis in Fig. 5, each branch can also be considered a sequence, comprising instructions such as that of para.0037.).
However Aharon does not explicitly disclose to generate a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range; an output unit configured to embed the instruction code sequence including the plurality of segment instruction code sequences by using a prelearned assembly language model for instruction code embedding and to output an embedding result of the instruction code sequence; a converter configured to generate an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer; and a detector configured to detect whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input, wherein the collector is further configured to generate each of the plurality of segment instruction code sequences as an individual file.
Lospinuso discloses to generate a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range (Lospinuso: col. 7 lines 11-15 “An example embodiment of the invention will now be described with reference to FIG. 2. FIG. 2 shows certain elements of an apparatus for provision of enabling safe handling of executable binaries (or other content that can carry malicious code) according to an example embodiment.” col. 9 lines 10-21 “The division module 180 manager may include tools to facilitate segmentation of executable objects according to the policy provided by the security policy engine 80. Thus, the division module 180 may be configured to take each executable object apart in chunks and remove the execution capability from each part by removing portions of the header that are needed to launch execution. The chunks may be of predetermined chunk sizes or random chunk sizes in various alternatives. In some cases, the header may be inspected to determine which parts are identified in the header, and division may be accomplished according to the parts identified in the header.” Claim 1 “divide the executable malware object into a plurality of malware segments, wherein the plurality of malware segments are divided based on a predetermined size or a random size” Malicious files can be segmented into a randomly selected size for safe handling of the code. The random sizes are within a predetermined range as the chunks cannot be larger than the total size of the code, therefore the range is greater than at least a bit and smaller than the total size of the executable code.)
wherein the collector is further configured to generate each of the plurality of segment instruction code sequences as an individual file (Lopinuso: “In this regard, a method for rendering malware files safe for handling according to one embodiment of the invention, as shown in FIG. 3, may include receiving an executable object at operation 300, and dividing the executable object into a plurality of segments or pieces at operation 310” the executable object is a malware file, and by dividing the malware file into segments or pieces in operation 310 generates each segment instruction code sequence as an individual file.)
Therefore it would have been obvious to one of ordinary skill in the art before the effective ifling date of the claimed invention to combine Aharon with Losponuso in order to incorporate to generate a plurality of segment instruction code sequences by segmenting the instruction code sequence by a randomly selected length within a predetermined range, wherein the collector is further configured to generate each of the plurality of segment instruction code sequences as an individual file and apply this concept of handling malicious code to potentially malicious code of Aharon.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of safe handling of malicious code (Lospinuso: col. 7 lines 11-15, background col.1 lines 22-32).
However Aharon-Lospinuso does not explicitly disclose an output unit configured to embed the instruction code sequence including the plurality of segment instruction code sequences by using a prelearned assembly language model for instruction code embedding and to output an embedding result of the instruction code sequence; a converter configured to generate an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer; and a detector configured to detect whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input,
Tepper discloses an output unit configured to embed the instruction code sequence by using a prelearned language model for instruction code embedding and to output an embedding result of the instruction code sequence (Tepper: para.0016 “The system 100 may receive source code 102 and compile the source code into intermediate representation (IR) code 104.” para.0020 “A set of graph embedding vectors (such as embedding vectors 108 in FIG. 1, already discussed) may be computed, one vector for each intermediate representation (IR) code instruction (e.g., each LLVM-IR instruction) of the input program (i.e., IR code 104 in FIG. 1, already discussed)…. Generation of embedding vectors, however, are part of the trainable elements in the neural network (trainable through backpropagation). Since training involves using dependence graphs (e.g., PDG), after training, the values of these embedding vectors will be influenced by the collection of the PDG graphs seen during training. Once trained, the neural network will be tuned to the selected task.” The input file is the source code 102, and converted into an instruction code sequence. In step 108 in Fig. 1 this is then embedded as vectors for each line of the Ir code. The software that performs this step is the output unit.); and
a detector configured to detect whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input (Tepper: Fig. 1, para.0017 “Through learning algorithm(s) 112, the graph attention neural network 110 may be trained to handle tasks such as software analysis (label 114), in which the system automatically extracts information about the software (such as, e.g., classifying as malicious code), or software enhancement (label 116), in which the system modifies the LLVM-IR bitcode to improve the software runtime performance. Examples of software analysis may include a software classification analysis, a thread coarsening analysis, or a heterogeneous scheduling analysis. Examples of software enhancement may include program modifications to improve performance via loop vectoring and/or optimization pass ordering.” The embedded vector graph form of the code is then input into the universal graph network, with in combination with learning algorithm 112 and software analysis 114, the input file is classified as malicious or not.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Aharon-Lospinuso with Tepper in order to incorporate an output unit configured to embed the instruction code sequence by using a prelearned language model for instruction code embedding and to output an embedding result of the instruction code sequence; and a detector configured to detect whether or not the input file is a malicious code, by using a prelearned malicious code classification model with the embedding result as an input, and apply the ideas of Tepper to the assembly language operation in Aharon, using the safe version of instruction code sequence including the plurality of segment instruction code sequences in Lospinuso.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved accuracy that comes with incorporating a machine learning model (Tepper: para.0017).
However Aharon-Lospinuso-Tepper does not explicitly disclose a converter configured to generate an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer.
Eij discloses a converter configured to generate an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer (Eij: col.10 lines 25-65 “In addition to the information already present in the assembly instructions 1120, the object code format 1140 includes the following information: each source code instruction has been assigned a sequence number that is unique to the assembly function 1120 that it appears in; … This is usually necessary because, depending on the assembly language used, not all operands in the assembly instructions 1120 carry a datatype but for proper operation of the transform 2000 and build 4000 steps it is desirable that the datatypes of all values in the CDFG are known. The nodes in the CDFG are marked with the numbers of the corresponding instructions in the object code section, Such that the relationship between executed operations in the object code section and the nodes in the CDFG can be established in the analysis step 1200….All functions in the annotated executable 1158 are assigned a sequence number that is unique to the executable 1158. As a result, functions can be referred to by number instead of by name which is an advantage if the source program 996 is written in a language like C where function names are not necessarily unique in a program; the function CDFGs present in the object code 1140 are combined to form the overall program CDFG 1157 that represents the static structure of program 996.” The system links each instruction to a number, and each function can be called via the number. Either the logical mapping the system generates between integer and program function and/or the binary libraries 998 in fig. 9a-b, are instruction code dictionaries used for this process. The converter is the linking module 1150 or 1151 in Fig. 9a-b.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Aharon-Lospinuso-Tepper with Eij in order to incorporate a converter configured to generate an indexed instruction code sequence corresponding to the instruction code sequence by using an instruction code dictionary for indexing an instruction code by an integer and by indexing an instruction code in the instruction code sequence by an integer.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved compiling and execution of code (Eij: col.10 lines 25-65).
Regarding Claim 15, Aharon-Lospinuso-Tepper-Eij discloses claim 14 as set forth above.
However Aharon-Lospinuso does not explicitly disclose wherein the output unit is further configured to output an embedding result of the indexed instruction code sequence by embedding the indexed instruction code sequence.
Tepper further discloses wherein the output unit is further configured to output an embedding result of the instruction code sequence by embedding the instruction code sequence (Tepper: para.0016 “The system 100 may receive source code 102 and compile the source code into intermediate representation (IR) code 104.” para.0020 “A set of graph embedding vectors (such as embedding vectors 108 in FIG. 1, already discussed) may be computed, one vector for each intermediate representation (IR) code instruction (e.g., each LLVM-IR instruction) of the input program (i.e., IR code 104 in FIG. 1, already discussed)…. Generation of embedding vectors, however, are part of the trainable elements in the neural network (trainable through backpropagation). Since training involves using dependence graphs (e.g., PDG), after training, the values of these embedding vectors will be influenced by the collection of the PDG graphs seen during training. Once trained, the neural network will be tuned to the selected task.” The input file is the source code 102, and converted into an instruction code sequence. In step 108 in Fig. 1 this is then embedded as vectors for each line of the Ir code for the learning process. The software that performs this step is the output unit.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Aharon-Lospinuso with Tepper in order to incorporate wherein the output unit is further configured to output an embedding result of the instruction code sequence by embedding the instruction code sequence.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved accuracy that comes with incorporating a machine learning model (Tepper: para.0017).
However Aharon-Lospinuso-Tepper does not explicitly disclose wherein the output unit is further configured to output an embedding result of the indexed instruction code sequence by embedding the indexed instruction code sequence.
Eij discloses the indexed instruction code (Eij: col.10 lines 25-65 “In addition to the information already present in the assembly instructions 1120, the object code format 1140 includes the following information: each source code instruction has been assigned a sequence number that is unique to the assembly function 1120 that it appears in; … This is usually necessary because, depending on the assembly language used, not all operands in the assembly instructions 1120 carry a datatype but for proper operation of the transform 2000 and build 4000 steps it is desirable that the datatypes of all values in the CDFG are known. The nodes in the CDFG are marked with the numbers of the corresponding instructions in the object code section, Such that the relationship between executed operations in the object code section and the nodes in the CDFG can be established in the analysis step 1200….All functions in the annotated executable 1158 are assigned a sequence number that is unique to the executable 1158. As a result, functions can be referred to by number instead of by name which is an advantage if the source program 996 is written in a language like C where function names are not necessarily unique in a program; the function CDFGs present in the object code 1140 are combined to form the overall program CDFG 1157 that represents the static structure of program 996.” The system links each instruction to a number, and each function can be called via the number. Either the logical mapping the system generates between integer and program function and/or the binary libraries 998 in fig. 9a-b, are instruction code dictionaries used for this process.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Aharon-Tepper with Eij in order to incorporate generating an indexed instruction code sequence such that the learning process is done by a integer represented code in Tepper.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved compiling and execution of code (Eij: col.10 lines 25-65).
Claim(s) 8-10 are rejected under 35 U.S.C. 103 as being unpatentable over Aharon et al. (hereinafter Aharon, US 2007/0089171 A1) in view of Lospinuso et al. (hereinafter Lospinuso, US 10,291,647 B2) in view of Tepper et al. (hereinafter Tepper, US 2020/0326934 A1) further in view of Eijndhovern et al. (hereinafter Eij, US 9,081,928 B2) further in view of Choi et al. (hereinafter Choi, US 2022/0067579 A1).
Regarding Claim 8, Aharon-Lospinuso-Tepper-Eij discloses claim 7 as set forth above.
While Tepper discloses in at least para.0020, the embedding module processing each instruction in the code as vectors using natural language and word processing techniques, it does not specially disclose which algorithms are used, therefore Aharon-Lospinuso-Tepper-Eij does not explicitly disclose wherein the learning of the assembly language model learns the assembly language model by performing a masked language model (MLM) task and a next sentence prediction (NSP) task of the assembly language model by using the indexed instruction code sequence.
Choi discloses wherein the learning of the language model learns the language model by performing a masked language model (MLM) task and a next sentence prediction (NSP) task of the language model by using the sentence sequence (Choi: para.0051 “BERT architecture 216 uses bidirectionality by pre-training on a couple of tasks—Masked Language Model and Next Sentence Prediction. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. a “next sentence prediction” model jointly pretrains text-pair representations by splitting the corpus into sentence pairs. For 50% of the pairs, the second sentence would actually be the next sentence to the first sentence, labeled ‘IsNext’. For the remaining 50% of the pairs, the second sentence would be a random sentence from the corpus labeled ‘NotNext’.” Para.0052 “As used herein, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or multiple sentences packed together.” The language model is trained using mlm and nsp tasks, with a series of sentences being input into the learning process.).
Therefore it would have been obvious to combine Aharon-Lospinuso-Tepper-Eij with Choi in order to incorporate wherein the learning of the language model learns the language model by performing a masked language model (MLM) task and a next sentence prediction (NSP) task of the language model by using the sentence sequence, and apply this technique to the machine learning process based on assembly language model and the indexed instruction code sequence as established in Aharon-Tepper-Eij.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved efficiency and outcome that comes with BERT architecture (Choi: para.0053).
Regarding Claim 9, Aharon-Lospinuso-Tepper-Eij discloses claim 7 as set forth above.
While Tepper discloses in at least para.0020, the embedding module processing each instruction in the code as vectors using natural language and word processing techniques, it does not specially disclose which algorithms are used and generally states n-grams are used, therefore, Aharon-Lospinuso-Tepper-Eij does not explicitly disclose wherein the learning of the assembly language model learns the assembly language model by treating the indexed instruction code sequence as a sentence and by treating each instruction code as a token.
Choi discloses wherein the learning of the language model learns the language model by treating the input sequence as a sentence and by treating each word as a token (Choi: para.0051 “BERT architecture 216 uses bidirectionality by pre-training on a couple of tasks—Masked Language Model and Next Sentence Prediction. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. a “next sentence prediction” model jointly pretrains text-pair representations by splitting the corpus into sentence pairs. For 50% of the pairs, the second sentence would actually be the next sentence to the first sentence, labeled ‘IsNext’. For the remaining 50% of the pairs, the second sentence would be a random sentence from the corpus labeled ‘NotNext’.” Para.0052 “As used herein, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or multiple sentences packed together.” The language model is trained using mlm and nsp tasks, with a series of sentences being input into the learning process.).
Therefore it would have been obvious to combine Aharon-Lospinuso-Tepper-Eij with Choi in order to incorporate wherein the learning of the language model learns the language model by treating the input sequence as a sentence and by treating each word as a token, and apply this technique to the the indexed instruction code sequence and instruction codes as established in Aharon-Lospinuso-Tepper-Eij. Choi leaves how to construct the “sentence” “token” and “sequence” open ended for implementation. Seen in para.0052, generally a sequence can be a plurality of sentences, and a sentence can be set to any span of continuous text, but can also be a series of tokens. It is obvious that in implementation, an instruction code can be set as a token, and a series of tokens can be set as a sentence to improve the learning algorithm in Tepper.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved efficiency and outcome that comes with BERT architecture (Choi: para.0053).
Regarding Claim 10 Aharon-Lospinuso-Tepper-Eij -Choi discloses claim 9 as set forth above.
While Tepper discloses in at least para.0020, the embedding module processing each instruction in the code as vectors using natural language and word processing techniques, it does not specially disclose which algorithms are used and generally states n-grams are used, therefore, Aharon-Lospinuso-Tepper-Eij does not explicitly disclose wherein the learning of the assembly language model learns the assembly language model by using a vector that adds token embedding for the indexed instruction code sequence, position embedding for a position of an instruction code, and segment embedding for distinguishing two indexed instruction code sequences.
Choi discloses wherein the learning of the language model learns the language model by using a vector that adds token embedding for the document, position embedding for a position of a word, and segment embedding for distinguishing two sentences (Choi: para.0064 “Embeddings 310 are vector representation of words in the natural language descriptions found in documents. In the BERT architecture, each of embeddings 310 is a combination of three embeddings: positional embeddings to express the position of words in a sentence, segment embedding to distinguish between sentence pairs, and token embeddings learned for the specific token from the pretraining corpus token vocabulary.” Choi performs machine learning process using vector embeddings including position, segment and token embedding.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Aharon-Lospinuso-Tepper-Eij with Choi in order to incorporate wherein the learning of the language model learns the language model by using a vector that adds token embedding for the document, position embedding for a position of a word, and segment embedding for distinguishing two sentences, and apply this technique to the machine learning process of Aharon-Tepper-Eij for the assembly language model for the indexed instruction code sequence.
One of ordinary skill in the art would have been motivated to combine because of the expected benefit of improved efficiency and outcome that comes with BERT architecture (Choi: para.0053).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Sawant et al. US 11,604,626 B1 see abstract, Fig. 6 and sections relevant to fig. 6, showing detection of a code that matches a previous embedding satisfying a particular requirement
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EUI H KIM whose telephone number is (571)272-8133. The examiner can normally be reached 7:30-5 M-R, M-F alternating.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamal B Divecha can be reached on 5712725863. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/EUI H KIM/ Examiner, Art Unit 2453
/KAMAL B DIVECHA/ Supervisory Patent Examiner, Art Unit 2453