DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 8 is rejected under 35 U.S.C. 101 because
Claim 8 states “computer readable storage medium”, specification does not define the computer readable storage medium and therefore, under BRI, it can be interpreted as carrier wave signals, transitory, propagating signals which do not fall within any category of statutory subject matter. Thus, the claims are considered directed to non-statutory subject matter. See MPEP § 2106.
Specification
The disclosure is objected to because of the following informalities: summary of the invention is verbatim as claim language. Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-9 are is/are rejected under 35 U.S.C. 103 as being unpatentable over Ji et al US 2022/0244953 in view of Koo et al US 2023/0161879.
Regarding claims 1 and 5
Ji et al teaches
a memory storing a binary code similarity detection program [fig 3, 0015] in order to overcome those drawbacks in the prior art, a binary code similarity detection system is provided];
a processor configured to execute the binary code similarity detection program [0074] conventional binary code similarity detection methods first disassemble the binary code to assembly code, in which the statement is combined by operation code (opcode) and operand. Further, the control flow operations (e.g., branch statement) split the assembly code into multiple basic blocks, where either all the statements inside one basic block will execute together];
wherein the binary code similarity detection program performs a preprocessing operation of generating an assembly expression for the binary code by converting a machine language of an input binary code into an assembly language and extracting an assembly function or a command from the binary code converted to the assembly language, and detects a similarity to the assembly expression of a pre-stored binary code by inputting the assembly expression generated by the preprocessing operation to a trained model based on [bidirectional encoder representations from transformers (BERT)] [0010] the assembly codes 120 and 130 of FIGS. 1B and 1C are similar because they share the same compiler family (llvm), optimization level (O1), and target architecture (x86), with the only difference being compiler version (version 3.3 for assembly code 120 and version 3.5 for assembly code 130). In contrast, the assembly code 140 of FIG. 1D is drastically different, due to its choice of compiling configuration (gcc version 4.8.5 with O3 for the x64 architecture). In the assembly code 140 of FIG. 1D, both the code size and the control flow differ significantly from the examples in FIGS. 1B and 1C, mainly because of loop related optimization techniques (e.g., tree vectorization and loop unrolling). For the reasons discussed above, binary-binary similarity detection methods that rely on a single, binary level model for similarity analysis have difficulty in fully accounting for the differences that arise solely from the different compiling configurations] and [0056] to extract the instruction-level features 620, the system 300 extracts the unique instruction patterns and their combinations. FIG. 6B illustrates the extracted instruction-level features 620 of the assembly code 420 shown in FIG. 4B. To improve the representativeness of the extracted instruction-level features 620, the system 300 may add a wildcard to represent any instruction. For example, the extracted instruction-level features 620 are shown in FIG. 6B with “|” as the instruction split symbol and [0016] the compiling configuration of the target binary code may be identified by a neural network trained on a training dataset of binary codes compiled using known configurations, for example a graph attention network trained on attributed function call graphs of binary codes. The target binary code and the comparing binary may be compared using a graph neural network (e.g., a graph triplet loss network) that compares attributed control flow graphs of the of the target binary code and the comparing binary];
the trained model is generated by performing a pre-training step of causing the assembly expression to be understood and a fine-tuning step of inputting an assembly expression of a first binary code and an assembly expression of a second binary code to a pre-trained model and then fine-tuning the pre-trained model based on a similarity between the first binary code and the second binary code [0066] having generated an AFCG 315 for the target binary code 310 and the binary codes 332 in the training set 330, the system 300 identifies the target compiling configuration 318 using a graph neural network (GNN) trained on the training dataset 330, which is able to learn an embedding for a graph and further tune the model based on the downstream task (i.e., multi-graph classification). More specifically, the system 300 may use a specific type of GNN, known as a graph attention network (GAT) 700] and [0016] the compiling configuration of the target binary code may be identified by a neural network trained on a training dataset of binary codes compiled using known configurations, for example a graph attention network trained on attributed function call graphs of binary codes. The target binary code and the comparing binary may be compared using a graph neural network (e.g., a graph triplet loss network) that compares attributed control flow graphs of the of the target binary code and the comparing binary] and [0074] conventional binary code similarity detection methods first disassemble the binary code to assembly code, in which the statement is combined by operation code (opcode) and operand. Further, the control flow operations (e.g., branch statement) split the assembly code into multiple basic blocks, where either all the statements inside one basic block will execute together, or none of them will execute. Taking each basic block as a node and the control flow relationship as an edge, prior art methods generate a control flow graph (CFG). As control flow graphs maintain code structures, they are an essential representation for code analysis. However, only using the control flow graph without the specific assembly code ignores the syntax features of the binary code]. Ji et al teaches binary code similarity detection and trained model but doesn’t teach explicitly [bidirectional representation from transformers [BERT]], however Koo et al teaches [0077] embodiments of the present disclosure provide an assembly language model for embedding an instruction code and a malware binary classification model using the assembly language mode, for example, a malicious code detection model. Herein, the assembly language model may be based on the BERT (Bidirectional Encoder Representation from Transformer) model proposed by Google, but the assembly language model is not necessarily restricted or limited to the BERT model and may be generated by using every scheme and artificial intelligence capable of generating an assembly language model]. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to incorporate BERT as powerful tools for understanding language context, used extensively for tasks like search and sentiment analysis. The modification would have been obvious because one of ordinary skill in the art would have been motivated to combine teaching in understanding of search queries, document relevance and determines if two sentences have the same meaning.
Regarding claims 2 and 6
Koo et al teaches
the binary code similarity detection program, in the pre-training step, replaces some words in the assembly expression with mask words and performs masked language modeling (MLM) training to match words before being replaced with the mask words [0078] herein, in embodiments of the present disclosure, an assembly language model for embedding an instruction code is pre-learned, and a masked language model (MLM) task using an instruction code sequence and a next sentence prediction (NSP) task are performed for the pre-learning of the assembly language model]. The feature of providing MLM would be obvious for to learn deep, bidirectional contextual understanding of language by predicting intentionally hidden (masked) words in a sentence, enabling as the reasons set forth in the rejection of claim 1.
Regarding claims 3 and 7
Ji et al teaches
the binary code similarity detection program, in the fine-tuning step, constructs a first pre-trained model and a second pre-trained model according to a siamese neural network and fine-tunes the first pre-trained model and the second pre-trained model based on a similarity between a first embedding vector output by inputting the assembly expression of the first binary code to the first pre-trained model and a second embedding vector output by inputting the assembly expression of the second binary code to the second pre-trained model [0066] Having generated an AFCG 315 for the target binary code 310 and the binary codes 332 in the training set 330, the system 300 identifies the target compiling configuration 318 using a graph neural network (GNN) trained on the training dataset 330, which is able to learn an embedding for a graph and further tune the model based on the downstream task (i.e., multi-graph classification). More specifically, the system 300 may use a specific type of GNN, known as a graph attention network (GAT) 700] and [0010] the assembly codes 120 and 130 of FIGS. 1B and 1C are similar because they share the same compiler family (llvm), optimization level (O1), and target architecture (x86), with the only difference being compiler version (version 3.3 for assembly code 120 and version 3.5 for assembly code 130). In contrast, the assembly code 140 of FIG. 1D is drastically different, due to its choice of compiling configuration (gcc version 4.8.5 with O3 for the x64 architecture). In the assembly code 140 of FIG. 1D, both the code size and the control flow differ significantly from the examples in FIGS. 1B and 1C, mainly because of loop related optimization techniques (e.g., tree vectorization and loop unrolling). For the reasons discussed above, binary-binary similarity detection methods that rely on a single, binary level model for similarity analysis have difficulty in fully accounting for the differences that arise solely from the different compiling configurations] and [0016] the compiling configuration of the target binary code may be identified by a neural network trained on a training dataset of binary codes compiled using known configurations, for example a graph attention network trained on attributed function call graphs of binary codes. The target binary code and the comparing binary may be compared using a graph neural network (e.g., a graph triplet loss network) that compares attributed control flow graphs of the of the target binary code and the comparing binary]. The feature of providing pre trained model and vector…would be obvious for the reasons set forth in the rejection of claim 1.
Regarding claim 4
Ji et al teaches
the binary code similarity detection program generates a plurality of assembly expressions obtained by dividing the binary code through the preprocessing operation for the input binary code and detects a similarity between respective assembly expressions and the assembly expression of the pre-stored binary code [0006] when source code is unavailable, binary code similarity detection may be used to perform vulnerability detection, malware analysis, security patch analysis, and even plagiarism detection. The traditional approach for binary code similarity detection takes two different binary codes as the inputs (e.g., the whole binary, functions, or basic blocks) and computes a measurement of similarity between them. If two binary codes were compiled from the same or similar source code, this binary-binary code similarity approach produces a high similarity score. To compare binary code from a device or closed-source application to source code, however, requires source-binary code similarity detection, where the code to be analyzed is in the binary format while the one for comparison is in the source code format. A traditional approach is to first compile the source code with a particular compiling configuration and then compare the compiled source code to the target binary code using binary-binary code similarity detection methods. However, such an approach faces two major challenges that prevent them from achieving high accuracy and coverage]. The feature of providing detect similarity…would be obvious for the reasons set forth in the rejection of claim 1.
Regarding claims 8-9
Ji et al teaches
computer program that is stored in a computer-readable storage medium and performs the binary code similarity detection method [0089] Referring back to FIG. 3, the binary code similarity detection system 300 may be realized using any hardware computing device (e.g., a server, a personal computer, etc.). The source code database 372 and the compiling configuration training dataset 330 may be stored on any non-transitory computer readable storage media internal to the hardware computing device or externally accessible to the hardware computing device via a wired, wireless, or network connection]. The feature of providing storage medium… would be obvious for the reasons set forth in the rejection of claim 1.
Relevant Prior Art
US 12175225 B2 Ševčenko et al teaches System And Method For Binary Code De-compilation Using Machine Learning
US 11972333 B1 Horesh et al teaches Supervisory Systems For Generative Artificial Intelligence Models
US 12073195 B2 Duan et al teaches Retrieval-augmented Code Completion
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Anil Khatri whose telephone number is (571)272-3725. The examiner can normally be reached M-F 8:30-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Wei Zhen can be reached at 571-272-3708. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ANIL KHATRI/ Primary Examiner, Art Unit 2191