DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is response to amendment filed 01/19/2026.
Status of the claims
Claims 1-8 were pending, claims 1 and 8 has been amended. Therefore, claims 1-8 are currently pending for examination.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-8 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kim et al. “Enhancing Code Similarity with Augmented Data Filtering and Ensemble Strategies”, September 2022 (hereafter Kim).
Regarding claim 1, Kim discloses: A processor-implemented method of generating a similarity determination model of programming codes performed by a computing device including at least a processor, the method comprising (Abstract):
Performing, by the processor, preprocessing on raw data written in any one language ([page 678, section A. “Data Preprocessing” and fig. 1] discloses: data processing process the raw text);
Performing, by the processor, a three-step data filtering process on the preprocessed data, thereby removing duplicate data included in the preprocessed data ([page 678 section B. “Data Filtering Strategies”] discloses: three step data filtering strategies for removing duplicate);
Generating, by the processor, positive pairs and negative pairs using the filtered data for use in training ([page 678 section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar); and
Training, by the processor, a pre-trained language model using the generated positive pairs and negative pairs to generate the similarity determination model for determining a similarity of the programming codes ([page 678 section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar);
Wherein the training uses a cross-validation ensemble technique to eliminate redundancy in the training of the pre-trained language model ([page 679 section D. “Cross-Validation Ensemble”]); and
Determining, by the processor, a similarity of two programming codes using the similarity determination model to eliminate code duplication in programs ([page 678, section B. “Data Filtering Strategies” and section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar and data filtering strategies for removing duplicate),
Wherein the two programming codes are written in different programming languages and the similarity between the two programming codes is determined at a sentence-level granularity ([page 677, 2nd column, 1st paragraph] discloses: cross-language source code detection (i.e., detecting similarity even when source codes are written in different PLs (programming languages)) and section C. “Forming positive and negative pairs”, page 678 discloses: determining two codes for similarity include rank matching sentences according to their relevance to a given query).
Regarding claim 2, Kim discloses: The method of claim 1, wherein in the performing of the preprocessing, comprises: at least one of removing a new line, removing a space, removing a comment, and removing a null space ([page 678, section A. “Data Preprocessing”] discloses: unnecessary new
lines and spaces are removed. Comments (sequences marked with ‘#’) written to help understand the code are also somewhat unnecessary for our training process. Hence, they are removed along with white spaces).
Regarding claim 3, Kim discloses: The method of claim 1, wherein the performing of the three-step data filtering process comprises: performing a first filtering step of generating a hash table for first data included in the preprocessed data, and removing duplicate data from the first data and second data included in the preprocessed data using the hash table ([page 678, section B. “Data Filtering Strategies”] discloses: Deduplication of Data: First, the data corresponding to A and B, and the TEST codes are loaded. The first filtering strategy filters overlapped sequence data using a hash table (HT). This method records all values of A in the HT and then checks whether these values exist in B; this is the most common and widely used method. For the first filtering of Algorithm 1 (1~6 lines), (1) the sequence input to the DEDUPLICATION function and initialized HT creates a new HT from the A, and (2) the new HT is compared with B to create First_filtered_codes (6 lines). Most duplicate data are filtered out in this process).
Regarding claim 4, Kim discloses: The method of claim 3, wherein the performing of the three-step data filtering comprises: performing a second filtering step of removing all white spaces existing before and after each line and all new lines between character strings by concatenating all new lines for data first-filtered by the first filtering step, and removing only intersection values ([page 678, section B. “Data Filtering Strategies”] discloses: Deletion of Intersection: Second, the purpose of the Second Filtering is to filter out data not completely filtered out by the HT owing to reasons such as trailing space. This includes spaces such as white spaces ‘ ’ in the right or left edge of sentences, tabs ‘\t’, In the second filtering of
Algorithm 1 (7~10 lines), (1) the SIMPLIFY function concatenates all newlines existing in the character strings of the code. (2) All spaces and newlines before and after the character strings are removed. (3) Filtering is performed once more by taking the intersection of these filtered sequences of
the test code and using first_filtered_codes to generate second_filtered_codes).
Regarding claim 5, Kim discloses: The method of claim 4, further comprising: performing a third filtering step of removing duplicate data through a comparison between all words based on the white spaces, for data second-filtered by the second filtering step ([page 678, section B. “Data Filtering Strategies”] discloses: Exhaustive Search: Most of the duplicated data are removed by the second filtering process. However, a method that completely eliminates duplication is required. Therefore, an exhaustive search is performed, and s and TEST_codes are mutually compared in the third filtering of Algorithm 1 to remove the few remaining duplicate data traces (11~20 lines)).
Regarding claim 6, Kim discloses: The method of claim 5, wherein the generating of the positive and negative pairs comprises: generating the positive pairs and the negative pairs from the data thirdly filtered by the third filtering step using a BM25 algorithm or BM25L algorithm ([page 678 section C. “Forming positive and negative pairs”] discloses: BM25 algorithm of the Okapi system, which is based on probabilistic retrieval research, is a ranking function utilized by search engines to rank matching sentences according to their relevance to a given query. For length normalization of BM25, BM25L, a newer variant to boost scores of very long documents and more effective than BM25, was proposed [25]).
Regarding claim 7, Kim discloses: The method of claim 1, wherein graphcodebert and codebert-mlm are used as the pre-trained language model ([page 679 section E. “Experiment setup”] discloses: The graphcodebert and codebert-mlm models were used for the proposed method. Both models were trained on a large dataset used for researching the probing capability of code-based PLMs).
Regarding claim 8, Kim discloses: A processor-implemented method of generating a similarity determination model of programming codes, performed by a computing device including at least a processor, the method comprising: performing, by the processor, preprocessing on raw data written in any one language ([page 678, section A. “Data Preprocessing” and fig. 1] discloses: data processing process the raw text);
performing, by the processor, a three-step data filtering process on the preprocessed data, thereby removing duplicate data included in the preprocessed data ([page 678 section B. “Data Filtering Strategies”] discloses: three step data filtering strategies for removing duplicate);
generating, by the processor, positive pairs and negative pairs using the filtered data for use in training ([page 678 section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar);
training, by the processor, a pre-trained language model using the generated positive pairs and negative pairs to generate the similarity determination model for determining a similarity of the programming codes ([page 678 section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar);
wherein the training of the pre-trained language model comprises using a cross-validation ensemble technique to eliminate redundancy in the training of the pre-trained language model([page 679 section D. “Cross-Validation Ensemble”]); and
Determining, by the processor, a similarity of two programming codes using the similarity determination model wherein the similarity between the two programming codes is determined at a sentence-level granularity, thereby improving software productivity by eliminating duplication in code written in different programming languages to eliminate code duplication in programs ([page 677, 2nd column] discloses: cross-language source code detection (i.e., detecting similarity even when source codes are written in different PLs (programming languages)) and section C. “Forming positive and negative pairs” and eliminate duplication between training and testing, page 678 discloses: determining two codes for similarity include rank matching sentences according to their relevance to a given query); and
wherein the performing of the three-step data filtering process comprises ([page 678, section B. “Data Filtering Strategies” and section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar and data filtering strategies for removing duplicate):
performing a first filtering step of generating a hash table for first data included in the preprocessed data, and removing duplicate data from the first data and second data included in the preprocessed data using the hash table ([page 678, section B. “Data Filtering Strategies”] discloses: Deduplication of Data: First, the data corresponding to A and B, and the TEST codes are loaded. The first filtering strategy filters overlapped sequence data using a hash table (HT). This method records all values of A in the HT and then checks whether these values exist in B; this is the most common and widely used method. For the first filtering of Algorithm 1 (1~6 lines), (1) the sequence input to the DEDUPLICATION function and initialized HT creates a new HT from the A, and (2) the new HT is compared with B to create First_filtered_codes (6 lines). Most duplicate data are filtered out in this process),
performing a second filtering step of removing all white spaces existing before and after each line and all new lines between character strings by concatenating all new lines for data first-filtered by the first filtering step, and removing only intersection values ([page 678, section B. “Data Filtering Strategies”] discloses: Deletion of Intersection: Second, the purpose of the Second Filtering is to filter out data not completely filtered out by the HT owing to reasons such as trailing space. This includes spaces such as white spaces ‘ ’ in the right or left edge of sentences, tabs ‘\t’, In the second filtering of Algorithm 1 (7~10 lines), (1) the SIMPLIFY function concatenates all newlines existing in the character strings of the code. (2) All spaces and newlines before and after the character strings are removed. (3) Filtering is performed once more by taking the intersection of these filtered sequences of the test code and using first_filtered_codes to generate second_filtered_codes), and
performing a third filtering step of removing duplicate data through a comparison between all words based on the white spaces, for data second-filtered by the second filtering step ([page 678, section B. “Data Filtering Strategies”] discloses: Exhaustive Search: Most of the duplicated data are
removed by the second filtering process. However, a method that completely eliminates duplication is required. Therefore, an exhaustive search is performed, and s and TEST_codes are mutually compared in the third filtering of Algorithm 1 to remove the few remaining duplicate data traces (11~20 lines)).
Response to Arguments
Applicant's arguments have been fully considered but they are not persuasive.
Applicant argued that Kim does not anticipates or teaches the amended claims, specifically for claim 1: codes in different programming languages. Examiner disagree, Kim [page 677, 2nd column] discloses: cross-language source code detection (i.e., detecting similarity even when source codes are written in different PLs (programming languages). Therefore, Kim teaching the two programming codes are written in different programming languages and page 678 discloses: similarity between code is determined at a sentence-level granularity such as determining two codes for similarity include rank matching sentences according to their relevance to a given query and use of the model to eliminate duplication such as in section C. “Forming positive and negative pairs” and eliminate duplication between training and testing. Thus, Kim teaches the amended claims that the two programming codes are written in different programming languages and the similarity between the two programming codes is determined at a sentence-level granularity and eliminate code duplication in programs as recited in independence claims.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CINDY NGUYEN whose telephone number is (571)272-4025. The examiner can normally be reached M-F 8:00-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhatia Ajay can be reached at 571-272-3906. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CINDY NGUYEN/Examiner, Art Unit 2156
/AJAY M BHATIA/Supervisory Patent Examiner, Art Unit 2156