Last updated: May 29, 2026

Application No. 18/507,705

METHOD AND DEVICE FOR DETERMINING SIMILARITY OF PROGRAMMING CODES BASED ON CROSS-VALIDATION ENSEMBLE AND FILTERING STRATEGY

Final Rejection §102

Filed

Nov 13, 2023

Priority

Nov 14, 2022 — RE 10-2022-0151807

Examiner

NGUYEN, CINDY

Art Unit

2156

Tech Center

2100 — Computer Architecture & Software

Assignee

Korea University Research And Business Foundation

OA Round

4 (Final)

Interview Optional

— +9.2% interview lift. Interview lift (+9.2%) is below the 15.0% threshold. A written response is recommended.

Based on 696 resolved cases, 2023–2026

Examiner Intelligence

NGUYEN, CINDY View full profile →

Grants 78% — above average

Career Allowance Rate

546 granted / 696 resolved

+23.4% vs TC avg

Moderate +9% lift

Without

With

+9.2%

Interview Lift

resolved cases with interview

Typical timeline

3y 1m

Avg Prosecution

14 currently pending

Career history

707

Total Applications

across all art units

Statute-Specific Performance

§101

2.5%

-37.5% vs TC avg

§103

76.3%

+36.3% vs TC avg

§102

15.9%

-24.1% vs TC avg

§112

0.7%

-39.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 696 resolved cases

Office Action

§102

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is response to amendment filed 01/19/2026.
Status of the claims
Claims 1-8 were pending, claims 1 and 8 has been amended.  Therefore, claims 1-8 are currently pending for examination.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-8 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kim et al. “Enhancing Code Similarity with Augmented Data Filtering and Ensemble Strategies”, September 2022 (hereafter Kim). 

Regarding claim 1, Kim discloses: A processor-implemented method of generating a similarity determination model of programming codes performed by a computing device including at least a processor, the method comprising (Abstract):  
Performing, by the processor,  preprocessing on raw data written in any one language ([page 678, section A. “Data Preprocessing” and fig. 1] discloses: data processing process the raw text);
Performing, by the processor, a three-step data filtering process on the preprocessed data, thereby removing duplicate data included in the preprocessed data ([page 678 section B. “Data Filtering Strategies”] discloses: three step data filtering strategies for removing duplicate);
 	Generating, by the processor,  positive pairs and negative pairs using the filtered data for use in training  ([page 678 section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar); and
 Training, by the processor, a pre-trained language model using the generated positive pairs and negative pairs to generate the similarity determination model for determining a similarity of the programming codes  ([page 678 section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar);
Wherein the training uses a cross-validation ensemble technique to eliminate redundancy in the training of the pre-trained language model   ([page 679 section D. “Cross-Validation Ensemble”]); and 
Determining, by the processor, a similarity of two programming codes using the similarity determination model to eliminate code duplication in programs ([page 678, section B. “Data Filtering Strategies” and section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar and data filtering strategies for removing duplicate),
Wherein the two programming codes are written in different programming languages and the similarity between the two programming codes is determined at a sentence-level granularity ([page 677, 2nd column, 1st paragraph]  discloses: cross-language source code detection (i.e., detecting similarity even when source codes are written in different PLs (programming languages)) and section C. “Forming positive and negative pairs”, page 678 discloses:  determining two codes  for similarity include rank matching sentences according to their relevance to a given query). 

Regarding claim 2, Kim discloses:  The method of claim 1, wherein in the performing of the preprocessing, comprises: at least one of removing a new line, removing a space, removing a comment, and removing a null space ([page 678, section A. “Data Preprocessing”] discloses: unnecessary new
 lines and spaces are removed. Comments (sequences marked with ‘#’) written to help understand the code are also somewhat unnecessary for our training process. Hence, they are removed along with white spaces).

Regarding claim 3, Kim discloses:  The method of claim 1, wherein the performing of the three-step data filtering process comprises: performing  a first filtering step of generating a hash table for first data included in the preprocessed data, and removing duplicate data from the first data and second data included in the preprocessed data using the hash table ([page 678, section B. “Data Filtering Strategies”] discloses: Deduplication of Data: First, the data corresponding to A and B, and the TEST codes are loaded. The first filtering strategy filters overlapped sequence data using a hash table (HT). This method records all values of A in the HT and then checks whether these values exist in B; this is the most common and widely used method. For the first filtering of Algorithm 1 (1~6 lines), (1) the sequence input to the DEDUPLICATION function and initialized HT creates a new HT from the A, and (2) the new HT is compared with B to create First_filtered_codes (6 lines). Most duplicate data are filtered out in this process).

Regarding claim 4, Kim discloses:   The method of claim 3, wherein the performing of the three-step data filtering comprises: performing a second filtering step of removing all white spaces existing before and after each line and all new lines between character strings by concatenating all new lines for data first-filtered by the first filtering step, and  removing only intersection values ([page 678, section B. “Data Filtering Strategies”] discloses: Deletion of Intersection: Second, the purpose of the Second Filtering is to filter out data not completely filtered out by the HT owing to reasons such as trailing space. This includes spaces such as white spaces ‘ ’ in the right or left edge of sentences, tabs ‘\t’, In the second filtering of
Algorithm 1 (7~10 lines), (1) the SIMPLIFY function concatenates all newlines existing in the character strings of the code. (2) All spaces and newlines before and after the character strings are removed. (3) Filtering is performed once more by taking the intersection of these filtered sequences of
the test code and using first_filtered_codes to generate second_filtered_codes).

Regarding claim 5, Kim discloses:   The method of claim 4, further comprising:   performing a third filtering step of removing duplicate data through a comparison between all words based on the white spaces, for data second-filtered by the second filtering step ([page 678, section B. “Data Filtering Strategies”] discloses: Exhaustive Search: Most of the duplicated data are removed by the second filtering process. However, a method that completely eliminates duplication is required. Therefore, an exhaustive search is performed, and s and TEST_codes are mutually compared in the third filtering of Algorithm 1 to remove the few remaining duplicate data traces (11~20 lines)).
   
Regarding claim 6, Kim discloses:   The method of claim 5, wherein the generating of the    positive and negative pairs comprises: generating the positive pairs and the negative pairs from the data thirdly filtered by the third filtering step using a BM25 algorithm or BM25L algorithm ([page 678 section C. “Forming positive and negative pairs”] discloses: BM25 algorithm of the Okapi system, which is based on probabilistic retrieval research, is a ranking function utilized by search engines to rank matching sentences according to their relevance to a given query. For length normalization of BM25, BM25L, a newer variant to boost scores of very long documents and more effective than BM25, was proposed [25]).

Regarding claim 7, Kim discloses:   The method of claim 1, wherein graphcodebert and codebert-mlm are used as the pre-trained language model ([page 679 section E. “Experiment setup”] discloses: The graphcodebert and codebert-mlm models were used for the proposed method. Both models were trained on a large dataset used for researching the probing capability of code-based PLMs).

Regarding claim 8, Kim discloses:    A processor-implemented method of generating a similarity determination model of programming codes, performed by a computing device including at least a processor, the method comprising: performing, by the processor, preprocessing on raw data written in any one language ([page 678, section A. “Data Preprocessing” and fig. 1] discloses: data processing process the raw text);
 performing, by the processor, a three-step data filtering process on the preprocessed data, thereby removing duplicate data included in the preprocessed data  ([page 678 section B. “Data Filtering Strategies”] discloses: three step data filtering strategies for removing duplicate); 
generating, by the processor, positive pairs and negative pairs using the filtered data for use in training  ([page 678 section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar);
 training, by the processor, a pre-trained language model using the generated positive pairs and negative pairs to generate the similarity determination model for determining a similarity of the programming codes ([page 678 section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar);
wherein the training of the pre-trained language model comprises using a cross-validation ensemble technique to eliminate redundancy in the training of the pre-trained language model([page 679 section D. “Cross-Validation Ensemble”]); and 
Determining, by the processor, a similarity of two programming codes using the similarity determination model wherein the similarity between the two programming codes is determined at a sentence-level granularity,  thereby improving software productivity by eliminating duplication in code written in different programming languages to eliminate code duplication in programs ([page 677, 2nd column]  discloses: cross-language source code detection (i.e., detecting similarity even when source codes are written in different PLs (programming languages)) and section C. “Forming positive and negative pairs” and eliminate duplication between training and testing, page 678 discloses:  determining two codes  for similarity include rank matching sentences according to their relevance to a given query); and
 wherein the performing of the three-step data filtering process comprises ([page 678, section B. “Data Filtering Strategies” and section C. “Forming positive and negative pairs”] discloses: training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar and data filtering strategies for removing duplicate):
 performing a first filtering step of generating a hash table for first data included in the preprocessed data, and removing duplicate data from the first data and second data included in the preprocessed data using the hash table ([page 678, section B. “Data Filtering Strategies”] discloses: Deduplication of Data: First, the data corresponding to A and B, and the TEST codes are loaded. The first filtering strategy filters overlapped sequence data using a hash table (HT). This method records all values of A in the HT and then checks whether these values exist in B; this is the most common and widely used method. For the first filtering of Algorithm 1 (1~6 lines), (1) the sequence input to the DEDUPLICATION function and initialized HT creates a new HT from the A, and (2) the new HT is compared with B to create First_filtered_codes (6 lines). Most duplicate data are filtered out in this process),
 performing a second filtering step of removing all white spaces existing before and after each line and all new lines between character strings by concatenating all new lines for data first-filtered by the first filtering step, and removing only intersection values ([page 678, section B. “Data Filtering Strategies”] discloses: Deletion of Intersection: Second, the purpose of the Second Filtering is to filter out data not completely filtered out by the HT owing to reasons such as trailing space. This includes spaces such as white spaces ‘ ’ in the right or left edge of sentences, tabs ‘\t’, In the second filtering of Algorithm 1 (7~10 lines), (1) the SIMPLIFY function concatenates all newlines existing in the character strings of the code. (2) All spaces and newlines before and after the character strings are removed. (3) Filtering is performed once more by taking the intersection of these filtered sequences of the test code and using first_filtered_codes to generate second_filtered_codes), and 
performing a third filtering step of removing duplicate data through a comparison between all words based on the white spaces, for data second-filtered by the second filtering step ([page 678, section B. “Data Filtering Strategies”] discloses: Exhaustive Search: Most of the duplicated data are
removed by the second filtering process. However, a method that completely eliminates duplication is required. Therefore, an exhaustive search is performed, and s and TEST_codes are mutually compared in the third filtering of Algorithm 1 to remove the few remaining duplicate data traces (11~20 lines)).


Response to Arguments
Applicant's arguments have been fully considered but they are not persuasive. 
Applicant argued that Kim does not anticipates or teaches the amended claims, specifically for claim 1: codes in different programming languages.  Examiner disagree, Kim  [page 677, 2nd column]  discloses: cross-language source code detection (i.e., detecting similarity even when source codes are written in different PLs (programming languages). Therefore, Kim teaching the two programming codes are written in different programming languages and page 678 discloses:  similarity between code is determined at a sentence-level granularity such as determining two codes  for similarity include rank matching sentences according to their relevance to a given query and use of the model to eliminate duplication such as in section C. “Forming positive and negative pairs” and eliminate duplication between training and testing.  Thus, Kim teaches the amended claims that the two programming codes are written in different programming languages and the similarity between the two programming codes is determined at a sentence-level granularity and eliminate code duplication in programs as recited in independence claims.


Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.


Contact Information

Any inquiry concerning this communication or earlier communications from the examiner should be directed to CINDY NGUYEN whose telephone number is (571)272-4025. The examiner can normally be reached M-F 8:00-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhatia Ajay can be reached at 571-272-3906. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CINDY NGUYEN/Examiner, Art Unit 2156

/AJAY M BHATIA/Supervisory Patent Examiner, Art Unit 2156

Read full office action

Prosecution Timeline

Show 3 earlier events

Apr 11, 2025

Response after Non-Final Action

Jun 18, 2025

Final Rejection mailed — §102

Jul 21, 2025

Response after Non-Final Action

Sep 16, 2025

Request for Continued Examination

Sep 24, 2025

Response after Non-Final Action

Oct 17, 2025

Non-Final Rejection mailed — §102

Jan 19, 2026

Response Filed

Feb 19, 2026

Final Rejection mailed — §102 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

19/017,119

Patent 12632455

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM

1y 4m to grant Granted May 19, 2026

18/928,754

Patent 12619661

SYSTEMS AND METHODS FOR TEMPLATING OF EXECUTABLE GRAPH-BASED MODELS

1y 6m to grant Granted May 05, 2026

18/662,973

Patent 12596762

METHOD FOR PROVIDING INFORMATION, METHOD FOR GENERATING DATABASE, AND PROGRAM

1y 10m to grant Granted Apr 07, 2026

16/511,966

Patent 12572537

LEARNED RESOURCE CONSUMPTION MODEL FOR OPTIMIZING BIG DATA QUERIES

6y 8m to grant Granted Mar 10, 2026

18/901,269

Patent 12566795

Method and system for synchronized search and retrieval of visual work instructions using artificial intelligence

1y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

5-6

Expected OA Rounds

78%

Grant Probability

88%

With Interview (+9.2%)

3y 1m (~7m remaining)

Median Time to Grant

High

PTA Risk

Based on 696 resolved cases by this examiner. Grant probability derived from career allowance rate.

METHOD AND DEVICE FOR DETERMINING SIMILARITY OF PROGRAMMING CODES BASED ON CROSS-VALIDATION ENSEMBLE AND FILTERING STRATEGY

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email