Last updated: May 29, 2026

Application No. 17/805,310

GRAPH-BASED SEMI-SUPERVISED GENERATION OF FILES

Non-Final OA §103

Filed

Jun 03, 2022

Examiner

DASGUPTA, SHOURJO

Art Unit

2144

Tech Center

2100 — Computer Architecture & Software

Assignee

International Business Machines Corporation

OA Round

3 (Non-Final)

Interview Optional

— +38.6% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 65% grant rate with +38.6% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 454 resolved cases, 2023–2026

Examiner Intelligence

DASGUPTA, SHOURJO View full profile →

Grants 65% — above average

Career Allowance Rate

297 granted / 454 resolved

+10.4% vs TC avg

Strong +39% interview lift

Without

With

+38.6%

Interview Lift

resolved cases with interview

Typical timeline

3y 5m

Avg Prosecution

17 currently pending

Career history

486

Total Applications

across all art units

Statute-Specific Performance

§101

2.3%

-37.7% vs TC avg

§103

91.8%

+51.8% vs TC avg

§102

2.5%

-37.5% vs TC avg

§112

2.8%

-37.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 454 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
2.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office Action has been withdrawn pursuant to 37 CFR 1.114.  

Detailed Action
3.	This Non-Final Office Action is responsive to Applicants’ amendments and arguments, as first received on 2/2/26 and asserted via the RCE submission received 3/5/26.  Claims 1, 3-8, 10-15, and 17-20 are presently pending, of which claims 1, 8, and 15 are independent.


Claim Rejections - 35 USC § 103
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

5.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office Action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

6.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

7.	Claims 1, 3-6, 8, 10-13, 15, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Non-Patent Literature “Learning from, Understanding, and Supporting DevOps Artifacts for Docker” (“Henkel”) in view of Non-Patent Literature “A Novel Neural Source Code Representation Based on Abstract Syntax Tree” (“Zhang”) and further in view of U.S. Patent Application Publication No. 2023/0036159 (“Duppils”).
Regarding claim 1, HENKEL teaches A system for graph-based ... generation of files (Henkel: Abstract’s final paragraph discussing how the framework contributes to and improves upon the creation of Dockerfiles), the system comprising:
a memory; and a processor in communication with the memory (Henkel: regarding both the “memory” and “processor” recitations, the Examiner reasons that the taught binnacle toolset (Abstract and Introduction sections on page 1), which is contemplated to facilitate improved DevOps functions, necessarily runs on a computer, e.g. one used by developer professionals in the discussed DevOps capacity and that such a computer would be understood by one of ordinary skill in the art to feature memory and processor elements to function as intended in relation to the reference’s teachings), the processor being configured to perform operations comprising:
collecting a set of repositories (Henkel: page 2’s Fig. 1 showing a Setup and Collect framework layer, specifically with a GitHub repository as subjected to a Repository Ingestor and a File Downloader, where in the paragraph beginning at the bottom of page 2’s first column, it is discussed that “As a prerequisite to our analysis and experimentation, we also collected approximately 900,000 GitHub repositories, and from these repositories, captured approximately 219,000 Dockerfiles (of which 178,000 are unique). Within this large corpus of Dockerfiles, we identified a subset written by Docker experts: this Gold Set is a collection of high-quality Dockerfiles that our techniques use as an oracle for Docker best practices”);
filtering the set of repositories based on one or more predefined rules and dividing data from the set of repository into a quality subset and an uncertain subset (Henkel: the collection of GitHub repositories as referenced above, as further clarified by page 3, second column, second full paragraph, which discusses “Using binnacle, we ingested every public repository on GitHub with ten or more stars”, such that the collection of the approximately 900,000 GitHub repositories involves a filtering step based on a predefined rule (e.g., only collecting repositories with 10+ stars), such that the result of the filtering by 10+ stars is that the repositories are grouped into data deemed of a higher quality and data that is not), wherein the quality subset is one that provides one or more datum (e.g., the repositories that have been filtered based on a predefined rule, as discussed just above – for example, 10+ stars repositories) and the uncertain subset has data that does not meet all predefined rules or a set of threshold values (i.e., the repositories that do not meet the 10+ stars filtering rule);
splitting the one or more datum into a quality dataset and an uncertain dataset (Henkel: page 3, second column, third full paragraph which discusses “Although both the number of repositories we ingested and the number of Dockerfiles we collected were large, we still had not addressed challenge (D2): high-quality data. To find high-quality data, we looked within our Dockerfile corpus and extracted every Dockerfile that originally came from the docker-library/ GitHub organization. This organization is run by Docker, and houses a set of official Dockerfiles written by and maintained by Docker experts. There are approximately 400 such files in our Dockerfile corpus. We will refer to this smaller subset of Dockerfiles as the Gold Set. Because these files are Dockerfiles created and maintained by Docker’s own experts, they presumably represent a higher standard of quality than those produced by non-experts. This set provides us with a solution to challenge (D2)—the Gold Set can be used as an oracle for good Dockerfile hygiene.  In addition to the Gold Set, we also collected approximately 5,000 Dockerfiles from several industrial repositories, with the hope that these files would also be a source of high-quality data.”, such that obtaining a gold set as taught from a corpus is akin to splitting the corpus into those deemed to be the gold set (i.e., “high quality” as recited) and those that are not (i.e., “uncertain” as recited)).

Applicants’ claim now further recites the additional limitations, which Henkel further reads on:
extracting, from the quality dataset and the uncertain dataset, respective codebases (Henkel: page 3, second column, third full paragraph which discusses (i) “... To find high-quality data, we looked within our Dockerfile corpus and extracted every Dockerfile that originally came from the docker-library/ GitHub organization. This organization is run by Docker, and houses a set of official Dockerfiles written by and maintained by Docker experts. ... We will refer to this smaller subset of Dockerfiles as the Gold Set. Because these files are Dockerfiles created and maintained by Docker’s own experts, they presumably represent a higher standard of quality than those produced by non-experts. This set provides us with a solution to challenge (D2)—the Gold Set can be used as an oracle for good Dockerfile hygiene.” and then also from the same paragraph, (ii) “In addition to the Gold Set, we also collected approximately 5,000 Dockerfiles from several industrial repositories, with the hope that these files would also be a source of high-quality data.”, such that these obtained datasets are clearly of two types – a first one per (i) that is a Gold Set and hence “high-quality” and the second one per (ii) which is less certain, and where (i) and (ii) are understood by the Examiner to constitute the corpus per Fig. 1’s first diagram block in its Enforce Rules framework layer, and hence subject to parsing and rule enforcement as successive diagram blocks indicate (where the parsing at minimum involves or is equivalent to “extracting” as recited)); and 
generating one or more codebase feature encodings/embeddings/representations for both the quality and the uncertain dataset (Henkel: page 2, second column, first full paragraph discussing “... we introduced a novel technique for generating structured representations of DevOps artifacts in the presence of nested languages, which we call phased parsing”, where the result is akin to the AST as shown on page 4’s Fig. 2(d), which embodies antecedents and consequents (page 5, second column, first full paragraph), to facilitate further parsing and rule enforcement (e.g., per Fig. 1’s Enforce Rules framework layer)).
Regarding the bulleted limitations discussed just prior, Henkel does not teach that its encodings/embeddings/representations are feature vectors specifically as per Applicants’ further limitation.  Rather, the Examiner relies upon ZHANG to teach what Henkel otherwise lacks, see e.g., Zhang’s comparable code analysis framework (Abstract), and particularly one where the analysis can lend itself to a classification task (Abstract again, also page 784’s 2nd column) such that the analyzed source code can be digested into an understandable grammar that lends itself to representations using nodes (as shown per Fig. 1 on page 784) in a manner that is similar to Henkel’s approach per Henkel’s Fig. 2, for example.  Zhang, specifically, teaches that these representations are then traversed for a transformation into vector form (page 785, section III’s introduction, and then later in sections III(B-C) and IV on pages 786-788 generally (see specifically section IV’s opening remarks and discussion of source code classification found at the bottom of page 788’s left column)).
Like Henkel, Zhang relates to analysis of a codebase to generate a machine learning result/benefit, and hence is similarly directed and therefore analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate vector-based encoding/embedding aspect (as Zhang contemplates), post-extraction, to prepare the extracted data into a format that facilitates further processing, with a reasonable expectation of success, such as to develop an improved understanding of the underlying codebase data/information (as both references generally contemplate) by using an approach per Zhang’s vectors that is amenable to and widely-practiced for machine-learning objectives in the state of the art.

Applicants’ claim further recites that the generation of files is semi-supervised and now features an active step for performing a semi-supervised learning formulation and storing results to use in future codebase generation.  The Examiner does not believe Henkel or Zhang to teach this, and rather relies upon DUPPILS to teach what Henkel etc. otherwise lack, see e.g. Duppils’s comparable framework that identifies vulnerabilities/deficiencies in program code, and explicitly uses semi-supervised learning ([0084], [0088], [0111] for example) in relation to a gold standard basis for its classification and learning ([0116]) such that the ability to manage/maintain the code in view of new vulnerabilities persists with time ([0002]-[0003], [0006]), thereby implying the development and persistence (i.e., “storing ... to use in future ...”) of the learned features.
Like Henkel and Zhang, Duppils relates to analysis of a codebase to generate a machine learning result/benefit, and hence is similarly directed and therefore analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Duppils’s semi-supervised learning aspect to evaluate codebases, with a reasonable expectation of success, such that there is a credible ability to manage/maintain the code in view of new vulnerabilities persists with time.

Regarding claim 3, Henkel in view of Zhang and further in view of Duppils teach the system of claim 1, as discussed above.  The aforementioned references further teach the additional limitations wherein the processor is further configured to perform operations comprising: computing a pairwise similarity utilizing the one or more codebase feature vectors (Zhang’s discussion on code clone detection, starting at the bottom of page 788’s left column, which is distance-based in terms of the code as represented in vector form and is explicitly a similarity comparison, as would be applied to address known codebase analysis considerations as discussed in section IV(B) on page 792); and generating a codebase level graph (Henke’s FIG. 2 showing the development of a codebase graph across different levels based on parsing, where the developed graph is then used in rule enforcement per FIGs. 4 and 6 for example).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 4, Henkel in view of Zhang and further in view of Duppils teach the system of claim 3, as discussed above.  The aforementioned references further teach the additional limitations wherein the processor is further configured to perform operations comprising: incorporating a target node (Henkel’s section 3.5, starting on page 7, discussing rule enforcement such that the input for this process could be understood to be a target that can be subject to a graph-based approach as shown per Fig. 6, where portions of an input tree (constituting target nodes) and a developed rule-based tree are subject to comparison/matching) (apart from that, see Henkel’s rule formulation per its FIG. 2 where the rule-based tree is developed feasibly one node at a time based on extraction, and would necessarily involve the incorporation of data that is new at the time and hence a type of target).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 5, Henkel in view of Zhang and further in view of Duppils teach the system of claim 4, as discussed above.  The aforementioned references further teach the additional limitations wherein the processor is further configured to perform operations comprising: generating one or more learning node representations (see Henkel’s rule formulation per its FIG. 2 where the rule-based tree is developed feasibly one node at a time based on extraction).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 6 Henkel in view of Zhang and further in view of Duppils teach the system of claim 5, as discussed above.  The aforementioned references further teach the additional limitations wherein the processor is further configured to perform operations comprising: generating a file, wherein the file is generated based on the one or more learning node representations, and wherein the file includes at least one of the one or more datum (Henkel’s page 1, Abstract, discussing “... The learned rules and analyzer in binnacle can be used to aid developers in the IDE when creating Dockerfiles, and in a post-hoc fashion to identify issues in, and to improve, existing Dockerfiles.”, where the improvement of an existing file can be understood to generate a new version of the file with a new improvement).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 8, the claim includes the same or similar limitations as claim 1 discussed above, and is therefore rejected under the same rationale.

Regarding claim 10, the claim includes the same or similar limitations as claim 3 discussed above, and is therefore rejected under the same rationale.

Regarding claim 11, the claim includes the same or similar limitations as claim 4 discussed above, and is therefore rejected under the same rationale.

Regarding claim 12, the claim includes the same or similar limitations as claim 5 discussed above, and is therefore rejected under the same rationale.

Regarding claim 13, the claim includes the same or similar limitations as claim 6 discussed above, and is therefore rejected under the same rationale.

Regarding claim 15, the claim includes the same or similar limitations as claim 1 discussed above, and is therefore rejected under the same rationale.  In particular, the present claim recites “A computer program product ... comprising a computer readable storage medium having program instructions embodied therewith ...” which the Examiner believes is implicit to Henkel’s teachings, see e.g., claim’s discussion of the processor and memory elements, which would be understood to respectively execute and store code/instructions that define/constitute the taught binnacle toolset.

Regarding claim 17, the claim includes the same or similar limitations as claim 3 discussed above, and is therefore rejected under the same rationale.

Regarding claim 18, the claim includes the same or similar limitations as claim 4 discussed above, and is therefore rejected under the same rationale.

Regarding claim 19, the claim includes the same or similar limitations as claim 5 discussed above, and is therefore rejected under the same rationale.

Regarding claim 20, the claim includes the same or similar limitations as claim 6 discussed above, and is therefore rejected under the same rationale.


8.	Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Henkel in view of Zhang and Duppils and further in view of Non-Patent Literature “Supervised and unsupervised learning: the two approaches that we should know in the world of machine learning” (“Rosidi”).
Regarding claim 7, Henkel in view of Zhang and further in view of Duppils teach the system of claim 5, as discussed above.  The aforementioned references teach the additional limitation wherein the one or more learning node representations include at least the target node and one or more unlabeled nodes (Henkel: page 3, second column, third full paragraph which discusses (i) “... To find high-quality data, we looked within our Dockerfile corpus and extracted every Dockerfile that originally came from the docker-library/ GitHub organization. This organization is run by Docker, and houses a set of official Dockerfiles written by and maintained by Docker experts. ... We will refer to this smaller subset of Dockerfiles as the Gold Set. Because these files are Dockerfiles created and maintained by Docker’s own experts, they presumably represent a higher standard of quality than those produced by non-experts. This set provides us with a solution to challenge (D2)—the Gold Set can be used as an oracle for good Dockerfile hygiene.” and then also from the same paragraph, (ii) “In addition to the Gold Set, we also collected approximately 5,000 Dockerfiles from several industrial repositories, with the hope that these files would also be a source of high-quality data.”, such that these obtained datasets are clearly of two types – a first one per (i) that is a Gold Set and hence “high-quality” (i.e., ground truth as Henkel refers to it and hence labeled) and the second one per (ii) which is less certain (unlabeled)) but does not explicitly teach that wherein the one or more unlabeled nodes are utilized as propagation bridges.  Rather, the Examiner relies upon ROSIDI to teach what Henkel etc. otherwise lack, see e.g., Rosidi’s very bottom of page 16 leading into page 17, as subject to the Examiner’s pagination of the reference as provided, where in this cited-to portion, Rosidi establishes a label propagation method to implement semi-supervised learning  (initially introduced in the middle of page 15), such that (back on page 17) “With label propagation, we don’t turn our supervised model into a semi-supervised model per se, but rather it’s an algorithm where we can turn our unlabeled data into labeled data. It works by connecting the whole dataset based on their distance, which typically is computed with Euclidean distance.  Label propagation treats a dataset as a graph, where each data point can be seen as a node, and the edge connecting two nodes can be seen as the notion of similarity between them (distance between two nodes). If the distance between two nodes is small, then it can be inferred that the two nodes have the same label and vice versa.”
Like Henkel, Rosidi is directed to semi-supervised machine learning to help better understand unlabeled data.  Hence, the references are similarly directed and therefore analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate label propagation per Rosidi to understand unlabeled data per Henkel’s framework, with a reasonable expectation of success, such that the distance based approach used per Rosidi and widely practiced in the state of the art can help facilitate the overall machine learning objectives for codebase information per Henkel for example.

Regarding claim 14, the claim includes the same or similar limitations as claim 7 discussed above, and is therefore rejected under the same rationale.


Conclusion
9.	The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure:
U.S. Patent Application Publication No. 2021/0240826
Non-Patent Literature “A Systematic Mapping Study on Analysis of Code Repositories”
Non-Patent Literature “Unsupervised Classifying of Software Source Code Using Graph Neural Networks”
Non-Patent Literature “The Vectors of Code: On Machine Learning for Software”

10.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHOURJO DASGUPTA whose telephone number is (571)272-7207. The examiner can normally be reached M-F 8am-5pm CST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tamara Kyle can be reached at 571 272 4241. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SHOURJO DASGUPTA/Primary Examiner, Art Unit 2144

Read full office action

Prosecution Timeline

Show 6 earlier events

Dec 02, 2025

Final Rejection mailed — §103

Jan 23, 2026

Interview Requested

Jan 30, 2026

Applicant Interview (Telephonic)

Jan 30, 2026

Examiner Interview Summary

Feb 02, 2026

Response after Non-Final Action

Mar 05, 2026

Request for Continued Examination

Mar 14, 2026

Response after Non-Final Action

Mar 20, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/562,124

Patent 12626174

METHOD OF DRIVING A QUANTUM COMPUTER TO FIND ONE OR MORE STATES OF INTEREST OF A NETWORK

4y 4m to grant Granted May 12, 2026

17/270,853

Patent 12614058

ARCHITECTURE OF A COMPUTER FOR CALCULATING A CONVOLUTION LAYER IN A CONVOLUTIONAL NEURAL NETWORK

5y 2m to grant Granted Apr 28, 2026

17/479,547

Patent 12608535

AUTOMATED DIGITAL TEXT OPTIMIZATION AND MODIFICATION

4y 7m to grant Granted Apr 21, 2026

17/491,240

Patent 12591802

GENERATING ESTIMATES BY COMBINING UNSUPERVISED AND SUPERVISED MACHINE LEARNING

4y 6m to grant Granted Mar 31, 2026

17/342,719

Patent 12586371

SENSOR DATA PROCESSING

4y 9m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

65%

Grant Probability

99%

With Interview (+38.6%)

3y 5m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 454 resolved cases by this examiner. Grant probability derived from career allowance rate.