Office Action Analysis: 18635759 — LLM-DRIVEN SYSTEM TO GENERATE DESCRIPTIONS OF MANUFACTURING PROCESSES IN REAL-TME

Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 05/15/2024 has been considered by the examiner.
Specification
The disclosure is objected to because of the following informalities: Specification paragraphs [0001], [0005], [0008], and [0019] recite “a Large Langue Model (LLM)”. This should read “a Large Language Model (LLM)”.  
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claims 1 and 11, the phrase "and/or" renders the claims indefinite because it is unclear whether the limitation following the phrase are part of the claimed invention. See MPEP 2173.05(d). For purposes of examination, the limitation will be read as "or".
Dependent claims 2-10, and 12-20 are rejected for the same reason.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., abstract idea – mental process) without significantly more. 
Step (1) Are the claims directed to a process, machine, manufacture, or composition of matter;
Step (2A) Prong One: Are the claims directed to a judicially recognized exception, i.e., a law of nature, a natural phenomenon, or an abstract idea;
Prong Two: If the claims are directed to a judicial exception under Prong One, then is the judicial exception integrated into a practical application;
Step (2B) If the claims are directed to a judicial exception and do not integrate the judicial exception, do the claims provide an inventive concept.

Step 1:
Claim 1 recites a method. Therefore, the claim is directed to the statutory categories of process.
Step 2A:
Prong One:
	Claim 1 recites:
“detecting one or more objects and/or one or more actions in the single image frame”. Under its broadest reasonable interpretation in light of the specification, the limitation encompasses the mental process of detecting objects and/or actions in an image frame which is practically capable of being performed in the human mind with the assistance of pen and paper.
“concatenating the first text description of the single image frame with a plurality of previously generated second text descriptions of previously collected single image frames ”. Under its broadest reasonable interpretation in light of the specification, the limitation encompasses the mental process of simply putting the first text description and second text description together which is practically capable of being performed in the human mind with the assistance of pen and paper.
“analyzing…the text description of the scene in real-time.”. Under its broadest reasonable interpretation in light of the specification, the limitation encompasses the mental process of analyzing the text description of the scene which is practically capable of being performed in the human mind with the assistance of pen and paper.
Further the above  steps collectively amount to: collecting information, analyzing the information, and presenting results of the analysis. The Federal Circuit has held that claims directed to collecting, analyzing, and displaying information are abstract ideas. See:
Electric Power Group v. Alstom, 830 F.3d 1350 (Fed. Cir. 2016)
SAP America v. InvestPic, 898 F.3d 1161 (Fed. Cir. 2018)
Prong Two:
This judicial exception is not integrated into a practical application. The additional elements of “collecting a single image frame of a workstation where a manufacturing step of a manufacturing process is performed” and “generating a first text description of the single image frame based on the one or more detected objects and/or the one or more actions” and “providing the concatenation of the first text description and the plurality of previously generated second text descriptions to a Large Language Model (LLM) to thereby cause the LLM to generate a text description of a scene that is representative of the manufacturing step in the manufacturing process” and “visualizing the text description of the scene in real-time” amount to no more than mere necessary data gathering and applying because it is simply using hardware, or a generic computer as a tool to perform the abstract idea. Thus, they are insignificant extra-solution activity. Even when viewed in combination, these additional elements do not integrate the abstract idea into a practical application and the claims are thus directed to the abstract idea.
Step 2B:
Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. “collecting a single image frame of a workstation where a manufacturing step of a manufacturing process is performed” and “generating a first text description of the single image frame based on the one or more detected objects and/or the one or more actions” and “providing the concatenation of the first text description and the plurality of previously generated second text descriptions to a Large Language Model (LLM) to thereby cause the LLM to generate a text description of a scene that is representative of the manufacturing step in the manufacturing process” and “visualizing the text description of the scene in real-time” amount to no more than mere data outputting and extra solution activity. These elements, individually and in combination, are well-understood, routine, conventional activity. As such, the claim is ineligible.
Further the claim does not improve the functioning of a computer itself (see Enfish), improve a specific technological process (see McRO), solve a technical problem rooted in computer technology (see DDR Holdings) and recite a specific technical implementation of image processing or LLM architecture Instead, the computer components are used as tools to perform the abstract idea. Merely applying the abstract idea in a manufacturing environment does not integrate the exception into a practical application. Accordingly, Claim 1 does not integrate the judicial exception into a practical application.

Regarding to claim 11 (Non-Transitory Storage Medium): Claim 11 is directed to a non-transitory computer-readable medium containing instructions to perform the method of Claim 1. Because the underlying method is directed to an abstract idea without significantly more, the storage medium claim likewise does not recite patent-eligible subject matter.

Step 1:
	Claims 2-10 recite a method. Therefore, the claims are directed to the statutory categories of process.
Step 2A:
Prong One:
Claims 2-10 merely narrow the previously recited abstract idea limitations. For the reasons described above, this judicial exception is not meaningfully integrated into a practical application, or significantly more than the abstract idea. The claims disclose similar limitations described for the independent claims above and do not provide anything more than the mental process that is practically capable of being performed in the human mind with the assistance of pen and paper.
Prong Two:
These judicial exceptions are not integrated into a practical application nor includes additional elements that are sufficient to amount to significantly more. Thus, the claims are ineligible.
In detail claims 2–10 depend from Claim 1 and add:
specification of RGB or depth camera (Claim 2)
pretraining the LLM (Claim 3)
generating a prompt (Claim 4)
performance analysis, incident detection, conformity checks (Claim 5)
providing output to management or workers (Claims 6–7)
use of a short-term cache (Claims 8–9)
storing in a database (Claim 10)
These additional elements are generic technological components, or recite further data analysis and reporting functions. None of these limitations improve computer functionality or add significantly more than the abstract idea. Accordingly, Claims 2–10 are rejected under §101 for the same reasons as Claim 1.

Regarding claims 12–20: Claims 12–20 mirror Claims 2–10 in storage-medium format and fail for the same reasons.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8, 10-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Bhattacharyya et al. (US 2025/0119625 A1) (hereinafter, “Bhattacharyya”) in view of Neblett (US 2025/0245486 A1).
Regarding claim 1, Bhattacharyya discloses a method comprising: collecting a single image frame (Paragraph [0064] “In identifying various types of text data based on analysis of the video, such as video descriptions, video captions, and video objects, keyframes of the video may be identified and/or extracted for analysis. As used herein, a keyframe refers to a video frame that is analyzed for text data. Keyframes may be identified from among the video frames in any number of ways. As one example, an optical flow-based heuristic may be used to identify, select, or extract keyframes from a set of frames.”) [of a workstation where a manufacturing step of a manufacturing process is performed];
detecting one or more objects and/or one or more actions in the single image frame (Paragraph [0138] “The computing device 900 may be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations of these, for gesture detection and recognition. Additionally, the computing device 900 may be equipped with accelerometers or gyroscopes that enable detection of motion.”);
generating a first text description  (text data in Paragraph [0124] equate to text description) of the single image frame based on the one or more detected objects and/or the one or more actions (Paragraph [0121] “an OCR module 520 can be applied to identify text descriptions 522 associated with the keyframes. In this regard, text descriptions 522 include text presented in the keyframe. BLIP-2 technology 524 is used to verbalize the video to generate captions 526 and objects 528.”; Paragraph [0124] “text data associated with a video is obtained. As described herein, the text data can correspond with a plurality of modalities of the video. For example, some text data corresponds with the audio of the video, while other text data corresponds with images of the video. Examples of text data includes video metadata, video descriptions, video captions, video objects, and a video transcription. The video descriptions, video captions, and/or video objects can be identified in association with keyframes extracted from the video.”);
concatenating the first text description of the single image frame with a plurality of previously generated second text descriptions (video metadata and the transcript in Paragraph [0122] equate to previously generated second text descriptions) of previously collected single image frames (Paragraph [0122] “video metadata 506, the descriptions 522, the captions 526, the objects 528, and the transcript 512 can be concatenated into the prompt 530.”);
providing the concatenation of the first text description and the plurality of previously generated second text descriptions to a Large Language Model (LLM) to thereby cause the LLM to generate a text description of a scene (Paragraph [0122] “A prompt 530 is then generated using the various types of text data. For example, the video metadata 506, the descriptions 522, the captions 526, the objects 528, and the transcript 512 can be concatenated into the prompt 530. The prompt 530 is provided to a LLM to generate a text representation 532 that represents the text associated with the video.”) [that is representative of the manufacturing step in the manufacturing process]; and
analyzing and visualizing the text description of the scene in real-time (Paragraph [0115] “a video insight is generated and provided for display in real time. In this way, in response to an indication to identify a video insight, a video insight is identified and provided for display in real time.”; Paragraph [0118] “video insights can be provided for utilization, for example, to analyze videos, to provide recommendations, and/or the like. In this regard, the video insight provider 236 provides video insights and, in some cases, corresponding data (e.g., probabilities) for analysis.”; See Figure 5 element 532 showing the generated text description

However, Bhattacharyya fails to teach a workstation where a manufacturing step of a manufacturing process is performed and a description of a scene that is representative of the manufacturing step in the manufacturing process.
Neblett teaches a workstation where a manufacturing step of a manufacturing process is performed (Paragraph [0044] “the digital work environment, in conjunction with the AI assistant may inform manufacturers of work instruction processes and procedures in context. The AI assistant may be provisioned to retrieve relevant contextual information related to a work protocol, and to provide that contextual information to a user through the digital work environment.”; Paragraph [0060] “a user 900 equipped with an augmented-reality device 902 in a work environment 904. Augmented reality device 902 may be an example of computing device 300…Work environment 904 shows a portion of a workstation 910 with three similar, but non-identical substations (911, 912, 913).”; Paragraph [0061] “Computer vision may be used for object recognition and/or image & object classification in a work environment, enabling work in AR. A user may thus get information as to where they are in context”.) and a description of a scene that is representative of the manufacturing step in the manufacturing process (Paragraph [0046] “Prompts may be provided by users performing installation, repair, maintenance, training, inspection, design (e.g., visualizing in context)."; Paragraph [0066] “… an AI assist may be provided to an engineer drafting work protocols. Further, the AI assistant may be utilized to generate low-granularity, high-level work protocols based on training on existing work instructions and knowledge databases. For example, the AI assistant may generate instructions to “route and install wiring harness”. Such an instruction may be verified by a human, and then additional contextually relevant content may be retrieved that can be folded into the new work protocol. In this way, the LLM and human users may collaborate to generate usable work instructions.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Bhattacharyya’s reference to include a workstation where a manufacturing step of a manufacturing process is performed and a description of a scene that is representative of the manufacturing step in the manufacturing process taught by Neblett’s reference. The motivation for doing so would have been because LLM intelligently aggregate, provide access to, and deliver work protocol information to user, providing advantages over traditional methods that require users to manually find the instructions as suggested by Neblett (see Neblett, Paragraph [0025]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Neblett with Bhattacharyya to obtain the invention specified in claim 1.
Regarding claim 2, which claim 1 is incorporated, Bhattacharyya discloses wherein the single image frame is collected by an RGB or a depth camera that is configured to monitor the workstation where the manufacturing step of the manufacturing process is performed (Paragraph [0138] “The computing device 900 may be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations of these, for gesture detection and recognition. Additionally, the computing device 900 may be equipped with accelerometers or gyroscopes that enable detection of motion.”).
Regarding claim 3, which claim 1 is incorporated, Bhattacharyya fails to teach wherein the LLM is pretrained using a description of the manufacturing process that includes the manufacturing step.
Neblett teaches wherein the LLM is pretrained using a description of the manufacturing process that includes the manufacturing step (Paragraph [0004] “One or more large language models are trained on at least the work protocols, and text and metadata are extracted from the technical documents”; Paragraph [0040] “training one or more large language models on at least the work protocols, extracted text, and metadata as described with regard to LLM 134 and FIG. 1. Training may further involve supporting documentation and/or required specifications, such as those referenced but not included in the work protocols themselves…the one or more large language models may be trained on a per-program basis. A program may comprise a particular manufacturing end-point (e.g., an aircraft) or a subset of that endpoint (e.g., jet engine)”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Bhattacharyya’s reference to include wherein the LLM is pretrained using a description of the manufacturing process that includes the manufacturing step taught by Neblett’s reference. The motivation for doing so would have been to reduce errors in procedure and execution, and reduce rework due to error as suggested by Neblett (see Neblett, Paragraph [0025]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Neblett with Bhattacharyya to obtain the invention specified in claim 3.
Regarding claim 4, which claim 1 is incorporated Bhattacharyya discloses wherein providing the concatenation of the first and second text descriptions comprises: generating a prompt based on the concatenation (Paragraph [0122] “video metadata 506, the descriptions 522, the captions 526, the objects 528, and the transcript 512 can be concatenated into the prompt 530.”); and
providing the prompt to the LLM (Paragraph [0122] “A prompt 530 is then generated using the various types of text data. For example, the video metadata 506, the descriptions 522, the captions 526, the objects 528, and the transcript 512 can be concatenated into the prompt 530. The prompt 530 is provided to a LLM to generate a text representation 532 that represents the text associated with the video.”).
Regarding claim 5, which claim 1 is incorporated, Bhattacharyya discloses wherein analyzing and visualizing the text description of the scene in real-time comprises one or more of (Paragraph [0118] “video insights can be provided for utilization, for example, to analyze videos, to provide recommendations, and/or the like. In this regard, the video insight provider 236 provides video insights and, in some cases, corresponding data (e.g., probabilities) for analysis.”):
generating a real-time visualization of the scene (Paragraph [0115] “a video insight is generated and provided for display in real time. In this way, in response to an indication to identify a video insight, a video insight is identified and provided for display in real time.”);
performing performance analysis of the scene (Note that the claim requires only one of generating a real-time visualization of the scene; performing incident detection in the scene; and performing a conformity check of the scene);
performing incident detection in the scene (Note that the claim requires only one of generating a real-time visualization of the scene; performing incident detection in the scene; and performing a conformity check of the scene); and
performing a conformity check of the scene (Note that the claim requires only one of generating a real-time visualization of the scene; performing incident detection in the scene; and performing a conformity check of the scene).
Regarding claim 6, which claim 5 is incorporated, Bhattacharyya discloses wherein one or more of the real-time visualization, the performance analysis, the incident detection, and the conformity check are provided to a management and engineering group for further analysis (Paragraph [0033] “a user device, such as user device 110, can facilitate generating and/or providing text representations and/or video insights. A user device 110, as described herein, is generally operated by an individual or entity that may initiate generation and/or that views text representation(s) and/or video insight(s). In some cases, such an individual may be, or be associated with, a contributor, manager, developer, or creator of a video (e.g., a video being analyzed to generate the text representation and/or video insight). In this regard, the user may be interested in text representations and/or video insights, for example, to understand how to enhance or improve the video, to understand how to market or advertise the video, etc”).
Regarding claim 7, which claim 5 is incorporated, Bhattacharyya fails to teach wherein the real-time visualization of the scene is provided to a worker who is performing the manufacturing step of the manufacturing process at the workstation, the real-time visualization providing instructions on how to perform the manufacturing step in the manufacturing process to the worker.
Neblett teaches wherein the real-time visualization of the scene is provided to a worker who is performing the manufacturing step of the manufacturing process at the workstation, the real-time visualization providing instructions on how to perform the manufacturing step in the manufacturing process to the worker (Paragraph [0044] “The digital work environment may enable contextual deployment of a somewhat generalized work instruction at execution time… the digital work environment, in conjunction with the AI assistant may inform manufacturers of work instruction processes and procedures in context. The AI assistant may be provisioned to retrieve relevant contextual information related to a work protocol, and to provide that contextual information to a user through the digital work environment.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Bhattacharyya’s reference to include wherein the real-time visualization of the scene is provided to a worker who is performing the manufacturing step of the manufacturing process at the workstation, the real-time visualization providing instructions on how to perform the manufacturing step in the manufacturing process to the worker taught by Neblett’s reference. The motivation for doing so would have been to reduce errors in procedure and execution, and reduce rework due to error as suggested by Neblett (see Neblett, Paragraph [0025]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Neblett with Bhattacharyya to obtain the invention specified in claim 7.
Regarding claim 8, which claim 1 is incorporated, Bhattacharyya discloses wherein the first text description and the plurality of previously generated second text descriptions are stored in a short-term cache prior to being concatenated (Paragraph [0047] “The data store 214 is configured to store various types of information accessible by the video insight service 212 or other server or service. In embodiments, user devices (such as user devices 110 of FIG. 1), data sources (such as data sources 116 of FIG. 1), a video service (such as video service 118 of FIG. 1), and/or servers or services can provide data to the data store 214 for storage, which may be retrieved or referenced by any such component. As such, the data store 214 may store videos, text data (e.g., metadata, descriptions, captions, etc.”; Paragraph [0120] “The video information may include a video name, a channel, etc. In some cases, such data is used to identify more video metadata 506, such as the company name and business name... In addition, the automatic speech recognition 510 is performed to identify video transcript 512.”).
Regarding claim 10, which claim 1 is incorporated, Bhattacharyya discloses wherein the first text description and the text description of the scene are stored in a database prior to analyzing and visualizing the text description of the scene in real-time (Paragraph [0037] “video service 118 may initiate generation of text representations and/or video insights on a periodic basis. Such video information can then be stored and, thereafter, accessed by the video service 118 to provide to a user device for viewing (e.g., based on a user navigating to a particular video, for instance, in a video store, a video search, etc.).”; Paragraph [0043] “other cases, the video insights service 112 outputs a video insight(s) to another service, such as video service 118, or a data store, such as data store 114. For example, upon generating a video insight(s), the video insight(s) can be provided to video service 118 and/or data store 114 for subsequent use.”).
Regarding claim 11, Bhattacharyya discloses a non-transitory storage medium having stored therein instructions that are executable by one or more hardware processors to perform operations comprising (Paragraph [0133] “computer-readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program sub-modules, or other data.”):
collecting a single image frame (Paragraph [0064] “In identifying various types of text data based on analysis of the video, such as video descriptions, video captions, and video objects, keyframes of the video may be identified and/or extracted for analysis. As used herein, a keyframe refers to a video frame that is analyzed for text data. Keyframes may be identified from among the video frames in any number of ways. As one example, an optical flow-based heuristic may be used to identify, select, or extract keyframes from a set of frames.”) [of a workstation where a manufacturing step of a manufacturing process is performed];
detecting one or more objects and/or one or more actions in the single image frame (Paragraph [0138] “The computing device 900 may be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations of these, for gesture detection and recognition. Additionally, the computing device 900 may be equipped with accelerometers or gyroscopes that enable detection of motion.”);
generating a first text description (text data in Paragraph [0124] equates to text description) of the single image frame based on the one or more detected objects and/or the one or more actions (Paragraph [0121] “an OCR module 520 can be applied to identify text descriptions 522 associated with the keyframes. In this regard, text descriptions 522 include text presented in the keyframe. BLIP-2 technology 524 is used to verbalize the video to generate captions 526 and objects 528.”; Paragraph [0124] “text data associated with a video is obtained. As described herein, the text data can correspond with a plurality of modalities of the video. For example, some text data corresponds with the audio of the video, while other text data corresponds with images of the video. Examples of text data includes video metadata, video descriptions, video captions, video objects, and a video transcription. The video descriptions, video captions, and/or video objects can be identified in association with keyframes extracted from the video.”).
concatenating the first text description of the single image frame with a plurality of previously generated second text descriptions (video metadata and the transcript in Paragraph [0122] equate to previously generated second text descriptions) of previously collected single image frames (Paragraph [0122] “video metadata 506, the descriptions 522, the captions 526, the objects 528, and the transcript 512 can be concatenated into the prompt 530.”);
providing the concatenation of the first text description and the plurality of previously generated second text descriptions to a Large Language Model (LLM) to thereby cause the LLM to generate a text description of a scene (Paragraph [0122] “A prompt 530 is then generated using the various types of text data. For example, the video metadata 506, the descriptions 522, the captions 526, the objects 528, and the transcript 512 can be concatenated into the prompt 530. The prompt 530 is provided to a LLM to generate a text representation 532 that represents the text associated with the video.”) [that is representative of the manufacturing step in the manufacturing process]; and
analyzing and visualizing the text description of the scene in real-time (Paragraph [0115] “a video insight is generated and provided for display in real time. In this way, in response to an indication to identify a video insight, a video insight is identified and provided for display in real time.”; Paragraph [0118] “video insights can be provided for utilization, for example, to analyze videos, to provide recommendations, and/or the like. In this regard, the video insight provider 236 provides video insights and, in some cases, corresponding data (e.g., probabilities) for analysis.”; See Figure 5 element 532 showing the generated text description).

However, Bhattacharyya fails to teach a workstation where a manufacturing step of a manufacturing process is performed and a description of a scene that is representative of the manufacturing step in the manufacturing process.
Neblett teaches a workstation where a manufacturing step of a manufacturing process is performed (Paragraph [0044] “the digital work environment, in conjunction with the AI assistant may inform manufacturers of work instruction processes and procedures in context. The AI assistant may be provisioned to retrieve relevant contextual information related to a work protocol, and to provide that contextual information to a user through the digital work environment.”; Paragraph [0060] “a user 900 equipped with an augmented-reality device 902 in a work environment 904. Augmented reality device 902 may be an example of computing device 300…Work environment 904 shows a portion of a workstation 910 with three similar, but non-identical substations (911, 912, 913).”; Paragraph [0061] “Computer vision may be used for object recognition and/or image & object classification in a work environment, enabling work in AR. A user may thus get information as to where they are in context”.) and a description of a scene that is representative of the manufacturing step in the manufacturing process (Paragraph [0046] “Prompts may be provided by users performing installation, repair, maintenance, training, inspection, design (e.g., visualizing in context)."; Paragraph [0066] “… an AI assist may be provided to an engineer drafting work protocols. Further, the AI assistant may be utilized to generate low-granularity, high-level work protocols based on training on existing work instructions and knowledge databases. For example, the AI assistant may generate instructions to “route and install wiring harness”. Such an instruction may be verified by a human, and then additional contextually relevant content may be retrieved that can be folded into the new work protocol. In this way, the LLM and human users may collaborate to generate usable work instructions.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Bhattacharyya’s reference to include a workstation where a manufacturing step of a manufacturing process is performed and a description of a scene that is representative of the manufacturing step in the manufacturing process taught by Neblett’s reference. The motivation for doing so would have been because LLM intelligently aggregate, provide access to, and deliver work protocol information, providing advantages over traditional methods that require users to manually find the instructions as suggested by Neblett (see Neblett, Paragraph [0025]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Neblett with Bhattacharyya to obtain the invention specified in claim 11.
Regarding claim 12 (drawn to a non-transitory storage medium), claim 12 is rejected the same as claim 2 and the arguments similar to that presented above for claim 2 are equally applicable to the claim 12, and all the other limitations similar to claim 2 are not repeated herein, but incorporated by reference.
Regarding claim 13 (drawn to a non-transitory storage medium), claim 13 is rejected the same as claim 3 and the arguments similar to that presented above for claim 3 are equally applicable to the claim 13, and all the other limitations similar to claim 3 are not repeated herein, but incorporated by reference.
Regarding claim 14 (drawn to a non-transitory storage medium), claim 14 is rejected the same as claim 4 and the arguments similar to that presented above for claim 4 are equally applicable to the claim 14, and all the other limitations similar to claim 4 are not repeated herein, but incorporated by reference.
Regarding claim 15 (drawn to a non-transitory storage medium), claim 15 is rejected the same as claim 5 and the arguments similar to that presented above for claim 5 are equally applicable to the claim 15, and all the other limitations similar to claim 5 are not repeated herein, but incorporated by reference.
Regarding claim 16 (drawn to a non-transitory storage medium), claim 16 is rejected the same as claim 6 and the arguments similar to that presented above for claim 6 are equally applicable to the claim 16, and all the other limitations similar to claim 6 are not repeated herein, but incorporated by reference.
Regarding claim 17 (drawn to a non-transitory storage medium), claim 17 is rejected the same as claim 7 and the arguments similar to that presented above for claim 7 are equally applicable to the claim 17, and all the other limitations similar to claim 7 are not repeated herein, but incorporated by reference.
Regarding claim 18 (drawn to a non-transitory storage medium), claim 18 is rejected the same as claim 8 and the arguments similar to that presented above for claim 8 are equally applicable to the claim 18, and all the other limitations similar to claim 8 are not repeated herein, but incorporated by reference.
Regarding claim 20 (drawn to a non-transitory storage medium), claim 20 is rejected the same as claim 10 and the arguments similar to that presented above for claim 10 are equally applicable to the claim 20, and all the other limitations similar to claim 10 are not repeated herein, but incorporated by reference.
Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Bhattacharyya et al. (US 2025/0119625 A1) (hereinafter, “Bhattacharyya”) in view of Neblett (US 2025/0245486 A1), and further in view of Najork et al. (US 6,301,614 B1) (hereinafter, “Najork”).
Regarding claim 9, which claim 8 is incorporated, Bhattacharyya and Neblett fail to teach wherein the short-term cache is initially empty and the first text description and the plurality of previously generated second text descriptions are not concatenated until a predetermined number of first and second text descriptions have been stored in the short-term cache.
Najork teaches wherein the short-term cache is initially empty and the first text description and the plurality of previously generated second text descriptions are not concatenated until a predetermined number of first and second text descriptions have been stored in the short-term cache (End of Column 6 from line 65 to Beginning of Column 7 to line 6 “that means the specified URL is a “new URL” for a document not previously known to the web crawler. In this case, the URL numeric representation N is added to cache B (step 214). If adding the URL numeric representation to cache B causes cache B to become fall (216-Yes), then the contents of cache B are merged with the disk file (step 218) and cache B is reset to a predefined initial (i.e., empty) state. During the merging process, the stored numerical representations in cache B 126 and in the disk file 136 are combined and reorganized into a sorted order.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective
filing date to modify Bhattacharyya in view of Neblett to include wherein the short-term cache is initially empty and the first text description and the plurality of previously generated second text descriptions are not concatenated until a predetermined number of first and second text descriptions have been stored in the short-term cache taught by Najork’s reference. The motivation for doing so would have been to reset the cache to its initial state as suggested by Najork (see Najork, Abstract).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Najork with Bhattacharyya and Neblett to obtain the invention specified in claim 9.
Regarding claim 19 (drawn to a non-transitory storage medium), claim 19 is rejected the same as claim 9 and the arguments similar to that presented above for claim 9 are equally applicable to the claim 19, and all the other limitations similar to claim 9 are not repeated herein, but incorporated by reference.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Namjoshi et al. ("A Mask-RCNN based object detection and captioning framework for industrial videos." International Journal of Advanced Technology and Engineering Exploration 8.84 (2021): 1466.) discloses an automated surveillance video analysis for industrial videos by detecting objects and captioning the video frames.
Patalas-Maliszewska et al. ( "An automated recognition of work activity in industrial manufacturing using convolutional neural networks." Electronics 10.23 (2021): 2946.) discloses assessing and analyzing employee activity in an industrial setting by recognizing and detecting work activity using Convolutional Neural Networks (CNNs).
Maalej (US 2025/0267315 A1) discloses a method for capturing an image from a video stream and using an Image-to-Text AI model able to detect objects and actions in the image and generate texts describing the context.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to UROOJ FATIMA whose telephone number is (571)272-2096. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Henok Shiferaw can be reached at (571) 272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/UROOJ FATIMA/Examiner, Art Unit 2676  


/Henok Shiferaw/Supervisory Patent Examiner, Art Unit 2676
Read full office action
LLM-DRIVEN SYSTEM TO GENERATE DESCRIPTIONS OF MANUFACTURING PROCESSES IN REAL-TME

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

LLM-DRIVEN SYSTEM TO GENERATE DESCRIPTIONS OF MANUFACTURING PROCESSES IN REAL-TME

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email