DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment and Argument
Applicant’s amendment and argument with respect to pending claims 1-20 filed on 09/15/2025 have been fully considered but the argument has been rendered moot in view of new ground(s) of rejection necessitated by the amendment of the pending claims.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wu (US 11951833 B1) in view of Xiao et al. (US 20250209815 A1) and Gu et al. “A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models” arXiv: 2307.12980v1 [cs.CV] 24 Jul 2023.
Regarding claim 1, Wu discloses one or more processors comprising processing circuitry to: identify one or more frames of image data that depict at least a portion of an interior space of an ego-machine (Figs. 1-4, col. 18, lines 37-67, A camera (e.g., the lens 112a and the capture device 102a) is shown capturing an interior of the ego vehicle 50 (e.g., detecting the driver 202)…By analyzing video of the driver 202 and/or other occupants of the ego vehicle 50 (e.g., extracting video data from the captured video), the processors 106a-106n may determine a body position and/or body characteristics (e.g., a distance, orientation and/or location of the body and/or head) of one or more occupants of the ego vehicle 50 and/or objects within the ego vehicle 50. See also col. 10, lines 53-64: determine a likelihood that pixels in the video frames belong to a particular object and/or objects e.g., a person, car seat, seatbelt etc.); generate, based at least on applying (col. 24, line 43 to col. 25, line 30, col. 27, lines 1-16, col. 53, lines 24-44: The training data 252a-252n may comprise a large amount of information (e.g., input video frames)...The circuits 256a-256n may implement artificial intelligence models…The circuit 254 may be configured to receive the training data 252a-252n...if the artificial intelligence model 256b is configured to detect a driver (or driver behavior), the training data 252a-252n may provide a ground truth sample of a person performing a particular behavior (e.g., driving)); and control one or more operations of the ego-machine based at least on the one or more responses (col. 53, lines 46 to col. 54, line 7: If the driver 202 is the person detected providing input to the infotainment unit 352, then the method 600 may move to the step 618. In the step 618, the processors 106a-106n may generate the control signal VCTRL. The signal VCTRL may be configured to cause the CPU 354 of the infotainment unit 352 to disable input (e.g., ignore the input TOUCH_DRV)).
Wu does not disclose generate a multimodal prompt comprising the one or more frames of image data and a text prompt representative of a query to determine whether the one or more frames of image data depict one or more conditions associated with at least one of an operator or an occupant; generate, based at least on applying the multimodal prompt to a vision-language model (VLM) one or more responses indicating whether the one or more conditions associated with at least one of the operator or the occupant are detected in the one or more frames of image data.
However, Xiao discloses generate a multimodal prompt comprising the one or more frames of image data (abstract, ¶0002: A description of the content item may be received based on the respective visual prompt and the respective audio prompt for each frame of the plurality of frames input to a large language model (LLM) trained to output descriptive information for content items).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify Wu by incorporating the teaching of Xiao for enabling a deep understanding of content items while reducing or eliminating computational resources, coding, and complex systems required to implement other machine learning and artificial intelligence techniques (Xiao ¶0009).
Furthermore, Gu discloses “Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks…Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks…This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g., Flamingo), image-text matching models (e.g., CLIP), and text-to-image generation models (e.g., Stable Diffusion).” Abstract, See also Fig. 1 (a) Multimodal-to-Text Generation illustrating an image and text prompts. Pages 1-2. Gu further discloses “Compared to the traditional paradigm, prompt engineering has multiple advantages. Firstly, it requires a few labeled data to adapt a pre-trained model to new tasks, which greatly reduces the effort of human supervision and computation resource for finetuning. Secondly, prompt engineering enables a pre-trained model to perform predictions on new tasks solely based on the prompt without updating any of the model’s parameters, allowing serving a large scale of downstream tasks using the same model. This makes it possible to apply large-scale pre-trained models for realworld applications.” Introduction section, page 1. “Incorporating the visual modality into LLMs has opened up exciting opportunities for various applications, such as visual commonsense reasoning [15], visual question answering [1, 16, 17, 15], multimodal dialogue systems [18, 1, 12], etc. By combining textual and visual cues, VLMs have the potential to provide a more comprehensive understanding of multimodal data and produce outputs that align with human-like reasoning and perception [19]. Furthermore, the fusion of text and visual features within VLMs plays a crucial role in seamlessly integrating information from both modalities. This fusion process enables the model to capture interdependencies and interactions between textual and visual elements, resulting in more accurate and contextually grounded generations [19].” Section 3.1, pages 2-3.
It has been held by the courts that combining prior art elements according to known methods to yield predictable results, simple substitution of one known element for another to obtain predictable results, or choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success, is not sufficient to distinguish over the prior art, as it requires only ordinary skill in the art. KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385, 1397 (2007).
In this case, it would have been obvious to one of ordinary skill in the art at before the effective filing date of the claimed invention to modify the teaching of Wu in view of Xiao as noted above, by incorporating the teaching of Gu, to arrive at the claimed invention of “generate a multimodal prompt comprising the one or more frames of image data and a text prompt representative of a query to determine whether the one or more frames of image data depict one or more conditions associated with at least one of an operator or an occupant; generate, based at least on applying the multimodal prompt to a vision-language model (VLM) one or more responses indicating whether the one or more conditions associated with at least one of the operator or the occupant are detected in the one or more frames of image data” as recited in claim 1, since such a modification requires combining the prior art elements according to known methods (i.e., incorporating multimodal technique of Gu into the LLM technique of Xiao) to yield predictable results of improving the accuracy of tasks (e.g. driver monitoring).
Regarding claim 2, Wu in view of Xiao and Gu discloses the one or more processors of claim 1. Wu discloses wherein the (col. 24, lines 14-22: the reference video frame and/or the current video frame may be used as training data for the CNN module 150. Col. 25, lines 1-34: if the artificial intelligence model 256b is configured to detect a driver (or driver behavior), the training data 252a-252n may provide a ground truth sample of a person performing a particular behavior (e.g., driving)).
As stated above, Wu discloses video frame may be used as training data for the CNN module, however, does not explicitly discloses updating VLM, and one or more captions generated by a large language model based at least on metadata associated with the one or more training frames.
However, Xiao-Gu discloses updating VLM (¶0032, 0040: Attention mechanisms can be employed to enable machine learning models of the content analysis module 132 to focus on different parts of content item data (e.g., frames, images, etc.) when determining relationships between visual elements/objects to enable the machine learning models to handle frames/scenes with many visual elements/objects or complex interactions. Based on the evaluation, machine learning models of the content analysis module 132 may be fine-tuned to improve accuracy), and one or more captions generated by a large language model based at least on metadata associated with the one or more training frames (¶0078: system server(s) 126 receives the description of the content item based on the respective visual prompt and the respective audio prompt for each frame of the plurality of frames input to a large language model (LLM) trained to output descriptive information for content items…system server(s) 126 receives the description of the content item based on an additional prompt for each frame of the plurality of frames generated from at least one of metadata associated with the content item, closed captioning data associated with the content item...). Gu discloses VLM model. See Fig. 1, abstract, Section 3.1 The motivation statement set forth above with respect to claim 1 applies here.
Regarding claim 3, Wu in view of Xiao and Gu discloses the one or more processors of claim 1. Xiao-Gu further discloses wherein the VLM is updated using one or more captions generated by a large language model based at least on metadata that is associated with one or more training frames (¶0032, 0040, 0078: system server(s) 126 receives the description of the content item based on the respective visual prompt and the respective audio prompt for each frame of the plurality of frames input to a large language model (LLM) trained to output descriptive information for content items…system server(s) 126 receives the description of the content item based on an additional prompt for each frame of the plurality of frames generated from at least one of metadata associated with the content item, closed captioning data associated with the content item...) and represents at least one of: one or more driver or occupant attributes, one or more instructed actions, or one or more session attributes represented in the one or more training frames (¶0037: By iterating through each frame of a content item, identifying objects and their relationships within the frames, and converting the information into prompts for an LLM, textual representations of the visual elements/objects within each frame may be generated, stored, analyzed, and/or displayed. To determine relationships between visual elements/objects depicted in frames of a content item, once visual elements/objects are identified, feature vectors representing attributes of the visual elements/objects may extracted from the content item). Gu discloses VLM model. See Fig. 1, abstract, Section 3.1. The motivation statement set forth above with respect to claim 1 applies here.
Regarding claim 4, Wu in view of Xiao and Gu discloses the one or more processors of claim 1. Wu further discloses wherein the (col. 24, lines 14-22: the reference video frame and/or the current video frame may be used as training data for the CNN module 150. Col. 25, lines 1-34: if the artificial intelligence model 256b is configured to detect a driver (or driver behavior), the training data 252a-252n may provide a ground truth sample of a person performing a particular behavior (e.g., driving)).
As stated above, Wu discloses video frame may be used as training data for the CNN module, however, does not explicitly discloses updating VLM.
However, Xiao-Gu discloses updating VLM (¶0032, 0040: Attention mechanisms can be employed to enable machine learning models of the content analysis module 132 to focus on different parts of content item data (e.g., frames, images, etc.) when determining relationships between visual elements/objects to enable the machine learning models to handle frames/scenes with many visual elements/objects or complex interactions. Based on the evaluation, machine learning models of the content analysis module 132 may be fine-tuned to improve accuracy), and one or more captions generated by a large language model based at least on metadata associated with the one or more training frames (¶0036, 0078: system server(s) 126 receives the description of the content item based on the respective visual prompt and the respective audio prompt for each frame of the plurality of frames input to a large language model (LLM) trained to output descriptive information for content items…system server(s) 126 receives the description of the content item based on an additional prompt for each frame of the plurality of frames generated from at least one of metadata associated with the content item, closed captioning data associated with the content item...). Gu discloses VLM model. See Fig. 1, abstract, Section 3.1 .The motivation statement set forth above with respect to claim 1 applies here.
Regarding claim 5, Wu in view of Xiao and Gu teaches the one or more processors of claim 1. Wu in view of Xiao and Gu further teaches wherein the processing circuitry is further to periodically prompt (Wu: col. 32, lines 30-56: Referring to FIG. 4, a diagram illustrating an object comparison between a reference video frame and a captured video frame is shown. The reference video frame 300 and the current video frame 300′ may be video frames processed by the processors 106a-106n (e.g., generated in response to the signals FRAMES_A-FRAMES_N by one of the capture devices 102a-102n)… the reference video frame 300 may be captured periodically. Col. 33, lines 26-41: the reference video frame 300 may provide a reference for when the status of the seat belt 304 is unused (e.g., not being worn by a passenger/driver)). Gu discloses VLM model. See Fig. 1, abstract, Section 3.1. The motivation statement set forth above with respect to claim 1 applies here.
Regarding claim 6, Wu in view of Xiao and Gu teach the one or more processors of claim 1. Wu further teaches a sliding window of video frames generated using one or more cameras of a driver or occupant monitoring system (col. 10, lines 37-50, col. 46, lines 63-67: The processors 106a-106n may analyze the video frame 500 and other temporally related video frames (e.g., video frames captured before and after the video frame 500) to determine the movement and/or behavior of the passenger 502b). Furthermore, Xiao-Gu teaches wherein the processing circuitry is further to prompt the VLM using a representation of the one or more frames sampled from a sliding window of video frames generated ( ¶0071: visual features from video data and audio features from audio data associated with each frame of a content item may be transformed into a temporal sequence of embeddings. The temporal sequence of embeddings may correspond to a sequence of occurrence for each frame of the content item. As described herein the LLM may use the visual and audio embeddings to generate an output that describes what is happening in the content item). Gu discloses VLM model. See Fig. 1, abstract, Section 3.1. The motivation statement set forth above with respect to claim 1 applies here.
Regarding claim 7, Wu in view of Xiao and Gu teach the one or more processors of claim 1. Wu further teaches wherein the one or more responses by the (col. 40, line 15 to col. 41, lines 14: The CNN module 150 may be configured to detect features and/or descriptors in the example video frame 450 and compare the features and/or descriptors against the features and/or descriptors learned from the training data 252a-252n in order to recognize the pixels of the video frame 500 that correspond to the body parts of the occupants). Furthermore, Xiao-Gu discloses one or more responses by the VLM (¶0060-0063: As described, an LLM of the content analysis module 132 may output the following Example Response to the Example Prompts. According to some aspects, Example Response is just an example and other responses may be output based on deep video understanding with large language models). Gu discloses VLM model. See Fig. 1, abstract, Section 3.1. The motivation statement set forth above with respect to claim 1 applies here.
Regarding claim 8, Wu in view of Xiao and Gu discloses the one or more processors of claim 1. Wu further teaches wherein the one or more responses indicate one or more results of occlusion detection performed based at least on the (col. 43, lines 48-54: the arm 462 may not be visibly connected to the body of the driver 202 (e.g., partially blocked by the steering wheel 408). The processors 106a-106n may determine how likely that the arm 462 is the arm of the driver 202 (e.g., the more likely that the arm 462 belongs to the driver 202, the more the confidence level may be increased)). Furthermore, Xiao-Gu discloses a predictive model trained for object detection and classification iterating through each frame of the plurality of frames of the content item (Xiao¶0032, 0075). Gu discloses VLM model. See Fig. 1, abstract, Section 3.1. The motivation statement set forth above with respect to claim 1 applies here.
Regarding claim 9, Wu discloses wherein the one or more operations of the ego- machine comprise at least one of: issuing an audible or visual alert, adjusting one or more in-vehicle infotainment settings, or activating one or more safety systems (col. 3, lines 36-56, col. 54, LINES 1-8: If the driver 202 is the person detected providing input to the infotainment unit 352, then the method 600 may move to the step 618. In the step 618, the processors 106a-106n may generate the control signal VCTRL. The signal VCTRL may be configured to cause the CPU 354 of the infotainment unit 352 to disable input (e.g., ignore the input TOUCH_DRV)).
Regarding claim 10, Wu discloses wherein the one or more processors are comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system for performing remote operations; a system for performing real-time streaming; a system for generating or presenting one or more of augmented reality content, virtual reality content, or mixed reality content; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational AI operations; a system implementing one or more language models; a system implementing one or more large language models (LLMs); a system implementing one or more vision language models (VLMs); a system for generating synthetic data; a system for generating synthetic data using AI; a system for performing one or more generative AI operations; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources (col. 13, lines 33-45: an autonomous vehicle implementation. Col. 10, lines 23-36, col. 25, lines 32-40: deep learning techniques).
Regarding claims 11-20, the claims are drawn to a system/method claims and recite the limitation analogous to claims 1-7 and 10, and are rejected due to the same reason set forth above with respect to claims 1-7 and 10.
The following are the prior arts made of record and not relied upon are considered pertinent to applicant's disclosure.
Alpert et al. (US 20210312193 A1) describes “devices for assisting operation of vehicles, and more particularly, to devices and methods for predicting intersection violations, predicting collisions, or both.” ¶0001
Palan (US 20140266655 A1) describes “a vehicle driving assistance system configured to assist a driver with respect to the operation of a vehicle.”¶0001.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NATHNAEL AYNALEM whose telephone number is (571)270-1482. The examiner can normally be reached M-F 9AM-5:30 PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, SATH PERUNGAVOOR can be reached at 571-272-7455. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NATHNAEL AYNALEM/Primary Examiner, Art Unit 2488