Last updated: April 19, 2026
Application No. 18/656,309
SYSTEM(S) AND METHOD(S) FOR UTILIZING GENERATIVE MODEL(S) TO GENERATE CONTENT RESPONSIVE TO VIDEO DATA

Non-Final OA §102
Filed
May 06, 2024
Examiner
TRAN, DUY ANH
Art Unit
2674
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
Interview Optional

— +17.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 128 resolved cases, 2023–2026
Examiner Intelligence

TRAN, DUY ANH View full profile →
Grants 81% — above average
Career Allow Rate
104 granted / 128 resolved
+19.3% vs TC avg
Strong +18% interview lift
Without
With
+17.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
29 currently pending
Career history
157
Total Applications
across all art units
Statute-Specific Performance

§101
12.9%
-27.1% vs TC avg
§103
42.0%
+2.0% vs TC avg
§102
26.7%
-13.3% vs TC avg
§112
11.3%
-28.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 128 resolved cases
Office Action

§102
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 05/06/2024, 06/17/2025, 08/06/2025 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Status
Claim(s) 1-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Divakaran et al (U.S. 20170160813 A1; Divakaran).

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Divakaran et al (U.S. 20170160813 A1; Divakaran).

Regarding claim 1,  Divakaran discloses a method implemented by one or more processors, (Paragraph 3: “methods, systems including computing devices, and computer-program products are provided for implementing a virtual personal assistant.” Paragraph 344: “The program code may be executed by a processor, which may include one or more processors”) the method comprising: 
receiving a stream of vision data, the stream of vision data being generated based on sensor data from one or more vision components of a client device; (Figs. 1 and 5; Paragraph 50: “ A person 100 using a smartphone 102 that includes a virtual personal assistant 150 can interact with the smartphone 102 using various sensory input, such as audio input 110, image input 120, and/or tactile input 130. … As another example, the person 100 can provide image input 120, captured for example by a camera.”; Paragraph 95: “FIG. 5 illustrates in greater detail an example of the audio understanding 414 and image understanding 416 components of the virtual personal assistant platform 410. … the image understanding 416 component can extract and interpret information in images.”)
receiving a representation of a spoken utterance, the spoken utterance being captured in audio data generated by one or more microphones of the client device; (Figs. 1 and 5; Paragraph 50: “ A person 100 using a smartphone 102 that includes a virtual personal assistant 150 can interact with the smartphone 102 using various sensory input, such as audio input 110, image input 120, and/or tactile input 130. the person 100 can provide audio input 110, captured for example by a microphone, by speaking to the smartphone 102.” ; Paragraph 95: “FIG. 5 illustrates in greater detail an example of the audio understanding 414 and image understanding 416 components of the virtual personal assistant platform 410. As noted above, the audio understanding 414 component can extract and interpret information from audio input”.)
processing, using a generative model (GM), first GM input (Fig.5: the audio, video, and/or tactile input 502) to generate corresponding first GM output (Fig.5: populated intent 530; Fig.6 :populated intent 630), (Figs. 5 and 22 ; Paragraph 99: “the parameter extractor 518 can associate the word classification 526 with the generalized unpopulated intent 528 so that the intent expressed in the audio, video, and/or tactile input 502 can be made more definite. The result is the populated intent 530.”; Paragraphs 267-282: “FIG. 22 illustrates an example of an interaction assistant 2210. An interaction assistant 2210 can be configured to analyze and interpret both verbal 2230 and non-verbal 2228 inputs and identify therefrom the various types of verbal and/or non-verbal behavioral cues 2232 that may be expressed by a person who has provided the input. … The interaction modeler 2214 can enable modeling of an interaction within the interaction's context, as it evolves over time … generative models (such as Hidden Markov Models (HMMs)) or a combination of discriminative and generative models may be used to model certain aspects of an interaction.”) the first GM input comprising at least the stream of vision data and the representation of the spoken utterance; (Paragraph 95: “FIG. 5 illustrates an example of a general understanding system 500 that can convert user input into a user intent. In various implementations, the understanding system 500 receives audio, video, and/or tactile input 502 and events 504.”)
determining, based on the first GM output, a subset of the stream of vision data;  (Paragraph 279: “generative models (such as Hidden Markov Models (HMMs)) or a combination of discriminative and generative models may be used to model certain aspects of an interaction. For example, in some embodiments, HMMs may be used to identify transition points in the interaction (such as conversational turns or the beginning or end of a phase of the interaction),”) 
processing, using the GM, second GM input to generate corresponding second GM output, the second GM input comprising at least the subset of the stream of vision data and the representation of the spoken utterance; (Figs. 4-6: Interpretation 418 and Paragraph 78: “The interpretation 418 component can use the information extracted from audio and visual information by the audio understanding 414 and image understanding 416 components, and attempt to determine a person's current intent. “Intent” in this context means an objective, goal, task, purpose, request, or meaning intended by the verbal and/or visual input.”; Paragraphs 101-102 and 328)
determining, based on the second GM output, responsive content, wherein the responsive content is responsive to the spoken utterance and the stream of vision data; (Figs. 4-6: Reasoning 420, Output generation 422 and Paragraph 80- 81: “The reasoning 420 component can receive the input intent and input state and determine a reasoned task, or course of action.  … The reasoning 420 component may be assisted by pre-defined workflows, including domain-specific workflows, as well as models and rules. … The output generation 422 component can create responses, which can be output using natural language and/or a visual display. For example, the output generation 422 can formulate a textual response, and indicate whether the textual response should be displayed on a screen or vocalized. As another example, the output generation 422 can assemble. a combined textual and visual response.”) and 
causing the client device to render the responsive content.(Figs. 4-6: Text to Speech 424 and Paragraph 82: “  The text to speech 424 component can convert text output, such as may be provided by the output generation 422 component, to audio output. Other output, such as text output to be displayed or graphic output to be displayed on a screen, can be provided directly to the user interface of the user interface and virtual personal assistant client application 450.”)

Regarding claim 2, Divakaran discloses wherein the stream of vision data comprises a plurality of sequential image frames. (Paragraph 223: “the video classification system 1912 can develop and use the video event model 1914 to identify simple and complex events that likely are depicted in still and moving images (e.g., videos) provided by the video capture device 1902. In some cases, the images being provided by the video capture device 1902 are real-time, that is, being delivered as they are being captured.”; Paragraph 225: “a complex event may include elements that have been juxtaposed (e.g., they may occur together, either in the same frame or in a sequence of frames of the video)”)

Regarding claim 3,  Divakaran discloses wherein the plurality of sequential image frames corresponds to a time period in which the spoken utterance was spoken. (Paragraph 78: “ The interpretation 418 component can use the information extracted from audio and visual information by the audio understanding 414 and image understanding 416 components … a person may say, “what is that?” and point at an object. In this example, the interpretation 418 component may, from the verbal input, determine that the person's intent is for the virtual personal assistant system 400 to identify something.”) 

Regarding claim 4, Divakaran discloses wherein the subset of the stream of vision data comprises a subset of the plurality of sequential image frames.  (Paragraph 223: “the video classification system 1912 can develop and use the video event model 1914 to identify simple and complex events that likely are depicted in still and moving images (e.g., videos) provided by the video capture device 1902. In some cases, the images being provided by the video capture device 1902 are real-time, that is, being delivered as they are being captured.”; Paragraph 225: “a complex event may include elements that have been juxtaposed (e.g., they may occur together, either in the same frame or in a sequence of frames of the video)”)

Regarding claim 5,  Divakaran discloses wherein the stream of vision data captures an environment of the client device, wherein the responsive content is responsive to an object in the environment captured by the stream of vision data. (Paragraph 78: “ The interpretation 418 component can use the information extracted from audio and visual information by the audio understanding 414 and image understanding 416 components … a person may say, “what is that?” and point at an object. In this example, the interpretation 418 component may, from the verbal input, determine that the person's intent is for the virtual personal assistant system 400 to identify something.”)

Regarding claim 6,  Divakaran discloses wherein the stream of vision data includes one or more frames capturing a hand of a user pointing toward the object. (Paragraph 78: “ The interpretation 418 component can use the information extracted from audio and visual information by the audio understanding 414 and image understanding 416 components … a person may say, “what is that?” and point at an object. In this example, the interpretation 418 component may, from the verbal input, determine that the person's intent is for the virtual personal assistant system 400 to identify something.”)
 
Regarding claim 7, Divakaran discloses  wherein the spoken utterance identifies the object based on one or more properties of the object. (Paragraph 78: “ The interpretation 418 component can use the information extracted from audio and visual information by the audio understanding 414 and image understanding 416 components … a person may say, “what is that?” and point at an object. In this example, the interpretation 418 component may, from the verbal input, determine that the person's intent is for the virtual personal assistant system 400 to identify something.” ; (Paragraph 244: “ Extracted features can also include, for example, the direction a person is looking when speaking particular words (e.g. “what is that?”) and objects detected as probably being within the person's field of view.”)

Regarding claim 8,  Divakaran discloses wherein the properties of the object comprise a location of the object in the environment captured in the stream of vision data. (Paragraph 244: “ Extracted features can also include, for example, the direction a person is looking when speaking particular words (e.g. “what is that?”) and objects detected as probably being within the person's field of view.”)

Regarding claim 9,  Divakaran discloses wherein the properties of the object comprise a color of the object.  (Paragraph 120: “in developing a virtual personal assistant system for an e-commerce vendor that sells jeans, the shareable ontology 912 may be defined to include “jeans” as an ontological concept having properties of style, color, size, and care instructions.”; Paragraph 227: “the complex event recognition engine 1950 includes a feature recognition module 1952, a semantic representation module 1954, and a complex event classification module 1956. … Some examples of static visual feature detectors include Gist, Scale-Invariant Feature Transform (SIFT), and colorSIFT.”)

Regarding claim 10,  Divakaran discloses wherein the spoken utterance includes a request to identify the object, from among a plurality of objects present in the environment, based on a prominence of the object in the stream of vision data. (Paragraph 78: “ the interpretation 418 component may, from the verbal input, determine that the person's intent is for the virtual personal assistant system 400 to identify something. Furthermore, the interpretation 418 component may, from image input, determine that the thing to be identified is an object being pointed at.”)

Regarding claim 11, Divakaran discloses wherein the prominence of the object is determined based on one or more of: a size of the object in the stream of vision data, a number and/or percentage of frames of the stream of vision data capturing the object, and a determined distance between the client device and the object.  (Paragraph 244: “Features extracted by the eye gaze detection and feature extractor 2006b can be used to determine where a person is looking. … his information can be used to identify an object that the person is looking at. Extracted features include, for example, whether the person is looking at the system display, what percentage of the time the person spends looking at the display, what parts of the display the person focuses on, how close the person's focus is to the desired areas of focus, and what percentage of the time the person spends looking at the desired area of focus.”)

Regarding claim 12,  Divakaran discloses further comprising: receiving subsequent user input; responsive to determining, based on the subsequent user input, to determine an alternative subset of the stream of vision data; (Figs. 10 and 24-25 and Paragraph 141-142: “ the reasoner 1018 may include a dialog manager module, which keeps track of the current state and flow of each conversation or dialog that occurs between a person and the virtual personal assistant system 1010. … Once the reasoner 1018 has determined an appropriate course of action by which to respond to the person's inputs, the reasoner 1018 can communicate an “output intent” to the output generator 1020. … if the user intent is “buy product” but the reasoner 1018 determines by executing a “check stock” task flow that the product the user wants to buy is not available for purchase, the output intent may be “offer alternative product.”) processing, using the GM, third GM input to generate corresponding third GM output, the third GM input comprising at least the subset of the stream of vision data, the representation of the spoken utterance, and a representation of the subsequent user input; (Figs. 5 and 22 ; Paragraph 99: “the parameter extractor 518 can associate the word classification 526 with the generalized unpopulated intent 528 so that the intent expressed in the audio, video, and/or tactile input 502 can be made more definite. The result is the populated intent 530.”; Paragraphs 267-282: “FIG. 22 illustrates an example of an interaction assistant 2210. An interaction assistant 2210 can be configured to analyze and interpret both verbal 2230 and non-verbal 2228 inputs and identify therefrom the various types of verbal and/or non-verbal behavioral cues 2232 that may be expressed by a person who has provided the input. … The interaction modeler 2214 can enable modeling of an interaction within the interaction's context, as it evolves over time … generative models (such as Hidden Markov Models (HMMs)) or a combination of discriminative and generative models may be used to model certain aspects of an interaction.”) determining, based on the third GM output, the alternative subset of the stream of vision data; (Paragraph 279: “generative models (such as Hidden Markov Models (HMMs)) or a combination of discriminative and generative models may be used to model certain aspects of an interaction. For example, in some embodiments, HMMs may be used to identify transition points in the interaction (such as conversational turns or the beginning or end of a phase of the interaction),”) processing, using the GM, fourth GM input to generate corresponding fourth GM output, the fourth GM input comprising at least the alternative subset of the stream of vision data, the representation of the spoken utterance, and optionally the representation of the subsequent user input; (Figs. 4-6: Interpretation 418 and Paragraph 78: “The interpretation 418 component can use the information extracted from audio and visual information by the audio understanding 414 and image understanding 416 components, and attempt to determine a person's current intent. “Intent” in this context means an objective, goal, task, purpose, request, or meaning intended by the verbal and/or visual input.”; Paragraphs 101-102 and 328) determining, based on the fourth GM output, additional responsive content, wherein the additional responsive content is responsive to the spoken utterance, the stream of vision data and the subsequent user input; (Figs. 4-6: Reasoning 420, Output generation 422 and Paragraph 80- 81: “The reasoning 420 component can receive the input intent and input state and determine a reasoned task, or course of action.  … The reasoning 420 component may be assisted by pre-defined workflows, including domain-specific workflows, as well as models and rules. … The output generation 422 component can create responses, which can be output using natural language and/or a visual display. For example, the output generation 422 can formulate a textual response, and indicate whether the textual response should be displayed on a screen or vocalized. As another example, the output generation 422 can assemble. a combined textual and visual response.”) and causing the client device to render the additional responsive content. (Figs. 4-6: Text to Speech 424 and Paragraph 82: “  The text to speech 424 component can convert text output, such as may be provided by the output generation 422 component, to audio output. Other output, such as text output to be displayed or graphic output to be displayed on a screen, can be provided directly to the user interface of the user interface and virtual personal assistant client application 450.”, the person ordinary skill in the art would know that the a virtual personal assistant perform the same function when they are receive the subsequent input such as conversation or dialog that occurs between a person and the virtual personal assistant.)

Regarding claim 13, Divakaran discloses further comprising:  receiving subsequent user input; responsive to determining, based on the subsequent user input, to determine additional responsive content without determining an alternative subset of the stream of vision data, (Figs. 10 and 24-25 and Paragraph 141-142: “From this analysis, the reasoner 1018 can determine a likely appropriate task to execute on the person's behalf and/or a likely appropriate system response to the person's intended goal or objective as derived from the meaning of the inputs and reflected in the user intent  … he likely appropriate system task or response may be to ask the person for additional information, while in other cases, the likely appropriate system task or response may involve building a search query based on the inputs and execute an information retrieval process …  the reasoner 1018 may include a dialog manager module, which keeps track of the current state and flow of each conversation or dialog that occurs between a person and the virtual personal assistant system 1010.) processing, using the GM, fifth GM input to generate corresponding fifth GM output, the fifth GM input comprising at least the subset of the stream of vision data, the representation of the spoken utterance, and a representation of the subsequent user input; (Figs. 5 and 22 ; Paragraph 99: “the parameter extractor 518 can associate the word classification 526 with the generalized unpopulated intent 528 so that the intent expressed in the audio, video, and/or tactile input 502 can be made more definite. The result is the populated intent 530.”; Paragraphs 267-282: “FIG. 22 illustrates an example of an interaction assistant 2210. An interaction assistant 2210 can be configured to analyze and interpret both verbal 2230 and non-verbal 2228 inputs and identify therefrom the various types of verbal and/or non-verbal behavioral cues 2232 that may be expressed by a person who has provided the input. … The interaction modeler 2214 can enable modeling of an interaction within the interaction's context, as it evolves over time … generative models (such as Hidden Markov Models (HMMs)) or a combination of discriminative and generative models may be used to model certain aspects of an interaction.”) determining, based on the fifth GM output, additional responsive content, wherein the additional responsive content is responsive to the spoken utterance, the stream of vision data, and the subsequent user input; (Figs. 4-6: Reasoning 420, Output generation 422 and Paragraph 80- 81: “The reasoning 420 component can receive the input intent and input state and determine a reasoned task, or course of action.  … The reasoning 420 component may be assisted by pre-defined workflows, including domain-specific workflows, as well as models and rules. … The output generation 422 component can create responses, which can be output using natural language and/or a visual display. For example, the output generation 422 can formulate a textual response, and indicate whether the textual response should be displayed on a screen or vocalized. As another example, the output generation 422 can assemble. a combined textual and visual response.”) and causing the client device to render the additional responsive content.  .(Figs. 4-6: Text to Speech 424 and Paragraph 82: “  The text to speech 424 component can convert text output, such as may be provided by the output generation 422 component, to audio output. Other output, such as text output to be displayed or graphic output to be displayed on a screen, can be provided directly to the user interface of the user interface and virtual personal assistant client application 450.” the person ordinary skill in the art would know that the a virtual personal assistant perform the same function when they are receive the subsequent input such as conversation or dialog that occurs between a person and the virtual personal assistant.)

Regarding claim 14,  Divakaran discloses wherein the responsive content is responsive to an object from among a plurality of objects captured by the stream of vision data, and wherein the subsequent user input is indicative of a request for additional responsive content responsive to another of the plurality of objects captured by the stream of vision data. (Figs.24-25 and Paragraph 321-322: “At step 2404, the person 2400 asks: “Can you find me a Chinese restaurant in Menlo Park?” Using automatic speech recognition and natural language understanding, the virtual personal assistant can determine that the person's 2400 intent is to location a particular type of restaurant (Chinese) in a particular city (Menlo Park). … At step 2408, the person 2400 may be satisfied with the virtual personal assistant's suggestion, but may further ask a seemingly unrelated question: “Sure. Can you show me the bookstore on the map?”;  Paragraphs 328-329)

Regarding claim 15, Divakaran discloses wherein the representation of the subsequent user input is indicative of a request to generate additional responsive content which is not responsive to the object. (Figs.24-25 and Paragraph 321-322: “At step 2404, the person 2400 asks: “Can you find me a Chinese restaurant in Menlo Park?” Using automatic speech recognition and natural language understanding, the virtual personal assistant can determine that the person's 2400 intent is to location a particular type of restaurant (Chinese) in a particular city (Menlo Park). … At step 2408, the person 2400 may be satisfied with the virtual personal assistant's suggestion, but may further ask a seemingly unrelated question: “Sure. Can you show me the bookstore on the map?”;    

Regarding claim 16,  Divakaran discloses a method implemented by one or more processors (Paragraph 3: “methods, systems including computing devices, and computer-program products are provided for implementing a virtual personal assistant.”; Paragraph 344: “The program code may be executed by a processor, which may include one or more processors”), the method comprising: 
obtaining sensor data captured by one or more sensors of a client device, wherein the sensor data comprises at least a stream of vision data generated by one or more vision components of the client device, and audio data generated by one or more microphones of the client device; (Figs. 1 and 5; Paragraph 50: “ A person 100 using a smartphone 102 that includes a virtual personal assistant 150 can interact with the smartphone 102 using various sensory input, such as audio input 110, image input 120, and/or tactile input 130. the person 100 can provide audio input 110, captured for example by a microphone, by speaking to the smartphone 102 … the person 100 can provide image input 120, captured for example by a camera.”)
determining, based on the audio data, a representation of a spoken utterance captured in the audio data; Paragraph 95: “FIG. 5 illustrates in greater detail an example of the audio understanding 414 and image understanding 416 components of the virtual personal assistant platform 410. As noted above, the audio understanding 414 component can extract and interpret information from audio input”.)
determining a subset of the stream of vision data; (Figs. 5 and 22 ; Paragraph 99: “the parameter extractor 518 can associate the word classification 526 with the generalized unpopulated intent 528 so that the intent expressed in the audio, video, and/or tactile input 502 can be made more definite. The result is the populated intent 530.”; Paragraph 279: “generative models (such as Hidden Markov Models (HMMs)) or a combination of discriminative and generative models may be used to model certain aspects of an interaction. For example, in some embodiments, HMMs may be used to identify transition points in the interaction (such as conversational turns or the beginning or end of a phase of the interaction),”)
sending, to a remote computing device (Fig.10 and Paragraph 133: “he verbal and non-verbal inputs captured by the multi-modal user interface 1012 can be processed by the virtual personal assistant platform 1014.  …  the virtual personal assistant platform 1014, may be located external to the virtual personal assistant 1010, and communicate with the virtual personal assistant 1010 by a communication link, such as for example a network connection.”), the subset of the stream of vision data and the representation of the spoken utterance; (Figs. 4-6: Interpretation 418 and Paragraph 78: “The interpretation 418 component can use the information extracted from audio and visual information by the audio understanding 414 and image understanding 416 components, and attempt to determine a person's current intent. “Intent” in this context means an objective, goal, task, purpose, request, or meaning intended by the verbal and/or visual input.”; Paragraphs 101-102 and 328; Paragraph 138: “(Fig.10 and Paragraph 133: “he verbal and non-verbal inputs captured by the multi-modal user interface 1012 can be processed by the virtual personal assistant platform 1014.  …  the virtual personal assistant platform 1014, may be located external to the virtual personal assistant 1010, and communicate with the virtual personal assistant 1010 by a communication link, such as for example a network connection.”) and
 receiving, from the remote computing device, (Fig.10 and Paragraph 133: “he verbal and non-verbal inputs captured by the multi-modal user interface 1012 can be processed by the virtual personal assistant platform 1014.  …  the virtual personal assistant platform 1014, may be located external to the virtual personal assistant 1010, and communicate with the virtual personal assistant 1010 by a communication link, such as for example a network connection.”) responsive content, wherein the responsive content is responsive to the stream of vision data and the spoken utterance; (Figs. 4-6: Reasoning 420, Output generation 422 and Paragraph 80- 81: “The reasoning 420 component can receive the input intent and input state and determine a reasoned task, or course of action.  … The reasoning 420 component may be assisted by pre-defined workflows, including domain-specific workflows, as well as models and rules. … The output generation 422 component can create responses, which can be output using natural language and/or a visual display. For example, the output generation 422 can formulate a textual response, and indicate whether the textual response should be displayed on a screen or vocalized. As another example, the output generation 422 can assemble. a combined textual and visual response.”; Paragraph 142: “Once the reasoner 1018 has determined an appropriate course of action by which to respond to the person's inputs, the reasoner 1018 can communicate an “output intent” to the output generator 1020.)  and 
rendering, at the client device, the responsive content.  (Figs. 4-6: Text to Speech 424 and Paragraph 82: “  The text to speech 424 component can convert text output, such as may be provided by the output generation 422 component, to audio output. Other output, such as text output to be displayed or graphic output to be displayed on a screen, can be provided directly to the user interface of the user interface and virtual personal assistant client application 450.”)

Regarding claim 17,  Divakaran discloses wherein determining the subset of the stream of vision data comprises: determining, based on the audio data, which frames of the stream of vision data correspond to a period of time in which the spoken utterance was spoken. (Paragraph 279: “ HMMs may be used to identify transition points in the interaction (such as conversational turns or the beginning or end of a phase of the interaction), while CRFs may be used to capture and analyze the non-stationarity of the behavioral cues during the segments of the interaction identified by the HMMs.” ;Paragraph 236: “ a set of input images, the video-specific model 1924 may contain information about instances of semantic elements 1928 detected in the input images by type (e.g., actors, scenes, actions, objects, audio, text, or geographic location) and information about each complex event 1926 detected in the input images. Further, the video-specific model 1924 can map the complex event and semantic element information to the location(s) in the images at which they occur (e.g., frame number).” … ;)

Regarding claim 18,   Divakaran discloses wherein determining, based on the audio data, which frames of the stream of vision data correspond to a period of time in which the spoken utterance was spoken comprises: determining a starting frame from among the stream of vision data corresponding to a time when the spoken utterance was started; and excluding frames of the stream of vision data which were captured prior to the starting frame from being included in the subset of the stream of vision data.  (Paragraph 279: “ HMMs may be used to identify transition points in the interaction (such as conversational turns or the beginning or end of a phase of the interaction), while CRFs may be used to capture and analyze the non-stationarity of the behavioral cues during the segments of the interaction identified by the HMMs.”; Paragraph 236: “ a set of input images, the video-specific model 1924 may contain information about instances of semantic elements 1928 detected in the input images by type (e.g., actors, scenes, actions, objects, audio, text, or geographic location) and information about each complex event 1926 detected in the input images. Further, the video-specific model 1924 can map the complex event and semantic element information to the location(s) in the images at which they occur (e.g., frame number).”;  Paragraphs 285-286: “each or any of the analyzers 2264, 2266, 2268 may analyze temporal patterns of the non-verbal cues and verbal content. For instance, if the verbal content of one participant includes the word “sorry” at the beginning of an interaction and the word “great” at the end of the interaction, the results of the temporal analysis performed by the analyzer 2264 may be different than if “great” occurred early in the interaction and “sorry” occurred later.. … The temporal dynamics analyzer 2264 can also consider the time interval in which behavioral cues occur in relation to other time intervals.)

Regarding claim 19,  Divakaran discloses wherein determining, based on the audio data, which frames of the stream of vision data correspond to a period of time in which the spoken utterance was spoken comprises: determining an ending frame from among the stream of vision data corresponding to a time when the spoken utterance ended; and excluding frames of the stream of vision data captured after the ending frame from being included in the subset of the stream of vision data.  (Paragraph 279: “ HMMs may be used to identify transition points in the interaction (such as conversational turns or the beginning or end of a phase of the interaction), while CRFs may be used to capture and analyze the non-stationarity of the behavioral cues during the segments of the interaction identified by the HMMs.” Paragraph 236: “ a set of input images, the video-specific model 1924 may contain information about instances of semantic elements 1928 detected in the input images by type (e.g., actors, scenes, actions, objects, audio, text, or geographic location) and information about each complex event 1926 detected in the input images. Further, the video-specific model 1924 can map the complex event and semantic element information to the location(s) in the images at which they occur (e.g., frame number).”;  Paragraphs 285-286: “each or any of the analyzers 2264, 2266, 2268 may analyze temporal patterns of the non-verbal cues and verbal content. For instance, if the verbal content of one participant includes the word “sorry” at the beginning of an interaction and the word “great” at the end of the interaction, the results of the temporal analysis performed by the analyzer 2264 may be different than if “great” occurred early in the interaction and “sorry” occurred later … The temporal dynamics analyzer 2264 can also consider the time interval in which behavioral cues occur in relation to other time intervals.”)

Regarding claim 20,  Divakaran discloses a system (Paragraph 3: “methods, systems including computing devices, and computer-program products are provided for implementing a virtual personal assistant.”) comprising: at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the at least one processor (Paragraphs 343-344: “The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. … The computer-readable data storage medium may form part of a computer program product, which may include packaging materials … The program code may be executed by a processor, which may include one or more processors”) to be operable to: 
receive a stream of vision data, the stream of vision data being generated based on sensor data from one or more vision components of a client device; (Figs. 1 and 5; Paragraph 50: “ A person 100 using a smartphone 102 that includes a virtual personal assistant 150 can interact with the smartphone 102 using various sensory input, such as audio input 110, image input 120, and/or tactile input 130. … the person 100 can provide image input 120, captured for example by a camera.”; Paragraph 95: “FIG. 5 illustrates in greater detail an example of the audio understanding 414 and image understanding 416 components of the virtual personal assistant platform 410. … the image understanding 416 component can extract and interpret information in images.”)
receive a representation of a spoken utterance, the spoken utterance being captured in audio data generated by one or more microphones of the client device; (Figs. 1 and 5; Paragraph 50: “ A person 100 using a smartphone 102 that includes a virtual personal assistant 150 can interact with the smartphone 102 using various sensory input, such as audio input 110, image input 120, and/or tactile input 130. the person 100 can provide audio input 110, captured for example by a microphone, by speaking to the smartphone 102.”; Paragraph 95: “FIG. 5 illustrates in greater detail an example of the audio understanding 414 and image understanding 416 components of the virtual personal assistant platform 410. As noted above, the audio understanding 414 component can extract and interpret information from audio input”.))
process, using a generative model (GM), first GM input (Fig.5: the audio, video, and/or tactile input 502) to generate corresponding first GM output (Fig.5: populated intent 530; Fig.6 :populated intent 630), (Figs. 5 and 22 ; Paragraph 99: “the parameter extractor 518 can associate the word classification 526 with the generalized unpopulated intent 528 so that the intent expressed in the audio, video, and/or tactile input 502 can be made more definite. The result is the populated intent 530.”; Paragraphs 267-282: “FIG. 22 illustrates an example of an interaction assistant 2210. An interaction assistant 2210 can be configured to analyze and interpret both verbal 2230 and non-verbal 2228 inputs and identify therefrom the various types of verbal and/or non-verbal behavioral cues 2232 that may be expressed by a person who has provided the input. … The interaction modeler 2214 can enable modeling of an interaction within the interaction's context, as it evolves over time … generative models (such as Hidden Markov Models (HMMs)) or a combination of discriminative and generative models may be used to model certain aspects of an interaction.”) the first GM input comprising at least the stream of vision data and the representation of the spoken utterance; (Paragraph 95: “FIG. 5 illustrates an example of a general understanding system 500 that can convert user input into a user intent. In various implementations, the understanding system 500 receives audio, video, and/or tactile input 502 and events 504.”)
determine, based on the first GM output, a subset of the stream of vision data;  (Paragraph 279: “generative models (such as Hidden Markov Models (HMMs)) or a combination of discriminative and generative models may be used to model certain aspects of an interaction. For example, in some embodiments, HMMs may be used to identify transition points in the interaction (such as conversational turns or the beginning or end of a phase of the interaction),”) 
process, using the GM, second GM input to generate corresponding second GM output, the second GM input comprising at least the subset of the stream of vision data and the representation of the spoken utterance; (Figs. 4-6: Interpretation 418 and Paragraph 78: “The interpretation 418 component can use the information extracted from audio and visual information by the audio understanding 414 and image understanding 416 components, and attempt to determine a person's current intent. “Intent” in this context means an objective, goal, task, purpose, request, or meaning intended by the verbal and/or visual input.”; Paragraphs 101-102 and 328)
determine, based on the second GM output, responsive content, wherein the responsive content is responsive to the spoken utterance and the stream of vision data; (Figs. 4-6: Reasoning 420, Output generation 422 and Paragraph 80- 81: “The reasoning 420 component can receive the input intent and input state and determine a reasoned task, or course of action.  … The reasoning 420 component may be assisted by pre-defined workflows, including domain-specific workflows, as well as models and rules. … The output generation 422 component can create responses, which can be output using natural language and/or a visual display. For example, the output generation 422 can formulate a textual response, and indicate whether the textual response should be displayed on a screen or vocalized. As another example, the output generation 422 can assemble. a combined textual and visual response.”) and 
cause the client device to render the responsive content. (Figs. 4-6: Text to Speech 424 and Paragraph 82: “  The text to speech 424 component can convert text output, such as may be provided by the output generation 422 component, to audio output. Other output, such as text output to be displayed or graphic output to be displayed on a screen, can be provided directly to the user interface of the user interface and virtual personal assistant client application 450.”)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Nguyen et al (U.S. 20220310094 A1), “Automated Assistant Interaction Prediction Using Fusion Of Visual and Audio Input”, teaches about  determining whether detected voice activity or various physical movements of a user represent an intent to interact with an automated assistant or automated assistant device. These determinations can be made when the user provides audio and/or visual input to the automated assistant device without requiring that the automated assistant first be explicitly invoked and transitioned into a fully listening/responsive state in which the automated assistant attempts to respond to any captured utterance.
Bobbili et al (U.S. 20230197078 A1), “Multiple Virtual Assistants”, teaches about a speech-processing system may provide access to multiple virtual assistants via one or more voice-controlled devices. Each assistant may leverage language processing and language generation features of the speech-processing system, while handling different commands and/or providing access to different back applications. Different assistants may be available for use with a particular voice-controlled device based on time, location, the particular user, etc. The voice-controlled device may include components for facilitating user interaction with multiple assistants.
Sethi et al (U.S. 20220374605 A1), “Continuous Learning For Natural-Language Understanding Models For Assistant Systems”, teaches about  method includes receiving a user request to automatically debug a natural-language understanding (NLU) model, accessing a plurality of predicted semantic representations generated by the NLU model; generating a plurality of expected semantic representations associated with the plurality of dialog sessions based on an auto-correction model; identifying incorrect semantic representations of the predicted semantic representations based on a comparison between the predicted semantic representations and the expected semantic representations, and automatically correcting the incorrect semantic representations by replacing them with respective expected semantic representations generated by the auto-correction model.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Duy A Tran whose telephone number is (571)272-4887. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ONEAL R MISTRY can be reached at (313)-446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DUY TRAN/Examiner, Art Unit 2674                                                                                                                                                                                                        

/ONEAL R MISTRY/Supervisory Patent Examiner, Art Unit 2674
Read full office action
Prosecution Timeline

May 06, 2024
Application Filed
Feb 21, 2026
Non-Final Rejection — §102 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/947,989
Patent 12573024
IMAGE AUGMENTATION FOR MACHINE LEARNING BASED DEFECT EXAMINATION
2y 5m to grant Granted Mar 10, 2026
18/182,536
Patent 12561934
AUTOMATIC ORIENTATION CORRECTION FOR CAPTURED IMAGES
2y 5m to grant Granted Feb 24, 2026
18/009,189
Patent 12548284
METHOD FOR ANALYZING ONE OR MORE ELEMENT(S) OF ONE OR MORE PHOTOGRAPHED OBJECT(S) IN ORDER TO DETECT ONE OR MORE MODIFICATION(S), AND ASSOCIATED ANALYSIS DEVICE
2y 5m to grant Granted Feb 10, 2026
17/599,872
Patent 12530798
LEARNED FORENSIC SOURCE SYSTEM FOR IDENTIFICATION OF IMAGE CAPTURE DEVICE MODELS AND FORENSIC SIMILARITY OF DIGITAL IMAGES
2y 5m to grant Granted Jan 20, 2026
18/096,835
Patent 12505539
CELL BODY SEGMENTATION USING MACHINE LEARNING
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
81%
Grant Probability
99%
With Interview (+17.5%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 128 resolved cases by this examiner. Grant probability derived from career allow rate.