Last updated: April 19, 2026
Application No. 18/555,814
SYSTEM AND METHOD FOR GENERATING VISUAL CAPTIONS

Final Rejection §103
Filed
Oct 17, 2023
Examiner
AUGUSTIN, MARCELLUS
Art Unit
2682
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
2 (Final)
Interview Optional

— +15.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 838 resolved cases, 2023–2026
Examiner Intelligence

AUGUSTIN, MARCELLUS View full profile →
Grants 82% — above average
Career Allow Rate
684 granted / 838 resolved
+19.6% vs TC avg
Strong +16% interview lift
Without
With
+15.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
31 currently pending
Career history
869
Total Applications
across all art units
Statute-Specific Performance

§101
11.0%
-29.0% vs TC avg
§103
50.7%
+10.7% vs TC avg
§102
18.5%
-21.5% vs TC avg
§112
12.0%
-28.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 838 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to filed Amendments
Applicant’s Amendments/Remarks filed on 01/23/2026 have been received and made of record.
Claims 1-20 have been amended.
Claims 1-20 remained pending.
Please refer to the action below.

Examiner Notes
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  However, the claimed subject matter, not the specification, is the measure of the invention. 



Response to Remarks/Arguments
Applicants’ arguments of 01/23/2026, corresponding to pages 8-11 pertaining to the prior arts of record and currently to the newly added claim limitations of the amended independent claims 1, 17, and 20 citing “Applicant respectfully disagrees with the rejections and reasoning in the Office Action. Without acquiescing or otherwise agreeing with the rejection, Applicant has amended independent claim 1, as discussed during the interview, to advance the prosecution in the interest of compact prosecution. As discussed during the interview, Applicant submits that Yao, Kelly, and Ghosh do not disclose, teach, or suggest features of amended claim 1 reciting, in part: "...inputting the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, obtain at least one source from among  a plurality of personal and public resources of images, the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images;..." However, Yao, Kelly, and Ghosh do not teach, disclose or suggest: ""...language model to predict an intent of displaying an image" based on "a participant or conversation" ... at least one source from among a plurality of personal and public resources of images,.." and "determining at least one image based on the at least one source of the images." Reconsideration is respectfully requested”; have been considered, however, these newly added claimed amendments and remarks are moot in light of the new ground of rejections.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 17, and 20 is/are rejected under 35 U.S.C. 103 as obvious over Yao et al (CN 110457673, cited in NPL), in view of Iwasaki et al. (JP 2010218181, A1).   

    Regarding claim 1, Yao teaches a computer-implemented method (cited trained deep learning model of the disclosure comprises as illustrated in at least the Abstract said method and a device employing at least an NLU configured to receive a natural language input/audio and output or convert as cited in the disclosure “converting the audio data into text data, and the text data as the identified natural language sentence” into text data), comprising: 
receiving audio data via a sensor of a computing device (obtained audio data of at least the disclosure and Fig. 1 are further received understoodly via a sensor or receiving means of said computing device embodying the at least the trained deep learning model and the NLU of the disclosure);
converting the audio data to a text and extracting a portion of the text portion (Figs. 1-2 and 6 further notes converting of the speech utterance to text data which the system understandably supply to the trained machine learning for later processing as said supplied text may obviously comprise one or more recognized portion of said text);
inputting the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (the system further inputs the text data or portion thereof to a neural network-based language model for outputting at least one of a type of visual sign action images corresponding to at least an inputted action word or text data which is based further on implied user intent);
determining at least one visual image based on at least one of the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (the deep learning system of further Figs. 1-2 and 6 further configured for determining at least as implied one or more visual/sign action images based on at least one of the type of the visual images);
and outputting the at least one visual image on a display of the computing device (Figs. 1-6 and the Abstract further illustrates the displaying of said outputting of said at least one visual sign action image on a display of the computing device).
     However, Yao is silent regarding wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Yao in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Yao in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided display visual content, based on the predicted intent of the user request of Yao, in the sense that said combination of Iwasaki when combined with the deep learning model and the natural language processing model system of Yao  further enables the system of Yao not only predict a displayed image intent or type but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

     Regarding claim 17, Yao teaches a computing device (at least the Abstract teaches said device employing at least an NLU and a cited trained deep learning model of the disclosure configured to receive a natural language input/audio and output or convert as cited in the disclosure “converting the audio data into text data, and the text data as the identified natural language sentence” into text data), comprising:
at least one processor (the device of the Abstract and Fig. 1 inherently includes at least one processor); and 
a memory (the device of the Abstract and Fig. 1 inherently further includes at least one memory device) storing instructions that, when executed by the at least one processor, configures the at least one processor to:
receive audio data via a sensor of the computing device (obtained audio data of at least the disclosure and Fig. 1 are further received understoodly via a sensor or receiving means of said computing device embodying the at least the trained deep learning model and the NLU of the disclosure);
convert the audio data to a text and extract a portion of the text (Figs. 1-2 and 6 further notes converting of the speech utterance to text data which the system understandably supply to the trained machine learning for later processing as said supplied text may obviously comprise one or more recognized portion of said text);
input the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (the system further inputs the text data or portion thereof to a neural network-based language model for outputting at least one of a type of visual sign action images corresponding to at least an inputted action word or text data which is based further on implied user intent);
determine at least one visual image based on the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (the deep learning system of further Figs. 1-2 and 6 further configured for determining at least as implied one or more visual/sign action images based on at least one of the type of the visual images);
and output the at least one visual image on a display of the computing device (Figs. 1-6 and the Abstract further illustrates the displaying of said outputting of said at least one visual sign action image on a display of the computing device).
     However, Yao is silent regarding wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and cause display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Yao in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Yao in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided display visual content, based on the predicted intent of the user request of Yao, in the sense that said combination of Iwasaki when combined with the deep learning model and the natural language processing model system of Yao  further enables the system of Yao not only predict a displayed image intent or type but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

     Regarding claim 20, Yao teaches a computer-implemented method for providing visual captions (the disclosure and the Abstract teaches processing of input natural language data such as “(or the text data of audio data conversion, and the information contained in the text data as the caption information), the method provided by the embodiment of the present invention, the caption information into the sign language, and storing it” further comprises said method for providing visual captions),
the method comprising: 
receiving audio data via a sensor of a computing device (obtained audio data of at least the disclosure and Fig. 1 are further received understoodly via a sensor or receiving means of said computing device embodying the at least the trained deep learning model and the NLU of the disclosure);
converting the audio data to a text and extracting a portion of the text (Figs. 1-2 and 6 further notes converting of the speech utterance to text data which the system understandably supply to the trained machine learning for later processing as said supplied text may obviously comprise one or more recognized portion of said text);
inputting the portion of the text to one or more machine language (ML) models to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (the system further inputs the text data or portion thereof to at least a neural network-based language model and a natural language processing model for outputting at least one of a type of visual sign action images corresponding to at least an inputted action word or text data which is based further on implied user intent);
determine at least one visual image based on the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (the deep learning system of further Figs. 1-2 and 6 further configured for determining at least as implied one or more visual/sign action images based on at least one of the type of the visual images);
and output the at least one visual image on a display of the computing device (Figs. 1-6 and the Abstract further illustrates the displaying of said outputting of said at least one visual sign action image on a display of the computing device).
     However, Yao is silent regarding wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and cause display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Yao in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Yao in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided display visual content, based on the predicted intent of the user request of Yao, in the sense that said combination of Iwasaki when combined with the deep learning model and the natural language processing model system of Yao  further enables the system of Yao not only predict a displayed image intent or type but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

Claim(s) 1, 17, and 20 is/are further rejected under 35 U.S.C. 103 as obvious over Kelly et al (US 20230343011, previously cited), in view of Iwasaki et al. 

    Regarding claim 1, Kelly teaches a computer-implemented method (cited method of at least Figs. 1-2 employing a trained generative model of at least para. 0021 to conduct realtime conversation or meeting between at least two users, the method further configured to receive a natural language input/audio and output via at least a user interface device caption and sign language to at least one participant of the conversation according to a further calculated confidence score of at least para. 0075-0077), comprising: 
receiving audio data via a sensor of a computing device (obtained audio data of at least para. 0021 and Fig. 2 further received understoodly via a sensor or receiving means of said computing device);
converting the audio data to a text and extracting a portion of the text portion (para. 0021 further notes converting of the speech utterance to text data which the system understandably supply to the trained machine learning further obviously comprise one or more recognized portion of said text);
inputting the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (the method of further para. 0021 further supported by at least para. 0075-0077, and 0026-0028 further illustrates inputted text data or portion thereof to a generative network-based language model, to access or obtain a dictionary, a lemmatizer and/or a database management system (DBMS) based on cited accurately predicting visual display based on implied intent of at least para. 0075-0076 relevant to the text of para. 0021 “In an example, a lemma-lookup dictionary or a lemmatizer can be used to map the resulting string into a resulting lemmatized string” or a database management system (DBMS) as a form of a source of the visual images base on further implied image sign language display intent for determining and displaying at least one of a type of visual sign action images, a content of the visual images, or a confidence score for each of the visual images corresponding to at least the inputted text data);
determining at least one visual image based on at least one of the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (the deep learning system of further para. 0075-0077 and 0021 further configured for determining at least as implied one or more visual/sign action images based on at least one of the type of the visual images, the content of the visual images, or the confidence score for each of the visual images); and 
outputting the at least one visual image on a display of the computing device (Figs. 1-2 and para. 0021 and 0077 further illustrates the displaying of said outputting of said at least one visual sign action image on a display of the computing device).
     However, Kelly is silent regarding wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kelly in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Kelly in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided display visual content, based on the predicted display image intent of Kelly, in the sense that said combination of Iwasaki when combined with the deep learning model and the predict display intent system of Kelly further enables the system of Kelly not only predict a displayed image intent or type but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 17, Kelly teaches a computing device (cited method system of at least Figs. 1-2 and para. 0021 employing a trained generative model to conduct realtime conversation or meeting between at least two users, the method further configured to receive a natural language input/audio and output caption and sign language to at least one participant of the conversation according to a further calculated confidence score of at least para. 0075), comprising: 
at least one processor (para. 0084-0085); and 
a memory (para. 0084-0085) storing instructions that, when executed by the at least one processor, configures the at least one processor to:
receive audio data via a sensor of the computing device (obtained audio data of at least para. 0021 and Fig. 2 further received understoodly via a sensor or receiving means of said computing device); 
convert the audio data to a text and extract a portion of the text (para. 0021 further notes converting of the speech utterance to text data which the system understandably supply to the trained machine learning further obviously comprise one or more recognized portion of said text);
input the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (the method of further para. 0021 further supported by at least para. 0075-0077, and 0026-0028 further illustrates inputted text data or portion thereof to a generative network-based language model, to access or obtain a dictionary, a lemmatizer and/or a database management system (DBMS) based on cited accurately predicting visual display based on  implied intent of at least para. 0075-0076 relevant to the text of para. 0021 “In an example, a lemma-lookup dictionary or a lemmatizer can be used to map the resulting string into a resulting lemmatized string” or a database management system (DBMS) as a form of a source of the visual images base on further implied image sign language display intent for determining and displaying at least one of a type of visual sign action images, a content of the visual images, or a confidence score for each of the visual images corresponding to at least the inputted text data);
determining at least one visual image based on at least one of the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (the deep learning system of further para. 0075-0077 and 0021 further configured for determining at least as implied one or more visual/sign action images based on at least one of the type of the visual images, the content of the visual images, or the confidence score for each of the visual images); and 
outputting the at least one visual image on a display of the computing device (Figs. 1-2 and para. 0021 and 0077 further illustrates the displaying of said outputting of said at least one visual sign action image on a display of the computing device).
     However, Kelly is silent regarding wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and cause display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kelly in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Kelly in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided display visual content, based on the predicted display image intent of Kelly, in the sense that said combination of Iwasaki when combined with the deep learning model and the predict display intent system of Kelly further enables the system of Kelly not only predict a displayed image intent or type but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

     Regarding claim 20, Kelly teaches in at least para. 0021 and Fig. 2 a computer-implemented method for providing visual captions (cited method of at least Figs. 1-2 employs a trained generative model of at least para. 0021 to conduct realtime conversation or meeting between at least two users, the method further configured to receive a natural language input/audio and output caption and sign language to at least one participant of the conversation according to a further calculated confidence score of at least para. 0075), the method comprising: 
receiving audio data via a sensor of a computing device (obtained audio data of at least para. 0021 and Fig. 2 further received understoodly via a sensor or receiving means of said computing device);
converting the audio data to a text and extracting a portion of the text (para. 0021 further notes converting of the speech utterance to text data which the system understandably supply to the trained machine learning further obviously comprise one or more recognized portion of said text);
inputting the portion of the text to one or more machine language (ML) models to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (the method of further para. 0021 further supported by at least para. 0075-0077, and 0026-0028 further illustrates inputted text data or portion thereof to a generative network-based language model, to access or obtain a dictionary, a lemmatizer and/or a database management system (DBMS) based on cited accurately predicting visual display based on  implied intent of at least para. 0075-0076 relevant to the text of para. 0021 “In an example, a lemma-lookup dictionary or a lemmatizer can be used to map the resulting string into a resulting lemmatized string” or a database management system (DBMS) as a form of a source of the visual images base on further implied image sign language display intent for determining and displaying at least one of a type of visual sign action images, a content of the visual images, or a confidence score for each of the visual images corresponding to at least the inputted text data);
determining at least one visual image by inputting at least one of the type of the visual images, the source of the visual images, the content of the visual images, and the confidence score for each of the visual images to another ML model (the deep learning system of further para. 0075 and 0021 further configured for determining at least as implied one or more visual/sign action images based on at least one of the type of the visual images, the content of the visual images, or the confidence score for each of the visual images); and 
outputting the at least one visual image on a display of the computing device (Figs. 1-2 and para. 0021 further illustrates the displaying of said outputting of said at least one visual sign action image on a display of the computing device).
     However, Kelly is silent regarding wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kelly in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Kelly in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided display visual content, based on the predicted display image intent of Kelly, in the sense that said combination of Iwasaki when combined with the deep learning model and the predict display intent system of Kelly further enables the system of Kelly not only predict a displayed image intent or type but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

Claim(s) 1, 17, and 20 is/are further rejected under 35 U.S.C. 103 as being obvious over Ghosh et al (US 2023/007746, previously cited), in view of Iwasaki et al. 

    Regarding claim 1, Ghosh teaches a computer-implemented method of Figs. 3-8 and para. 0026 comprising: 
receiving audio data via a sensor of a computing device (Fig. 5 further illustrates in a case received audio data via a sensor 340 of the computing device);
converting the audio data to a text and extracting a portion of the text (para. 0026 and Fig. 5 further cites converted audio data to a text and fed text data to a trained neural network-based language model, as the fed or ingested text may obviously comprise one or more recognized extracted portion of said text);
inputting the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (received inputs 510 of further Fig. 5 and para. 0025-0028 and para. 0043-0046 comprising natural language inputs comprising at least text input data and gazed input data further fed to a neural network-based language model configured to detect and predict a display sign intent image relevant to the text and gazed data, and based on the intent, to obtain from a source output of further para. 0049 in the form of at least one of a type of visual sign images, and a content of the visual images);
determining at least one visual image based on at least one of the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (analyzed further in Fig. 5 and para. 0025-0028 and 0049 at least one processed and determined at least one visual sign/expression images based on the type of the visual images, the source of the visual images, and the content of the visual images); and 
outputting the at least one visual image on a display of the computing device (provide output and display of further Fig. 5 and para. 0025-0028 and 0049 of at least one visual image on a display of the computing device).
     However, Ghosh is silent regarding wherein specifically predict intent of displaying said image, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Ghosh in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided predicted display visual content of Ghosh, in the sense that said combination of Iwasaki when combined with the deep learning model and the predict visual display intent system of Ghosh further enables the system of Ghosh not only predict a displayed intent but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 17, Ghosh teaches a computing device 300 of Figs. 3-8 and para. 0026 comprising: 
at least one processor (the one or more processors of further para. 0026 and 0065-0066); and 
a memory (the one or memory of further para. 0065-0066) storing instructions that, when executed by the at least one processor, configures the at least one processor to: receive audio data via a sensor of the computing device (Fig. 5 further illustrates in a case received audio data via a sensor 340 of the computing device);
convert the audio data to a text and extract a portion of the text (para. 0026 and Fig. 5 further cites converted audio data to a text and fed text data to a trained neural network-based language model, as the fed or ingested text may obviously comprise one or more recognized extracted portion of said text);
input the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (received inputs 510 of further Fig. 5 and para. 0025-0028 and para. 0043-0046 comprising natural language inputs comprising at least text input data and gazed input data further fed to a neural network-based language model configured to detect and predict a display sign intent image relevant to the text and gazed data, and based on the intent, to obtain from a source output of further para. 0049 in the form of at least one of a type of visual sign images, and a content of the visual images);
determine at least one visual image based on the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (analyzed further in Fig. 5 and para. 0025-0028 at least one processed and determined at least one visual sign/expression images based on the type of the visual images, the source of the visual images, and the content of the visual images); and 
output the at least one visual image on a display of the computing device (provide output and display of further Fig. 5 and para. 0025-0028 of at least one visual image on a display of the computing device).
     However, Ghosh is silent regarding wherein specifically predict intent of displaying said image, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and cause display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Ghosh in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided predicted display visual content of Ghosh, in the sense that said combination of Iwasaki when combined with the deep learning model and the predict visual display intent system of Ghosh further enables the system of Ghosh not only predict a displayed intent but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

     Regarding claim 20, Ghosh teaches a computer-implemented method for providing visual captions (the smart glasses of further Figs. 2-5 further configured to output or provide caption data and/or sign languages from converted text data), the method comprising: 
receiving audio data via a sensor of a computing device (Fig. 5 further illustrates in a case received audio data via a sensor 340 of the computing device);
converting the audio data to a text and extracting a portion of the text (para. 0026 and Fig. 5 further cites converted audio data to a text and fed text data to a trained neural network-based language model, as the fed or ingested text may obviously comprise one or more recognized extracted portion of said text);
inputting the portion of the text to one or more machine language (ML) models to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (received inputs 510 of further Fig. 5 and para. 0025-0028 and para. 0043-0046 comprising natural language inputs comprising at least text input data and gazed input data further fed to a neural network-based language model configured to detect and predict a display sign intent image relevant to the text and gazed data, and based on the intent, to obtain from a source output of further para. 0049 in the form of at least one of a type of visual sign images, and a content of the visual images);
determining at least one visual image by inputting at least one of the type of the visual images, the source of the visual images, the content of the visual images, and the confidence score for each of the visual images to another ML model (analyzed further in Fig. 5 and para. 0025-0028 at least one processed and determined at least one visual sign/expression images based on the type of the visual images, the source of the visual images, and the content of the visual images); and 
outputting the at least one visual image on a display of the computing device (provide output and display of further Fig. 5 and para. 0025-0028 of at least one visual image on a display of the computing device).
     However, Ghosh is silent regarding wherein specifically predict intent of displaying said image, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device.
    Iwasaki teaches at least in the Abstract a provided retrieval device and method, for properly displaying an image close to the retrieval intention of a user, based on user inputted text query; essentially implied the system determines a user image display intent corresponding to the inputted text, which request and/or intent of displaying is obviously based on a user natural text language conversation with the system; the system further access or obtain a DB3 source from one of among plurality of index DB source, image DB source, and a historical image DB 3 which may obviously comprises personal and/or public image resources, to retrieve desired images based at least on a detected click log further related to the image source DB 3, the system further causes as noted further in the disclosure and the Abstract displaying of the intended displayed images corresponding to the queried text on a display of the device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in in view of  Iwasaki to include wherein said predict intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Ghosh in in view of  Iwasaki are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Iwasaki’s combination of obtaining candidates image sources, identified user image display intents, and the displayed of the desired images based on a further degree-of-certainty intention further complements the provided predicted display visual content of Ghosh, in the sense that said combination of Iwasaki when combined with the deep learning model and the predict visual display intent system of Ghosh further enables the system of Ghosh not only predict a displayed intent but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images according to at least a calculated certainty measure to cause ultimately the display of the intended images based on provided text query, according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

Claim(s) 1-5, 9-10, 17, 20 is/are further rejected under 35 U.S.C. 103 as being obvious over Ghosh in view of Guy et al. (US 2022/0413688, A1).

     Regarding claim 1, Ghosh teaches a computer-implemented method of Figs. 3-8 and para. 0026 comprising: 
receiving audio data via a sensor of a computing device (Fig. 5 further illustrates in a case received audio data via a sensor 340 of the computing device);
converting the audio data to a text and extracting a portion of the text (para. 0026 and Fig. 5 further cites converted audio data to a text and fed text data to a trained neural network-based language model, as the fed or ingested text may obviously comprise one or more recognized extracted portion of said text);
inputting the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (received inputs 510 of further Fig. 5 and para. 0025-0028 and para. 0043-0046 comprising natural language inputs comprising at least text input data and gazed input data further fed to a neural network-based language model configured to detect and predict a display sign intent image relevant to the text and gazed data, and based on the intent, to obtain from a source output of further para. 0049 in the form of at least one of a type of visual sign images, and a content of the visual images);
determining at least one visual image based on at least one of the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (analyzed further in Fig. 5 and para. 0025-0028 and 0049 at least one processed and determined at least one visual sign/expression images based on the type of the visual images, the source of the visual images, and the content of the visual images); and 
outputting the at least one visual image on a display of the computing device (provide output and display of further Fig. 5 and para. 0025-0028 and 0049 of at least one visual image on a display of the computing device).
     However, Ghosh is silent regarding wherein specifically predict intent of displaying said image, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device.
    Guy teaches at least in para. 0065-0068 determines and predict visual display intent corresponding to visual images that participants in a conversation may desire to display, and based on the intent, obtain at least one local image memory source and/or remote DB 230 from among a plurality of personal or public resources of images, and the intent of displaying the image is based on a participant or conversation; the system further in para. 0065-0068 determining at least one presentation image based on the at least one source of the images and causing display of the at least one presentation image corresponding in a case to the queried text on a display of the computing device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in in view of  Guy to include wherein said predict display intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Ghosh in in view of  Guy are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Guy’s combination of obtaining candidates image sources, identified predicted display image intents, and the displayed of the desired images further complements the provided predicted display visual content of Ghosh, in the sense that said combination of Guy when combined with the deep learning model and the predict visual display intent system of Ghosh further enables the system of Ghosh not only predict a displayed intent but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images whereby said participants may continue an ongoing conversation seamlessly according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

     Regarding claim 2 (according to the method of claim 1), Ghosh further teaches wherein the computing device is a head mounted smart glasses (the at least one or more smart glasses 300 of further Figs. 3).

     Regarding claim 3 (according to the method of claim 1), Ghosh further teaches wherein the computing device is a smart display configured for video conferencing (smart glasses 300 of further para. 0045, 0025 and Fig. 5 further comprises an artificial intelligence system connected to at least a network system and a display device further adapted for receiving/display data from various sources and further including obviously for video conferencing via said cited network system).

     Regarding claim 4 (according to the method of claim 2), Ghosh further teaches wherein further comprising a smart phone in communication with the head mounted smart glasses and the neural network-based language model being disposed on the smart phone (Figs. 2-5 and para. 0018-0020 and 0067 further configured with systems and methods further comprising an external device/source of further para. 0067 or another user of further Fig. 5 which may embody said smart phone in communication with the head mounted smart glasses 300 and in a case the external device/environment or smart phone may obviously embodying said neural network-based language model being disposed on the smart phone).

     Regarding claim 5 (according to the method of claim 1), Ghosh further teaches wherein further comprising an external computing device in communication with the computing device and the neural network-based language model being disposed on the external computing device (Fig. 5 and para. 0067 further illustrates an embodiment where an external computing source/device may be in communication with the computing device 300 and said neural network-based language model being understoodly in a case disposed on the external computing device of at least para. 0067).

    Regarding claim 9 (according to claim 1), Ghosh further teaches wherein the type of the visual images comprises at least one of a photo stored on the computing device, an emoji, an image, a video, a map, a personal photo from an album or a contact, a three dimensional (3D) model, a clip art, a poster, a visual representation of a Uniform Resource Locator (URL) for a website, a list, an equation, or an article (the processed output of the smart glasses of further  Fig. 5 and para. 0042 further comprises an output type of the visual images comprises at least one of an emoji, an image, a video).

    Regarding claim 10 (according to claim 1), Ghosh further implies wherein the portion of the text comprises at least a number of words from an end of the text greater than a threshold (the smart glasses of further Figs. 2-5 further configured to  output caption data and/or sign languages from converted text data comprises obviously portion of the text further comprising obviously at least a number of words from an end of the text greater than a threshold).

    Regarding claim 17, Ghosh teaches a computing device 300 of Figs. 3-8 and para. 0026 comprising: 
at least one processor (the one or more processors of further para. 0026 and 0065-0066); and 
a memory (the one or memory of further para. 0065-0066) storing instructions that, when executed by the at least one processor, configures the at least one processor to: receive audio data via a sensor of the computing device (Fig. 5 further illustrates in a case received audio data via a sensor 340 of the computing device);
convert the audio data to a text and extract a portion of the text (para. 0026 and Fig. 5 further cites converted audio data to a text and fed text data to a trained neural network-based language model, as the fed or ingested text may obviously comprise one or more recognized extracted portion of said text);
input the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (received inputs 510 of further Fig. 5 and para. 0025-0028 and para. 0043-0046 comprising natural language inputs comprising at least text input data and gazed input data further fed to a neural network-based language model configured to detect and predict a display  sign intent image relevant to the text and gazed data, and based on the intent, to obtain from a source outputs of further para. 0049 in the form of at least one of a type of visual sign images, and a content of the visual images);
determine at least one visual image based on the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images (analyzed further in Fig. 5 and para. 0025-0028 at least one processed and determined at least one visual sign/expression images based on the type of the visual images, the source of the visual images, and the content of the visual images); and 
output the at least one visual image on a display of the computing device (provide output and display of further Fig. 5 and para. 0025-0028 of at least one visual image on a display of the computing device).
     However, Ghosh is silent regarding wherein specifically predict intent of displaying said image, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and cause display of the at least one image corresponding to the text on a display of the computing device.
    Guy teaches at least in para. 0065-0068 determines and predict visual display intent corresponding to visual images that participants in a conversation may desire to display, and based on the intent, obtain at least one local image memory source and/or remote DB 230 from among a plurality of personal or public resources of images, and the intent of displaying the image is based on a participant or conversation; the system further in para. 0065-0068 determining at least one presentation image based on the at least one source of the images and causing display of the at least one presentation image corresponding in a case to the queried text on a display of the computing device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in in view of  Guy to include wherein said predict display intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Ghosh in in view of  Guy are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Guy’s combination of obtaining candidates image sources, identified predicted display image intents, and the displayed of the desired images further complements the provided predicted display visual content of Ghosh, in the sense that said combination of Guy when combined with the deep learning model and the predict visual display intent system of Ghosh further enables the system of Ghosh not only predict a displayed intent but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images whereby said participants may continue an ongoing conversation seamlessly according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

     Regarding claim 20, Ghosh teaches a computer-implemented method for providing visual captions (the smart glasses of further Figs. 2-5 further configured to output or provide caption data and/or sign languages from converted text data), the method comprising: 
receiving audio data via a sensor of a computing device (Fig. 5 further illustrates in a case received audio data via a sensor 340 of the computing device);
converting the audio data to a text and extracting a portion of the text (para. 0026 and Fig. 5 further cites converted audio data to a text and fed text data to a trained neural network-based language model, as the fed or ingested text may obviously comprise one or more recognized extracted portion of said text);
inputting the portion of the text to a language model to predict an intent of displaying an image relevant to the text and, based on the intent, to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for each of the visual images (received inputs 510 of further Fig. 5 and para. 0025-0028 and para. 0043-0046 comprising natural language inputs comprising at least text input data and gazed input data further fed to a neural network-based language model configured to detect and predict a display  sign intent image relevant to the text and gazed data, and based on the intent, to obtain from a source outputs of further para. 0049 in the form of at least one of a type of visual sign images, and a content of the visual images);
determining at least one visual image by inputting at least one of the type of the visual images, the source of the visual images, the content of the visual images, and the confidence score for each of the visual images to another ML model  (analyzed further in Fig. 5 and para. 0025-0028 at least one processed and determined at least one visual sign/expression images based on the type of the visual images, the source of the visual images, and the content of the visual images); and 
outputting the at least one visual image on a display of the computing device (provide output and display of further Fig. 5 and para. 0025-0028 of at least one visual image on a display of the computing device).
     However, Ghosh is silent regarding wherein specifically predict intent of displaying said image, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; 
determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device.
    Guy teaches at least in para. 0065-0068 determines and predict visual display intent corresponding to visual images that participants in a conversation may desire to display, and based on the intent, obtain at least one local image memory source and/or remote DB 230 from among a plurality of personal or public resources of images, and the intent of displaying the image is based on a participant or conversation; the system further in para. 0065-0068 determining at least one presentation image based on the at least one source of the images and causing display of the at least one presentation image corresponding in a case to the queried text on a display of the computing device. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in in view of  Guy to include wherein said predict display intent, and based on the intent, obtain at least one source from among a plurality of personal and public resources of images, and the intent of displaying the image is based on a participant or conversation; determining at least one image based on the at least one source of the images; and causing display of the at least one image corresponding to the text on a display of the computing device, as discussed above, as Ghosh in in view of  Guy are in the same field of endeavor of employing methods and systems for providing or retrieving visual or display content to supplement verbal or non-verbal queries for at least displaying based on user intent the desired displayed user images; Guy’s combination of obtaining candidates image sources, identified predicted display image intents, and the displayed of the desired images further complements the provided predicted display visual content of Ghosh, in the sense that said combination of Guy when combined with the deep learning model and the predict visual display intent system of Ghosh further enables the system of Ghosh not only predict a displayed intent but to also configure to access and obtain a source of the intent displayed image from one or more image database source to retrieve as best intended the intended visual images whereby said participants may continue an ongoing conversation seamlessly according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

Claim(s) 8 is/are rejected under 35 U.S.C. 103 as obvious over Ghosh in view of Guy, and further in view of Kelly. 

    Regarding claim 8 (according to claim 1), Ghosh in view of Guy are silent regarding wherein the confidence score for each of the visual images is between 0 and 1, and the method further comprises: omitting the outputting of a visual image, in response to the respective confidence score of the visual image not meeting a threshold confidence score of 0.5. 
   Kelly further teaches in at least para. 0075 the trained learning model configured to output caption visual images from ingested or fed text data corresponding to a confidence score for each of the output or predicted visual images is between 0 and 1, and said method and/or the output confidence result further allow the system to obviously in a case omitting the outputting of a visual image, in response to the respective confidence score of the visual image not meeting a threshold confidence score of 0.5.  It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in view of Guy, and further in view of Kelly to include wherein confidence score for each of the visual images is between 0 and 1, and the method further comprises: omitting the outputting of a visual image, in response to the respective confidence score of the visual image not meeting a threshold confidence score of 0.5, as discussed above, as Ghosh in view of Guy, and further in view of Kelly are in the same of endeavor of translating realtime caption and sign language data from inputted text data between at least two communicating users, Kelly further complements the smart glasses realtime translating text to caption/sign language data of Ghosh in view of Guy with supplemented predicted visual image content and types presented via a user interface device according to a calculated confidence score correlating to the predicted visual images which when supplemented to the smart glasses output data further allow the system to in a case omitting those visual images not meeting a certain threshold according to further known methods to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F). 

Claim(s) 11-14 and 16 is/are rejected under 35 U.S.C. 103 as obvious over Ghosh in view of Guy, and further in view of Fashimpaur et al (US 11726578, A1). 

      Regarding claim 11 (according to claim 1), Ghosh in view of Guy are silent regarding wherein the outputting of the at least one visual image comprises outputting the at least one visual image as a scrollable list proximate to a side of the display of the computing device.
    Fashimpaur teaches at least Figs. 1 and 3 a head mount device including the smart glasses device wherein the outputting of said device further comprises at least one visual image comprises outputting the at least one visual image of further Fig. 5 and the disclosure as a scrollable list proximate to a side of the display of the computing device. It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in view of Guy, and further in view of Fashimpaur to include wherein outputting of the at least one visual image comprises outputting the at least one visual image as a scrollable list proximate to a side of the display of the computing device, as discussed above, as Ghosh in view of Guy, and further in view of Fashimpaur are in the same of endeavor of displaying via at least a head mounted display translated realtime text to caption and sign language data between at least two communicating users, Fashimpaur further complements the smart glasses realtime translating text to caption/sign language data of Ghosh in view of Guy with a supplemented scrollable list display proximate to a side of the display of the head mounted display which when added or supplemented to the smart glasses of Ghosh in view of Guy further allow the system to scroll display contents with further convenience either vertically or horizontally through at least gesture means or the like to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F). 

    Regarding claim 12 (according to claim 11), Ghosh in view of Guy are silent regarding wherein further comprising outputting the at least one visual image as a vertical scrollable list.
    Fashimpaur further teaches at least in the disclosure and Fig. 5 where the smart glasses and/or the head mount device further comprising the displayed scrollable list further comprising said outputting at least one visual image as a vertical scrollable list. It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in view of Guy, and further in view of Fashimpaur to include wherein outputting the at least one visual image as a vertical scrollable list, as discussed above, as Ghosh in view of Guy, and further in view of Fashimpaur are in the same of endeavor of displaying via at least a head mounted display translated realtime text to caption and sign language data between at least two communicating users, Fashimpaur further complements the smart glasses realtime translating text to caption/sign language data of Ghosh in view of Guy with a supplemented scrollable list display proximate to a side of the display of the head mounted display which when added or supplemented to the smart glasses of Ghosh in view of Guy further allow the system to scroll display contents with further convenience either vertically or horizontally through at least gesture means or the like to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F). 

    Regarding claim 13 (according to claim 11), Ghosh in view of Guy are silent regarding wherein further comprising outputting the at least one visual image as a horizontal scrollable list, in response to the at least one visual image being an emoji.
    Fashimpaur further teaches at least in the disclosure and Fig. 5 where the smart glasses and/or the head mount device further comprising the displayed scrollable list further comprising said outputting at least one visual image as a horizontal scrollable list, in response to the at least one visual image being obviously an emoji. It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in view of Guy, and further in view of Fashimpaur to include wherein further comprising outputting the at least one visual image as a horizontal scrollable list, in response to the at least one visual image being an emoji, as discussed above, as Ghosh in view of Guy, and further in view of Fashimpaur are in the same of endeavor of displaying via at least a head mounted display translated realtime text to caption and sign language data between at least two communicating users, Fashimpaur further complements the smart glasses realtime translating text to caption/sign language data of Ghosh in view of Guy with a supplemented scrollable list display proximate to a side of the display of the head mounted display which when added or supplemented to the smart glasses of Ghosh in view of Guy further allow the system to scroll display contents with further convenience either vertically or horizontally through at least gesture means or the like to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 14 (according to claim 11), Ghosh in view of Guy are silent regarding wherein further comprising publicly displaying an image from the scrollable list, in response to an input being received form a user of the computing device.
    Fashimpaur further teaches at least in the disclosure and Fig. 5 where the smart glasses and/or the head mount device further comprising the displayed screen which further comprising publicly displaying an image from the scrollable list according to known design, in response to an input being received form a user of the computing device. It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in view of Guy, and further in view of Fashimpaur to include wherein further comprising publicly displaying an image from the scrollable list, in response to an input being received form a user of the computing device, as discussed above, as Ghosh in view of Guy, and further in view of Fashimpaur are in the same of endeavor of displaying via at least a head mounted display translated realtime text to caption and sign language data between at least two communicating users, Fashimpaur further complements the smart glasses realtime translating text to caption/sign language data of Ghosh in view of Guy with a supplemented scrollable list display proximate to a side of the display of the head mounted display which when added or supplemented to the smart glasses of Ghosh in view of Guy further allow the system to scroll display contents with further convenience either vertically or horizontally through at least gesture means as said scrollable list display configuration screen may be in a case allow the screen to as understood in the art be visually publicly displayable or the like according to known means and/or design choices to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 16 (according to claim 11), Ghosh in view of Guy are silent regarding wherein: the scrollable list is displayed on the computing device and not visible to another computing device in communication with the computing device; and the scrollable list is displayed on the another computing device, in response to an input being received from a user of the computing device.
    Fashimpaur further teaches at least in the disclosure and Fig. 5 where the smart glasses and/or the head mount device further comprising the displayed screen displaying the scrollable list screen which may according to a predetermined understood design not visible to another computing device in communication with the computing device and said scrollable list may be obviously displayed on the another computing device, in response to an input being received from a user of the computing device. It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ghosh in view of Guy, and further in view of Fashimpaur to include wherein scrollable list is displayed on the computing device and not visible to another computing device in communication with the computing device; and the scrollable list is displayed on the another computing device, in response to an input being received from a user of the computing device, as discussed above, as Ghosh in view of Guy, and further in view of Fashimpaur are in the same of endeavor of displaying via at least a head mounted display translated realtime text to caption and sign language data between at least two communicating users, Fashimpaur further complements the smart glasses realtime translating text to caption/sign language data of Ghosh in view of Guy with a supplemented scrollable list display proximate to a side of the display of the head mounted display which when added or supplemented to the smart glasses of Ghosh in view of Guy further allow the system to scroll display contents with further convenience either vertically or horizontally through at least gesture means as said scrollable list display configuration screen may be in a case allow the screen to as understood in the art be visually publicly displayable or the like according to known means and/or design choices to yield predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

Claims Standings
Claims 6-7, 15, and 18-19 remained objected over the prior arts of record, as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, and/or if properly incorporated in the indepddent claims including all of the limitations of the base claim and any intervening claims. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARCELLUS AUGUSTIN whose telephone number is (571)270-3384. The examiner can normally be reached 9 AM- 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BENNY TIEU can be reached on 571-272-7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/MARCELLUS J AUGUSTIN/Primary Examiner, Art Unit 2682                                                                                                                                                                                              04/03/2026
Read full office action
Prosecution Timeline

Oct 17, 2023
Application Filed
Oct 20, 2025
Non-Final Rejection — §103
Jan 20, 2026
Applicant Interview (Telephonic)
Jan 21, 2026
Examiner Interview Summary
Jan 23, 2026
Response Filed
Apr 06, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/811,884
Patent 12597126
IMAGE SETTING DEVICE, IMAGE SETTING METHOD, AND IMAGE SETTING PROGRAM
2y 5m to grant Granted Apr 07, 2026
17/761,578
Patent 12586170
SYSTEM AND METHOD FOR GENERATING PREDICTIVE IMAGES FOR WAFER INSPECTION USING MACHINE LEARNING
2y 5m to grant Granted Mar 24, 2026
17/887,618
Patent 12573079
System and Method for Identifying Feature in an Image of a Subject
2y 5m to grant Granted Mar 10, 2026
17/969,537
Patent 12573388
BEHAVIOR DETECTION
2y 5m to grant Granted Mar 10, 2026
17/990,471
Patent 12569129
ANATOMICAL LOCATION DETECTION OF FEATURES OF A GASTROINTESTINAL TRACT OF A PATIENT
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
82%
Grant Probability
98%
With Interview (+15.9%)
2y 8m
Median Time to Grant
Moderate
PTA Risk
Based on 838 resolved cases by this examiner. Grant probability derived from career allow rate.
SYSTEM AND METHOD FOR GENERATING VISUAL CAPTIONS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email