Last updated: April 19, 2026
Application No. 18/749,461
IMAGE GENERATION

Non-Final OA §103
Filed
Jun 20, 2024
Examiner
VU, KHOA
Art Unit
2611
Tech Center
2600 — Communications
Assignee
BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD.
OA Round
1 (Non-Final)
Interview Optional

— +15.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 345 resolved cases, 2023–2026
Examiner Intelligence

VU, KHOA View full profile →
Grants 68% — above average
Career Allow Rate
234 granted / 345 resolved
+5.8% vs TC avg
Strong +16% interview lift
Without
With
+15.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
27 currently pending
Career history
372
Total Applications
across all art units
Statute-Specific Performance

§101
8.2%
-31.8% vs TC avg
§103
73.3%
+33.3% vs TC avg
§102
8.1%
-31.9% vs TC avg
§112
5.9%
-34.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 345 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.
Claims 1-4, 6-13 and 15-19 are rejected under 35 U.S.C. 103 as being unpatentable by Gao et al. (U.S. 2019/0005948 A1) in view of Li et al. (U.S. 2025/0252267 A1). 
Regarding Claim 1, Gao discloses a method (Gao, [0026] “A method”), comprising: 
obtaining current dialogue data, wherein the current dialogue data comprises user input data of the current round of dialogue and historical dialogue data of the historical round of dialogue (Gao, [0009] “receiving current dialogue information; determining a user intention of the current dialogue information and historical multi-round slot distribution information of historical dialogue information” and [0029] “In detail, the current dialogue information may be a structural representation that is understandable by machine…are performed on a current speech input of a user” Gao teaches obtaining current dialogue data includes user speech input data of the current round of dialogue and historical dialogue data; 
determining a requirement type of the user in the current round of dialogue based on the current dialogue data (Gao, [0006] “the user implicitly express him/her requirements in one searching, an uncertainty of understanding the user's requirements by the natural language understanding (NLU) exists in a certain degree” and [0033] “In order to acquire the current single-round slot distribution information, it requires determining the conditions that each slot of the semantic list is filled with the current dialogue keywords firstly” Gao teaches determining a user requirements in one searching in the current round slot distribution based on the current dialogue keywords firstly; 
in response to the requirement type being an image processing requirement, determining an action sequence for implementing the image processing requirement (Gao, [0112] “The NLU device is configured to understand the natural language of the user and to convert natural language requirements inputted by the user into the structural representations understandable by machine” and [0029] “the current dialogue information may be a structural representation that…is formed after a series of processes (such as automatic speech recognition, natural language understanding) are performed on a current speech input of a user” Gao teaches in response to the requirement type being an processing requirement (understand the natural language of the user and convert natural language requirements) and determining an action sequence for implementing the processing requirement (the structural representation is formed after a series of processes are performed on a current speech input of a user).
Gao teaches the researches in this field include image recognition, natural language processing and expert systems, etc. (Gao, [0003]).
However, Gao does not explicitly teach in response to the requirement type being an image processing requirement, determining an action sequence for implementing the image processing requirement, wherein the action sequence comprises at least one image processing action;
executing the action sequence to generate a target image;
generating response data corresponding to the user input data based on the target image.
However, Li teaches in response to the requirement type being an image processing requirement, determining an action sequence for implementing the image processing requirement, wherein the action sequence comprises at least one image processing action (Li, [0235] The current dialogue environment data may be obtained based on a sensor, in a scenario of a dialogue between an indoor user and a smart speaker, the sensor may be an image capture apparatus” and [0035] “the second feedback dialogue data is determined in combination of the external knowledge, the second feedback dialogue data can meet the dialogue requirement of the user” and [0260] “in FIG. 8, in a scenario of a dialogue between the user and a mobile terminal, if the second input dialogue data is “Panda in an xx zoo”, the mobile terminal may determine that the second semantic keyword of the second input dialogue data is “panda”, knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8” Li teaches an image processing requirement (an image capture a scenario of a dialogue), determining an action sequence of input dialogue data include one image processing action (image is obtained from Wiki, Fig. 8); 
executing the action sequence to generate a target image (Li, [0260] “in FIG. 8, in a scenario of a dialogue between the user and a mobile terminal, if the second input dialogue data is “Panda in an xx zoo”, the mobile terminal may determine that the second semantic keyword of the second input dialogue data is “panda”, knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8” Li teaches executing the sequence of a dialogue between the user and a mobile terminal to generate a target image (an image is obtained from Wiki, Fig. 8); and 
generating response data corresponding to the user input data based on the target image (Li, [0260] “in FIG. 8, in a scenario of a dialogue between the user and a mobile terminal, if the second input dialogue data is “Panda in an xx zoo”,… knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8” Li teaches generating response data corresponding to the user input data based on the target image (“Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8)
Gao and Li  are combinable because they are from the same field of endeavor, system and method for image processing and try to solve similar problems.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made for modifying the method of Gao to combine with executing the action sequence to generate a target image (as taught by Li) in order to because Li can provide executing the sequence of a dialogue between the user and a mobile terminal to generate a target image (an image is obtained from Wiki, Fig. 8) (Li, Fig. 8, [0260]). Doing so, it may provide a computer system designed to assist humans and complete natural, coherent, and smooth communication tasks with humans (Li, [0090]).
Regarding Claim 2, a combination of Gao and Li discloses the method according to claim 1, wherein determining the requirement type of the user in the current round of dialogue comprises: 
determining first input data for inputting into a first language model based on the current dialogue data (Gao, [0034] “the current dialogue information may be a structural representation that is understandable by machine and is formed after a series of processes (such as automatic speech recognition, natural language understanding) are performed on a current speech input of a user”) and [0044] In detail, a large amount of system actions is stored in the spoken dialogue system. When the current dialogue status satisfies the pre-configured rule, candidate system action is determined by the spoken dialogue system from the large amount of system actions. The extracted candidate system action feature will be inputted into the decision model subsequently” Gao teaches determining a first input data (a large amount of the natural language) for inputing into a first language model (the decision model) based on the current dialogue data; and 
inputting the first input data into the first language model to obtain the requirement type output by the first language model (Gao, [0112] The NLU device is configured to understand the natural language of the user and to convert natural language requirements inputted by the user into the structural representations understandable by machine. Referring to FIGS. 8 and 9, the understanding result ot of the natural language is outputted by the NLU device to the DST module” Gao teaches the first input data into the first language model (the decision model) to obtain the requirement type (the natural language requirements) output by the 1st language model.
Regarding Claim 3, a combination of Gao and Li discloses the method according to claim 2, wherein determining the first input data for inputting into the first language model comprises: 
obtaining a set first template, wherein the first template comprises first guidance information for guiding the first language model to identify the requirement type and a first slot to be filled (Gao, [0033] In order to acquire the current single-round slot distribution information, it requires determining the conditions that each slot of the semantic list is filled with the current dialogue keywords firstly” and [0034] Firstly, N current dialogue keywords of the current dialogue information are determined, where N is a natural number. Each of the plurality of segmentations corresponds to one dialogue keyword” and [0035] Secondly, the semantic list corresponding to the user intention is acquired. The semantic list includes M slots, where M is a natural number” Gao teaches obtaining a set first template (Each of the plurality of segmentations, as a set first template, corresponds to one dialogue keyword) for guiding the first language model to identify the requirement type and a first slot to be filled (each slot of the semantic list is filled with the current dialogue keywords firstly); and 
filling the current dialogue data into the first slot to obtain the first input data (Gao, [0035] “The semantic list includes M slots, where M is a natural number, the spoken language system automatically generates the semantic list corresponding to the user intention of the speech input. . It is to be noted that, when the N current dialogue keywords are determined, N semantic lists may be acquired” and [0038] “the current single-round slot distribution information covers all dialogue keywords of the current dialogue information and all conditions that each slot is filled with all dialogue keywords” Gao teaches filling the current dialog data (keywords and conditions) into the first slot to obtain the first input data (the speech input).
Regarding Claim 4,  the method according to claim 1, Gao does not explicitly teach wherein determining the requirement type of the user in the current round of dialogue comprises: 
inputting the current dialogue data into a classification model to obtain the requirement type output by the classification model.  
However, Li teaches inputting the current dialogue data into a classification model to obtain the requirement type output by the classification model (Li, abstract “ current dialogue environment data, and second input dialogue data input by a user” and [0178] “ obtaining S402: Input the second sample data set into a pre-trained binary classification network model, to obtain a first classification result indicating whether external knowledge needs to be introduced” Li inputting the current dialogue data into a classification network model to obtain the requirements.
Gao and Li are combinable see rationale in claim 1.
Regarding Claim 6, the method according to claim 1, Gao does not explicitly teach wherein determining the action sequence for implementing the image processing requirement comprises: 
determining the action sequence for implementing the image processing requirement based on a set corresponding relationship between a plurality of image processing requirements and a plurality of action sequences.
However, Li teaches determining the action sequence for implementing the image processing requirement based on a set corresponding relationship between a plurality of image processing requirements and a plurality of action sequences (Li, [0260] “in FIG. 8, in a scenario of a dialogue between the user and a mobile terminal, if the second input dialogue data is “Panda in an xx zoo”, the mobile terminal may determine that the second semantic keyword of the second input dialogue data is “panda”. In this case, knowledge corresponding to “panda” may be obtained from the network knowledge, for example, knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8, or picture knowledge of the panda shown in FIG. 8 may be obtained” and [0261] Correspondingly, the second feedback dialogue data output by the dialogue system in the mobile terminal may be “Black and white, really cute” shown in FIG. 8” Li teaches determining the action sequence for implementing the image processing, the input dialogue data “Panda is an xx zoo”, the feedback dialogue data output “Black and white, really cute” and generate knowledge about panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki image shown in FIG. 8.
Gao and Li are combinable see rationale in claim 1.
Regarding Claim 7, the method according to claim 1, Gao does not explicitly teach wherein executing the action sequence to generate the target image comprises: extracting target data for implementing the image processing requirement from the current dialogue data; and 
for any image processing action in the action sequence: 
determining input parameters of the image processing action based on the target data; and 
executing the image processing action to obtain a result image of the image processing action based on the input parameters.  
However, Li teaches extracting target data for implementing the image processing requirement from the current dialogue data (Li, [0260] “in FIG. 8, in a scenario of a dialogue between the user and a mobile terminal, if the second input dialogue data is “Panda in an xx zoo”, the mobile terminal may determine that the second semantic keyword of the second input dialogue data is “panda” and [0261] Correspondingly, the second feedback dialogue data output by the dialogue system in the mobile terminal may be “Black and white, really cute” shown in FIG. 8” Li teaches extracting target data (“Panda in an xx zoo” and “Black and white, really cute shown in FIG. 8”  from the current dialogue data ; and 
for any image processing action in the action sequence (Li, [0260] “In this case, knowledge corresponding to “panda” may be obtained from the network knowledge, for example, knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8, or picture knowledge of the panda shown in FIG. 8 may be obtained” Li teaches for any image (Wiki image, Fig 8) processing action sequence (a dialogue data action sequence between user and a mobile terminal, Fig. 8).
determining input parameters of the image processing action based on the target data (Li, [0260] “in FIG. 8, in a scenario of a dialogue between the user and a mobile terminal, if the second input dialogue data is “Panda in an xx zoo”, the mobile terminal may determine that the second semantic keyword of the second input dialogue data is “panda” and [0261] Correspondingly, the second feedback dialogue data output by the dialogue system in the mobile terminal may be “Black and white, really cute” shown in FIG. 8” Li teaches determining input parameters of the image processing action (“Panda in an xx zoo” and “Black and white, really cute shown in FIG. 8” from the current dialogue data ; and 
executing the image processing action to obtain a result image of the image processing action based on the input parameters (Li, [0260] “for example, knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8, or picture knowledge of the panda shown in FIG. 8 may be obtained” Li teaches executing the image processing to obtain a result image based on the input parameters (Pandas are distributed in an xx city . . . ” is obtained from Wiki image shown in FIG. 8).
Gao and Li are combinable see rationale in claim 1.
Regarding Claim 8, the method according to claim 1, Gao does not explicitly teach wherein generating the response data corresponding to the user input data comprises: 
inputting the target image and a set third template into a third language model to obtain explanation data for explaining the target image output by the third language model, wherein the third template is used to guide the third language model to generate the explanation data and determining the target image and the explanation data as the response data.  
However, Li teaches inputting the target image and a set third template into a third language model to obtain explanation data for explaining the target image output by the third language model, wherein the third template is used to guide the third language model to generate the explanation data and determining the target image and the explanation data as the response data (Li, [0249] “FIG. 7, the plurality of different types of knowledge include: text knowledge, a knowledge graph, picture knowledge, multi-modal knowledge, and network knowledge. Because the picture knowledge may include both image content and text content” and [0254] “FIG. 7, the task-oriented dialogue system further includes the cross-attention network model. An input of the cross-attention network model is a result of separately encoding the obtained knowledge corresponding to the second semantic keyword, and an output of the cross-attention network model is the target fusion feature vector” and [0260] “for example, knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8, or picture knowledge of the panda shown in FIG. 8 may be obtained” Li teaches input the target image (as a picture knowledge) into a third language model (the cross-attention network model, Fig. 7) for the target image output by the third language model (the target fusion feature vector) and the explanation data as the response data (e.g., Pandas are distributed in an xx city . . . ” is obtained from Wiki image shown in FIG. 8); 
Gao and Li are combinable see rationale in claim 1.
Regarding Claim 9, the method according to claim 1, Gao does not explicitly teach wherein generating the response data corresponding to the user input data comprises: 
inputting the target image into an image-to-text model to obtain description text of the target image output by the image-to-text model; inputting the description text into a fourth language model to obtain explanation data for explaining the target image output by the fourth language model; and 
determining the target image and the explanation data as the response data.  
However, Li teaches inputting the target image into an image-to-text model to obtain description text of the target image output by the image-to-text model; inputting the description text into a fourth language model to obtain explanation data for explaining the target image output by the fourth language model; and 
determining the target image and the explanation data as the response data (Li, [0249] “FIG. 7, the plurality of different types of knowledge include: text knowledge, a knowledge graph, picture knowledge, multi-modal knowledge, and network knowledge. Because the picture knowledge may include both image content and text content” and [0157] “the task-oriented dialogue system includes the natural language understanding module” and [0256] “FIG. 7, the task-oriented dialogue system further includes a dialogue network model. An output of the cross-attention network model is an input of the dialogue network model, and an output of the dialogue network model is the second feedback dialogue data” and [0260] For example, as shown in FIG. 8, in a scenario of a dialogue between the user and a mobile terminal, if the second input dialogue data is “Panda in an xx zoo”, knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki shown in FIG. 8, or picture knowledge of the panda shown in FIG. 8 may be obtained” and [0262] “in FIG. 9. In this case, knowledge corresponding to the “key” may be obtained from the network knowledge, for example, knowledge about the key “The key is used for unlocking . . . ” is obtained from Wiki shown in FIG. 9, or picture knowledge of the key shown in FIG. 9 may be obtained” Li teaches inputting the target image (picture knowledge may include both image content and text content) into an image-to-text model (referred to the dialogue network model, Fig. 7) to obtain description text of the target image. Inputting the description text into a fourth language model (referred to as the natural language understanding module in the task-oriented dialogue system) to obtain explanation data for explaining the target image output by the fourth language model (e.g., the knowledge about the panda “Pandas are distributed in an xx city . . . ” is obtained from Wiki image shown in FIG. 8).
Gao and Li are combinable see rationale in claim 1.
Regarding Claim 10, a combination of Gao and Li discloses an electronic device (Gao, [0010] “a computer device”) , comprising: 
a processor (Gao, [0010] “a processor”); and 
a memory communicatively connected to the processor (Gao, [0010] “a memory”); 
wherein the memory stores instructions executable by the processor, and the instructions, when executed by the processor (Gao, [0010] “The processor is configured to run a program corresponding to executable program codes by reading the executable program codes stored in the memory”, cause the processor to perform operations comprising: 
obtaining current dialogue data, wherein the current dialogue data comprises user input data of the current round of dialogue and historical dialogue data of the historical round of dialogue; 
determining a requirement type of the user in the current round of dialogue based on the current dialogue data; 
in response to the requirement type being an image processing requirement, determining an action sequence for implementing the image processing requirement, wherein the action sequence comprises at least one image processing action; 
executing the action sequence to generate a target image; and 
generating response data corresponding to the user input data based on the target image.  
Claim 10 is substantially similar to claim 1 is rejected based on similar analyses.
Regarding Claim 11, a combination of Gao and Li discloses the electronic device according to claim 10, wherein determining the requirement type of the user in the current round of dialogue comprises: 
determining first input data for inputting into a first language model based on the current dialogue data; and 
inputting the first input data into the first language model to obtain the requirement type output by the first language model.  
Claim 11 is substantially similar to claim 2 is rejected based on similar analyses.
Regarding Claim 12, a combination of Gao and Li discloses the electronic device according to claim 11, wherein determining the first input data for inputting into the first language model comprises: 
obtaining a set first template, wherein the first template comprises first guidance information for guiding the first language model to identify the requirement type and a first slot to be filled; and 
filling the current dialogue data into the first slot to obtain the first input data.  
Claim 12 is substantially similar to claim 3 is rejected based on similar analyses.
Regarding Claim 13, a combination of Gao and Li discloses the electronic device according to claim 10, wherein determining the requirement type of the user in the current round of dialogue comprises: 
inputting the current dialogue data into a classification model to obtain the requirement type output by the classification model.  
Claim 13 is substantially similar to claim 4 is rejected based on similar analyses.
Regarding Claim 15, a combination of Gao and Li discloses the electronic device according to claim 10, wherein determining the action sequence for implementing the image processing requirement comprises: 
determining the action sequence for implementing the image processing requirement based on a set corresponding relationship between a plurality of image processing requirements and a plurality of action sequences.  
Claim 15 is substantially similar to claim 6 is rejected based on similar analyses.
Regarding Claim 16, a combination of Gao and Li discloses the electronic device according to claim 10, wherein executing the action sequence to generate the target image comprises: 
extracting target data for implementing the image processing requirement from the current dialogue data; and for any image processing action in the action sequence: 
determining input parameters of the image processing action based on the target data; and 
executing the image processing action to obtain a result image of the image processing action based on the input parameters.  
Claim 16 is substantially similar to claim 7 is rejected based on similar analyses.
Regarding Claim 17, a combination of Gao and Li discloses the electronic device according to claim 10, wherein generating the response data corresponding to the user input data comprises: 
inputting the target image and a set third template into a third language model to obtain explanation data for explaining the target image output by the third language model, wherein the third template is used to guide the third language model to generate the explanation data; and 
determining the target image and the explanation data as the response data.  
Regarding Claim 18, a combination of Gao and Li discloses the electronic device according to claim 10, wherein generating the response data corresponding to the user input data comprises: 
inputting the target image into an image-to-text model to obtain description text of the target image output by the image-to-text model; 
inputting the description text into a fourth language model to obtain explanation data for explaining the target image output by the fourth language model; and 
determining the target image and the explanation data as the response data.  
Claim 18 is substantially similar to claim 9 is rejected based on similar analyses.
Regarding Claim 19, a combination of Gao and Li discloses a non-transitory computer readable storage medium storing computer instruction (Gao, [0012], a non-transitory computer readable storage medium having a computer program stored thereon. When the computer program is executed by a processor”, wherein the computer instructions are configured to enable a computer to perform operations comprising: 
obtaining current dialogue data, wherein the current dialogue data comprises user input data of the current round of dialogue and historical dialogue data of the historical round of dialogue;
 	determining a requirement type of the user in the current round of dialogue based on the current dialogue data; 
in response to the requirement type being an image processing requirement, determining an action sequence for implementing the image processing requirement, wherein the action sequence comprises at least one image processing action; 
executing the action sequence to generate a target image; and 
generating response data corresponding to the user input data based on the target image.  
Claim 19 is substantially similar to claim 1 is rejected based on similar analyses.
Claims 5, 14, 20 are rejected under 35 U.S.C. 103 as being unpatentable by Gao et al. (U.S. 2019/0005948 A1) in view of Li et al. (U.S. 2025/0252267 A1) and further in view of Wang et al. (U.S. 2022/0179888 A1).
Regarding Claim 5, the methods according to claim 1, the combination of Gao and Li does not explicitly teach wherein determining the action sequence for implementing the image processing requirement comprises: 
obtaining a set second template, wherein the second template comprises second guidance information for guiding the second language model to generate the action sequence and a second slot to be filled; 
filling the image processing requirement into the second slot to obtain second input data for inputting into the second language model; and 
inputting the second input data into the second language model to obtain the action sequence output by the second language model.  
However, Wang teaches obtaining a set second template, wherein the second template comprises second guidance information for guiding the second language model to generate the action sequence and a second slot to be filled (Wang, [0007] FIG. 1B is a schematic diagram of a chat scene dialogue about emotion information and [0035] FIG. 20A is a user emotion type and a user emotion template and [0036] FIG. 20B is a chatbot emotion type and a chatbot emotion template and Fig. 2, [0122] a Natural Language Generation (NLG), which generates responses according to dialogue context information, individual database, and knowledge base and [0252] in FIG. 5A, the intent information is extracted from the historical dialogue (user input information and robot response information), including intent, slot filling information, and then in combination with the open database, such as “Time: T1, Relation: (Athlete A, team, football team B), source: user”, “Time: T2, Relation: (Athlete A, score, 3 balls), source: Chatbot” Wang teaches obtaining a set second template (e.g., the user motion template and a chatbot emotion template, Figs 20A, 20B) of a chat scene dialogue about emotion information between a user and a robot response, Fig. 1B for guiding the second language model (a Natural Language Generation (NLG), Fig. 2) to generate the action sequence (Fig. 2) and the second lot to be filled (e.g., Figs 5A, 5B the slot filling information for the source: user (T1) and Chatbot (T2), robot response); 
filling the image processing requirement into the second slot to obtain second input data for inputting into the second language model (Wang, [0252] in FIG. 5A, the intent information is extracted from the historical dialogue (user input information and robot response information), including intent, slot filling information, and then in combination with the open database, such as “Time: T1, Relation: (Athlete A, team, football team B), source: user”, “Time: T2, Relation: (Athlete A, score, 3 balls), source: Chatbot” and [0253] “the individual knowledge base records the key information in the dialogue between the user and the Chatbot through the form of the knowledge representation… is stored in the individual knowledge base through the form of the knowledge representation, as shown in FIG. 5B” Wang teaches filling the image processing requirement (records the key information in the dialogue between the user and the Chatbot) into the slot (Figs 5A, 5B) for inputting into the second language model; and 
inputting the second input data into the second language model to obtain the action sequence output by the second language model (Wang, [0011] FIG. 2A is a schematic structural diagram of an actively growing chatbot provided” and [0115] Wherein, as shown in FIG. 2A, the architecture mainly includes the following four basic modules” and [0122] a Natural Language Generation (NLG), which generates responses according to dialogue context information, individual database, and knowledge base” Wang teaches inputting the second input data (User utterance information, Fig. 2A) into the second language model (the NLG) to obtain the action sequence and generate output by the second language model (generates responses according to dialogue context information, individual database, and knowledge base).
Gao, Li and Wang are combinable because they are from the same field of endeavor, system and method for image processing and try to solve similar problems.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made for modifying the method of Gao and Li to combine with obtaining a second template of a second language to generate the action sequence and a second slot to be filled (as taught by Wang) in order to obtain a second template of a second language to generate the action sequence and a second slot to be filled because Wang can provide obtaining a set second template (e.g., the user motion template and a chatbot emotion template, Figs 20A, 20B) of a chat scene dialogue about emotion information between a user and a robot response, Fig. 1B for guiding the second language model (a Natural Language Generation (NLG), Fig. 2) to generate the action sequence (Fig. 2) and the second lot to be filled (e.g., Figs 5A, 5B the slot filling information for the source: user (T1) and Chatbot (T2), robot response) (Wang, Fig. 2, [0122], Figs. 5A, 5B, [0252]). Doing so, it may provide solving the problem of how to output response information more accurately when the intelligent chatbot is applied to the interaction with a user (Wang, [0042]).
Regarding Claim 14, a combination of Gao, Li and Wang discloses the electronic device according to claim 10, wherein determining the action sequence for implementing the image processing requirement comprises: 
obtaining a set second template, wherein the second template comprises second guidance information for guiding the second language model to generate the action sequence and a second slot to be filled;
 	filling the image processing requirement into the second slot to obtain second input data for inputting into the second language model; and 
inputting the second input data into the second language model to obtain the action sequence output by the second language model.  
Claim 14 is substantially similar to claim 5 is rejected based on similar analyses.
Regarding Claim 20, a combination of Gao, Li and Wang discloses the non-transitory computer-readable storage medium according to claim 19, wherein determining the action sequence for implementing the image processing requirement comprises: 
obtaining a set second template, wherein the second template comprises second guidance information for guiding the second language model to generate the action sequence and a second slot to be filled; 
filling the image processing requirement into the second slot to obtain second input data for inputting into the second language model; and 
inputting the second input data into the second language model to obtain the action sequence output by the second language model. 
Claim 20 is substantially similar to claim 5 is rejected based on similar analyses.
Conclusion
The prior arts made of record and not relied upon are considered pertinent to applicant's disclosure Zhai et al. (U.S. 2021/0256969 A1) and Golpar et al. (U.S. 2019/0325089 A1). 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KHOA VU whose telephone number is (571)272-5994. The examiner can normally be reached 8:00- 4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached at 571-272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KHOA VU/Examiner, Art Unit 2611                                                                                                                                                                                                        

/KEE M TUNG/Supervisory Patent Examiner, Art Unit 2611
Read full office action
Prosecution Timeline

Jun 20, 2024
Application Filed
Feb 18, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/174,686
Patent 12598266
IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD, AND STORAGE MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/449,613
Patent 12597087
HIGH-PERFORMANCE AND LOW-LATENCY IMPLEMENTATION OF A WAVELET-BASED IMAGE COMPRESSION SCHEME
2y 5m to grant Granted Apr 07, 2026
16/195,776
Patent 12578941
TECHNIQUE FOR INTER-PROCEDURAL MEMORY ADDRESS SPACE OPTIMIZATION IN GPU COMPUTING COMPILER
2y 5m to grant Granted Mar 17, 2026
18/359,815
Patent 12567181
SYSTEMS AND METHODS FOR REAL-TIME PROCESSING OF MEDICAL IMAGING DATA UTILIZING AN EXTERNAL PROCESSING DEVICE
2y 5m to grant Granted Mar 03, 2026
17/404,364
Patent 12548431
CONTEXTUALIZED AUGMENTED REALITY DISPLAY SYSTEM
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
68%
Grant Probability
84%
With Interview (+15.8%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 345 resolved cases by this examiner. Grant probability derived from career allow rate.