Last updated: April 19, 2026
Application No. 18/746,767
SPEECH AND VIRTUAL OBJECT GENERATION METHOD AND DEVICE

Non-Final OA §101§103
Filed
Jun 18, 2024
Examiner
SULTANA, NADIRA
Art Unit
2653
Tech Center
2600 — Communications
Assignee
Lenovo (Beijing) Limited
OA Round
1 (Non-Final)
Interview Optional

— +31.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 97 resolved cases, 2023–2026
Examiner Intelligence

SULTANA, NADIRA View full profile →
Grants 74% — above average
Career Allow Rate
72 granted / 97 resolved
+12.2% vs TC avg
Strong +31% interview lift
Without
With
+31.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
29 currently pending
Career history
126
Total Applications
across all art units
Statute-Specific Performance

§101
25.4%
-14.6% vs TC avg
§103
54.8%
+14.8% vs TC avg
§102
12.0%
-28.0% vs TC avg
§112
3.6%
-36.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 97 resolved cases
Office Action

§101 §103
DETAILED ACTION

Notice of AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority

Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. CN202310798631.1, filed on 06/30/2023.

Claim Rejections - 35 USC § 101

35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-14 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an
abstract idea without significantly more.

The Independent claims 1 and 9 recite “obtaining an object image of a virtual object”; “determining target sound category characteristics corresponding to the virtual object based on the object image”; “obtaining text information, the text information being used to describe speech content that needs to be output by the virtual object”; “and generating speech data that conforms to the target sound category characteristics based on the text information and the target sound category characteristics”. The limitations above as drafted, is a process that, under its broadest reasonable interpretation, covers a mental process, as this could be performed in the human mind or with the aid of pen and paper.
The limitation of " obtaining ... ", “determining..”, "generating ... ", as drafted covers mental activities. More specifically, a person can obtain an image of an object, such as a game character in a game, can determine what type of sound category that character will have, obtain a text information related to the sound category/characteristics and which will describe how that game character will speak and generating speech based on that. All the steps above are examples of observation and evaluation that could be performed in the human mind or with the aid of pencil and paper.
The claim 9 recites “ non transitory computer readable storage medium” , “processor”,  for performing the method. All those are recited at a high level of generality and are recited as performing generic computer functions routinely used in computer applications. The current specification in paragraph [0009], [00173], [00176] specifies “processor”, and “ non transitory computer readable storage medium” as performing generic computer functions that are well-understood, routine and conventional activities amount to no more than implementing the abstract idea with a computerized system. The claims as drafted, are not patent eligible.
Thus, taken alone, the additional elements do not amount to significantly more than the above identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds
nothing that is not already present when looking at the elements taken individually. There is no indication
that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation. Claims 1 and 9 are therefore not drawn to eligible subject matter as they are directed to an abstract idea without
significantly more than the abstract idea.

The Independent claim 7 recites “obtaining an object image used to construct the virtual object”; “determining target sound category characteristics corresponding to the virtual object based on the object image”; “and constructing the virtual object associated with the target sound category characteristics based on the object image”. The limitations above as drafted, is a process that, under its broadest reasonable interpretation, covers a mental process, as this could be performed in the human mind or with the aid of pen and paper.
The limitation of " obtaining ... ", “determining..”, "constructing ... ", as drafted covers mental activities. More specifically, a person can obtain an image of an object, such as a game character in a game, can determine what type of sound category that character will have and construct a game character based on those information. All the steps above are examples of observation and evaluation that could be performed in the human mind or with the aid of pencil and paper.
The claim 7 does not recite any additional elements. Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation. Claim 7 is therefore not drawn to eligible subject matter as this is  directed to an abstract idea without significantly more than the abstract idea.

Claims 2 and 10 recite the additional limitation of “wherein determining the target sound category characteristics corresponding to the virtual object based on the object image includes: inputting the object image into an object classification model to obtain target object category characteristics of the virtual object identified by the object classification model”; “and determining the target object category characteristics of the virtual object as the target sound category characteristics corresponding to the virtual object” , where to determine the object category of the virtual object and to determine the corresponding sound category of the target virtual object, could be performed in the human mind or with the aid of pen and paper. The claims recite additional limitation of object classification model, which is specified in specification in paragraph [0057],[0076] as classification model which could be a convolutional neural network, which is not sufficient to amount to significantly more than the judicial exception. The claims 2 and 10 as drafted, are not patent eligible.

Claims 3 and 11 recite “wherein: the object classification model is obtained by training using a first image sample in at least a sample group, and object category characteristics identified by the object classification model from the first sample image are the same as sound category characteristics identified by a sound classification model from a first sound sample corresponding to the first image sample as training objectives, the sample group including the first image sample and the first sound sample belonging to the same object”, where obtaining object classification model from an image sample, to determine the object category of the virtual object and the category of sound from the sound sample are same, could be performed in the human mind or with the aid of pen and paper. The claims recite additional limitations of object classification model and sound classification model. Object classification model is specified in specification in paragraph [0057],[0076] as classification model which could be a convolutional neural network, which is not sufficient to amount to significantly more than the judicial exception. Sound classification model is specified in specification in paragraph [0080] as an open-source sound classification model, which could be a neural network model, which is not sufficient to amount to significantly more than the judicial exception. The claims 3 and 11 as drafted, are not patent eligible.

Claims 4 and 12 recite “wherein: the sample group is labeled with an actual object identifier, and the training objectives also includes that predicted object information of the first image sample determined by using the object classification model is consistent with the actual object identifier labeled by the sample group to which the first image sample belongs”, to determine that object identifier in the sample group and predicted object are same is an evaluation, observation and could be performed in the human mind or with the aid of pen and paper. The claims recite additional limitation of object classification model, which is specified in specification in paragraph [0057],[0076] as classification model which could be a convolutional neural network, which is not sufficient to amount to significantly more than the judicial exception. The claims 4 and 12 as drafted, are not patent eligible.

Claims 5 and 13 recite “wherein training the object classification model includes: obtaining at least one sample groups; for each sample group, inputting the first image sample in the sample group into an image classification model and the first sound sample in the sample group into the sound classification model, and extracting the sound category characteristics identified by the sound classification model and the object category characteristics identified by the image classification model to obtain the predicted object information corresponding to the first image sample determined by the image classification model; and in response to the training objectives not being met based on characteristic similarity, the predicted object information and the actual object identifier corresponding to each sample group, adjusting parameters of the image classification model, and returning to extraction of the sound category characteristics identified by the sound classification model and the object category characteristics identified by the image classification model until the training objectives are met, the image classification model being used to determine a trained object classification model”,  predicting the object information from the samples of images and samples of sounds, in response to determining that the predicted object and actual object identifier are not same, adjusting parameters until they are same, could be done with the aid of pen and paper. The claims recite additional limitations of object classification model and sound classification model. Object classification model is specified in specification in paragraph [0057],[0076] as classification model which could be a convolutional neural network, which is not sufficient to amount to significantly more than the judicial exception. Sound classification model is specified in specification in paragraph [0080] as an open-source sound classification model, which could be a neural network model, which is not sufficient to amount to significantly more than the judicial exception. The claims 5 and 13 as drafted, are not patent eligible.

Claims 6 and 14 recite “wherein generating the speech data that conforms to the target sound category characteristics based on the text information and the target sound category characteristics includes: using a speech synthesis model to construct the speech data to obtain the speech data with the target sound category characteristics based on the text information and the target sound category characteristics”, to generate speech data based on the text and sound category, could be performed in the human mind. The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as claims 6 and 14 do not recite any additional limitations. The claims as drafted, are not patent eligible.

Claim 8 recites “obtaining text information, the text information being used to describe speech content that needs to be output by the virtual object; and generating speech data with the target sound category characteristics for the virtual object based on the text information”, where generating speech data from the text information, could be performed in the human mind. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception, as claim 8 does not recite any additional limitations. The claim as drafted, is not patent eligible.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 9 are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al.  ( CN 114596836 A), hereinafter referenced as Xu, in view of Chen et al. (US 20240107127 A1), hereinafter referenced as Chen. 

Regarding Claim 1, Xu teaches a speech generation method comprising: 
obtaining an object image of a virtual object ( Xu: Page 6, para.[2], obtaining the target image and target text for speech synthesis, the target image comprises a target object as a virtual character); 
determining target sound category characteristics corresponding to the virtual object based on the object image( Xu: Page 6, para.[3], inputting the target image to the trained sound feature extraction model to obtain the sound feature of the target object (virtual character)); 
 
and generating speech data that conforms to the target sound category characteristics based on the text information and the target sound category characteristics ( Xu: page 6, para.[10], sound feature of the target object and the target text of the to-be-synthesized speech input training speech synthesis model, to obtain the target speech of the sound characteristic of the target object output by the speech synthesis model ).  

Xu while teaching the method of claim 1, fails to explicitly teach the claimed, obtaining text information, the text information being used to describe speech content that needs to be output by the virtual object; 

However, Chen does teach the claimed, obtaining text information, the text information being used to describe speech content that needs to be output by the virtual object ( Chen: Para.[0251], obtaining subtitle text for the virtual character object which will broadcast the subtitle text ( such as news broadcast)); 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Chen’s teaching of video display and processing method, apparatus and system, in the field of multimedia technology, into the system and method of voice synthesis, taught by Xu, because, this would improve the quality of the video produced and reduce the time cost of producing the video.(Chen, Para.[0033]). 

Claim 9 is non-transitory computer readable storage medium claim containing computer-executable instructions for, when executed by one or more processors ( Xu: Page 3, para.[4], provided a computer readable storage medium which is stored with a computer program, when the computer program is executed by a processor, the processor executes the method), performing the steps in method claim 1 above and as such, claim 9 is similar in scope and content to claim 1 and therefore, claim 9 is rejected under similar rationale as presented against claim 1 above.

Regarding Claim 2, Xu in view of Chen teach the method of claim 1. Xu further teaches, wherein determining the target sound category characteristics corresponding to the virtual object based on the object image includes: inputting the object image into an object classification model to obtain target object category characteristics of the virtual object identified by the object classification model ( Xu: Page 2, para.[3], inputting the target image corresponding to the virtual character to the sound feature extract model); 
and determining the target object category characteristics of the virtual object as the target sound category characteristics corresponding to the virtual object ( Xu: Page 2, para.[6], obtaining the sound feature if the target object).

Claim 10 is non-transitory computer readable storage medium claim performing the steps in method claim 2 above and as such, claim 10 is similar in scope and content to claim 2 and therefore, claim 10 is rejected under similar rationale as presented against claim 2 above.

Regarding Claim 3,  Xu in view of Chen teach the method of claim 2. Xu further teaches,  wherein: the object classification model is obtained by training using a first image sample in at least a sample group, and object category characteristics identified by the object classification model from the first sample image are the same as sound category characteristics identified by a sound classification model from a first sound sample corresponding to the first image sample as training objectives, the sample group including the first image sample and the first sound sample belonging to the same object ( Xu: Page 7, para.[8], [10], the training of the sound feature extraction model, which can be, for example, deep learning network Deep Neural Network (DNN) model and by obtaining the sample image and the sample image of the annotation data, the sample image comprises a sample object as a virtual character image, the annotation data comprises sample sound characteristic of the sample object).

Claim 11 is non-transitory computer readable storage medium claim performing the steps in method claim 3 above and as such, claim 11 is similar in scope and content to claim 3 and therefore, claim 11 is rejected under similar rationale as presented against claim 3 above.

Regarding Claim 4, Xu in view of Chen teach the method of claim 3. Xu further teaches, wherein: the sample group is labeled with an actual object identifier ( Xu: Page 2, para.[4], obtaining annotated sample image) , 
and the training objectives also includes that predicted object information of the first image sample determined by using the object classification model is consistent with the actual object identifier labeled by the sample group to which the first image sample belongs ( Xu: Page 8, para.[8], training process includes a sample object prediction value and the real value and calculating the loss value).

Claim 12 is non-transitory computer readable storage medium claim performing the steps in method claim 4 above and as such, claim 12 is similar in scope and content to claim 4 and therefore, claim 12 is rejected under similar rationale as presented against claim 4 above.

Regarding Claim 5, Xu in view of Chen teach the method of claim 4. Xu further teaches, wherein training the object classification model includes: obtaining at least one sample groups ( Xu: Page 10, para.[2], obtaining sample image); 
for each sample group, inputting the first image sample in the sample group into an image classification model and the first sound sample in the sample group into the sound classification model, and extracting the sound category characteristics identified by the sound classification model and the object category characteristics identified by the image classification model to obtain the predicted object information corresponding to the first image sample determined by the image classification model ( Xu: Page 10, para.[2], sample image is input into the sound feature extraction model which will output the first prediction sound feature of the sample object in the sample image) ; 
and in response to the training objectives not being met based on characteristic similarity, the predicted object information and the actual object identifier corresponding to each sample group, adjusting parameters of the image classification model, and returning to extraction of the sound category characteristics identified by the sound classification model and the object category characteristics identified by the image classification model until the training objectives are met, the image classification model being used to determine a trained object classification model ( Xu: Page 8, para.[7], Page 9, para.[10], para.[5],[10], the loss value is calculated based on the prediction voice spectrum characteristic (predicted value) and sample voice spectrum characteristic (real value). Subsequently, based on the loss value, adjusting the parameter sound feature extraction model ).

Claim 13 is non-transitory computer readable storage medium claim performing the steps in method claim 5 above and as such, claim 13 is similar in scope and content to claim 5 and therefore, claim 13 is rejected under similar rationale as presented against claim 5 above.

Regarding Claim 6, Xu in view of Chen teach the method of claim 1. Xu further teaches, wherein generating the speech data that conforms to the target sound category characteristics based on the text information and the target sound category characteristics includes: using a speech synthesis model to construct the speech data to obtain the speech data with the target sound category characteristics based on the text information and the target sound category characteristics ( Xu: Page 3, para.[1], page 6, para.[10], the speech synthesis model generates the speech corresponding to the sound feature of the target object and the target text).

Claim 14 is non-transitory computer readable storage medium claim performing the steps in method claim 6 above and as such, claim 14 is similar in scope and content to claim 6 and therefore, claim 14 is rejected under similar rationale as presented against claim 6 above.


Claims 7, 8 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (US 20240107127 A1), hereinafter referenced as Chen , in view of Xu et al.  ( CN 114596836 A), hereinafter referenced as Xu.

Regarding Claim 7, Chen teaches a virtual object generation method comprising: obtaining an object image used to construct the virtual object ( Chen: Para.[0255], user image is inputted into a preset biometric feature extraction model to extract the user biometric features to obtain an avatar with the user biometric features ); 

Chen while teaching the method of claim 7, fails to explicitly teach the claimed, determining target sound category characteristics corresponding to the virtual object based on the object image; and constructing the virtual object associated with the target sound category characteristics based on the object image.

However, Xu does teach the claimed, determining target sound category characteristics corresponding to the virtual object based on the object image( Xu: Page 6, para.[3], inputting the target image to the trained sound feature extraction model to obtain the sound feature of the target object (virtual character)); 
and constructing the virtual object associated with the target sound category characteristics based on the object image ( Xu: Page 6, para.[2], obtaining the target image and target text for speech synthesis, the target image comprises a target object as a virtual character. Page 6, para.[5], generating a virtual character corresponding to the sound feature); 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Xu’s teaching of voice synthesis method and model training method, into the system and method of video display and processing method, apparatus and system, in the field of multimedia technology, taught by Chen, because, this would improve the diversity and flexibility of man-machine voice interaction.(Xu, Page 3). 

Regarding Claim 8,  Chen in view of Xu teach the method of claim 7. Chen further teaches, further comprising: obtaining text information, the text information being used to describe speech content that needs to be output by the virtual object ( Chen: Para.[0251], obtaining subtitle text for the virtual character object which will broadcast the subtitle text ( such as news broadcast)); 
Xu further teaches, and generating speech data with the target sound category characteristics for the virtual object based on the text information ( Xu: page 6, para.[10], sound feature of the target object and the target text of the to-be-synthesized speech input training speech synthesis model, to obtain the target speech of the sound characteristic of the target object output by the speech synthesis model ).  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Xu’s teaching of voice synthesis method and model training method, into the system and method of video display and processing method, apparatus and system, in the field of multimedia technology, taught by Chen, because, this would improve the diversity and flexibility of man-machine voice interaction.(Xu, Page 3). 

Conclusion

Listed below are the prior arts made of record and not relied upon but are considered pertinent to applicant's disclosure.
QI et al. (US 20240169687 A1) teaches model training methods, scene recognition methods, and related devices. One example method includes obtaining a first image, recognizing an image of a target object irrelevant to scene recognition in the first image by using an object detection model, performing masking on a region in which the target object is located in the first image to obtain a third image, then generating a plurality of sample object images that are irrelevant to the scene recognition through an image generative model, combining the sample object image and the third image to obtain a target image, inputting the target image to a first convolutional neural network for training, and inputting the third image to a second convolutional neural network for training to obtain a scene recognition model.
Liu et al. (CN 114121006 A) teaches an invention relates to artificial intelligence field, claims an image output method of virtual character, device, device and storage medium. wherein the method comprises: when receiving the interactive request of the target object, outputting the preset interactive response according to the interactive request, and collecting the audio data and video data of the target object; extracting the first voice data of the target object according to the audio data; obtaining the second voice data corresponding to the target object according to the video data; determining the target voice data of the target object according to the first voice data and the second voice data; obtaining the target text information according to the target voice; using the semantic analysis model to semantic classification processing the target text information to obtain the classification result; obtaining the target response scheme according to the classification result, and generating facial image control information of response voice information and virtual character; outputting the response voice information, and controlling the virtual character face state display according to the facial image control information.
Wang et al.  (US 20200296532 A1) teaches  a sound reproduction method performed at a computing device. The method includes: detecting, by the computing device, a sound triggering event that corresponds to a first virtual object, the sound triggering event carrying sound source feature information used for matching a sound source; determining, by the computing device according to the sound source feature information, a sound source position at which the sound source is located, and obtaining a first transmission distance between the sound source position and a first position at which the first virtual object is located; determining, by the terminal according to the first transmission distance, a target sound of the sound source at the first position; and generating, by the terminal, the target sound at the first position in the virtual scene. This application resolves a technical problem that accuracy of sound reproduction is relatively low in a sound reproduction method.
Li et al. (Learning Visual Styles from Audio-Visual Associations", arXiv:2205.05072 [cs.CV], 10 May, 2022 ) teaches a variety of methods for manipulating the style of an input image. From the patter of rain to the crunch of snow, the sounds we hear often convey the visual textures that appear within a scene. In this paper, we present a method for learning visual styles from unlabeled audio-visual data. Our model learns to manipulate the texture of a scene to match a sound, a problem we term audio-driven image stylization. Given a dataset of paired audio-visual data, we learn to modify input images such that, after manipulation, they are more likely to co-occur with a given input sound. In quantitative and qualitative evaluations, our sound-based model outperforms label-based approaches. We also show that audio can be an intuitive representation for manipulating images, as adjusting a sound’s volume or mixing two sounds together results in predictable changes to visual style.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to NADIRA SULTANA whose telephone number is (571)272-4048. The examiner can normally be reached M-F,7:30 am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D. Shah can be reached on (571) 270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/NADIRA SULTANA/Examiner, Art Unit 2653                                                                                                                                                                                                        /DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

Jun 18, 2024
Application Filed
Jan 09, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/654,845
Patent 12603086
CONTEXTUAL EDITABLE SPEECH RECOGNITION METHODS AND SYSTEMS
2y 5m to grant Granted Apr 14, 2026
18/129,882
Patent 12591747
ENTITY-CONDITIONED SENTENCE GENERATION
2y 5m to grant Granted Mar 31, 2026
18/154,197
Patent 12573413
AUDIO CODING METHOD AND RELATED APPARATUS, AND COMPUTER-READABLE STORAGE MEDIUM
2y 5m to grant Granted Mar 10, 2026
18/316,173
Patent 12567420
METHOD AND APPARATUS FOR CONTROLLING SOUND RECEIVING DEVICE BASED ON DUAL-MODE AUDIO THREE-DIMENSIONAL CODE
2y 5m to grant Granted Mar 03, 2026
17/575,195
Patent 12536992
ELECTRONIC DEVICE AND METHOD FOR PROVIDING VOICE RECOGNITION SERVICE
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+31.1%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 97 resolved cases by this examiner. Grant probability derived from career allow rate.