Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-3, 5-8, 10-15, 17, 19-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kothari et al. (US Pub 2021/0375023 A1).
As to claim 1, Kothari discloses a processor, comprising: one or more circuits to use one or more neural networks to generate one or more portions of one or more images based, at least in part, on one or more tasks to identify one or more objects in the one or more portions (Kothari, ¶0045, “AR headset 104 or another display device can include one or more front-facing cameras 106 that can capture image data for objects located within a field of view of these one or more cameras. In at least one embodiment, this image data can be analyzed on a device such as client device 110 or remote server 130, for purposes of generating content to be displayed on a display 108, such as a display screen of AR headset 104. In at least one embodiment, display 108 can be at least partially transparent so objects can be seen through this display, but image content can be rendered and displayed as a type of overlay with respect to this image content.” “this enables content to be generated and displayed that relates in some way to one or more objects of an environment or scene in which a user is located. It at least one embodiment, this enables a user to interact or direct a certain type of content to be generated based on objects that a user places within that field of view, or objects within that field of view for which a user provides input.” ¶0046, “an application can attempt to identify one or more objects represented or described in this text, and then generate image, video, or animation content to be displayed to user 102 via display 108. In at least one embodiment, as user 102 causes this text to change, such as my flipping pages of a book 150 or scrolling an interface, this generated content can update to represent this change in text.” ¶0048, “an image analysis component 114 can analyze this image data, such as to determine text included in this image data, as well as to determine changes to text represented in this data. In at least one embodiment, a content manager 116 can then utilize this text to generate AR content to be presented via headset 104. In at least one embodiment, this can include identifying objects, characters, or other elements represented in this text and generating images or animation representative of those objects, characters, or elements.” The “image data for objects located within a field of view of these one or more cameras” is one or more portions of one or more images. “identify one or more objects represented or described in this text” is one or more tasks to identify one or more objects in the one or more portions. ¶0051, “one or more neural networks can be used attempt to animate an entire story or sequence represented in input text.” is generate one or more portions of one or more images. ¶0061, “use of cached latent space with this VAE enables extracted features of a scene, as captured by an AR camera, can be leveraged to build this cache, and groom a scene around that information. In at least one embodiment, if a real-world background includes a wall with a painting and text content also includes a painting, this real-world scene and object positioning can be leveraged while creating a clip.”).
As to claim 2, claim 1 is incorporated and Kothari discloses the one or more neural networks generate the one or more portions of one or more images based, at least in part, on comparing one or more loss values calculated by the one or more tasks (Kothari, ¶0384, “model training 3114 may include retraining or updating an initial model 3304 (e.g., a pre-trained model) using new training data (e.g., new input data, such as customer dataset 3306, and/or new ground truth data associated with input data). In at least one embodiment, to retrain, or update, initial model 3304, output or loss layer(s) of initial model 3304 may be reset, or deleted, and/or replaced with an updated or new output or loss layer(s). In at least one embodiment, initial model 3304 may have previously fine-tuned parameters (e.g., weights and/or biases) that remain from prior training, so training or retraining 3114 may not take as long or require as much processing as training a model from scratch. In at least one embodiment, during model training 3114, by having reset or replaced output or loss layer(s) of initial model 3304, parameters may be updated and re-tuned for a new data set based on loss calculations associated with accuracy of output or loss layer(s) at generating predictions on new, customer dataset 3306 (e.g., image data 3108 of FIG. 31).” Retraining can be part of neural network feature.).
As to claim 3, claim 1 is incorporated and Kothari discloses the one or more neural networks include one or more transformer models each specific to one or more independent tasks (Kothari, ¶0057, “a segment of text can be encoded using layers of a convolutional neural network (CNN), as a CNN can determine an underlying pattern in this text, where different patterns can be used to generate different objects, actions, or occurrences in animation. In at least one embodiment, encoding from one or more CNNs 210 can be fed to a transformer 212, which can determine and understand how objects are related, which is important to bring consistency and coherence to a sequence of animation. In at least one embodiment, a transformer can attempt to understand an entire story in order to generate appropriate animation. In at least one embodiment, this can include determining which objects to emphasize, as well as a relative importance of various events to this overall story. In at least one embodiment, transformer 212 can have an ability to go back and modify already generated animation as transformer 212 better understands this story and importance of objects and events in that story.”).
As to claim 5, claim 1 is incorporated and Kothari discloses the one or more neural networks are trained to perform the one or more tasks to include identifying text in an image (Kothari, ¶0046, “an object within a field of view of a camera 106 of an AR headset 104 may include textual content. In at least one embodiment, this may include a book, magazine, digital interface, publication, or other source of text. In at least one embodiment, camera 106 can capture image or video data including a view of at least a portion of this text, and this text can be analyzed to determine content to be presented to user 102 through AR headset 104.” “an application can attempt to identify one or more objects represented or described in this text, and then generate image, video, or animation content to be displayed to user 102 via display 108. In at least one embodiment, as user 102 causes this text to change, such as my flipping pages of a book 150 or scrolling an interface, this generated content can update to represent this change in text.” ¶0053, “these images or video frames may be passed to an optical character recognition (OCR) application 206 or module that can generate text data corresponding to text represented in these images or video, wherein this text can be provided as an additional input to one or more neural networks to help improve inferencing results.”).
As to claim 6, claim 1 is incorporated and Kothari discloses the one or more tasks include one or more tasks to perform modification of an image (Kothari, ¶0053, “these images or video frames may be passed to an optical character recognition (OCR) application 206 or module that can generate text data corresponding to text represented in these images or video, wherein this text can be provided as an additional input to one or more neural networks to help improve inferencing results.” ¶0054, “these text frames and any other identified text data can be used to generate image, animation, or other media content to be displayed or presented through AR device 202 or client device 204” ¶0057, “a segment of text can be encoded using layers of a convolutional neural network (CNN), as a CNN can determine an underlying pattern in this text, where different patterns can be used to generate different objects, actions, or occurrences in animation. In at least one embodiment, encoding from one or more CNNs 210 can be fed to a transformer 212, which can determine and understand how objects are related, which is important to bring consistency and coherence to a sequence of animation. In at least one embodiment, a transformer can attempt to understand an entire story in order to generate appropriate animation. In at least one embodiment, this can include determining which objects to emphasize, as well as a relative importance of various events to this overall story. In at least one embodiment, transformer 212 can have an ability to go back and modify already generated animation as transformer 212 better understands this story and importance of objects and events in that story.”).
As to claim 7, claim 1 is incorporated and Kothari discloses the one or more circuits are to cause the one or more neural networks to be trained based, at least in part, on performing one or more independent tasks (Kothari, ¶0058, “a copy of an entire text can be obtained such that a system can have access to portions of that text other than are available from an AR device at a current time. In at least one embodiment, input text is pre-processed in order to detect sections that can be represented visually. In at least one embodiment, this can include using named entity recognition (NER) to detect objects and associated descriptions, as well as bi-directional encoder representations from transformers (BERT) to maintain context of these sections and group by relevance. In at least one embodiment, NER can be used to extract a real world noun entity from text data and classify this entity into a pre-defined category or classification, as may relate to a person, place, or object. In at least one embodiment, BERT can be used to perform transfer learning, sentiment analysis, sentence classification, and other such tasks useful for determining context. In at least one embodiment, NER can be performed unsupervised without labeled text using a BERT model that has been trained in an unsupervised fashion on a corpus with a specific language model objective.” ¶0067, ¶0084, ¶0384, “model training 3114 may include retraining or updating an initial model 3304 (e.g., a pre-trained model) using new training data (e.g., new input data, such as customer dataset 3306, and/or new ground truth data associated with input data). In at least one embodiment, to retrain, or update, initial model 3304, output or loss layer(s) of initial model 3304 may be reset, or deleted, and/or replaced with an updated or new output or loss layer(s).” ¶0384-0387.).
As to claim 8, Kothari discloses a system, comprising: one or more processors to use one or more neural networks to generate one or more portions of one or more images based, at least in part, on one or more tasks to identify one or more objects in the one or more portions (See claim 1 for detailed analysis.).
As to claim 10, claim 8 is incorporated and Kothari discloses the one or more neural networks include one or more transformer models each specific to one or more tasks (See claim 3 for detailed analysis.).
As to claim 11, claim 8 is incorporated and Kothari discloses the one or more neural networks include a visual decoder, masking autoencoder, and visual encoder (See claim 4 for detailed analysis.).
As to claim 12, claim 8 is incorporated and Kothari discloses the one or more neural networks include at least one neural network to perform one or more tasks of identifying text in the one or more portions of the one or more images (See claim 5 for detailed analysis.).
As to claim 13, claim 8 is incorporated and Kothari discloses the one or more tasks include tasks to perform modification of the one or more portions of the one or more images (See claim 6 for detailed analysis.).
As to claim 14, claim 8 is incorporated and Kothari discloses the one or more processors training the one or more neural networks based, at least in part, on one or more results generated by the one or more tasks (See claim 7 for detailed analysis.).
As to claim 15, Kothari discloses a method, comprising: using one or more neural networks to generate one or more portions of one or more images based, at least in part, on one or more tasks to identify one or more objects in the one or more portions (See claim 1 for detailed analysis.).
As to claim 17, claim 15 is incorporated and Kothari discloses the one or more neural networks include one or more transformer models each specific to one or more tasks (See claim 3 for detailed analysis.).
As to claim 19, claim 15 is incorporated and Kothari discloses the one or more neural networks are each trained to perform the one or more tasks including identifying text in an image (See claim 5 for detailed analysis.).
As to claim 20, claim 15 is incorporated and Kothari discloses the one or more tasks include performing modification of an image (See claim 6 for detailed analysis.).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 4, 9, 16, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Kothari et al. (US Pub 2021/0375023 A1) in view of He, Kaiming, et al. ("Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.).
As to claim 4, claim 1 is incorporated and Kothari discloses the one or more neural networks include a visual decoder, (Kothari, ¶0054, “this information can then be provided as input to one or more neural networks, such as one or more variational auto-encoders (VAEs). In at least one embodiment, these VAEs 214, which can include appropriate encoders 216 and decoders 218, can then generate images or animation that is meaningful with respect to these input text images, while also being consistent over a sequence or period of time.” ¶0058, “a copy of an entire text can be obtained such that a system can have access to portions of that text other than are available from an AR device at a current time. In at least one embodiment, input text is pre-processed in order to detect sections that can be represented visually. In at least one embodiment, this can include using named entity recognition (NER) to detect objects and associated descriptions, as well as bi-directional encoder representations from transformers (BERT) to maintain context of these sections and group by relevance.” ¶0059, “an autoencoder such as a VAE 214 can be utilized whose latent space is guided by an accumulating transformer 212, along with data in a cache”).
Kothari does not explicitly disclose masking autoencoder. However, Kothari discloses BERT (Kothari, ¶0058, “input text is pre-processed in order to detect sections that can be represented visually. In at least one embodiment, this can include using named entity recognition (NER) to detect objects and associated descriptions, as well as bi-directional encoder representations from transformers (BERT) to maintain context of these sections and group by relevance. In at least one embodiment, NER can be used to extract a real world noun entity from text data and classify this entity into a pre-defined category or classification, as may relate to a person, place, or object. In at least one embodiment, BERT can be used to perform transfer learning, sentiment analysis, sentence classification, and other such tasks useful for determining context. In at least one embodiment, NER can be performed unsupervised without labeled text using a BERT model that has been trained in an unsupervised fashion on a corpus with a specific language model objective. “). BERT is a well-known masked autoencoder.
He also teaches masking autoencoder (He, abstract, “masked autoencoders (MAE) are scalable self-supervised learners for computer vision.” Page 1, “The idea of masked autoencoders, a form of more general denoising autoencoders [58], is natural and applicable
in computer vision as well.“ Page 3, “Masked language modeling and its autoregressive counterparts, e.g., BERT [14] and GPT [47, 48, 4]”)
Kothari and He are considered to be analogous art because all pertain to deep learning. It would have been obvious before the effective filing date of the claimed invention to have modified Kothari with the features of “masking autoencoder” as taught by He. The suggestion/motivation would have been because the idea of masked autoencoders, a form of more general denoising autoencoders, is natural and applicable in computer vision as well (He, Page 1).
As to claim 9, claim 8 is incorporated and the combination of Kothari and He discloses the one or more neural networks generate the one or more portions of one or more images from an image masked based, at least in part, on a masked autoencoder (See claim 4 for detailed analysis.).
As to claim 16, claim 15 is incorporated and Kothari discloses generating the one or more portions of the one or more images from one or more portions of one or more images masked based, at least in part, on a masked autoencoder (See claim 4 for detailed analysis. He, Fig. 1, Fig. 2. Page 3. Page 12. Fig. 11, Page 14.).
As to claim 18, claim 15 is incorporated and Kothari discloses the one or more neural networks include a visual decoder, masking autoencoder, and visual encoder (See claim 4 for detailed analysis.).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019. (Year: 2019) discloses pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YU CHEN whose telephone number is (571)270-7951. The examiner can normally be reached on M-F 8-5 PST Mid-day flex.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached on 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/YU CHEN/Primary Examiner, Art Unit 2613