DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 13, and therefore claims 14–18 which depend therefrom are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claims do not fall within at least one of the four categories of patent eligible subject matter because claim 13 recites “a machine-readable storage medium storing instructions,” and the specification does not provide a disavowal of scope precluding the broadest reasonable interpretation from including a signal per se. Therefore, the broadest reasonable interpretation of claim 13 includes a signal per se, which is non-statutory subject matter. To overcome this rejection, claim 13 should be amended to recite “non-transitory.”
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-18 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al., "Images Speak in Images: A Generalist Painter for In-Context Visual Learning," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023, pp. 6830-6839, doi: 10.1109/CVPR52729.2023.00660 (herein “Wang”) in view of Zhang et al., "Instruct Me More! Random Prompting for Visual In-Context Learning," 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, January 8, 2024, pp. 2585-2594, doi: 10.1109/WACV57701.2024.00258 (herein "Zhang”).
Regarding claims 1, 7 and 13, with claim 1 as exemplary, substantive differences between the claims noted in curly brackets {}, and deficiencies of Wang noted in square brackets [], Wang teaches {an image processing model training apparatus, the image processing model which is a convolutional neural network model including Y prompt modules and M image processing modules, each of the M image processing modules corresponding to at least one of the Y prompt modules, the apparatus comprising: a memory for storing instructions; and [one or more processors] for executing the instructions to cause the apparatus to: - claim 1 / an image processing model training method, the image processing model which is a convolutional neural network model including Y prompt modules and M image processing modules, each of the M image processing modules corresponding to at least one of the Y prompt modules, the method comprising: - claim 7 / a machine-readable storage medium storing instructions, which when executed by [one or more processors], cause the machine to:} (Wang pages 6830–6831, fig. 1, Abstract and Introduction section, and page 6834, Architecture section, a generalist image processing model called “Painter” which is trained, and is comprised of multiple processing paths (shown as rows in fig. 1), each path for executing different image processing tasks thus the Painter model embodying M image processing models (one model for each different image processing task performed) with Y prompt inputs respective to the task model associated with the prompt input, where Wang’s title section includes a Github repository https://github.com/baaivision/Painter, and thus teaches an apparatus of storage with instructions intended to be executed on a processor)
acquire sample data including N sample images and a reference result corresponding to each of the N sample images (Wang page 6833, fig. 2 image pairs, section 3.2 Input format, during training, each input sample includes a pairs of images, each image pair consisting of one image (sample image) and its corresponding task output (reference result));
[add corresponding prompt information to the N sample images] based on the Y prompt modules, [to obtain N prompt sample images] corresponding to each image processing module, wherein the prompt information is related to each image processing task of each image processing module (Wang page 6833, fig. 2 two pairs, an input sample comprised of the concatenation of two pairs of images is masked in patches and replaced by learnable token-vectors, where page 6836, section 4.3 teaches that in a prompt tuning step, the learnable tensors define the task prompt (prompt information), and pages 6832–6833, section 3.1 teaching that the learnable tensor is redefined for each task (related to each image processing task)) ;
predict the N prompt sample images corresponding to each image processing module, to obtain N prediction results corresponding to each image processing module (Wang pages 6834–6835, section 4.1 training details and figure 3, from multiple input images respective to different tasks, a prediction is made for each task (each image processing module) in a one-to-one correlation (N prediction results for N prompt sample images)); and
adjust parameters of the M image processing modules and [the Y prompt modules] based on the N prediction results and N reference results corresponding to each image processing module (Wang pages 6834–6835, section 4.1, training details, and section 3.2 loss function, an AdamW optimizer with a cosine learning rate scheduler is employed during training with a simple regression loss computed on the masked pixels, where different sampling weights are used respective to the different tasks).
Although Wang discloses its image processing methodology as embodied in software, stored on a Github repository https://github.com/baaivision/Painter, and thus teaches an apparatus of storage with instructions intended to be executed on a processor (see Wang title information), Wang nonetheless does not explicitly teach one or more processors, however, it would have been obvious to a person having ordinary skill in the art (herein “PHOSITA”) before the effective filing date of the claimed invention to have included one or more processors into the architecture of Wang at least because Wang does at least teach the results of the execution of its software (see page 6835, results section), and therefore, such a modification would have been applying a known technique to a known device (method, or product) ready for improvement to yield predictable results. See MPEP §2143(I)(D).
Further, while Wang teaches prompt information respective to each image processing task, Wang does not explicitly teach “add corresponding prompt information to the N sample images” “to obtain N prompt sample images.” Still further, while Wang does teach respective input paths for each task, and different prompt tuning for each task, thus at least teaching Y prompt modules, Wang does not teach a prompt module having parameters that are adjusted, and thus does not explicitly teach “adjust parameters of the Y prompt modules,” as claimed.
Zhang teaches add corresponding prompt information to the N sample images to obtain N prompt sample images (Zhang pages 2587–2588, figure 3, section 3.2 prompt enhancer adapts the input data by making a pixel-level perturbation (adding corresponding prompt information) to make learnable prompts as a task identifier).
Zhang further teaches adjust parameters of the Y prompt modules (Zhang page 2588, section 3.4, the learnable prompt is trained for a specific task according to a Loss function which adjusts weights/probabilities of respective tokens as part of the training process).
Therefore, taking the teachings of Wang and Zhang together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the multi-task image processing prompt tuning model disclosed in Wang to include adding prompt information to input images and adjusting parameters of the prompt processing as disclosed in Zhang at least because doing so would allow for more fine-grain details in predictions and overcome the interference caused by low-quality in-context pairs that are not sufficient similar to query images, and also provide robustness against domain shift. See Zhang page 2592, section 5.
Regarding claims 2, 8 and 14, with claim 2 as exemplary, Wang teaches wherein add corresponding prompt information to the N sample images based on the Y prompt modules comprises one of: changing pixel values of at least a portion of pixels in the N sample images; or, adding image areas around the N sample images to expand the N sample images (given that the claim recites limitations in the alternative, Wang teaches changing pixel values on pages 6833–6834 in that a mask is applied to the images used for training, thus removing and changing the pixel value information from the mask region (at least a portion)).
Regarding claims 3, 9 and 15, with claim 3 as exemplary, Wang teaches wherein an image processing task corresponding to an image processing module is image recognition (Wang page 6833, keypoint detection section, one of the prediction tasks handled by the disclosed model includes detecting object (such as humans) and localizing their detailed keypoints). Wang does not explicitly teach, but Zhang teaches and prompt information corresponding to the image processing task is added to the N sample images by changing pixel values of outline of each object to be recognized in the N sample images (Zhang pages 2587–2588, fig. 3, and page 6837, fig. 4, sections 3.2, and 3.5, prompt enhancer adding perturbations to the in-context pair to a set of pixels around the edges of the image that are learnable to predict visual tokens in a vocabulary including animals like horse monkey and inanimate objects like broom (objects to be recognized)).
Therefore, taking the teachings of Wang and Zhang together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the multi-task image processing prompt tuning model disclosed in Wang to include the prompt information changing pixel values of the outline of objects as disclosed in Zhang at least because doing so would allow for more fine-grain details in predictions and overcome the interference caused by low-quality in-context pairs that are not sufficient similar to query images, and also provide robustness against domain shift. See Zhang page 2592, section 5.
Regarding claims 4, 10, and 16, with claim 4 as exemplary, and deficiencies of Wang noted in square brackets [], Wang teaches wherein an image processing task corresponding to an image processing module is semantic segmentation (Wang page 6833, semantic segmentation section, one of the prediction tasks performed by the disclosed model includes semantic segmentation); and prompt information corresponding to the image processing task [is added to the N sample images] by changing pixel values of at least a portion of pixels [of each object to be segmented] in the N sample images (Wang page 6833, fig. 2 two pairs, an input sample comprised of the concatenation of two pairs of images is masked in patches (changing pixel values of at least a portion – the masked portion) and replaced by learnable token-vectors, where page 6836, section 4.3 teaches that in a prompt tuning step, the learnable tensors define the task prompt (prompt information), and pages 6832–6833, section 3.1 teaching that the learnable tensor is redefined for each task (related to each image processing task)).
While Wang masks pixel values to indicate more prompting information, to the extent such masking is not explicitly “adding to the N sample images,” at least Zhang teaches “is added to the N sample images” (Zhang pages 2587–2588, figure 3, section 3.2 prompt enhancer adapts the input data by making a pixel-level perturbation (adding corresponding prompt information) to make learnable prompts as a task identifier). Zhang further teaches changing pixel values of each object to be segmented (Zhang pages 2587–2588, fig. 3, sections 3.2, and 3.5, prompt enhancer adding perturbations to the in-context pair to a set of pixels around the edges of the image that are learnable to predict visual tokens in a vocabulary (objects to be recognized)).
Therefore, taking the teachings of Wang and Zhang together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the multi-task image processing prompt tuning model disclosed in Wang to include the prompt information changing pixel values of objects as disclosed in Zhang at least because doing so would allow for more fine-grain details in predictions and overcome the interference caused by low-quality in-context pairs that are not sufficient similar to query images, and also provide robustness against domain shift. See Zhang page 2592, section 5.
Regarding claims 5, 11 and 17, with claim 5 as exemplary, and deficiencies of Wang noted in square brackets [], Wang teaches wherein an image processing task corresponding to an image processing module is depth estimation (Wang pages 6832–6833 monocular depth estimation section, one of the prediction tasks handled by the disclosed model includes depth estimation to estimate a per-pixel depth value); and prompt information corresponding to the image processing task [is added] to the N sample images by changing pixel values of pixels of each object to be estimated in the N sample images to increase color comparisons between objects to be estimated (Wang pages 6832–6833 monocular depth estimation section, and page 6845, fig. 3, RGB pixel values in a greater value range 0 to 255 are mapped to a depth range of lesser values range 0-10, thus increasing color comparisons/contrast, where fig. 3 illustrates the effect in the row labeled “NYUv2-Depth”).
While Wang masks pixel values to indicate more prompting information, to the extent such masking is not explicitly “adding to the N sample images,” at least Zhang teaches “is added to the N sample images” (Zhang pages 2587–2588, figure 3, section 3.2 prompt enhancer adapts the input data by making a pixel-level perturbation (adding corresponding prompt information) to make learnable prompts as a task identifier).
Therefore, taking the teachings of Wang and Zhang together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the multi-task image processing prompt tuning model disclosed in Wang to include the added prompt information changing pixel values as disclosed in Zhang at least because doing so would allow for more fine-grain details in predictions and overcome the interference caused by low-quality in-context pairs that are not sufficient similar to query images, and also provide robustness against domain shift. See Zhang page 2592, section 5.
Regarding claims 6, 12 and 18, with claim 6 as exemplary, Wang teaches wherein the image processing model further comprises a feature extraction module which is configured to: perform feature extraction on the N prompt sample images output by each prompt module (Wang pages 6833–6834, architecture section, and page 6836, section 4.3 prompt tuning, feature maps are evenly sampled (feature extraction) using a vanilla vision transformer (module) on the input image including the patches, some of which are masked by way of the learnable tensors from the prompt tuning), and input extracted feature into the M image processing modules corresponding to the Y prompt modules (Wang page 6834 Architecture and In-Context Inference sections, the task prompt as the learnable tensors, where the feature maps are input to be added patch by patch in the vanilla vision transformer, where page 6832–6833, section 3.1 teaches that for each task, a tensor is redefined, thus providing different processing (modules) corresponding to the different learnable tensors defining different task prompts (Y prompt modules), see also fig. 1 illustrating separate processing paths for different task prompts).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Chinese published patent application No. CN-117315070-A by Rui, published 12/19/2023, directed towards processing an image for semantic segmentation according to a text prompt.
Long et al., “Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection,” arXiv:2211.00849v2 [cs.CV] 29 Jul 2023, directed towards prompt training for image processing that adds information to the prompt.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Thursday, 09:00-17:00, Friday 09:00-13:00, EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
MICHELLE M. KOETH
Primary Examiner
Art Unit 2671
/MICHELLE M KOETH/Primary Examiner, Art Unit 2671