DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on December 2, 2025 has been entered.
Applicant has amended claims 6-9, 12, 15-17, 20-23, and 25. Claims 6-25 are pending.
Response to Arguments
Applicant’s arguments with respect to pending claims have been considered but are moot in view of the new ground(s) of rejection. The amended claims resulted in changes to the scope and contents; therefore, the grounds of rejection are modified accordingly. It is noted that previously applied prior arts remain in effect.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 6-9, 14-17, and 21-23 are rejected under 35 U.S.C. 103 as being are rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US PG Publication No. 2019/0295302 A1), hereafter referred to as Fu, applicant cited prior art originally cited by the examiner during examination of instant application, in view of Ioffe et al. (US PG Publication No. 2016/0217368 A1), hereafter referred to as Ioffe, applicant cited prior art originally cited by the examiner during examination of instant application, in further view of Schwartz et al. (US PG Publication No. 2018/0314932 A1), hereafter referred to as Schwartz, and in further view of Owechko et al. (US Pat. Publication No. 10176382), hereafter referred to as Owechko.
Regarding claim 6, Fu discloses a method (Par. [0003]: methods and systems for image generation through use of adversarial networks), comprising:
generating an image using one or more neural networks according to one or more semantic labels in a semantic layout (Par. [0004]: an image generator… the generator is implemented with a first neural network configured to generate a fake image based on a target segmentation. A fake image is a processor-generated image, where the processor may be a neural network… a target segmentation… is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… at the end of the training period, the generator has its first neural network trained to generate the fake image based on the target segmentation with more accuracy than at the start of the training period; Par. [0034-77]: Segmentation Guided Generative Adversarial Network (SGGAN), which leverages semantic segmentation to improve image generation performance further and provide spatial mapping… employ a segmentor implemented with a neural network that is designed to impose semantic information on the generated images. Experimental results on multi-domain image-to-image translation empirically demonstrates the ability of embodiments to control spatial modification and the superior quality of images generated by embodiments compared to state-of-the-art methods… Image-to-Image translation aims to map an image in a source domain to the corresponding image in a target domain… Segmentation Guided Generative Adversarial Network (SGGAN), which fully leverages semantic segmentation information to guide the image generation (e.g., translation) process… the image semantic segmentation can be obtained through a variety of methodologies… guide the generator with pixel-level semantic segmentations and, thus, further boost the quality of generated images… receive an input image 266a and/or 266b and generate a corresponding segmentation 267a and/or 267b, respectively, indicating features of the input images 266a and/or 266b. The segmentor 260 imposes semantic information on the image generation process… During the training 270, the segmentor 260 provides spatial guidance to the generator 220 to ensure the generated images, e.g., 274, comply with input segmentations, e.g., 271. The discriminator 240 aims to ensure the translated images, e.g., 274, are as realistic as the real images, e.g., 273… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… embodiment leverages semantic segmentation information in GAN based image translation tasks and also builds a segmentor which is trained together with the GAN framework to provide guidance in image translation… extracted landmarks 441 are processed to generate a pixel-wised semantic segmentation 442 where each pixel in the input image 440 is automatically classified into classes… the segmentor S is optimized by minimizing the difference between the landmark base segmentation 442 and segmentor generated segmentation 444. For instance, based upon differences between the landmark-based segmentation 442 and segmentor S generated segmentation 444, weights of the network implementing the segmentor S may be varied. As shown in FIG. 4, the similarity between the landmark-based segmentation 442 and segmentor generated segmentation 444 reveals that a segmentor network, implemented according to the embodiments described herein, can successfully capture the semantic information from an input image… Based on the segmentor network S, the proposed SGGAN, e.g., the network depicted in FIG. 2D, comprises three networks, a segmentor, a generator, and a discriminator. The proposed SGGAN utilizes semantic segmentations as strong regulations and control signals in multi-domain image-to-image translation; Par. [0098-127]: because embodiments use semantic segmentation information, embodiments effectively transfer all the attributes and produce much sharper, clearer, and more realistic translation results… generate attribute-level semantic segmentations from input faces. Generated semantic information may be used together with input images to guide face image translation processes… Embodiments generate more realistic results with better image quality (sharper and clearer details) after image translation. Additional morphing features (face attribute reallocation, changing face shape, making a person gradually smile) may be provided in the translation process and no known existing methods can achieve the same effects. An embodiment can generate facial semantic segmentations directly from given input face images… invention detects faces from the input image and extracts corresponding semantic segmentations. Then, an image translation process uses trained models of a novel deep learning based adversarial network, referred to herein as Segmentation Guided Generative Adversarial Networks, which fully leverages semantic segmentation information to guide the image translation process. An example benefit of embodiments includes explicitly guiding the generator with pixel-wise and instance level segmentations, and, thus, further boosting the image quality. Another benefit is the semantic segmentation working well prior to the image generation, which is able to edit the image content… the proposed SGGAN model may employ three networks, i.e., generator, discriminator, and segmentor. The generator takes as inputs, a given image, multiple attributes, and a target segmentation and generates a target image. The discriminator pushes the generated images towards a target domain distribution, and meanwhile, utilizes an auxiliary attribute classifier to enable the SGGAN to generate images with multiple attributes. The segmentor may impose semantic information on the generation process. This framework is trained using a large dataset of face images with attribute-level labels. Further, it is noted that embodiments may implement segmentations of any desired features, e.g., features of faces, clothes, street views, cityscapes, room layouts, room designs, and building designs, amongst other examples… embodiments implement image generation with spatial constraints… embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs. An embodiment of the SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. According to an example embodiment, the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network, according to an embodiment, attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… a SCGAN that takes latent vectors, attribute labels, and semantic segmentations as inputs, and decouples the image generation into three dimensions. As such, embodiments of the SCGAN are capable of generating images with controlled spatial contents and attributes and generate target images with a large diversity; generating an image using one or more neural networks according to one or more semantic labels in a semantic layout (e.g. generate images by using an image generator implemented with a first (i.e. one or more) neural network that is configured to generate a processor-generated image (i.e. generating an image using one or more neural networks) based on a target segmentation, including a set of segments (i.e. layouts, contours, silhouettes, boundaries, shapes, etc.) that include sets of pixels or set of contours that correspond to portions or landmarks of an image, for example, by employing a segmentor that is implemented with a neural network designed to impose semantic (i.e. type, category, class, keyword, topic, etc.) segmentation (i.e. layout) information (i.e. labels, tags, etc.) on generated images by guiding the image generator with pixel-wise and instance level segmentations (i.e. semantic layout), for example, and during neural network training, the segmentor provides spatial guidance to the generator in order to ensure that generated images comply with input semantic segmentations (i.e. generating an image using one or more neural networks according to one or more semantic labels in a semantic layout), for example, in order generate more realistic results with better image quality by synthesizing spatially constrained images, as indicated above), for example), but fails to teach the following as further recited in claim 6.
However, Ioffe teaches one or more neural networks that include one or more spatially-adaptive normalization layers that spatially vary one or more parameters of one or more transformations to modulate one or more activations (Par. [0002-3]: specification relates to processing inputs through the layers of neural networks to generate outputs… Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters; Par. [0015-46]: neural network system 100 can be configured to receive any kind of digital data input and to generate any kind of score or classification output based on the input … if the inputs to the neural network system 100 are images or features that have been extracted from images, the output generated by the neural network system 100 for a given image may be scores for each of a set of object categories, with each score representing an estimated likelihood that the image contains an image of an object belonging to the category… In particular, each of the layers of the neural network is configured to receive an input and generate an output from the input and the neural network layers collectively process neural network inputs received by the neural network system 100 to generate a respective neural network output for each received neural network input. Some or all of the neural network layers in the sequence generate outputs from inputs in accordance with current values of a set of parameters for the neural network layer. For example, some layers may multiply the received input by a matrix of current parameter values as part of generating an output from the received input… The neural network system 100 also includes a batch normalization layer 108 between a neural network layer A 104 and a neural network layer B 112 in the sequence of neural network layers. The batch normalization layer 108 is configured to perform one set of operations on inputs received from the neural network layer A 104 during training of the neural network system 100 and another set of operations on inputs received from the neural network layer A 104 after the neural network system 100 has been trained… During training of the neural network system 100 on a given batch of training examples, the batch normalization layer 108 is configured to receive layer A outputs 106 generated by the neural network layer A 104 for the training examples in the batch, process the layer A outputs 106 to generate a respective batch normalization layer output 110 for each training example in the batch, and then provide the batch normalization layer outputs 110 as an input to the neural network layer B 112. The layer A outputs 106 include a respective output generated by the neural network layer A 104 for each training example in the batch. Similarly, the batch normalization layer outputs 110 include a respective output generated by the batch normalization layer 108 for each training example in the batch…the batch normalization layer 108 computes a set of normalization statistics for the batch from the layer A outputs 106, normalizes the layer A outputs 106 to generate a respective normalized output for each training example in the batch, and, optionally, transforms each of the normalized outputs before providing the outputs as input to the neural network layer B 112… the neural network layer A 104 generates outputs by modifying inputs to the layer in accordance with current values of a set of parameters for the first neural network layer, e.g., by multiplying the input to the layer by a matrix of the current parameter values. In these implementations, the neural network layer B 112 may receive an output from the batch normalization layer 108 and generate an output by applying a non-linear operation, i.e., a non-linear activation function, to the batch normalization layer output… the neural network layer A 104 generates the outputs by modifying layer inputs in accordance with current values of a set of parameters to generate a modified first layer inputs and then applying a non-linear operation to the modified first layer inputs before providing the output to the batch normalization layer 108… FIG. 2 is a flow diagram of an example process 200 for generating a batch normalization layer output during training of a neural network on a batch of training examples. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a batch normalization layer included in a neural network system, e.g., the batch normalization layer 108 included in the neural network system 100 of FIG. 1, appropriately programmed, can perform the process 200… the batch normalization layer computes, for each dimension, the mean and the standard deviation of the components of the lower layer outputs that correspond to the dimension. The batch normalization layer then normalizes each component of each of the lower level outputs using the means and standard deviations to generate a respective normalized output for each of the training examples in the batch… the batch normalization layer computes, for each possible feature index and spatial location index combination, the mean and the variance of the components of the lower layer outputs that have that feature index and spatial location index. The batch normalization layer then computes, for each feature index, the average of the means for the feature index and spatial location index combinations that include the feature index. The batch normalization layer also computes, for each feature index, the average of the variances for the feature index and spatial location index combinations that include the feature index. Thus, after computing the averages, the batch normalization layer has computed a mean statistic for each feature across all of the spatial locations and a variance statistic for each feature across all of the spatial locations… the batch normalization layer transforms each component of each normalized output … In cases where the layer below the batch normalization layer is a layer that generates an output that includes multiple components indexed by dimension, the batch normalization layer transforms, for each dimension, the component of each normalized output in the dimension in accordance with current values of a set of parameters for the dimension. That is, the batch normalization layer maintains a respective set of parameters for each dimension and uses those parameters to apply a transformation to the components of the normalized outputs in the dimension. The values of the sets of parameters are adjusted as part of the training of the neural network system; Par. [057-59]: the batch normalization layer transforms each component of the normalized output… If the outputs generated by the layer below the batch normalization layer are indexed by dimension, the batch normalization layer transforms, for each dimension, the component of the normalized output in the dimension in accordance with trained values of the set of parameters for the dimension. If the outputs generated by the layer below the batch normalization layer are indexed by feature index and spatial location index, the batch normalization layer transforms each component of the normalized output in accordance with trained values of the set of parameters for the feature index corresponding to the component… The batch normalization layer provides the normalized output or the transformed normalized output as input to the layer above the batch normalization layer in the sequence; one or more neural networks that include one or more spatially-adaptive normalization layers that spatially vary one or more parameters of one or more transformations to modulate one or more activations (e.g. neural network system and method include neural networks (i.e. one or more neural networks), which are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input, for example, in which each layer of each network generates an output from a received input in accordance with current values of a respective set of parameters, including batch normalization layers that compute a mean statistic for features that have been extracted (i.e. segmented, derived, etc.) from given images, for example, including each feature across all of spatial locations and a variance statistic for each feature across all of the spatial locations (i.e. one or more neural networks that include one or more spatially-adaptive normalization layers), for example, in which each one of the batch normalization layers transforms (i.e. one or more transformations) each component of each normalized output to generate outputs from features extracted from input images in accordance with current values of a set of parameters for each one of the neural network layers (i.e. one or more parameters of one or more transformations), for example, including values of the sets of parameters that are adjusted (i.e. changed, adapted, variated, modified, updated, etc.) as part of the training of the neural network system (i.e. spatially vary one or more parameters of one or more transformations), for example, in order to generate an output by applying (i.e. modulating, controlling, instructing, forcing, etc.) a non-linear operation, such as a non-linear activation function (i.e. to modulate one or more activations), for example, to the batch normalization layer output (i.e. one or more neural networks that include one or more spatially-adaptive normalization layers that spatially vary one or more parameters of one or more transformations to modulate one or more activations), as indicated above), for example).
Fu and Ioffe are considered to be analogous art because they pertain to image processing applications based on neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention) to modify the method for image generation through use of adversarial networks (as disclosed by Fu) with one or more neural networks that include one or more spatially-adaptive normalization layers that spatially vary one or more parameters of one or more transformations to modulate one or more activations (as taught by Ioffe, Abstract, Par. [0002-3, 15-46, 57-59]) to process given input images through the layers of neural networks including normalization layers that generate neural network outputs that are accurate (Ioffe, Abstract, Par. [0002-3, 6]).
The combination of Fu and Ioffe, as a whole, teaches the method, as indicated above, but fails to teach the following as further recited in claim 6.
However, Schwartz teaches vary one or more parameters of one or more affine transformations to modulate one or more activations (Par. [0119-160]: FIG. 5 illustrates a graphics processing pipeline 500… generative adversarial network (GAN) comprises a generator and a discriminator. In use, the generator attempts to classify real objects, typically using a model, and attempts to confuse the discriminator. The GAN implements a recursive process which enables the generator and the discriminator to learn how to model 3D shapes… in some examples it may be useful to generate synthetic data for GANs using a GPUs rendering capabilities… exemplary type of machine learning algorithm is a neural network… exemplary type of neural network is the Convolutional Neural Network (CNN). A CNN is a specialized feedforward neural network for processing data having a known, grid-like topology, such as image data. Accordingly, CNNs are commonly used for compute vision and image recognition applications, but they also may be used for other types of pattern recognition such as speech and language processing. The nodes in the CNN input layer are organized into a set of “filters” (feature detectors inspired by the receptive fields found in the retina), and the output of each set of filters is propagated to nodes in successive layers of the network. The computations for a CNN include applying the convolution mathematical operation to each filter to produce the output of that filter. Convolution is a specialized kind of mathematical operation performed by two functions to produce a third function that is a modified version of one of the two original functions. In convolutional network terminology, the first function to the convolution can be referred to as the input, while the second function can be referred to as the convolution kernel. The output may be referred to as the feature map. For example, the input to a convolution layer can be a multidimensional array of data that defines the various color components of an input image. The convolution kernel can be a multidimensional array of parameters, where the parameters are adapted by the training process for the neural network… computation stages within a convolutional layer of a CNN. Input to a convolutional layer 1112 of a CNN can be processed in three stages of a convolutional layer 1114. The three stages can include a convolution stage 1116, a detector stage 1118, and a pooling stage 1120. The convolution layer 1114 can then output data to a successive convolutional layer. The final convolutional layer of the network can generate output feature map data or provide input to a fully connected layer, for example, to generate a classification value for the input to the CNN… The convolution stage 1116 can include an affine transformation, which is any transformation that can be specified as a linear transformation plus a translation. Affine transformations include rotations, translations, scaling, and combinations of these transformations. The convolution stage computes the output of functions (e.g., neurons) that are connected to specific regions in the input, which can be determined as the local region associated with the neuron. The neurons compute a dot product between the weights of the neurons and the region in the local input to which the neurons are connected. The output from the convolution stage 1116 defines a set of linear activations that are processed by successive stages of the convolutional layer 1114… The linear activations can be processed by a detector stage 1118. In the detector stage 1118, each linear activation is processed by a non-linear activation function. The non-linear activation function increases the nonlinear properties of the overall network without affecting the receptive fields of the convolution layer. Several types of non-linear activation functions may be used. One particular type is the rectified linear unit (ReLU), which uses an activation function… The output from the convolutional layer 1114 can then be processed by the next layer 1122. The next layer 1122 can be an additional convolutional layer or one of the fully connected layers 1108. For example, the first convolutional layer 1104 of FIG. 11A can output to the second convolutional layer 1106, while the second convolutional layer can output to a first layer of the fully connected layers 1108; Par. [0171]: combining results and synchronizing the model parameters between each node. Exemplary approaches to combining data include parameter averaging and update based data parallelism. Parameter averaging trains each node on a subset of the training data and sets the global parameters (e.g., weights, biases) to the average of the parameters from each node. Parameter averaging uses a central parameter server that maintains the parameter data. Update based data parallelism is similar to parameter averaging except that instead of transferring parameters from the nodes to the parameter server, the updates to the model are transferred; vary one or more parameters of one or more affine transformations to modulate one or more activations (e.g. exemplary type of neural network includes a convolutional neural network (CNN), in which computation stages within convolutional layers of the CNN include an affine transformation, respectively (i.e. one or more affine transformations), which are any transformation that can be specified as a linear transformation plus a translation, for example, and each of the affine transformations include rotations, translations, scaling, and combinations of these transformations, for example, and each convolution stage of the CNN computes the output of functions that are connected to specific regions in the input, in which each convolution layer includes a multidimensional array of data that defines various components of an input image, and each convolution kernel includes a multidimensional array of parameters, where each of the parameters are adapted (i.e. changed, variated, modified, updated, etc.) by the training process for the neural network (i.e. vary one or more parameters of one or more affine transformations), for example, and the output each convolution stage defines (i.e. modulates, applies, controls, instructs, etc.) a set of linear activations that are processed by successive stages of each of convolutional layer (i.e. to modulate one or more activations), for example, and each linear activation is processed by a non-linear activation function, including the rectified linear unit (ReLU) non-linear activation function, as indicated above), for example).
Fu, Ioffe, and Schwartz are considered to be analogous art because they pertain to image processing applications based on neural networks. Therefore, the combined teachings of Fu, Ioffe, and Schwartz, as a whole, would have rendered obvious the invention recited in claim 6 with a reasonable expectation of success in order to modify the method for image generation through use of adversarial networks (as disclosed by Fu) with vary one or more parameters of one or more affine transformations to modulate one or more activations (as taught by Schwartz, Abstract, Par. [0148-160, 171]) to compute the output of functions that are connected to specific regions in the input and to produce a set of linear activation processed by a non-linear activation function (Schwartz, Abstract, Par. [0154-158]).
The combination of Fu, Ioffe, and Schwartz, as a whole, teaches the method, including an image generator that uses pixel-level semantic segmentations (i.e. a semantic layout) to provide spatial guidance to ensure that generated images comply with semantic information (i.e. labels, tags, tec.) of input segmentations, as indicated above, for example, including batch normalization layers that compute a mean statistic for features that have been extracted (i.e. segmented, derived, etc.) from given images, for example, including each feature across all of spatial locations and a variance statistic for each feature across all of the spatial locations (i.e. one or more neural networks that include one or more spatially-adaptive normalization layers), in which each one of the batch normalization layers transforms (i.e. one or more transformations) each component of each normalized output to generate outputs from features extracted from input images in accordance with current values of a set of parameters for each one of the neural network layers, for example, including values of sets of parameters that are adjusted (i.e. changed, adapted, variated, modified, updated, etc.) as part of the training of the neural network system (i.e. spatially vary one or more parameters of one or more transformations), for example, in order to generate an output by applying (i.e. modulating, controlling, instructing, forcing, etc.) a non-linear operation, such as a non-linear activation function (i.e. to modulate one or more activations), for example, to the batch normalization layer output, in order to generate an output by applying (i.e. modulating, controlling, instructing, forcing, etc.) a non-linear operation, such as a non-linear activation function (i.e. to modulate one or more activations), to the batch normalization layer output, for example, including computation stages within convolutional layers that include an affine transformation (i.e. one or more affine transformations), and each convolution kernel includes a multidimensional array of parameters, where each of the parameters are adapted (i.e. changed, variated, modified, updated, etc.) by the training process for the neural network (i.e. vary one or more parameters of one or more affine transformations), and the output each convolution stage defines (i.e. modulates, applies, controls, instructs, etc.) a set of linear activations that are processed by successive stages of each of convolutional layer (i.e. to modulate one or more activations), as indicated above, for example, but does not expressly disclose “modulate” one or more activations, semantic “labels” and semantic layout “map”, respectively, as recited in claim 6.
However, Owechko teaches “modulate” one or more activations, semantic “labels”, and semantic layout “map” (Col. 2: a system for visual media reasoning using sparse associative recognition and recall. The system comprises one or more processors and a memory having instructions such that when the instructions are executed, the one or more processors perform multiple operations… An input image having input data is filtered using a non-linear sparse coding module and a first series of sparse coding filter kernels tuned to represent objects of general categories, followed by a second series of sparse coding filter kernels tuned to represent objects of specialized categories, resulting in a set of sparse codes. Object recognition is performed on the set of sparse codes by a neurally-inspired vision module to generate object and semantic labels for the set of sparse codes… selectively activating specific object or semantic label neurons in the neurally-inspired vision module; Col. 9-12: a method and system for recovering information pertaining to a photo or video about when it was taken, what is contained within, who is present, and where was the picture taken using sparse associative recognition and recall for visual media reasoning. The method and system is referred to as sparse associative recognition and recall, or SPARR. It can be implemented as a software system that will assist human analysts in rapidly extracting mission-relevant information from photos or video. This task is referred to as visual media reasoning (VMR). It includes reasoning with image content and preliminary annotations (e.g., who is in the photo?) to fully recover the 4 W's: “who”, “what”, “where”, and “when”… SPARR architecture comprises three tightly coupled layers that make use of bidirectional feedback between layers… layer 104 provides object and semantic labels… SPARR uses a general-purpose dictionary that can represent broad classes of natural images as a linear combinations of image prototypes… Starting with a general-purpose model (i.e., general categories pathway 206), the model predicts the existence of a certain coarse object category… in an image 210… Based on previous observations (i.e., training data) of images with semantic labels (i.e., annotated answers to all or some questions regarding who, what, where, when), or labeled images with associations… from the neurally-inspired visual layer (FIG. 1, element 104), the SAM layer… can predict with a certain confidence the semantics given a novel input image 102. These semantics (FIG. 1, labeled image with associations 118), which are in the form of activations of neurons; Col. 15-17: semantic features encode who, where, what, and when information. These semantic features are provided by preliminary analysis from other layers, in addition to results recalled from the spatiotemporal associative memory layer, and are used to top-down bias the Leabra Vision network to focus the object-recognition process… semantic features can be inferred from never-before-seen objects, leading to a rough classification and recovery of mission-relevant features. The authors of Literature Reference No. 19, for example, demonstrated that by presenting images, object labels, and semantic features for known objects, bidirectional connections from IT and semantic label layers allow generalization to never-before-seen images… During training, a bidirectionally connected associative network learns the joint distribution of object parts and relative location. During testing, as each object part is identified, it casts a vote as to the predicted object center through the bidirectional associative network. These predicted object center offsets are accumulated for a more robust estimate of object location with an object-center semantic 402… Object-center semantic maps enable 402 localization (i.e., estimating a bounding box) of objects… Similar to the object center map, each V1 receptive field image patch also has a silhouette map associated with it, conditioned on object type 404. During testing, output from the object layer intersects with the input image to create an approximate foreground-background segmentation 406. Through bottom-up/top-down iterative processing, this segmentation finds the most consistent interpretation of data against hypotheses. Rapid localization and segmentation, in turn, boosts object recognition by suppressing clutter and other nuisance factors. Silhouette semantic maps enable segmentation (i.e., estimation of the silhouettes) of objects in the present of clutter by explicitly learning the expected mask over the support (N×N pixel area) of each detected pattern, given the object identity and the location and strength of the detected pattern… Object-center and silhouette maps are extensions to the Leabra Vision system. Conditioning image patches (V1 receptive fields) on object type creates a hypothesis of where object centers and extents occur. Through the interaction between training assumptions and testing images, spatial pooling enhances localization and segmentation while suppressing clutter and other nuisance factors; Col. 20-21: general object labels layer 612 is the coarse categorization of objects in the image consisting of the set of objects the model is trained to learn. There is also a specialized object labels layer 614 and a semantics layer 616 which, as implied by its name, represent other more detailed information about the image (e.g., car model, person identity, plant type)… object label activations (i.e., truck, person, gun) are projected top-down to generate an “attentional” shroud (attentional feedback 202 in FIG. 2) over each object separately. This attentional shroud modulates the filter responses such that only the activated regions are strong enough to activate the layers above it on the next and final third pass 620, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found in the first pass 600. Furthermore, through the localization and segmentation augmentations of the Leabra Vision system, the object centers and the segments of the objects are also estimated… provide input to the SPARR system when performing recognition by selectively activating specific object or semantic label neurons in the upper layers or the attentional shroud; “modulate” one or more activations, semantic “labels”, and semantic layout “map” (e.g. system for visual media reasoning using sparse associative recognition and recall includes filtering an input image having input data using a non-linear sparse coding module and a first series of sparse coding filter kernels tuned to represent objects of general categories (i.e. types, classes, keywords, topics, etc.), for example, including performing object recognition on a set of sparse codes by using a neurally-inspired vision module to generate object and semantic labels (i.e. semantic “labels”) for a set of sparse codes, for example, including silhouette (i.e. layout, contour, boundary, shape etc.) semantic maps (i.e. semantic layout “map”) that enable localization and segmentation of objects, for example, and selectively activating specific object or semantic label neurons in the neurally-inspired vision module by using specialized object labels layer and a semantics layer which represent more detailed information about the image, including object label activations that are projected to generate an “attentional” shroud (attentional feedback) over each object separately, for example, and this attentional shroud modulates filter responses such that only the activated regions (i.e. “modulate” one or more activations) are strong enough to activate the layers on the next and final pass, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found, as indicated above), for example).
Fu, Ioffe, Schwartz, and Owechko are considered to be analogous art because they pertain to image processing applications based on neural networks. Therefore, the combined teachings of Fu, Ioffe, Schwartz, and Owechko, as a whole, would have rendered obvious the invention recited in claim 6 with a reasonable expectation of success in order to modify the method for image generation through use of adversarial networks (as disclosed by Fu) with “modulate” one or more activations, semantic “labels” and semantic layout “map” (as taught by Owechko, Abstract, Col. 2, 10-12, 15-17, 20-21) in order to modulate filter responses such that only activated regions are strong enough to activate the layers (Owechko, Abstract, Col. 21).
Additionally, Lee et al. (“Context-Aware Synthesis and Placement of Object Instances”), hereafter referred to as Lee, although not relied upon, is considered pertinent to applicant’s disclosure because it also discloses a similar concept as claimed by applicant’s invention, indicated above. In particular, Lee teaches in Pg. 1-4, generate realistic images based on generative adversarial networks (GANs) … propose a conditional GAN framework for the task. Our generator learns to predict plausible locations to insert object instances into the input semantic label map and also generate object instance masks with semantically coherent scales, poses and shapes… Our conditional GANs consist of two modules tailored to address the where and what problems, respectively, in learning the distributions of object locations and shapes. For each module, we encode the corresponding distributions through a variational auto-encoder (VAE)… that follows a unit Gaussian distribution in order to introduce sufficient variations of locations and shapes… the object generation pipeline of this work is on the semantic layout rather than image domain. As a result, we simplify the network module on learning desirable object shape… algorithm learns to place and synthesize a new instance of a specific object category (e.g., car and pedestrian) into a semantic map… learning affine transformations with a spatial transformer network that transforms and places a unit bounding box at a plausible location within the input semantic map. Then, given the context from the input semantic map and the predicted locations from the where module, we predict plausible shapes of the object instance with the what module (see Figure 1 and Figure 3). Finally, with the affine transformation learned from the STN, the synthesized object instance is placed into the input semantic map as the final output… given the input semantic map x, the where module aims to learn the conditional distribution of the location and size of object instances valid for the given scene context. We represent such spatial (and size) variations of object instances with affine transformations of a unit bounding box b. Thus, the where module is a conditional GAN, where the generator Gl takes x and a random vector zl as input and outputs an affine transformation matrix A… We denote A(obj) as applying transformation A… From training data in the supervised path, for each existing instance, we can calculate the affine transformation matrix A, which maps a box onto the object. Furthermore, we learn a neural network Gl, shared by both paths, which predicts Aˆ conditioned on x, so that the preferred locations are determined according to the global context of the input. As such, we aim to find a realistic transform Aˆ which gives a result that is indistinguishable from the result of A… We use two discriminators… which focuses on finding whether the new bounding box fits into the layout of the input semantic map, and… which aims to distinguish whether the transformaion parameters are realistic, for example.
Regarding claim 7, claim 6 is incorporated and the combination of Fu, Ioffe, Schwartz, and Owechko, as a whole, teaches the method (Fu, Par. [0003]), wherein
the one or more neural networks determine semantic information using the one or more semantic labels associated with the semantic layout map, the one or more semantic labels indicating respective types of image content; and
generate representations of the respective types of image content (Owechko, Col. 2: a system for visual media reasoning using sparse associative recognition and recall. The system comprises one or more processors and a memory having instructions such that when the instructions are executed, the one or more processors perform multiple operations… An input image having input data is filtered using a non-linear sparse coding module and a first series of sparse coding filter kernels tuned to represent objects of general categories, followed by a second series of sparse coding filter kernels tuned to represent objects of specialized categories, resulting in a set of sparse codes. Object recognition is performed on the set of sparse codes by a neurally-inspired vision module to generate object and semantic labels for the set of sparse codes… selectively activating specific object or semantic label neurons in the neurally-inspired vision module; Col. 9-12: a method and system for recovering information pertaining to a photo or video about when it was taken, what is contained within, who is present, and where was the picture taken using sparse associative recognition and recall for visual media reasoning. The method and system is referred to as sparse associative recognition and recall, or SPARR. It can be implemented as a software system that will assist human analysts in rapidly extracting mission-relevant information from photos or video. This task is referred to as visual media reasoning (VMR). It includes reasoning with image content and preliminary annotations (e.g., who is in the photo?) to fully recover the 4 W's: “who”, “what”, “where”, and “when”… SPARR architecture comprises three tightly coupled layers that make use of bidirectional feedback between layers… layer 104 provides object and semantic labels… SPARR uses a general-purpose dictionary that can represent broad classes of natural images as a linear combinations of image prototypes… Starting with a general-purpose model (i.e., general categories pathway 206), the model predicts the existence of a certain coarse object category… in an image 210… Based on previous observations (i.e., training data) of images with semantic labels (i.e., annotated answers to all or some questions regarding who, what, where, when), or labeled images with associations… from the neurally-inspired visual layer (FIG. 1, element 104), the SAM layer… can predict with a certain confidence the semantics given a novel input image 102. These semantics (FIG. 1, labeled image with associations 118), which are in the form of activations of neurons; Col. 15-17: semantic features encode who, where, what, and when information. These semantic features are provided by preliminary analysis from other layers, in addition to results recalled from the spatiotemporal associative memory layer, and are used to top-down bias the Leabra Vision network to focus the object-recognition process… semantic features can be inferred from never-before-seen objects, leading to a rough classification and recovery of mission-relevant features. The authors of Literature Reference No. 19, for example, demonstrated that by presenting images, object labels, and semantic features for known objects, bidirectional connections from IT and semantic label layers allow generalization to never-before-seen images… During training, a bidirectionally connected associative network learns the joint distribution of object parts and relative location. During testing, as each object part is identified, it casts a vote as to the predicted object center through the bidirectional associative network. These predicted object center offsets are accumulated for a more robust estimate of object location with an object-center semantic 402… Object-center semantic maps enable 402 localization (i.e., estimating a bounding box) of objects… Similar to the object center map, each V1 receptive field image patch also has a silhouette map associated with it, conditioned on object type 404. During testing, output from the object layer intersects with the input image to create an approximate foreground-background segmentation 406. Through bottom-up/top-down iterative processing, this segmentation finds the most consistent interpretation of data against hypotheses. Rapid localization and segmentation, in turn, boosts object recognition by suppressing clutter and other nuisance factors. Silhouette semantic maps enable segmentation (i.e., estimation of the silhouettes) of objects in the present of clutter by explicitly learning the expected mask over the support (N×N pixel area) of each detected pattern, given the object identity and the location and strength of the detected pattern… Object-center and silhouette maps are extensions to the Leabra Vision system. Conditioning image patches (V1 receptive fields) on object type creates a hypothesis of where object centers and extents occur. Through the interaction between training assumptions and testing images, spatial pooling enhances localization and segmentation while suppressing clutter and other nuisance factors; Col. 20-21: general object labels layer 612 is the coarse categorization of objects in the image consisting of the set of objects the model is trained to learn. There is also a specialized object labels layer 614 and a semantics layer 616 which, as implied by its name, represent other more detailed information about the image (e.g., car model, person identity, plant type)… object label activations (i.e., truck, person, gun) are projected top-down to generate an “attentional” shroud (attentional feedback 202 in FIG. 2) over each object separately. This attentional shroud modulates the filter responses such that only the activated regions are strong enough to activate the layers above it on the next and final third pass 620, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found in the first pass 600. Furthermore, through the localization and segmentation augmentations of the Leabra Vision system, the object centers and the segments of the objects are also estimated… provide input to the SPARR system when performing recognition by selectively activating specific object or semantic label neurons in the upper layers or the attentional shroud; wherein the one or more neural networks determine semantic information using the one or more semantic labels associated with the semantic layout map, the one or more semantic labels indicating respective types of image content; and generate representations of the respective types of image content (e.g. system for visual media reasoning using sparse associative recognition and recall includes filtering an input image having input data using a non-linear sparse coding module and a first series of sparse coding filter kernels tuned to represent objects of general categories (i.e. types, classes, keywords, topics, etc.) of input image content (i.e. respective types of image content), for example, including performing object recognition on a set of sparse codes by using a neurally-inspired vision module to generate object and semantic labels (i.e. one or more semantic labels indicating respective types of image content) for a set of sparse codes, for example, including silhouette (i.e. layout, contour, boundary, shape etc.) semantic maps (i.e. wherein the one or more neural networks determine semantic information using the one or more semantic labels associated with the semantic layout map, the one or more semantic labels indicating respective types of image content) that enable localization and segmentation of objects, for example, and selectively activating specific object or semantic label neurons in the neurally-inspired vision module by using specialized object labels layer and a semantics layer which represent more detailed information about the image, including object label activations that are projected to generate an “attentional” shroud (attentional feedback) over each object separately (i.e. generate representations of the respective types of image content), for example, and this attentional shroud modulates filter responses such that only the activated regions (i.e. “modulate” one or more activations”) are strong enough to activate the layers on the next and final pass, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found, as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 6.
Regarding claim 8, claim 7 is incorporated and the combination of Fu, Ioffe, Schwartz, and Owechko, as a whole, teaches the method (Fu, Par. [0003]), wherein
the one or more neural networks receive a boundary input separating an image space of the semantic layout map; and further comprising:
receive receiving indication of the one or more semantic labels to be associated with the semantic layout map; and
generating the semantic layout map with the one or more semantic labels (Owechko, Col. 2: a system for visual media reasoning using sparse associative recognition and recall. The system comprises one or more processors and a memory having instructions such that when the instructions are executed, the one or more processors perform multiple operations… An input image having input data is filtered using a non-linear sparse coding module and a first series of sparse coding filter kernels tuned to represent objects of general categories, followed by a second series of sparse coding filter kernels tuned to represent objects of specialized categories, resulting in a set of sparse codes. Object recognition is performed on the set of sparse codes by a neurally-inspired vision module to generate object and semantic labels for the set of sparse codes… selectively activating specific object or semantic label neurons in the neurally-inspired vision module; Col. 9-12: a method and system for recovering information pertaining to a photo or video about when it was taken, what is contained within, who is present, and where was the picture taken using sparse associative recognition and recall for visual media reasoning. The method and system is referred to as sparse associative recognition and recall, or SPARR. It can be implemented as a software system that will assist human analysts in rapidly extracting mission-relevant information from photos or video. This task is referred to as visual media reasoning (VMR). It includes reasoning with image content and preliminary annotations (e.g., who is in the photo?) to fully recover the 4 W's: “who”, “what”, “where”, and “when”… SPARR architecture comprises three tightly coupled layers that make use of bidirectional feedback between layers… layer 104 provides object and semantic labels… SPARR uses a general-purpose dictionary that can represent broad classes of natural images as a linear combinations of image prototypes… Starting with a general-purpose model (i.e., general categories pathway 206), the model predicts the existence of a certain coarse object category… in an image 210… Based on previous observations (i.e., training data) of images with semantic labels (i.e., annotated answers to all or some questions regarding who, what, where, when), or labeled images with associations… from the neurally-inspired visual layer (FIG. 1, element 104), the SAM layer… can predict with a certain confidence the semantics given a novel input image 102. These semantics (FIG. 1, labeled image with associations 118), which are in the form of activations of neurons; Col. 15-17: semantic features encode who, where, what, and when information. These semantic features are provided by preliminary analysis from other layers, in addition to results recalled from the spatiotemporal associative memory layer, and are used to top-down bias the Leabra Vision network to focus the object-recognition process… semantic features can be inferred from never-before-seen objects, leading to a rough classification and recovery of mission-relevant features. The authors of Literature Reference No. 19, for example, demonstrated that by presenting images, object labels, and semantic features for known objects, bidirectional connections from IT and semantic label layers allow generalization to never-before-seen images… During training, a bidirectionally connected associative network learns the joint distribution of object parts and relative location. During testing, as each object part is identified, it casts a vote as to the predicted object center through the bidirectional associative network. These predicted object center offsets are accumulated for a more robust estimate of object location with an object-center semantic 402… Object-center semantic maps enable 402 localization (i.e., estimating a bounding box) of objects… Similar to the object center map, each V1 receptive field image patch also has a silhouette map associated with it, conditioned on object type 404. During testing, output from the object layer intersects with the input image to create an approximate foreground-background segmentation 406. Through bottom-up/top-down iterative processing, this segmentation finds the most consistent interpretation of data against hypotheses. Rapid localization and segmentation, in turn, boosts object recognition by suppressing clutter and other nuisance factors. Silhouette semantic maps enable segmentation (i.e., estimation of the silhouettes) of objects in the present of clutter by explicitly learning the expected mask over the support (N×N pixel area) of each detected pattern, given the object identity and the location and strength of the detected pattern… Object-center and silhouette maps are extensions to the Leabra Vision system. Conditioning image patches (V1 receptive fields) on object type creates a hypothesis of where object centers and extents occur. Through the interaction between training assumptions and testing images, spatial pooling enhances localization and segmentation while suppressing clutter and other nuisance factors; Col. 20-21: general object labels layer 612 is the coarse categorization of objects in the image consisting of the set of objects the model is trained to learn. There is also a specialized object labels layer 614 and a semantics layer 616 which, as implied by its name, represent other more detailed information about the image (e.g., car model, person identity, plant type)… object label activations (i.e., truck, person, gun) are projected top-down to generate an “attentional” shroud (attentional feedback 202 in FIG. 2) over each object separately. This attentional shroud modulates the filter responses such that only the activated regions are strong enough to activate the layers above it on the next and final third pass 620, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found in the first pass 600. Furthermore, through the localization and segmentation augmentations of the Leabra Vision system, the object centers and the segments of the objects are also estimated… provide input to the SPARR system when performing recognition by selectively activating specific object or semantic label neurons in the upper layers or the attentional shroud; wherein the one or more neural networks receive a boundary input separating an image space of the semantic layout map; and further comprising: receive receiving indication of the one or more semantic labels to be associated with the semantic layout map; and generating the semantic layout map with the one or more semantic labels (e.g. system for visual media reasoning using sparse associative recognition and recall includes filtering an input image having input data using a non-linear sparse coding module and a first series of sparse coding filter kernels tuned to represent objects of general categories (i.e. types, classes, keywords, topics, etc.) of input image content (i.e. respective types of image content), for example, including performing object recognition on a set of sparse codes by using a neurally-inspired vision module to generate object and semantic labels (i.e. one or more semantic labels indicating respective types of image content) for a set of sparse codes, for example, including silhouette (i.e. layout, contour, boundary, shape etc.) semantic maps (i.e. wherein the one or more neural networks receive a boundary input separating an image space of the semantic layout map) that enable localization and segmentation of objects (i.e. receive receiving indication of the one or more semantic labels to be associated with the semantic layout map; and generating the semantic layout map with the one or more semantic labels), for example, and selectively activating specific object or semantic label neurons in the neurally-inspired vision module by using specialized object labels layer and a semantics layer which represent more detailed information about the image, including object label activations that are projected to generate an “attentional” shroud (attentional feedback) over each object separately (i.e. generate representations of the respective types of image content), for example, and this attentional shroud modulates filter responses such that only the activated regions (i.e. “modulate” one or more activations”) are strong enough to activate the layers on the next and final pass, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found, as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 6.
Regarding claim 9, claim 8 is incorporated and the combination of Fu, Ioffe, Schwartz, and Owechko, as a whole, teaches the method (Fu, Par. [0003]), further comprising:
selecting content, from a plurality of content options of types of image content associated with the one or more semantic labels (Owechko, Col. 2: a system for visual media reasoning using sparse associative recognition and recall. The system comprises one or more processors and a memory having instructions such that when the instructions are executed, the one or more processors perform multiple operations… An input image having input data is filtered using a non-linear sparse coding module and a first series of sparse coding filter kernels tuned to represent objects of general categories, followed by a second series of sparse coding filter kernels tuned to represent objects of specialized categories, resulting in a set of sparse codes. Object recognition is performed on the set of sparse codes by a neurally-inspired vision module to generate object and semantic labels for the set of sparse codes… selectively activating specific object or semantic label neurons in the neurally-inspired vision module; Col. 9-12: a method and system for recovering information pertaining to a photo or video about when it was taken, what is contained within, who is present, and where was the picture taken using sparse associative recognition and recall for visual media reasoning. The method and system is referred to as sparse associative recognition and recall, or SPARR. It can be implemented as a software system that will assist human analysts in rapidly extracting mission-relevant information from photos or video. This task is referred to as visual media reasoning (VMR). It includes reasoning with image content and preliminary annotations (e.g., who is in the photo?) to fully recover the 4 W's: “who”, “what”, “where”, and “when”… SPARR architecture comprises three tightly coupled layers that make use of bidirectional feedback between layers… layer 104 provides object and semantic labels… SPARR uses a general-purpose dictionary that can represent broad classes of natural images as a linear combinations of image prototypes… Starting with a general-purpose model (i.e., general categories pathway 206), the model predicts the existence of a certain coarse object category… in an image 210… Based on previous observations (i.e., training data) of images with semantic labels (i.e., annotated answers to all or some questions regarding who, what, where, when), or labeled images with associations… from the neurally-inspired visual layer (FIG. 1, element 104), the SAM layer… can predict with a certain confidence the semantics given a novel input image 102. These semantics (FIG. 1, labeled image with associations 118), which are in the form of activations of neurons; Col. 15-17: semantic features encode who, where, what, and when information. These semantic features are provided by preliminary analysis from other layers, in addition to results recalled from the spatiotemporal associative memory layer, and are used to top-down bias the Leabra Vision network to focus the object-recognition process… semantic features can be inferred from never-before-seen objects, leading to a rough classification and recovery of mission-relevant features. The authors of Literature Reference No. 19, for example, demonstrated that by presenting images, object labels, and semantic features for known objects, bidirectional connections from IT and semantic label layers allow generalization to never-before-seen images… During training, a bidirectionally connected associative network learns the joint distribution of object parts and relative location. During testing, as each object part is identified, it casts a vote as to the predicted object center through the bidirectional associative network. These predicted object center offsets are accumulated for a more robust estimate of object location with an object-center semantic 402… Object-center semantic maps enable 402 localization (i.e., estimating a bounding box) of objects… Similar to the object center map, each V1 receptive field image patch also has a silhouette map associated with it, conditioned on object type 404. During testing, output from the object layer intersects with the input image to create an approximate foreground-background segmentation 406. Through bottom-up/top-down iterative processing, this segmentation finds the most consistent interpretation of data against hypotheses. Rapid localization and segmentation, in turn, boosts object recognition by suppressing clutter and other nuisance factors. Silhouette semantic maps enable segmentation (i.e., estimation of the silhouettes) of objects in the present of clutter by explicitly learning the expected mask over the support (N×N pixel area) of each detected pattern, given the object identity and the location and strength of the detected pattern… Object-center and silhouette maps are extensions to the Leabra Vision system. Conditioning image patches (V1 receptive fields) on object type creates a hypothesis of where object centers and extents occur. Through the interaction between training assumptions and testing images, spatial pooling enhances localization and segmentation while suppressing clutter and other nuisance factors; Col. 20-21: general object labels layer 612 is the coarse categorization of objects in the image consisting of the set of objects the model is trained to learn. There is also a specialized object labels layer 614 and a semantics layer 616 which, as implied by its name, represent other more detailed information about the image (e.g., car model, person identity, plant type)… object label activations (i.e., truck, person, gun) are projected top-down to generate an “attentional” shroud (attentional feedback 202 in FIG. 2) over each object separately. This attentional shroud modulates the filter responses such that only the activated regions are strong enough to activate the layers above it on the next and final third pass 620, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found in the first pass 600. Furthermore, through the localization and segmentation augmentations of the Leabra Vision system, the object centers and the segments of the objects are also estimated… provide input to the SPARR system when performing recognition by selectively activating specific object or semantic label neurons in the upper layers or the attentional shroud; further comprising:
selecting content, from a plurality of content options of types of image content associated with the one or more semantic labels (e.g. system for visual media reasoning using sparse associative recognition and recall includes filtering an input image having input data using a non-linear sparse coding module and a first series of sparse coding filter kernels tuned to represent objects of general categories (i.e. types, classes, keywords, topics, etc.) of input image content (i.e. respective types of image content), for example, including performing object recognition on a set of sparse codes by using a neurally-inspired vision module to generate object and semantic labels (i.e. one or more semantic labels indicating respective types of image content) for a set of sparse codes, for example, including silhouette (i.e. layout, contour, boundary, shape etc.) semantic maps (i.e. wherein the one or more neural networks receive a boundary input separating an image space of the semantic layout map) that enable localization and segmentation of objects (i.e. receive receiving indication of the one or more semantic labels to be associated with the semantic layout map; and generating the semantic layout map with the one or more semantic labels), for example, and selectively activating specific object or semantic label neurons in the neurally-inspired vision module by using specialized object labels layer and a semantics layer which represent more detailed information about the image, including object label activations that are projected to generate an “attentional” shroud (attentional feedback) over each object separately (i.e. selecting content, from a plurality of content options of types of image content associated with the one or more semantic labels), for example, and this attentional shroud modulates filter responses such that only the activated regions (i.e. “modulate” one or more activations”) are strong enough to activate the layers on the next and final pass, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found, as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 6.
Regarding claim 14, claim 6 is incorporated and the combination of Fu, Ioffe, Schwartz, and Owechko, as a whole, teaches the method (Fu, Par. [0003]), wherein the one or more neural networks is a generative adversarial network (GAN) including a generator and a discriminator (Fu, Par. [0034-45]: a segmentor implemented with a neural network that is designed to impose semantic information on the generated images… GAN has the potential to provide realistic image generation… Segmentation Guided Generative Adversarial Network (SGGAN), which fully leverages semantic segmentation information to guide the image generation (e.g., translation) process… the image semantic segmentation can be obtained through a variety of methodologies, such as human annotations or any variety of existing segmentation methods… explicitly guide the generator with pixel-level semantic segmentations and, thus, further boost the quality of generated images. Further, the target segmentation employed in embodiments works as a strong prior, i.e., provides knowledge that stems from previous experience, for the image generator, which is able to use this prior knowledge to edit the spatial content… the segmentor neural network 260 includes a convolutional block 261… The segmentor network 260 implemented with the blocks 261-265 is configured to receive an input image 266a and/or 266b and generate a corresponding segmentation 267a and/or 267b, respectively, indicating features of the input images 266a and/or 266b. The segmentor 260 imposes semantic information on the image generation process … During training, estimated segmentations from the segmentor 260 are compared with their ground-truth values, which provides gradient information to optimize the generator 220. This optimization tends to teach the generator 220 to impose the spatial constraints indicated in an input segmentation 271 on the translated images, e.g., 274. During the training 270, the segmentor 260 provides spatial guidance to the generator 220 to ensure the generated images, e.g., 274, comply with input segmentations, e.g., 271. The discriminator 240 aims to ensure the translated images, e.g., 274, are as realistic as the real images, e.g., 273… the segmentor 260 receives a target segmentation 271 and a generated image 274 produced by the generator 220. Then, based upon a segmentation loss, i.e., the difference between a segmentation determined from the generated image 274 and the target segmentation 271, the segmentor 260 is adjusted, e.g., weights in a neural network implementing the segmentor 260 are modified so the segmentor 260 produces segmentations that are closer to the target segmentation 271. The generator 240 is likewise adjusted based upon the segmentation loss to generate images that are closer to the target segmentation 271; Par. [0050-68]: generator 301 is configured to receive three inputs, an input image (source image) 304, a target segmentation 305, and a vector of target attributes 306. A goal of the training process is to configure the generator 301 to translate the input image 304 into a generated image (fake image) 307, which complies with the target segmentation 305 and attribute labels 306… path of generator training is a reconstruction loss path which takes the generated image 307 as an input to the generator 301, as well as two other inputs, a source segmentation 315 (which may be a ground-truth landmark based segmentation) and a source attributes label 316. This path is expected to reconstruct an image 317 from the generated fake image 307 that should match the input source image 304… To train the discriminator 302, the input source image 304 is fed to the discriminator 302 which generates the discrimination result 319 and classification result 320. The discrimination result 319 is used to calculate a real adversarial loss term 321, and the classification result 320 is compared with the real source attributes label 316 to calculate a real classification loss 322. The fake adversarial losses 313, real adversarial losses 321, and the real classification loss 322 are summed up and fed to the optimizer 3laim 23 to optimize the discriminator 302 … a problem formulation, details of the segmentor network, and an overall objective function of an embodiment are provided… Let x, s, and c be an image of size (HxWx3), with the segmentation map (HxW ns) and attributes vector (1xnc) in the source domain; while y, s' and c' are its corresponding image, segmentation, and attributes in the target domain. The number of segmentation classes is denoted as ns as classes and the number of all the attributes is denoted as nc. Note, that for s and s', each pixel is represented by a one-hot vector of ns classes, while for c and c', they are binary vectors of multiple labels, in the scenario of multi-domain translation. Given this formulation, a goal of an embodiment is to find a mapping… To achieve this… G is formulated as the generator network in the proposed SGGAN model. Meanwhile, such an embodiment employs a discriminator D and a segmentor S to supervise the training of the generator G… In order to guide the generator by the target segmentation information, an additional network is built which takes an image as input and generates the image's corresponding semantic segmentation. This network is referred to as the segmentor network S which is trained together with the GAN framework… a processor-generated image, where the processor may be a neural network, and a target segmentation, according to an embodiment, is a set of segments, e.g., sets of pixels or set of contours, that correspond to portions or landmarks… of an image… extract features of the target segmentation, a first concatenation block configured to concatenate the extracted features with a latent vector, an up-sampling block configured to construct a layout of the fake image using the concatenated extracted features and latent vector, a second concatenation block configured to concatenate the layout with an attribute label to generate a multidimensional matrix representing features of the fake image; Par. [0115-134]: generate more realistic results with better image quality (sharper and clearer details) after image translation… embodiments implement image generation with spatial constraints. An example embodiment implements a novel Spatially Constrained Generative Adversarial Network (SCGAN), which decouples the spatial constraints from the latent vector and makes them available as additional control signal inputs… SCGAN includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image… the discriminator network is configured to distinguish between real images and generated images as well as classify images into attributes. The discrimination and classification results may guide the generator to synthesize realistic images with correct target attributes. The segmentor network… attempts to determine semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, embodiments implementing the SCGAN generate realistic images guided by semantic segmentations and attribute labels… provide image generation that synthesizes realistic images which cannot be distinguished from the real images of a given target dataset. Embodiments employ spatial constraints in generating high-quality images with target-oriented controllability… In an embodiment, (x, c, s) denotes the joint distribution of a target dataset, where x is a real image sample of size (HxWx3) with H and Was the height and width of x, c is its attribute label of size (1xnc) with nc as the number of attributes, and s is its semantic segmentation of size (HxWxns) with ns as the number of segmentation classes. Each pixel in s is represented by a one-hot vector with dimension ns, which codes the semantic index of that pixel… the problem is be defined as G (z, c, s)… where G( , , ) is the generating function, z is the latent vector of size (1xnz), c defines the target attributes, s acts as a high-level and pixel-wise spatial constraint, and y is the conditionally generated image which complies with the target c and s… SCGAN 1330 comprises three networks, a generator network G 1340, a discriminator network D 1360 with auxiliary classifier, and a segmentor network S 1380 which are trained together. The generator 1340 is designed such that a semantic segmentation, a latent vector, and an attribute label are input to the generator 1340 step by step to generate a fake image. The discriminator takes either fake or real images as input and outputs a discrimination result and a classification result. Similar to the discriminator 1360, the segmentor 1380 takes either a fake or real image as input and outputs a segmentation result which is compared to the ground-truth segmentation to calculate segmentation loss, which guides the generator to synthesize fake images that comply with the input segmentation… to construct the basic structure of the output image and attribute label c 1352 is fed into the generator 1340 through the expand block 1354 to guide the generator 1340 to generate attribute-specific images which share the similar basic image contents generated from s 1350 and z 1351…the expand block 1354 performs an expand operation to an input vector (attribute label 1352) by repeating the vector 1352 to the same width and height as a reference vector (the output of the block 1344). In such an embodiment, the expansion allows the vectors to be concatenated at the block 1345. The attribute label c 1352 is concatenated at block 1345 with the basic structure from the up-sampling blocks 1343 and 1344, to generate a multidimensional matrix representing features of the image being generated… To obtain realistic results which are difficult to distinguish from original data x (e.g., a real image 1356), the discriminator network D 1360 is implemented to form a GAN framework with the generator G 1340; Par. [0172-182]: generator of an embodiment of the SCGAN takes three inputs, semantic segmentation, latent vector, and attribute label. A critical issue is that the contents in the synthesized image should be decoupled well to be controlled by those inputs (semantic segmentation, latent vector, and attribute label)… a generator that functions in a step-by-step way to first, extract spatial information from semantic segmentation to construct the basic spatial structure of the synthesized image. Second, such an embodiment of the generator, takes the latent vector to add variations to the other unregulated components, and, in turn, uses the attribute label to render attribute-specific contents. As a result… the generator can successfully decouple the contents of synthesized images into controllable inputs. This approach solves the foreground-background merging problem and generates spatially controllable and attribute-specific images with variations on other unregulated contents… a conditional and target-oriented image generation task using a novel deep learning based adversarial network. In particular, one such example embodiment increases the controllability of image generation by using semantic segmentation as spatial constraints and attribute labels as conditional guidance. Embodiments can control the spatial contents as well as attribute-specific contents and generate diversified images with sharper and more realistic details… target-oriented image generation with spatial constraints… Spatially Constrained, Generative Adversarial Network (SCGAN) that decouples the spatial constraints from a latent vector and makes them available as additional control signal inputs. A SCGAN embodiment includes a generator network, a discriminator network with an auxiliary classifier, and a segmentor network, which are trained together adversarially… the generator is specially designed to take a semantic segmentation, a latent vector, and an attribute label as inputs step by step to synthesize a fake image. The discriminator network tries to distinguish between real images and generated images as well as classify the images into attributes. The discrimination and classification results guide the generator to synthesize realistic images with correct target attributes. The segmentor network attempts to conduct semantic segmentations on both real images and fake images to deliver estimated segmentations to guide the generator in synthesizing spatially constrained images. With those networks, example embodiments have increased controllability of an image synthesis task. Embodiment generate target-oriented realistic images guided by semantic segmentations and attribute labels; wherein the neural network is a generative adversarial network (GAN) including a generator and a discriminator (e.g. generate (i.e. produce, infer, construct, etc.) target-oriented realistic images guided by semantic segmentations (i.e. based on semantic layout, semantic segmentation mask, etc.) and attribute labels representing features of the image being generated, including target segmentation(s) comprising a set of segments, such as sets of pixels or set of contours (i.e. boundaries, shapes, silhouettes, etc.) that correspond to portions (i.e. regions, areas, etc.) or landmarks of an image being generated (i.e. received semantic layout/segmentation mask indicating a plurality (first, second, third… Nth) of regions of a digital representation of an image), by using a neural network, such as a Spatially Constrained Generative Adversarial Network (SCGAN) (i.e. the neural network is a generative adversarial network (GAN)), including a generator network and a discriminator network (i.e. including a generator and a discriminator), to generate diversified images with sharper and more realistic details, as indicated above), for example), for example).
Regarding claim 15, LUO discloses a system, comprising:
one or more processors (Par. [0004]: the system comprises a processor and a memory with computer code instructions stored thereon, wherein the processor and the memory, with the computer code instructions, are configured to cause the system to provide a generator, discriminator, and segmentor).
The steps of the program further recited in claim 15 correspond to claim 6 when executed and are rejected as applied to method claim 6 above.
Regarding claim 16, claim 15 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 7 above.
Regarding claim 17, claim 15 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 8 above.
Regarding claim 21, LUO discloses a processor, comprising:
one or more circuits to use one or more neural networks (Par. [0003-4]: methods and systems for image generation through use of adversarial networks… the system comprises a processor and a memory with computer code instructions stored thereon, wherein the processor and the memory, with the computer code instructions, are configured to cause the system to provide a generator, discriminator, and segmentor).
The steps of the program further recited in claim 21 correspond to claim 6 when executed and are rejected as applied to method claim 6 above.
Regarding claim 22, claim 21 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 7 above.
Regarding claim 23, claim 21 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 8 above.
Claims 10-11, 18-19, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Fu, in view of Ioffe, in view of Schwartz, in view Owechko, as applied to claim 6 above, in further view of Suzuki et al. (“Collaging on Internal Representations: An Intuitive Approach for Semantic Transfiguration”), referred to as Suzuki, Applicant cited prior art.
Regarding claim 10, claim 8 is incorporated the combination of Fu, Ioffe, Schwartz, Owechko, as a whole, teaches the method (Fu, Par. [0003]), but fails to teach the following as further recited in claim 10.
However, Suzuki teaches wherein the one or more least one spatially-adaptive normalization layers is a conditional layer configured to propagate semantic information from the semantic layout map throughout other layers of the neural network (Suzuki, Pg. 1, Abstract: CNN-based image editing method that allows the user to change the semantic information of an image over a user-specified region. Our method makes this possible by combining the idea of manifold projection with spatial conditional batch normalization (sCBN), a version of conditional batch normalization with userspecifiable spatial weight maps. With sCBN and manifold projection, our method lets the user perform (1) spatial class translation that changes the class of an object over an arbitrary region of user’s choice, and (2) semantic transplantation that transplants semantic information contained in an arbitrary region of the reference image to an arbitrary region in the target image; Pg. 1, Par. 1-2: deep generative models like generative adversarial networks (GANs) [10] and variational autoencoders (VAEs) [20] make possible the unsupervised learning of rich latent semantic information from images… Image conditional GANs [24, 40, 17] based on encoderdecoder architectures have been popular both for their convenient implemention in end-to-end differentiable ML frameworks, and their uncanny ability to produce photo- realistic images; Pg. 2, Par. 3: CNN-based image editing method that grants the user this very freedom. With our method, the user can transform a user-chosen part of image in a copy-paste fashion–and the user can do this all the while preserving semantic consistency. More precisely, we present a method that features two types of image transformation: (1) spatial class-translation that translates a class category of a region of interest, and (2) semantic transplantation that transplants a semantic feature of an user-selected region in an arbitrary image to a region of interest in the target image. To facilitate this editing process, we also propose an efficient optimization method to project images onto the latent space of generator; Pg. 2, Par. 7-8: Class-conditional GAN [29, 26, 43, 2] is a framework designed to learn an invariant latent representation among various classes, and it is capable of generating diverse images from a same latent code z by changing class embedding (Figure 2). The work of [26, 2], in particular, succeeded in producing an impressive results by interpolating the parameters of conditional batch normalization layer, which was first introduced in [31, 5]. Conditional batch normalization (CBN) is mechanism that learns conditional information by separately learning condition-specific scaling parameter and shifting parameter for batch normalization. Our method extends the technique used in [26] by restricting the region of interpolation to a region that corresponds to the region of interest in the pixel space. We will refer to our approach spatial conditional batch normalization (sCBN). Unlike the manipulation done in style transfer [12], we introduce the conditional information at multiple levels in the network, depending on the style preference of the user. As we will show, sCBN in the lower layers transforms global features, and sCBN at upper layers transforms local features… Semantic transformation. In order to grant the user with wide freedom of semantic transformation, there has to be some mechanism to finely adjust the user-suggested transformation so that the final product becomes natural; Pg. 3, Par. 3-4: Spatial Class-translation With our spatial class translation, the user can change the class of the object in the user-selected region of interest (ROI). The user can change the class of a part of the target objects in intuitive fashion… Spatial Semantic Transplantation With our semantic transplantation, the user can transplant a semantic feature of the user-selected object in the reference image to an object in the target image to be transformed. Our method first prompts the user to specify the region in the target image containing the object of interest, along with the reference image of equal size. The user will be also asked to specify the region in the reference image that contains the semantic information to be transplanted. The method then automatically transplants the semantic information of the specified region of the reference image into the target image; Pg. 4, Par. 2-5 and Pg. 5, Par. 1: spatial class translation Our method functions on a trained conditional generator G, paired with the discriminator D with which G was trained. Upon receiving the region of interest x clipped from the target image and the class c of the target object contained in x, the algorithm begins by looking for a latent variable z such that G(z; c) will be close to x in the feature space of D (Manifold Projection step). The class c can either be specified by the user or by a pre-trained classifier. Suppose that the user wants to partially translate a region R in x to a class c′, and let Vℓ be the set of features in ℓ-th conditional batch normalization(CBN) layers that correspond to R in the pixel space. Our method then simply substitutes the parameters governing the shift and mean parameters of Vℓ with those of c′ (Figure 6). This will result in a modification of G… in which the CBN parameters of Vℓ exclusively carry the style information of the class c′. A transformed image can be constructed by applying this modified G… to z… our spatial editing method is applicable to any generative model (e.g., GAN, VAE) that is equipped with a machanism to iteratively incorporate class information during its image generation process. We will next elaborate on the design of our sCBN and spatial semantic implantation, along with the other details omitted in the brief description above… Spatial conditional batch normalization (sCBN) is the core of our spatial class translation. As can be inferred from our naming, sCBN is based on batch normalization (BN) [16], a technique developed for the purpose of reducing the internal covariance shift to accelerate the training of neural network. More precisely, we will borrow our idea from conditional batch normalization (CBN) [8, 5], a variant of BN that incorporates the class specific semantic information in the parameters for BN. Given a set of batches sampled each from a single class, the conditional batch normalization [8, 5] works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class. Let us fix the layer ℓ, and let Fk,h,w represent the feature of ℓ-th layer at channel k, height location h, and width location w. Given a batch {Fi,k,h,w} of Fk,h,w s generated from class c, the CBN at layer ℓ then transforms Fi,k,h,w… In our implementation, we replaced CBN at each layer with sCBN; Pg. 6, Par. 3: generator used in our study is a ResNet-based generator trained as part of a conditional DCGAN. Each residua l block in our generator contains the conditional batch normalization (CBN) layer. At the time of inference, these CBN layers are replaced by the aforementioned sCBN layers that are tailored to the user’s preference. We base our architectures on those used in previous work [25, 26], and used the pre-trained model from [26]; Pg. 8, Par. 6: image transformation method that allows the user to translate the class of an object and transplant semantic features over a user-specified pixel region of the image. Indeed, there is still much room left for the exploration of the semantic information contained in the intermediate feature spaces of CNNs. We were, however, able to show that we can manipulate this information in a somewhat intuitive manner and produce customized photorealistic images; Pg. 11, Par. 9: we conducted a set of automatic spatial class translations. For each one the selected images, we (1) used a pre-trained model to extract the region of the object to be transformed (dog/cat), (2) conducted the manifold projection to obtain the z, (3) passed z to the generator with the class map corresponding to the segmented region, and (4) conducted a post-processing over the segmented region. For the semantic segmentation, we used a TensorFlow implementation of DeepLab v3 Xception model trained on MS COCO dataset; wherein the one or more least one spatially-adaptive normalization layers is a conditional layer configured to propagate semantic information from the semantic layout map throughout other layers of the neural network (e.g. image generative model which uses a neural network, such as a Generative Adversarial Network (GAN), that is equipped with a mechanism to iteratively incorporate class (i.e. feature, attribute, label, etc.) information during its image generation, including semantic features (i.e. class, attribute, label, etc.) of selected object regions (i.e. propagate semantic information), which are segmented/extracted in a reference (i.e. source, input, etc.) image, corresponding to object(s) in a target image to be transformed (i.e. an image to be generated), based on the set of features Vℓ in ℓ-th (first, second, third… Nth) conditional (i.e. adaptive, instance, etc.) batch normalization (CBN) layers (i.e. the one or more least one spatially-adaptive normalization layers is a conditional layer configured to propagate (i.e. transfer, pass, etc.) semantic information from the semantic layout map throughout other layers of the neural network) that correspond to a region R in the pixel space, including a number of spatial conditional batch normalization (sCBN) layers, as indicated above), for example).
Fu, Ioffe, Schwartz, Owechko, and Suzuki are considered to be analogous art because they pertain to image processing applications. Therefore, the combined teachings of Fu, Ioffe, Schwartz, Owechko, and Suzuki, as a whole, would have rendered obvious the invention recited in claim 10 with a reasonable expectation of success in order to modify the methods and systems for image generation through use of adversarial networks (as disclosed by Fu) with wherein the at least one spatially-adaptive normalization layer is a conditional layer configured to propagate semantic information from the semantic layout throughout other layers of the neural network (as taught by Suzuki, Abstract, Pg. 1-6, 8, 11) by training neural networks to perform semantic segmentation in order to produce customized photorealistic images based on a set of photorealistic transformations (Suzuki, Abstract, Pg. 1-2 and 8).
Regarding claim 11, claim 10 is incorporated and the combination of Fu, Ioffe, Schwartz, Owechko, and Suzuki, as a whole, teaches the method (Fu, Par. [0003]), further comprising:
modulating (Owechko, Col. 21: object label activations (i.e., truck, person, gun) are projected top-down to generate an “attentional” shroud (attentional feedback 202 in FIG. 2) over each object separately. This attentional shroud modulates the filter responses such that only the activated regions are strong enough to activate the layers above it on the next and final third pass 620, essentially focusing the attention of the hierarchy on specific regions of the image, determined by what general object category was found in the first pass 600. Furthermore, through the localization and segmentation augmentations of the Leabra Vision system, the object centers and the segments of the objects are also estimated. As the second pass 618 concludes, the object labels and semantics do not change; modulating (e.g. object label activations are projected top-down to generate an “attentional” shroud over each object separately, for example, and the attentional shroud modulates filter responses such that only the activated regions are strong enough to activate the layers (i.e. modulate activations), as indicated above), for example), by the one or more spatially-adaptive normalization layers, a set of activations through a spatially-adaptive transformation (Ioffe, Par. [0002-3]: specification relates to processing inputs through the layers of neural networks to generate outputs… Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters; Par. [0015-46]: neural network system 100 can be configured to receive any kind of digital data input and to generate any kind of score or classification output based on the input … if the inputs to the neural network system 100 are images or features that have been extracted from images, the output generated by the neural network system 100 for a given image may be scores for each of a set of object categories, with each score representing an estimated likelihood that the image contains an image of an object belonging to the category… In particular, each of the layers of the neural network is configured to receive an input and generate an output from the input and the neural network layers collectively process neural network inputs received by the neural network system 100 to generate a respective neural network output for each received neural network input. Some or all of the neural network layers in the sequence generate outputs from inputs in accordance with current values of a set of parameters for the neural network layer. For example, some layers may multiply the received input by a matrix of current parameter values as part of generating an output from the received input… The neural network system 100 also includes a batch normalization layer 108 between a neural network layer A 104 and a neural network layer B 112 in the sequence of neural network layers. The batch normalization layer 108 is configured to perform one set of operations on inputs received from the neural network layer A 104 during training of the neural network system 100 and another set of operations on inputs received from the neural network layer A 104 after the neural network system 100 has been trained… During training of the neural network system 100 on a given batch of training examples, the batch normalization layer 108 is configured to receive layer A outputs 106 generated by the neural network layer A 104 for the training examples in the batch, process the layer A outputs 106 to generate a respective batch normalization layer output 110 for each training example in the batch, and then provide the batch normalization layer outputs 110 as an input to the neural network layer B 112. The layer A outputs 106 include a respective output generated by the neural network layer A 104 for each training example in the batch. Similarly, the batch normalization layer outputs 110 include a respective output generated by the batch normalization layer 108 for each training example in the batch…the batch normalization layer 108 computes a set of normalization statistics for the batch from the layer A outputs 106, normalizes the layer A outputs 106 to generate a respective normalized output for each training example in the batch, and, optionally, transforms each of the normalized outputs before providing the outputs as input to the neural network layer B 112… the neural network layer A 104 generates outputs by modifying inputs to the layer in accordance with current values of a set of parameters for the first neural network layer, e.g., by multiplying the input to the layer by a matrix of the current parameter values. In these implementations, the neural network layer B 112 may receive an output from the batch normalization layer 108 and generate an output by applying a non-linear operation, i.e., a non-linear activation function, to the batch normalization layer output… the neural network layer A 104 generates the outputs by modifying layer inputs in accordance with current values of a set of parameters to generate a modified first layer inputs and then applying a non-linear operation to the modified first layer inputs before providing the output to the batch normalization layer 108… FIG. 2 is a flow diagram of an example process 200 for generating a batch normalization layer output during training of a neural network on a batch of training examples. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a batch normalization layer included in a neural network system, e.g., the batch normalization layer 108 included in the neural network system 100 of FIG. 1, appropriately programmed, can perform the process 200… the batch normalization layer computes, for each dimension, the mean and the standard deviation of the components of the lower layer outputs that correspond to the dimension. The batch normalization layer then normalizes each component of each of the lower level outputs using the means and standard deviations to generate a respective normalized output for each of the training examples in the batch… the batch normalization layer computes, for each possible feature index and spatial location index combination, the mean and the variance of the components of the lower layer outputs that have that feature index and spatial location index. The batch normalization layer then computes, for each feature index, the average of the means for the feature index and spatial location index combinations that include the feature index. The batch normalization layer also computes, for each feature index, the average of the variances for the feature index and spatial location index combinations that include the feature index. Thus, after computing the averages, the batch normalization layer has computed a mean statistic for each feature across all of the spatial locations and a variance statistic for each feature across all of the spatial locations… the batch normalization layer transforms each component of each normalized output … In cases where the layer below the batch normalization layer is a layer that generates an output that includes multiple components indexed by dimension, the batch normalization layer transforms, for each dimension, the component of each normalized output in the dimension in accordance with current values of a set of parameters for the dimension. That is, the batch normalization layer maintains a respective set of parameters for each dimension and uses those parameters to apply a transformation to the components of the normalized outputs in the dimension. The values of the sets of parameters are adjusted as part of the training of the neural network system; Par. [057-59]: the batch normalization layer transforms each component of the normalized output… If the outputs generated by the layer below the batch normalization layer are indexed by dimension, the batch normalization layer transforms, for each dimension, the component of the normalized output in the dimension in accordance with trained values of the set of parameters for the dimension. If the outputs generated by the layer below the batch normalization layer are indexed by feature index and spatial location index, the batch normalization layer transforms each component of the normalized output in accordance with trained values of the set of parameters for the feature index corresponding to the component… The batch normalization layer provides the normalized output or the transformed normalized output as input to the layer above the batch normalization layer in the sequence; by the one or more spatially-adaptive normalization layers, a set of activations through a spatially-adaptive transformation (e.g. neural network system and method include neural networks (i.e. one or more neural networks), which are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input, for example, in which each layer of each network generates an output from a received input in accordance with current values of a respective set of parameters, including a batch normalization layer that computes a mean statistic for each feature across all of spatial locations and a variance statistic for each feature across all of the spatial locations (i.e. one or more spatially-adaptive normalization layers of one or more neural networks), for example, and the batch normalization layer transforms (i.e. use one or more transformations) each component of each normalized output to generate outputs from inputs (i.e. an input) in accordance with current values of a set of parameters for the neural network layer, for example, in order to generate an output by applying (i.e. modulating, controlling, instructing, forcing, etc.) a non-linear operation, such as a non-linear activation function (i.e. activations), to the batch normalization layer output from inputs (i.e. by the one or more spatially-adaptive normalization layers, a set of activations through a spatially-adaptive transformation), as indicated above), for example) in order to propagate semantic information from the semantic layout map throughout the other layers of the one or more neural networks (Suzuki, Pg. 1, Abstract: CNN-based image editing method that allows the user to change the semantic information of an image over a user-specified region. Our method makes this possible by combining the idea of manifold projection with spatial conditional batch normalization (sCBN), a version of conditional batch normalization with userspecifiable spatial weight maps. With sCBN and manifold projection, our method lets the user perform (1) spatial class translation that changes the class of an object over an arbitrary region of user’s choice, and (2) semantic transplantation that transplants semantic information contained in an arbitrary region of the reference image to an arbitrary region in the target image; Pg. 1, Par. 1-2: deep generative models like generative adversarial networks (GANs) [10] and variational autoencoders (VAEs) [20] make possible the unsupervised learning of rich latent semantic information from images… Image conditional GANs [24, 40, 17] based on encoderdecoder architectures have been popular both for their convenient implemention in end-to-end differentiable ML frameworks, and their uncanny ability to produce photo- realistic images; Pg. 2, Par. 3: CNN-based image editing method that grants the user this very freedom. With our method, the user can transform a user-chosen part of image in a copy-paste fashion–and the user can do this all the while preserving semantic consistency. More precisely, we present a method that features two types of image transformation: (1) spatial class-translation that translates a class category of a region of interest, and (2) semantic transplantation that transplants a semantic feature of an user-selected region in an arbitrary image to a region of interest in the target image. To facilitate this editing process, we also propose an efficient optimization method to project images onto the latent space of generator; Pg. 2, Par. 7-8: Class-conditional GAN [29, 26, 43, 2] is a framework designed to learn an invariant latent representation among various classes, and it is capable of generating diverse images from a same latent code z by changing class embedding (Figure 2). The work of [26, 2], in particular, succeeded in producing an impressive results by interpolating the parameters of conditional batch normalization layer, which was first introduced in [31, 5]. Conditional batch normalization (CBN) is mechanism that learns conditional information by separately learning condition-specific scaling parameter and shifting parameter for batch normalization. Our method extends the technique used in [26] by restricting the region of interpolation to a region that corresponds to the region of interest in the pixel space. We will refer to our approach spatial conditional batch normalization (sCBN). Unlike the manipulation done in style transfer [12], we introduce the conditional information at multiple levels in the network, depending on the style preference of the user. As we will show, sCBN in the lower layers transforms global features, and sCBN at upper layers transforms local features… Semantic transformation. In order to grant the user with wide freedom of semantic transformation, there has to be some mechanism to finely adjust the user-suggested transformation so that the final product becomes natural; Pg. 3, Par. 3-4: Spatial Class-translation With our spatial class translation, the user can change the class of the object in the user-selected region of interest (ROI). The user can change the class of a part of the target objects in intuitive fashion… Spatial Semantic Transplantation With our semantic transplantation, the user can transplant a semantic feature of the user-selected object in the reference image to an object in the target image to be transformed. Our method first prompts the user to specify the region in the target image containing the object of interest, along with the reference image of equal size. The user will be also asked to specify the region in the reference image that contains the semantic information to be transplanted. The method then automatically transplants the semantic information of the specified region of the reference image into the target image; Pg. 4, Par. 2-5 and Pg. 5, Par. 1: spatial class translation Our method functions on a trained conditional generator G, paired with the discriminator D with which G was trained. Upon receiving the region of interest x clipped from the target image and the class c of the target object contained in x, the algorithm begins by looking for a latent variable z such that G(z; c) will be close to x in the feature space of D (Manifold Projection step). The class c can either be specified by the user or by a pre-trained classifier. Suppose that the user wants to partially translate a region R in x to a class c′, and let Vℓ be the set of features in ℓ-th conditional batch normalization(CBN) layers that correspond to R in the pixel space. Our method then simply substitutes the parameters governing the shift and mean parameters of Vℓ with those of c′ (Figure 6). This will result in a modification of G… in which the CBN parameters of Vℓ exclusively carry the style information of the class c′. A transformed image can be constructed by applying this modified G… to z… our spatial editing method is applicable to any generative model (e.g., GAN, VAE) that is equipped with a machanism to iteratively incorporate class information during its image generation process. We will next elaborate on the design of our sCBN and spatial semantic implantation, along with the other details omitted in the brief description above… Spatial conditional batch normalization (sCBN) is the core of our spatial class translation. As can be inferred from our naming, sCBN is based on batch normalization (BN) [16], a technique developed for the purpose of reducing the internal covariance shift to accelerate the training of neural network. More precisely, we will borrow our idea from conditional batch normalization (CBN) [8, 5], a variant of BN that incorporates the class specific semantic information in the parameters for BN. Given a set of batches sampled each from a single class, the conditional batch normalization [8, 5] works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class. Let us fix the layer ℓ, and let Fk,h,w represent the feature of ℓ-th layer at channel k, height location h, and width location w. Given a batch {Fi,k,h,w} of Fk,h,w s generated from class c, the CBN at layer ℓ then transforms Fi,k,h,w… In our implementation, we replaced CBN at each layer with sCBN; Pg. 6, Par. 3: generator used in our study is a ResNet-based generator trained as part of a conditional DCGAN. Each residua l block in our generator contains the conditional batch normalization (CBN) layer. At the time of inference, these CBN layers are replaced by the aforementioned sCBN layers that are tailored to the user’s preference. We base our architectures on those used in previous work [25, 26], and used the pre-trained model from [26]; Pg. 8, Par. 6: image transformation method that allows the user to translate the class of an object and transplant semantic features over a user-specified pixel region of the image. Indeed, there is still much room left for the exploration of the semantic information contained in the intermediate feature spaces of CNNs. We were, however, able to show that we can manipulate this information in a somewhat intuitive manner and produce customized photorealistic images; Pg. 11, Par. 9: we conducted a set of automatic spatial class translations. For each one the selected images, we (1) used a pre-trained model to extract the region of the object to be transformed (dog/cat), (2) conducted the manifold projection to obtain the z, (3) passed z to the generator with the class map corresponding to the segmented region, and (4) conducted a post-processing over the segmented region. For the semantic segmentation, we used a TensorFlow implementation of DeepLab v3 Xception model trained on MS COCO dataset; in order to propagate semantic information from the semantic layout map throughout the other layers of the one or more neural networks (e.g. image generative model which uses a neural network, such as a Generative Adversarial Network (GAN), that is equipped with a mechanism to iteratively incorporate class (i.e. feature, attribute, label, etc.) information during its image generation, including semantic features (i.e. class, attribute, label, etc.) of selected object regions (i.e. propagate semantic information), which are segmented/extracted in a reference (i.e. source, input, etc.) image, corresponding to object(s) in a target image to be transformed (i.e. an image to be generated), based on the set of features Vℓ in ℓ-th (first, second, third… Nth) conditional (i.e. adaptive, instance, etc.) batch normalization (CBN) layers (i.e. spatially-adaptive normalization layer is a conditional layer configured to propagate (i.e. transfer, pass, etc.) semantic information from the semantic layout throughout other layers of the neural network) that correspond to a region R in the pixel space, by incorporating the class specific semantic information in the parameters for batch normalization (BN), and given a set of batches sampled each from a single class, the conditional batch normalization works by modulating the set of intermediate features produced from each batch of inputs so that it follow a normal distribution with mean and variance that are specific to the corresponding class (i.e. modulating, by the spatially-adaptive normalization layer, a set of activations through a spatially-adaptive transformation in order to propagate the semantic information throughout the other layers of the neural network), as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 10.
Regarding claim 18, claim 15 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 10 above.
Regarding claim 19, claim 18 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 11 above.
Regarding claim 24, claim 21 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 11 above.
Claims 12-13, 20, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Fu, in view of Ioffe, in view of Schwartz, in view Owechko, as applied to claim 6 above, and in further view of SU et al. (PG Pub. No. 2021/0150812 A1), hereafter referred to as SU.
Regarding claim 12, claim 6 is incorporated and the combination of Fu, Ioffe, Schwartz, and Owechko, as a whole, teaches the method (Fu, Par. [0003]), but fails to teach the following as further recited in claim 12.
However, GAO teaches further comprising:
normalizing, by the spatially-adaptive normalization layer, layer activations to zero mean (Par. [0104-120]: convolutional neural network is a special type of neural network. The fundamental difference between a densely connected layer and a convolution layer is this: Dense layers learn global patterns in their input feature space, whereas convolution layers learn local patters: in the case of images, patterns found in small 2D windows of the inputs. This key characteristic gives convolutional neural networks two interesting properties: (1) the patterns they learn are translation invariant and (2) they can learn spatial hierarchies of patterns… convolutional neural network learns highly non-linear mappings by interconnecting layers of artificial neurons arranged in many different layers with activation functions that make the layers dependent. It includes one or more convolutional layers, interspersed with one or more sub-sampling layers and non-linear layers, which are typically followed by one or more fully connected layers. Each element of the convolutional neural network receives inputs from a set of features in the previous layer… Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (also called the channels axis). For an RGB image, the dimension of the depth axis is 3, because the image has three color channels; red, green, and blue. For a black-and-white picture, the depth is 1 (levels of gray). The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map. This output feature map is still a 3D tensor: it has a width and a height … The algorithm includes computing the activation of all neurons in the network, yielding an output for the forward pass; Par. [0130]: convolution layers of the convolutional neural network serve as feature extractors. Convolution layers act as adaptive feature extractors capable of learning and decomposing the input data into hierarchical features; Par. [0164-171]: Batch normalization is a method for accelerating deep network training by making data standardization an integral part of the network architecture. Batch normalization can adaptively normalize data even as the mean and variance change over time during training. It works by internally maintaining an exponential moving average of the batch-wise mean and variance of the data seen during training… Batch normalization can be seen as yet another layer that can be inserted into the model architecture, just like the fully connected or convolutional layer. The BatchNormalization layer is typically used after a convolutional or densely connected layer. It can also be used before a convolutional or densely connected layer… Batch normalization provides a definition for feed-forwarding the input and computing the gradients with respect to the parameters and its own input via a backward pass. In practice, batch normalization layers are inserted after a convolutional or fully connected layer, but before the outputs are fed into an activation function. For convolutional layers, the different elements of the same feature map--i.e. the activations--at different locations are normalized in the same way in order to obey the convolutional property. Thus, all activations in a mini -batch are normalized over all locations, rather than per activation… The internal covariate shift is the phenomenon where the distribution of network activations change across layers due to the change in network parameters during training. Ideally, each layer should be transformed into a space where they have the same distribution but the functional relationship stays the same. In order to avoid costly calculations of covariance matrices to decorrelate and whiten the data at every layer and step, we normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one… the batch normalization procedure is described herein per activation; Par. [0185]: batch normalization provides a new regularization method through normalization of scalar features for each activation within a mini-batch and learning each mean and variance as parameters; normalizing, by the spatially-adaptive normalization layer, layer activations to zero mean (e.g. convolutional neural network receives inputs from a set of features of previous layers, by using convolutions, which operate over three-dimensional (3D) tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (the channels axis), to perform Batch normalization, which adaptively normalizes data (i.e. spatially-adaptive normalization) even as the mean and variance change over time during training, by performing normalization of scalar features for each activation within a mini-batch and learning each mean and variance as parameters in order to normalize the distribution of each input feature in each layer across each mini-batch to have zero mean and a standard deviation of one (i.e. normalizing layer activations to zero mean), as indicated above), for example).
Fu, Ioffe, Schwartz, Owechko, and GAO are considered to be analogous art because they pertain to image processing applications. Therefore, the combined teachings of F Fu, Ioffe, Schwartz, Owechko, and GAO, as a whole, would have rendered obvious the invention recited in claim 12 with a reasonable expectation of success in order to modify the methods and systems for image generation through use of adversarial networks (as disclosed by Fu) with wherein the spatially-adaptive normalization layer is a conditional normalization layer and further comprising normalizing, by the spatially-adaptive normalization layer, layer activations to zero mean (as taught by GAO, Abstract, Par. [0104-120, 130, 164-171, 185]) to perform Batch normalization for accelerating deep neural network training by making data standardization an integral part of the network architecture to adaptively normalize data even as the mean and variance change over time during training (GAO, Abstract, Par. [0164]).
The combination of Fu, Ioffe, Schwartz, Owechko, and GAO, as a whole, teaches the method indicated above but fails to teach the following as further recited in claim 12.
However, SU teaches and de-normalizing the normalized layer activations to modulate activation using the one or more parameters of the one or more affine transformations (Par. [0028-29]: one or more neural network (NN) models, each adapted to approximate an image… The encoder selects a neural network model from the variety of NN models to determine an output image which approximates the second image based on the first image and the second image. Next, it determines at least some values of the parameters of the selected NN model according to an optimizing criterion, the first image, and the second image, wherein the parameters comprise node weights and/or node biases to be used with an activation function for at least some of the nodes in at least one layer of the selected NN model… For one or more color components of the encoded image, the image metadata may comprise: the number of neural-net layers in the NN, the number of neural nodes for at least one layer, and weights and offsets to be used with an activation function in some nodes of the at least one layer. After decoding the encoded image, the decoder generates an output image in the second dynamic range based on the encoded image and the parameters of the NN model; Par. [0050]: performance can be improved by renormalizing the input signals to the range [-1 1]… the neural network needs to include… a pre-scaling stage (normalization), where each channel in the input signal is scaled… a post-scaling stage (de-normalization), where each channel in the output signal… is scaled back to the original range; Par. [0138-139]: the image metadata comprise parameters for a neural network (NN) model to map the encoded image to an output image, wherein the image metadata comprise for one or more color components of the encoded image a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node… parameters for a neural network (NN) model to map the encoded image to an output image, wherein the image metadata comprise for one or more color components of the encoded image a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node; and… generating an output image based on the encoded image and the parameters of the NN model… wherein the image metadata further comprise scaling metadata, wherein for each color component of the encoded image the scaling metadata comprise a gain, a minimum, and a maximum value, and the method further comprises generating a de-normalizing output image based on the scaling metadata and the output image; and de-normalizing the normalized layer activations to modulate activation using the one or more parameters of the one or more affine transformations (e.g. image metadata comprising parameters (i.e. the one or more parameters) for a neural network (NN) model to map (i.e. transform, translate, etc.) an encoded image to an output image (i.e. generate an image using an affine transformation), including a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node parameters for a neural network (NN) model to map the encoded image to an output image (i.e. layer activations to modulate activation using an affine transformation), including a pre-scaling stage (normalization), where each channel in the input signal is scaled, and a post-scaling stage (de-normalization) (i.e. de-normalizing the normalized layer activations), where each channel in the output signal is scaled back to the original range, including scaling metadata, wherein for each color (i.e. feature, attribute, etc.) component of the encoded image the scaling metadata comprise a gain, a minimum, and a maximum value, and generating a de-normalizing output image based on the scaling metadata and the output image (i.e. de-normalizing the normalized layer activations to modulate (i.e. control, instruct, force, etc.) activation using the one or more parameters of the one or more affine transformations), as indicated above), for example).
Fu, Ioffe, Schwartz, Owechko, GAO, and SU are considered to be analogous art because they pertain to image processing applications based on neural networks. Therefore, the combined teachings of Fu, Ioffe, Schwartz, Owechko, GAO, and SU, as a whole, would have rendered obvious the invention recited in claim 12 with a reasonable expectation of success in order to modify the methods and systems for image generation through use of adversarial networks (as disclosed by Fu) with de-normalizing the normalized layer activations to modulate activation using the one or more parameters of the one or more affine transformations (as taught by SU, Abstract, Par. [0028-29, 50, 138-139]) to derive image-mapping functions based on neural-networks to improve performance by renormalizing (i.e. de-normalizing) the input signals (GAO, Abstract, Par. [0026, 164]).
Regarding claim 13, claim 12 is incorporated and the combination of Fu, Ioffe, Schwartz, Owechko, GAO, and SU, as a whole teaches the method (Fu, Par. [0003]), wherein the de-normalizing uses different normalization parameter values of the semantic layout map (SU, Par. [0028-29]: a neural network model from the variety of NN models to determine an output image which approximates the second image based on the first image and the second image. Next, it determines at least some values of the parameters of the selected NN model according to an optimizing criterion, the first image, and the second image, wherein the parameters comprise node weights and/or node biases to be used with an activation function for at least some of the nodes in at least one layer of the selected NN model… For one or more color components of the encoded image, the image metadata may comprise: the number of neural-net layers in the NN, the number of neural nodes for at least one layer, and weights and offsets to be used with an activation function in some nodes of the at least one layer. After decoding the encoded image, the decoder generates an output image in the second dynamic range based on the encoded image and the parameters of the NN model; Par. [0045-55]: input and output parameters of a NN may be expressed in terms of the mapping in equation… The goal is to find the parameters… in all (L+1) layers, to minimize the total minimum square error (MSE) for all P pixels… An L-layer neural-network based mapping can be represented using the following parameters, which can be communicated to a receiver as metadata… the normalization parameters for each input component (e.g., gain, min, and max); Par. [0094]: As described earlier, NNM metadata include the input normalization parameters and the neural-network parameters. These values are typically floating-point numbers in single or double precision; Par. [0138-139]: the image metadata comprise parameters for a neural network (NN) model to map the encoded image to an output image, wherein the image metadata comprise for one or more color components of the encoded image a number of neural-net layers in the NN, a number of neural nodes for each layer, and a weight and an offset to be used with an activation function of each node… generating an output image based on the encoded image and the parameters of the NN model… the image metadata further comprise scaling metadata, wherein for each color component of the encoded image the scaling metadata comprise a gain, a minimum, and a maximum value, and the method further comprises generating a de-normalizing output image based on the scaling metadata and the output image; wherein the de-normalizing uses different normalization parameter values of the semantic layout map (e.g. encoder selects a neural network model from a variety of NN models to determine an output image which approximates an encoded input image and determines values of the parameters of the selected NN model according to an optimizing criterion, including image metadata comprising parameters for a neural network (NN) model to map the encoded image to an output image, and generating an output image based on the encoded image and the parameters of the NN model, the image metadata further comprising scaling metadata, wherein for each color component of the encoded image the scaling metadata comprise a gain, a minimum, and a maximum value (i.e. different parameter values), and generating a de-normalizing output image based on the scaling metadata (i.e. the de-normalizing uses different normalization parameter values of the semantic layout map, as indicated above), for example).
The same motivation to combine above-mentioned teachings applies, as previously indicated in claim 12.
Regarding claim 20, claim 15 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 12 above.
Regarding claim 25, claim 21 is incorporated and is a corresponding apparatus claim rejected as applied to the method claim 12 above.
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GUILLERMO M RIVERA-MARTINEZ whose telephone number is (571) 272-4979. The examiner can normally be reached on 9 am to 5 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached on 571-270-5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/GUILLERMO M RIVERA-MARTINEZ/ Primary Examiner, Art Unit 2677