DETAILED ACTION
This office action is in response to the amendment/argument filed 12/10/2025.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Application is a National Stage application of PCT/US2021/039865. Priority to PCT/US2021/039865 with a priority date of 06/30/2021 is acknowledged under 35 USC 119(e) and 37 CFR 1.78.
Response to Arguments
Regarding the rejection of claim 1 under 35 U.S.C. 102, applicant’s arguments regarding the interpretation of the phrase “multiple synthetic images of an object” have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Rowell et al. (US 20200342652 A1), which teaches the claimed limitations according to their proper interpretation as clarified by amendment.
Regarding the rejection of claim 11 under 35 U.S.C. 103, applicant’s arguments filed 12/10/2025 have been fully considered but they are not persuasive.
Applicant argues that one of ordinary skill in the art would have had no strong motivation to combine the two distinct embodiments of Ferroni which together teach the claimed invention: “a synthetic digital image derived from CAD-data of the mobile unit” (identified as “possibility (2)”), and “a synthetic digital image derived from a transformation operation of a real image of the mobile unit” (labeled as “possibility (3)”); specifically, to apply the transformation operation of possibility (3) to the CAD-derived synthetic images of possibility (2).
Applicant alleges that because Ferroni states that possibility (3) alone could “generate a virtually limitless amount of training data”, one of ordinary skill in the art would not have sought to generate an even higher amount of training data. However, applying the transformation operation of possibility (3) to the CAD-derived synthetic images of possibility (2) could also increase the amount of potential variation within the generated training data, which is typically the goal when training for object recognition. CAD-derived images can be rendered from any angle, but real images are limited by the original angle of the camera.
Applicant also alleges that combining possibilities (2) and (3) would not preserve the advantage of automatic label generation for CAD-derived images because if the images are transformed, the automatically generated labels may need to be transformed manually. Applicant is correct in stating that Ferroni [0068] teaches that certain transformations require the associated label to also be transformed, but nowhere in Ferroni is it stated that these label transformations would have to be performed manually. In fact, one of ordinary skill in the art would recognize that for all of the possible transformations taught by Ferroni, either the associated label transformation could be performed automatically (translation, rotation, shearing, perspective warping, flipping upside down), or no label transformation would be required (filtering, lighting filtering, noising filtering, colour change, change in colour balance, change in contrast, change in lighting).
For instance, for a translation transformation, the movement of the label could simply copy the movement of the image: [0068] “if the original image is translated by 100 pixels to the left, the label pixel map also needs to be translated by 100 pixels to the left.” Similarly, for a rotation transformation, the label could be rotated around the same axis by the same amount; for a perspective warping transformation, the shape of the label region could be distorted using the same algorithm as the image transformation; etc. None of these require additional information that only a human could manually provide.
Therefore, the rejection of claim 11 under 35 U.S.C. 103 is maintained.
Claim Interpretation
Under MPEP 2143.03, "All words in a claim must be considered in judging the patentability of that claim against the prior art." In re Wilson, 424 F.2d 1382, 1385, 165 USPQ 494, 496 (CCPA 1970). As a general matter, the grammar and ordinary meaning of terms as understood by one having ordinary skill in the art used in a claim will dictate whether, and to what extent, the language limits the claim scope. Language that suggests or makes a feature or step optional but does not require that feature or step does not limit the scope of a claim under the broadest reasonable claim interpretation. In addition, when a claim requires selection of an element from a list of alternatives, the prior art teaches the element if one of the alternatives is taught by the prior art. See, e.g., Fresenius USA, Inc. v. Baxter Int’l, Inc., 582 F.3d 1288, 1298, 92 USPQ2d 1163, 1171 (Fed. Cir. 2009).
Claims 3, 17, 19, and 20 each recite “at least one of” followed by a list of elements. Since “at least one of” is disjunctive, any one of the elements found in the prior art is sufficient to reject the claim. Because, on balance, it appears the disjunctive interpretation enjoys the most specification support, the disjunctive interpretation (one of A, B OR C) is being adopted for the purposes of this Office Action. Applicant’s comments and/or amendments relating to this issue are invited to clarify the claim language and the prosecution history.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-5 and 16-17 is/are rejected under 35 U.S.C. 102(a)(1) and (a)(2) as being anticipated by Rowell et al. (US 20200342652 A1, hereinafter "Rowell").
Regarding claim 1, Rowell discloses: A computing device, comprising: a processor; and a memory ([0039] “The image computing device 100 may further include, a communication unit 110, memory 112, a processor 113, a storage device 111, and a display 116.”) communicatively coupled to the processor and storing executable instructions ([0040] “The memory 112 may store the code, routines, and data necessary for the imaging system 125 to provide its functionality. The memory 112 is coupled to the interconnect 115 for communication with the other components.”) that when executed cause the processor to:
generate multiple synthetic images of an object ([0058] describes object selection: “After generating the background image and adding the background image to a background portion of the image scene, the image scene generator 124 populates the image scene by adding foreground objects to the foreground portion of the image scene.”;
fig. 3 shows and [0067] describes the generation of multiple images of the same scene from various angles: “Depending on the position of the virtual camera module 204 within the 3D scene 200, different points included in the 3D object will be projected onto a 2D surface. Therefore, the image projector 126 may generate multiple camera views for each image scene with each camera view including a unique perspective of the image scene. In one embodiment, the image projector 126 generates a plurality of camera views by modifying one or more camera coordinates 210 included in a camera setting file. Depending on the application of the synthetic images produced by the synthetic image generation module 102, the camera coordinates 210 may be consistently and/or incrementally varied for each image scene generated during production of a particular synthetic image dataset to ensure the synthetic image dataset includes synthetic images capturing a consistent set of camera views for every scene. Alternatively, the camera coordinates 210 may be modified randomly by the image projector 126 to introduce additional variation into a synthetic image dataset.”;
fig. 4 shows the generation of multiple images of the same scene with various image augmentations), wherein the object is based on defined object parameters ([0060] “The 3D models and other foreground objects selected by the 3D model selection routine may be specific to the intended CV application of the synthetic images produced by the synthetic image generation system 101. A 3D model arrangement routine executed by the image scene generator 124 to position the 3D models and other foreground objects in a foreground portion of the scene may also be specific to the application of the synthetic images. The 3D model arrangement routine may define the horizontal and vertical location and depth of each object (e.g., 3D models and other foreground objects) in a scene. In one embodiment, to produce image scenes for generating synthetic images to train machine learning systems for object detection in autonomous vehicles, the image scene generator 124 selects objects commonly found near and/or on roads (e.g., cars, trucks, buses, trees, buildings, humans, animals, traffic lights, traffic signs, etc.) and positions the objects to generate a scene resembling the point of view from a car driving on the road. One or more affine transformations may also be applied to one or more background and/or foreground objects in the scene to simulate motion.”; also see paragraphs [0111] to [0113] describing “scene metadata”) and respective ones of the multiple synthetic images are based on randomized visual parameters applied to the object (fig. 4; [0091] “In some embodiments, it may be desirable to modify synthetic images provided by the synthetic image generation module 102 with noise and other defects. To increase the amount of variation in a synthetic image dataset, the image augmentation engine 104 may add noise and distortion to synthetic images.”;
[0099] teaches image augmentation based on randomness: “In one embodiment, the image augmentation engine 104 specifies a range of augmentation values for each augmentation operation then determines the augmentation to apply to each raw synthetic image 401 by sampling values within the range. For example, the image augmentation engine 104 may provide a range for a transformation augmentation operation between −20% and 20% of the image width for x and y, a range for a rotation augmentation operation from −17° to 17°, a range for a scaling rotation augmentation operation between 0.9 and 2.0, a range for a contrast augmentation operation from −0.8 to 0.4, and a range for multiplicative color changes to each RGB channel from 0.5 to 2. Values may be randomly, incrementally, or otherwise sampled from these ranges.”);
generate annotations of the object in multiple synthetic images based on the defined object parameters and the randomized visual parameters ([0027] “The synthetic image generation system also includes a synthetic image indexing module that provides image metadata for each synthetic image. The image metadata comprises synthetic image characteristics including scene composition (i.e., background, foreground objects, object arrangement, textures, etc.), camera settings (e.g., camera intrinsics, camera extrinsics, camera calibration metadata, etc.), camera capture settings (e.g., zoom, focus, baseline, zero disparity plane depth, lighting conditions, etc.), and image augmentations. By referencing image metadata during the dataset generation, the synthetic image generation system avoids producing duplicate images and provides precise control of synthetic image characteristics.”; also see paragraphs [0029] and [0125]); and
train a machine-learning (ML) model for detecting the object using the multiple synthetic images and annotations ([0060] “In one embodiment, to produce image scenes for generating synthetic images to train machine learning systems for object detection in autonomous vehicles, the image scene generator 124 selects objects commonly found near and/or on roads (e.g., cars, trucks, buses, trees, buildings, humans, animals, traffic lights, traffic signs, etc.) and positions the objects to generate a scene resembling the point of view from a car driving on the road.”;
[0121] describes how the training data includes annotations: “Training datasets including collections of synthetic image data generated by the synthetic image generation system 100 are stored in a training database 905. The training database 905 may also include synthetic image data metadata describing characteristics of synthetic images and additional image data channels as well as scene metadata describing attributes of image scenes captured in synthetic images and additional image data channels.”).
Regarding claim 2, Rowell discloses: The computing device of claim 1, wherein the object comprises a geometric shape ([0058] “In some embodiments, foreground objects are selected from a 3D model library including 36,000 or more 3D objects including chairs, tables, cars, animals, faces, etc. In other embodiments, foreground objects include light sources illuminating image scenes to provide light captured by virtual camera modules during generation of synthetic images. 3D model foreground objects may comprise a three dimensional mesh structure comprising points and/or a polygon mesh outlining a specific shape.”).
Regarding claim 3, Rowell discloses: The computing device of claim 1, wherein the randomized visual parameters comprise at least one of lighting, object pose, object orientation or background image ([0057] “Background images are added to a background portion of the image scene by the image scene generator 124…The image scene generator 124 may assemble background images by forming a large textured ground plane comprising a solid color, gradient color, or textured 2D image overlaid with a selection of static background objects having shapes randomly chosen from cubes, polyhedrons, cuboids, and cylinders. To add variation to the background image, the synthetic image generation module 102 may scale, rotate, texture, and/or deform one or more objects before placing the object(s) in the 2D image.”).
Regarding claim 4, Rowell discloses: The computing device of claim 1, wherein the multiple synthetic images comprise photorealistic images (Abstract “Embodiments of the invention assemble image scenes from virtual objects and capture realistic perspectives of images scenes as synthetic images.”; [0054] “The parameters included in camera settings files may be setup identical to an actual camera device in order to create camera views capturing a realistic, camera specific representation of an image scene perspective. By including parameters for camera intrinsics, camera calibration metadata, and camera capture settings in the camera settings file that are identical to actual camera device, the synthetic image generation module 102 may generate synthetic images simulating capture performance of the actual camera device under different capture conditions.”; [0063] “In other embodiments, the image scene generator 124 selects a texture file associated with an object in the scene to give the object a realistic appearance.”) generated by a 3D rendering engine ([0064] “The image projector 126 further processes image scenes from the image scene generator 124 to generate camera views rendered as synthetic images by the graphics rendering engine 128. In some embodiments, the synthetic image generation module 102 produces many synthetic images for each created scene. Each synthetic image captures the scene from a unique perspective by generating a camera view (i.e., a 2D projection) of a 3D scene.”).
Regarding claim 5, Rowell discloses: The computing device of claim 1, wherein the defined object parameters define an object type ([0111] “Scene metadata files may also define the number of objects to place in a scene and/or list specific objects files or object types to incorporate in a scene.”).
Regarding claim 16, Rowell discloses: The computing device of claim 1, wherein the defined object parameters include dimensions of the object ([0022] “The systems and methods of the invention also enable more precise scene construction wherein exact values for scene dimensions, object sizes, object depths, and other scene characteristics are known and customizable throughout scene construction.”).
Regarding claim 17, Rowell discloses: The computing device of claim 1, wherein the ML model includes at least one of a convolutional neural network (CNN), a recurrent neural network (RNN), or a variant thereof (Rowell [0134] “Machine learning systems communicatively coupled to—and/or included in—the machine learning service 105 include rules based classification algorithms, neural networks, and deep learning methods. More specifically, Naïve Bayes classification algorithms, decision tree classification algorithms, convolutional neural networks (CNNs), convolutional recurrent neural networks (CRNNs), hierarchical recurrent convolutional neural networks (HRNN), and HRNNs with attention vectors implemented in a machine learning framework (e.g., Keras, Scikitlearn, MXNet, or Tensorflow).”).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 6-7, 9-10, and 18-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rowell et al. (US 20200342652 A1, hereinafter "Rowell") in view of Wrenninge et al. (US 20190156151 A1, hereinafter "Wrenninge").
Regarding claim 6, Rowell partially teaches: A non-transitory computer-readable storage medium comprising instructions executable by a processor ([0040] “A memory 112 may include a non-transitory memory that stores data for providing the functionality described herein… The memory 112 may store the code, routines, and data necessary for the imaging system 125 to provide its functionality.”) to:
generate multiple synthetic images of multiple objects ([0058] describes object selection: “After generating the background image and adding the background image to a background portion of the image scene, the image scene generator 124 populates the image scene by adding foreground objects to the foreground portion of the image scene.”;
fig. 3 shows and [0067] describes the generation of multiple images of the same scene from various angles: “Depending on the position of the virtual camera module 204 within the 3D scene 200, different points included in the 3D object will be projected onto a 2D surface. Therefore, the image projector 126 may generate multiple camera views for each image scene with each camera view including a unique perspective of the image scene. In one embodiment, the image projector 126 generates a plurality of camera views by modifying one or more camera coordinates 210 included in a camera setting file. Depending on the application of the synthetic images produced by the synthetic image generation module 102, the camera coordinates 210 may be consistently and/or incrementally varied for each image scene generated during production of a particular synthetic image dataset to ensure the synthetic image dataset includes synthetic images capturing a consistent set of camera views for every scene. Alternatively, the camera coordinates 210 may be modified randomly by the image projector 126 to introduce additional variation into a synthetic image dataset.”;
fig. 4 shows the generation of multiple images of the same scene with various image augmentations) including, for respective objects of the multiple objects:
generating the respective object based on shape types of the respective object ([0060] “The 3D models and other foreground objects selected by the 3D model selection routine may be specific to the intended CV application of the synthetic images produced by the synthetic image generation system 101… In one embodiment, to produce image scenes for generating synthetic images to train machine learning systems for object detection in autonomous vehicles, the image scene generator 124 selects objects commonly found near and/or on roads (e.g., cars, trucks, buses, trees, buildings, humans, animals, traffic lights, traffic signs, etc.) and positions the objects to generate a scene resembling the point of view from a car driving on the road.” - objects are selected by type based on their relevance to the scene), and
generating the multiple synthetic images of the respective object by applying randomized parameters (fig. 4; [0091] “In some embodiments, it may be desirable to modify synthetic images provided by the synthetic image generation module 102 with noise and other defects. To increase the amount of variation in a synthetic image dataset, the image augmentation engine 104 may add noise and distortion to synthetic images.”;
[0099] teaches image augmentation based on randomness: “In one embodiment, the image augmentation engine 104 specifies a range of augmentation values for each augmentation operation then determines the augmentation to apply to each raw synthetic image 401 by sampling values within the range. For example, the image augmentation engine 104 may provide a range for a transformation augmentation operation between −20% and 20% of the image width for x and y, a range for a rotation augmentation operation from −17° to 17°, a range for a scaling rotation augmentation operation between 0.9 and 2.0, a range for a contrast augmentation operation from −0.8 to 0.4, and a range for multiplicative color changes to each RGB channel from 0.5 to 2. Values may be randomly, incrementally, or otherwise sampled from these ranges.”);
generate annotations for the multiple objects in the multiple synthetic images based on the shape types and the randomized parameters ([0027] “The synthetic image generation system also includes a synthetic image indexing module that provides image metadata for each synthetic image. The image metadata comprises synthetic image characteristics including scene composition (i.e., background, foreground objects, object arrangement, textures, etc.), camera settings (e.g., camera intrinsics, camera extrinsics, camera calibration metadata, etc.), camera capture settings (e.g., zoom, focus, baseline, zero disparity plane depth, lighting conditions, etc.), and image augmentations. By referencing image metadata during the dataset generation, the synthetic image generation system avoids producing duplicate images and provides precise control of synthetic image characteristics.”; also see paragraphs [0029] and [0125]); and
train a machine-learning (ML) model for detecting the multiple objects using the multiple synthetic images and annotations ([0060] “In one embodiment, to produce image scenes for generating synthetic images to train machine learning systems for object detection in autonomous vehicles, the image scene generator 124 selects objects commonly found near and/or on roads (e.g., cars, trucks, buses, trees, buildings, humans, animals, traffic lights, traffic signs, etc.) and positions the objects to generate a scene resembling the point of view from a car driving on the road.”;
[0121] describes how the training data includes annotations: “Training datasets including collections of synthetic image data generated by the synthetic image generation system 100 are stored in a training database 905. The training database 905 may also include synthetic image data metadata describing characteristics of synthetic images and additional image data channels as well as scene metadata describing attributes of image scenes captured in synthetic images and additional image data channels.”).
Rowell teaches generating multiple images from a single scene with a single selection of objects; therefore it does not explicitly teach generating multiple unique scenes of multiple objects. Rowell also may not explicitly teach generating the multiple synthetic images of the respective object by applying randomized parameters to the respective object, since its randomized image augmentation system is applied to the generated image as a whole.
Wrenninge teaches a method of generating multiple synthetic images in which a new scene is generated for each image (fig. 1A block S500, [0072] “Block S500 can include repeating Blocks S100, S200, S300, and S400 to build up a synthetic dataset of synthetic images. Repetition of the aforementioned Blocks can be performed any suitable number of times, to produce a synthetic image dataset of any suitable size.”, where Block S100 performs a new iteration of object selection.).
Wrenninge also teaches applying randomized parameters to each individual object when generating a synthetic image (fig. 4; [0091] “In some embodiments, it may be desirable to modify synthetic images provided by the synthetic image generation module 102 with noise and other defects. To increase the amount of variation in a synthetic image dataset, the image augmentation engine 104 may add noise and distortion to synthetic images.”;
[0099] teaches image augmentation based on randomness: “In one embodiment, the image augmentation engine 104 specifies a range of augmentation values for each augmentation operation then determines the augmentation to apply to each raw synthetic image 401 by sampling values within the range. For example, the image augmentation engine 104 may provide a range for a transformation augmentation operation between −20% and 20% of the image width for x and y, a range for a rotation augmentation operation from −17° to 17°, a range for a scaling rotation augmentation operation between 0.9 and 2.0, a range for a contrast augmentation operation from −0.8 to 0.4, and a range for multiplicative color changes to each RGB channel from 0.5 to 2. Values may be randomly, incrementally, or otherwise sampled from these ranges.”)
Rowell and Wrenninge are both analogous to the claimed invention because they are in the same field of synthetic image generation for training a machine learning model for computer vision. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the synthetic image generator of Rowell, which generates multiple varying images of a single scene with consistent objects, with the synthetic image generator of Wrenninge, which generates individual images of unique scenes with randomized object types and parameters, to generate multiple scenes with randomized object types and parameters and to generate multiple varying images of each scene. The motivation would have been to combine both approaches to generate the largest possible variety of training data in terms of both depth and breadth.
Regarding claim 7, the combination of Rowell in view of Wrenninge teaches: The non-transitory computer-readable storage medium of claim 6, wherein the instructions to generate the multiple synthetic images comprise instructions executable by the processor to:
add randomized out-of-focus effects of a camera lens to the synthetic images (Rowell [0158] “In one embodiment, camera blur 750 and lens flare 752 effects may be included in augmentation settings 550 for virtual cameras generating synthetic images under specific capture conditions. One camera blur effect 750 includes motion blur (i.e., a streaking effect of some captured objects) resulting from long exposure and/or rapid movement of objects in a scene portion captured in a synthetic image. In other embodiments, a Camera blur effect 750 may be a Gaussian blur simulating capture with poor focus and/or a post capture smoothing and/or noise reducing function included in an actual camera device.”; [0099] describes how parameters for image augmentation are generated by randomly sampling from given ranges); and
simulate motion blurring in the synthetic images, wherein each of the synthetic images to include different randomized motion blurring (Rowell [0091] “In some embodiments, it may be desirable to modify synthetic images provided by the synthetic image generation module 102 with noise and other defects. To increase the amount of variation in a synthetic image dataset, the image augmentation engine 104 may add noise and distortion to synthetic images. By adding other effects (e.g., lens flare, lens distortion, motion blur, and Gaussian blur), the image augmentation engine 104 may also simulate realistic capture errors that commonly occur under certain conditions with specific camera hardware.”, [0099] describes how parameters for image augmentation are generated by randomly sampling from given ranges).
Regarding claim 9, the combination of Rowell in view of Wrenninge teaches: The non-transitory computer-readable storage medium of claim 6, wherein the instructions to generate the multiple synthetic images comprise instructions executable by the processor to:
add noise to the synthetic images (Rowell [0091] “In some embodiments, it may be desirable to modify synthetic images provided by the synthetic image generation module 102 with noise and other defects. To increase the amount of variation in a synthetic image dataset, the image augmentation engine 104 may add noise and distortion to synthetic images.”), wherein each of the synthetic images to include a randomized type of noise (Wrenninge [0069] “Examples of parameters on which augmentation can be based (e.g., utilized to augment a synthetic image) include… simulated sensor or camera parameters (e.g., governing exposure level, sensor dynamic range, sensor black level, light response curve, static noise, temporal noise, shot noise, photon noise, color filter array/CFA arrangement, CFA filter characteristics, demosaicing, etc.), and any other suitable parameters related to varying or altering the synthetic image subsequent to rendering.”, [0041]-[0042] teaches that the value of each parameter is determined by sampling a probability density function associated with the parameter, [0045] teaches that the sampling of the probability density function may be randomized”).
Rowell and Wrenninge are both analogous to the claimed invention because they are in the same field of synthetic image generation for training a machine learning model for computer vision. It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Rowel in view of Wrenninge with the additional teachings of Wrenninge to randomly select a type of noise to be applied to a synthetic image. The motivation would have been to diversify the generated training dataset by increasing the variety of parameters and depicting a wider variety of image conditions.
Regarding claim 19, the combination of Rowell in view of Wrenninge teaches: The non-transitory computer-readable storage medium of claim 9, wherein the type of noise includes at least one of a Gaussian noise, a salt-and-pepper noise, a Poisson noise, or a speckle noise (Rowell [0096] “Other image degradations provided by the image augmentation engine 104 include Gaussian noise and Gaussian blur. In some embodiments, the image augmentation engine 104 creates noise augmentation images 406 and Gaussian blur images 408 by applying a Gaussian augmentation operation to a raw synthetic image 401. Gaussian augmentation operations distribute a defined amount of noise (e.g., occlusion or blurring) evenly throughout an entire image.”).
Regarding claim 10, the combination of Rowell in view of Wrenninge teaches: The non-transitory computer-readable storage medium of claim 6, wherein the instructions to generate the annotations comprise instructions executable by the processor to:
record the shape type for each of the multiple objects in the multiple synthetic images (Rowell [0125] “In some embodiments, the image indexing module 108 may index images on metadata fields encoded into the synthetic image at capture by the synthetic image generation module 102. The metadata fields include… image metadata (e.g., type and/or number objects included in an image, depth of background image, depth of a foreground object, maximum depth of a foreground object, number and type image augmentations performed on a parent synthetic image, additional image data rendered for a camera view captured in a parent synthetic image, etc.)”);
determine a bounding box for each of the multiple objects in each of the multiple synthetic images (Wrenninge [0066] “The output of Block S300 preferably includes a two dimensional synthetic image that realistically depicts a realistic 3D scene. The synthetic image defines a set of pixels, and each pixel is preferably labeled with the object depicted by the pixel (e.g., intrinsically labeled based on the parameters used to generate the object rendered in the image). In this manner, a “pixel-perfect” intrinsically annotated synthetic image can be created. In alternative variations, labelling can be performed on a basis other than a pixel-by-pixel basis; for example, labelling can include automatically generating a bounding box around objects depicted in the image, a bounding polygon of any other suitable shape, a centroid point, a silhouette or outline, a floating label, and any other suitable annotation, wherein the annotation includes label metadata such as the object class and/or other object metadata.”); and
generate a segmentation mask for each of the multiple objects in each of the multiple synthetic images (Wrenninge fig. 5 shows generated segmentation mask, [0066] “Semantic segmentation can be performed with any suitable level of granularity, and labeled or annotated pixels can be grouped by any category or subcategory of label or annotation (e.g., defined by the parameters used to procedurally define the scene in the image) in producing a segmentation of the image. For example, pixels can be semantically segmented by object class, object subclass, orientation, any other suitable geometric or other parameter as determined in accordance with one or more variations of Block S100, and/or any suitable combination of the aforementioned.”; Rowell [0082] also teaches a segmentation mask, but teaches layers separated by depth rather than separating each individual object).
Rowell and Wrenninge are both analogous to the claimed invention because they are in the same field of synthetic image generation for training a machine learning model for computer vision. It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Rowel in view of Wrenninge with the additional teachings of Wrenninge to generate bounding boxes and segmentation masks for each object in the synthetic images. The motivation would have been to improve neural network training by providing detailed ground truth information for supervised learning.
Regarding claim 18, the combination of Rowell in view of Wrenninge teaches: The non-transitory computer-readable storage medium of claim 6, wherein the instructions are further executable by the processor to select a background image, wherein the multiple synthetic images of the multiple objects includes the background image (Rowell [0057] “Background images are added to a background portion of the image scene by the image scene generator 124.”; the remainder of the paragraph explains the process).
Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rowell (US 20200342652 A1) in view of Wrenninge (US 20190156151 A1) as applied to claim 6 above, and further in view of Jacobs et al. (US 20230351560 A1, hereinafter "Jacobs").
Regarding claim 8, the combination of Rowell in view of Wrenninge teaches: The non-transitory computer-readable storage medium of claim 6, but does not explicitly teach: wherein the instructions to generate the multiple synthetic images comprise instructions executable by the processor to:
simulate shadows cast on the multiple objects in the synthetic images, wherein each of the synthetic images to include randomized shadows cast on the multiple objects.
Jacobs teaches: wherein the instructions to generate the multiple synthetic images comprise instructions executable by the processor to:
simulate shadows cast on the multiple objects in the synthetic images ([0027] “The first well-lit image and the second shadowed image are then processed via a two-dimensional image-based data generation pipeline to create a synthetic shadowed image. The synthetic shadowed image could be used to train the machine learning model. A series of steps are taken within the data pipeline. Once a well-lit image and a shadowed image of the subject are produced, the images can be combined in the pipeline using a mask. In an example embodiment, the mask may control the transparency/opacity of an image with respect to a background image. In some examples, a plurality of masks could correspond to different shadow shapes and/or occluder shapes. In various embodiments, the shadowed image could be layered over the well-lit image and the mask can be applied to the shadowed image. Such an arrangement could form a synthetic shadow by retaining a shadowed image portion that corresponds to a shape of the mask. In such scenarios, the remainder of the shadowed image may appear transparent so that the well-lit image is observable. In such a manner, this process forms a synthetically shadowed image.”), wherein each of the synthetic images to include randomized shadows cast on the multiple objects ([0030] “In some embodiments, the shape of the mask may be the principal identifier that the machine learning model could be trained to recognize among a plurality of different shaped shadows. The shape of the mask could be used to approximate real world shapes. A shadow mask may be produced by using 3D models of occluders and facial geometry to project shadows based on features such as the contours of the face. A shadow mask may also be produced from hand drawn 2D masks, randomly synthetically generated 2D masks, or a combination of any of the above with localized color and sharpness variations to emulate real world phenomena such as subsurface scattering in skin, spatially varying lighting environments, or shadow foreshortening.”, [0063] “In some embodiments, the shadow shapes could additionally or alternatively be gathered from two-dimensional shapes 808. For example, the two-dimensional shapes 808 could be hand drawn shapes of shadows, they could be random shapes, they could be traced shapes, they could include outlines from photographs, or they could be produced from any other manner not mentioned in order to obtain a two dimensional shape of a shadow.”).
Jacobs and the combination of Rowell in view of Wrenninge are both analogous to the claimed invention because they are in the same field of generating synthetic images to train a machine learning model (Jacobs [0025]-[0027]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Rowell in view of Wrenninge with the teachings of Jacobs to add simulated shadows as another option for parametrized, randomizable image augmentation, applying them to the multiple objects found in the images taught by Rowell in view of Wrenninge. The motivation would have been to improve the training dataset by depicting a wider variety of image conditions.
Claim(s) 11, 13, and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ferroni (US 20200034626 A1).
Regarding claim 11, Ferroni teaches a method, comprising:
generating a simulated object from a number of simulated subcomponents ([0066] “Another way for gathering training data pairs may be by rendering of CAD assemblies: The components, which should be classified/detected/segmented are known objects or components on e.g. trains. Hence, 3D CAD models of such trains are available. As a result, “synthetic” images and labels can be generated by rendering the 3D assemblies.”);
generating multiple synthetic images of the simulated object based on randomized visual parameters ([0077] “Developing a suitable synthetic image generator, which takes the collected training data and manually labelled data as “seed”, and performs randomized transformations (like image translation, rotation, shearing, lighting filtering, noising filtering, perspective warping) to generate a virtually limitless amount of training data that is somehow realistic, improving the detection accuracy and improving resistance to suboptimal image conditions.”);
generating annotations for the multiple synthetic images ([0066] “From the 3D CAD model, it is obviously possible not just to create the input “image” but also the input “label” in a semi-automatic or fully automatic way.”) based on information from the number of simulated subcomponents and randomized visual parameters ([0068] Depending on the type of label that is needed and the type of transformation that is done, it may or may not be necessary to also apply the same transformation to the label (only to the input digital image). For example, if a classification task is done where the label is simply “wheel”, “brake pad”, “spring” a change of the label is not needed if any of the transformations listed above is done. In other words, if the image is slightly rotated by +/−10 degrees, it would still have the same label. In other situations, such as in detection or segmentation, the change of the label may be needed depending on the transformation in the data augmentation procedure. In other words, if the original image is translated by 100 pixels to the left, the label pixel map also needs to be translated by 100 pixels to the left. However, if the colour balance is changed in the original image by increasing the strength of the green colours there is no need to apply the transformation to the label.); and
training a machine-learning (ML) model to detect an observed object and subcomponents of the observed object ([0052] “Due to careful training using data collected and labelled (details see below), the computer algorithm is capable of 1. categorising 2. detecting and 3. segmenting objects or parts purely from one or more frames of images taken from the partition of the mobile unit, e.g. the bogie system, taken while the unit (train) is in motion.”; [0057] “The present invention also refers to a method for training the method for machine learning of the recognition method.”) in images captured by a camera (Ferroni [0023] “The digital image may originate from any source feasible for a person skilled in the art, like a camera, a high resolution camera, an ultra sound device or an IR device.”; where [0010]-[0014] describe “the digital image” as the input for the object recognition functionality) using the multiple synthetic images and annotations ([0061] describes pairs of images and annotations used as training data, [0066] explains that the images may be generated synthetically based on 3D models).
Ferroni does not explicitly teach generating a simulated object from a number of simulated subcomponents and generating multiple synthetic images of the simulated object based on randomized visual parameters in the same embodiment of the invention; [0064] indicates that the visual transformations described in [0077] are to be applied to real captured training images, rather than the 3D model-derived synthetic training images described in [0066]. However, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have combined both of these embodiments to apply the visual transformations described in [0077] to the synthetic images described in [0066]. The motivation would have been to increase the possible variation in the training data and to preserve the advantages of both embodiments of training data generation: automatic label generation for the 3D model-derived synthetic images ([0066]), and the inclusion of suboptimal image conditions for the transformed images ([0077]).
Regarding claim 13, Ferroni teaches the method of claim 11, wherein generating the annotations for the multiple synthetic images comprises identifying a subcomponent of the simulated object based on a part type ([0061] describes the annotation of training data: “An example of “supervised” learning would be where the input example is complemented with a “ground-truth” label. For example, if the task is to classify an image into one of three different categories like “wheel”, “brake pad”, and “spring” a “database” or training set would be made out of the pairs: image of a wheel—example X1-label Y1 “wheel”, image of a brake pad—example X2-label Y2 “brake pad” and image of a spring—example Xn-label Yn “spring””; paragraphs [0019], [0045], [0076], and [0100] indicate that the objects the system is being trained to recognize (wheel, brake pad, spring, etc.) are subcomponents of a bogie (train wheel assembly)).
Regarding claim 14, Ferroni teaches the method of claim 11, further comprising running the ML model to detect the observed object and subcomponents in an image ([0071] “The present invention also refers to computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the recognition method.”) captured by a camera ([0023] “The digital image may originate from any source feasible for a person skilled in the art, like a camera, a high resolution camera, an ultra sound device or an IR device.”, where [0010]-[0014] describe “the digital image” as the input for the object recognition functionality).
Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ferroni (US 20200034626 A1) in view of Lundeen et al. (US 20220374834 A1, hereinafter "Lundeen").
Regarding claim 12, Ferroni teaches the method of claim 11, wherein the simulated object comprises a train bogie and the simulated subcomponents comprise components of the train bogie.
Ferroni does not explicitly teach: wherein the simulated object comprises a shipping package and the simulated subcomponents comprise components of the shipping package.
Lundeen teaches: wherein the simulated object comprises a shipping package ([0062] explains the idea of using machine learning to identify shipping packages and generate 3D bounding boxes for the packages. [0063]-[0064] describe one approach to doing this, which involves generating photorealistic simulated images of scenes including 3D models of shipping packages, in which the shipping boxes are annotated with their ground truth 3D bounding box data). Lundeen also teaches that shipping packages may have subcomponents including a label and/or one or more fiducial markers such as AprilTags or a QR code (fig. 2B, [0055]).
Lundeen and Ferroni are both analogous to the claimed invention because they are in the same field of generating training data to train a neural network in object recognition. Though Lundeen does not explicitly teach that the shipping package subcomponents are depicted in the simulated 3D model, it would have been obvious to one of ordinary skill in the art to modify the invention of Ferroni, in which simulated objects are generated and annotated based on their constituent subcomponents, with the teachings of Lundeen to apply the invention specifically toward shipping packages, including modeling their separate subcomponents. The motivation would have been to be able to use the invention to inspect packages in a similar manner as the invention of Ferroni inspects train parts, including helping to track and sort packages for delivery, as taught by Lundeen (“Background” section, [0001]-[0003]).
Claim(s) 15 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ferroni (US 20200034626 A1) in view of Holzer et al. (US 20200234488 A1, hereinafter "Holzer").
Regarding claim 15, Ferroni teaches the method of claim 14, but does not explicitly teach: further comprising detecting a defect in the observed object based on the detected subcomponents.
Holzer teaches: further comprising detecting a defect in the observed object based on the detected subcomponents (fig. 4 presents an overview of a method for detecting damage in individual components of an object using object recognition, concept is summarized in paragraph [0007], details are explained in paragraphs [0079] to [0093]).
Holzer and Ferroni are both analogous to the claimed invention because they are in the same field of using a neural network to identify object components. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Ferroni with the teachings of Holzer to add the ability to detect damage or flaws in the identified object components. The motivation would have been to apply the invention toward automating vehicle inspections, replacing manual inspections which can be time-consuming and produce inconsistent results (Holzer [0006]), or automating inspections of other types of objects (Holzer [0108]).
Regarding claim 20, the combination of Ferroni in view of Holzer teaches: The method of claim 15, wherein the defect includes at least one of a no-pattern- found defect, a side glue defect, a wrong color defect, a dented surface defect, a large gap defect, a label alignment defect, or a side skew defect (Holzer teaches the detection of a dented surface defect: [0114] “The damage information may identify the damage type and/or severity. For example, the damage information may identify damage as being light, moderate, or severe. As another example, the damage information may identify the damage as a dent or a scratch.”).
Holzer and Ferroni are both analogous to the claimed invention because they are in the same field of using a neural network to identify object components. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Ferroni with the teachings of Holzer to add the ability to detect dents in the identified object components. The motivation would have been to apply the invention toward automating vehicle inspections, replacing manual inspections which can be time-consuming and produce inconsistent results (Holzer [0006]), or automating inspections of other types of objects (Holzer [0108]).
References Cited
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Avidan et al. (US 20200342242 A1) teaches a system which generates synthetic images of road scenes based on defined and randomized parameters, automatically annotates the images, and uses the synthetic images to train a machine learning model. The teachings of this reference are very similar to those of Wrenninge and cover many of the same limitations, but it is more explicit about the user’s ability to directly modify certain image generation parameters (fig. 4).
Liu et al. (US 20190012802 A1) teaches the generation of synthetic training images based on combinations of components identified in other images. These images are used to train a neural network capable of identifying an object based on only a subset of its components, while other components may be obscured.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN STATZ whose telephone number is (571)272-6654. The examiner can normally be reached Mon-Fri 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at (571)272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BENJAMIN TOM STATZ/Examiner, Art Unit 2611
/TAMMY PAIGE GODDARD/Supervisory Patent Examiner, Art Unit 2611