DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Response to Amendment
The amendment filed on 24 October 2025 has been entered.
The amendment of claims 1, 3, 8-10, 13, 15, and 20, cancellation of claims 4 and 16, and addition of claims 21-22 have been acknowledged.
Response to Arguments
Applicant's arguments filed on 24 October 2025, with respect to the amended claims, have been fully considered but they are not persuasive.
Applicant’s Representative submits that the prior art does not teach the amended limitations (e.g., 2D image and the geometric relationship within the 3D space).
The examiner respectfully disagrees. The prior art of record teaches that the input images used are from ImageNet. ImageNet is a 2D image database. The prior art further teaches that the real-world coordinates are determined (Cioppa pg. 8 right column).
Regarding the geometric relationship, the secondary reference teaches using the geometric relationship within the 3D space (see Le Floch ¶¶0167, ¶¶0175, ¶¶0181, & ¶¶0265).
Applicant’s arguments with respect to the pending claims have been fully considered but are moot because the arguments rely on newly added and/or amended claim limitations (e.g., claims 21-22). The examiner has revised the rejections to match the new claim limitations.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 21-22 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claims 21-22 recite that the system does not require a pretrained template, offline dictionary, or prior camera-specific training data.
However, the original disclosure does not provide any information related to not requiring said features.
On the contrary, [0024] of the specification states “aspects of the present application enable automated camera calibration with a reduced amount of prior calibration or known information about the image capture device itself” (emphasis added), corresponding to the opposite feature (i.e., using the prior calibration information, albeit a reduced amount).
Claim Rejections - 35 USC § 103
Claim(s) 1-3, 5-15, and 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cioppa et al. (“Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops), in view of Le Floch (US 2019/0266780 A1), hereinafter referred to as Cioppa and Le Floch, respectively.
Regarding claim 1, Cioppa teaches a system comprising:
a set of operations comprising:
obtaining image data captured by an image capture device of a three-dimensional (3D) scene (Cioppa Fig. 2: Broadcast videos [Wingdings font/0xE0] ResNet-152 or I3D or C3D);
processing the image data to identify features within the image data, wherein each feature of the identified features corresponds to a two-dimensional region of the image data (Cioppa Fig. 2: 512 features/frame – the captured scene is 3D, e.g., soccer field, and the image data is 2D, e.g., video image; Cioppa pg. 5 left column: “4.2 Feature vector representation … ResNet-34 (RN). This network has 2.8 million parameters and achieves 73.27% top-1 accuracy on ImageNet. We use a frozen Res-Net-34 [20] and collect the feature vectors of dimension 512 in its penultimate layer” – ImageNet is a 2D image database; also see Cioppa pg. 4 right column: “we provide image representations of the player localization information”);
extracting a set of reference points based on the identified features, wherein each reference point of the set of reference points corresponds to a three-dimensional coordinate in space (Cioppa Fig. 2 & pg. 4 right column discussed above; Cioppa pg. 5 right column: “The World Cup 2014 dataset [21] stands as reference for evaluating soccer camera calibration methods. The test set comprises 186 calibrated images taken from 10 different games in various stadiums, perspectives, lighting conditions, and movements of the day”; Cioppa pg. 7 left column: “We process our top view with our own 3D convolutional network (3D)”; Cioppa pg. 8 right column: “We use our calibration and a player detection algorithm to obtain the player localization in real-world coordinates”);
generating, based on the set of reference points, a plurality of candidate sets of camera parameters (Cioppa pg. 4 left column: “Our training dataset is 60x larger than the World Cup 2014 dataset [21] used in [38] and contains a larger variety of camera viewpoints, making our student calibration network a valuable candidate for universal camera calibration in the context of soccer”); and
filtering the candidate sets of camera parameters to generate filtered candidate sets (Cioppa pg. 4 left column: “A binary score about the relevance of the calibration, set to 1 for frames with a plausible estimated calibration, is also computed by our student. This allows to discard cameras views that are not recorded by the main camera, such as close-up views, or public views”); and
processing the filtered candidate sets to generate a final set of camera parameters for the image capture device (Cioppa Fig. 2: Camera calibration CCBV-SN [Wingdings font/0xE0] calibrated data representation; Cioppa pg. 4 left column: “We release those estimates along our trained calibration network, which can be used with a wide variety of soccer videos. We denote CCBV-SN our student trained on SoccerNet”; Cioppa pg. 4 right column: “We use the homography computed by CCBV-SN to estimate the player localization on the field in real-world coordinates from the middle point of the bottom of their bounding box”).
However, Cioppa does not appear to explicitly teach at least one processor; memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations described above; and using geometric relationship within the 3D scene.
Pertaining to the same field of endeavor, Le Floch teaches at least one processor; and memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations (Le Floch ¶¶0119: “a non-transitory computer-readable medium storing a program which, when executed by a microprocessor or computer system in a device, causes the device to perform any method as defined above according to the first to third approaches”); and
using geometric relationship within the 3D scene to determine the reference points (Le Floch ¶¶0167: “The position and orientation of the scene volume V captured are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known)”; Le Floch ¶¶0175: “This matching is a pure geometrical issue based on known intrinsic and extrinsic parameters”; Le Floch ¶¶0181: “In case a volume V of the captured scene is delimited, its position and orientation are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known)”; Le Floch ¶¶0265: “This is a geometrical issue which depends only on the extrinsic and intrinsic parameters of each source camera 12i, given the elementary voxels considered”).
Cioppa and Le Floch are considered to be analogous art because they are directed to image processing for detecting objects. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the camera calibration and player localization for action spotting (as taught by Cioppa) to use a processor and memory and geometric constraints (as taught by Le Floch) because the combination is more convenient and allows the algorithm to be stored on a memory and automatically executed by a processor of the computer rather than being manually written each time, and the combination efficiently works on complex scenes like sport events with multiple players in a stadium, with an ability to detect a wide number of interoperating 3D objects like multiple players (Le Floch Abstract).
Regarding claim 2, Cioppa, in view of Le Floch, teaches the system of claim 1, wherein processing the image data to identify features comprises:
identifying the features using a machine learning model, wherein the features are associated with a feature class (Cioppa Abstract: “neural network architecture … SoccerNet … distilled network”; Cioppa Fig. 2);
encoding the identified features in an intermediate image (Cioppa Fig. 2; Cioppa pg. 5 left column: “as universal networks trained on e.g., ImageNet have an excellent transfer capability to encode meaningful visual information about any given image”; Cioppa pg. 5 right column: “encoding per-frame players information in a graph”); and
processing the intermediate image using a set of geometric constraints associated with the feature class to extract the set of reference points (Le Floch ¶¶0167: “The position and orientation of the scene volume V captured are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known)”; Le Floch ¶¶0175: “This matching is a pure geometrical issue based on known intrinsic and extrinsic parameters”; Le Floch ¶¶0181: “In case a volume V of the captured scene is delimited, its position and orientation are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known)”; Le Floch ¶¶0265: “This is a geometrical issue which depends only on the extrinsic and intrinsic parameters of each source camera 12i, given the elementary voxels considered”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the camera calibration and player localization for action spotting (as taught by Cioppa) to use a geometric constraint (as taught by Le Floch) because the combination efficiently works on complex scenes like sport events with multiple players in a stadium, with an ability to detect a wide number of interoperating 3D objects like multiple players (Le Floch Abstract).
Regarding claim 3, Cioppa, in view of Le Floch, teaches the system of claim 2, wherein:
the features comprise a first set of features associated with a first feature class (Cioppa pg. 5 right column: “Each player is represented with a node, whose features are defined by their associated RGB color, their position in real-world coordinates, and the area of the detected bounding box in the image frame”); and
the set of operations further comprises:
identifying a second set of features using the machine learning model, wherein the second set of features is associated with a second feature class and is encoded in the intermediate image using a different color or channel than a color or channel used to encode the first set of features (Cioppa Fig. 3: shows different feature classes, e.g., players, field, borders, etc. in different colors; Cioppa Table 3: shows different data representations, feature vectors, and networks for the detection results; Cioppa pg. 6 left column: “consists in 110,458 action timestamps spread over 17 classes within the 500 complete games of the SoccerNet [17] dataset”).
Regarding claim 5, Cioppa, in view of Le Floch, teaches the system of claim 1, wherein the features are identified using a convolutional neural network trained to perform semantic segmentation (Cioppa pg. 5 left column: “they can be easily processed with convolutional neural networks in deep learning pipelines”; Cioppa pg. 5 right column: “The graph convolutional network (see next section) that processes the player graph aggregates features from neighboring players”).
Regarding claim 6, Cioppa, in view of Le Floch, teaches the system of claim 1, wherein each candidate set of the plurality of candidate sets is generated based on a subset of reference points sampled from the set of reference points (Cioppa pg. 5 right column: “The World Cup 2014 dataset [21] stands as reference for evaluating soccer camera calibration methods. The test set comprises 186 calibrated images taken from 10 different games in various stadiums, perspectives, lighting conditions, and moments of the day … for each test image, we compute the entire intersection over union (IoU entire) between the top view projections of the field model by the ground0truth camera and by the estimated camera”; Cioppa pg. 6 left column: “consists in 110,458 action timestamps spread over 17 classes within the 500 complete games of the SoccerNet [17] dataset, which 22,551 actions related to the 100 test games”).
Regarding claim 7, Cioppa, in view of Le Floch, teaches the system of claim 1, wherein filtering the candidate sets comprises at least one of:
evaluating a candidate set of camera parameters of the candidate sets using a test reference point that is local to a subset of reference points used to generate the candidate set; or evaluating the candidate set of camera parameters using a plurality of test reference points that are different from the subset of reference points used to generate the candidate set of camera parameters (Cioppa pg. 5 right column & Cioppa pg. 6 left column discussed above).
Regarding claim 8, Cioppa, in view of Le Floch, teaches the system of claim 1, wherein the set of operations further comprises at least one of:
identifying, based on the final set of camera parameters, an object of the 3D scene (Cioppa Fig. 3, Tables 2-3);
generating, based on the final set of camera parameters, movement information for the object (Cioppa Table 2: various action classes indicating movement information are determined); or
providing an indication of the final set of camera parameters to a computing device (Cioppa Fig. 3: successful camera calibrations are indicated).
Regarding claim 9, Cioppa teaches a method for automated calibration of an image capture device, comprising:
obtaining image data captured by an image capture device of a three-dimensional (3D) scene (Cioppa Fig. 2: Broadcast videos [Wingdings font/0xE0] ResNet-152 or I3D or C3D);
processing the image data using a machine learning model to identify features associated within the image data, wherein the features are associated with one or more feature classes defined by a set of rules, and wherein each feature of the identified features corresponds to a two-dimensional region of the image data (Cioppa Fig. 2: 512 features/frame – the captured scene is 3D, e.g., soccer field, and the image data is 2D, e.g., video image;; Cioppa pg. 5 left column: “4.2 Feature vector representation … ResNet-34 (RN). This network has 2.8 million parameters and achieves 73.27% top-1 accuracy on ImageNet. We use a frozen Res-Net-34 [20] and collect the feature vectors of dimension 512 in its penultimate layer” – ImageNet is a 2D image database; Cioppa pg. 5 right column: “The World Cup 2014 dataset [21] stands as reference for evaluating soccer camera calibration methods. The test set comprises 186 calibrated images taken from 10 different games in various stadiums, perspectives, lighting conditions, and movements of the day”; also see Cioppa pg. 4 right column: “we provide image representations of the player localization information”)); and
generating, based on the set of reference points, a set of camera parameters for the image capture device (Cioppa Fig. 2: Camera calibration CCBV-SN [Wingdings font/0xE0] calibrated data representation; Cioppa pg. 4 left column: “We release those estimates along our trained calibration network, which can be used with a wide variety of soccer videos. We denote CCBV-SN our student trained on SoccerNet”; Cioppa pg. 4 right column: “We use the homography computed by CCBV-SN to estimate the player localization on the field in real-world coordinates from the middle point of the bottom of their bounding box”).
Cioppa further teaches that it was known to extract a set of reference points based on the identified features, wherein each reference point of the set of reference points corresponds to a three-dimensional coordinate in space (Cioppa Fig. 2, pg. 4 right column, pg. 5 right column, pg. 7 left column, & pg. 8 right column discussed above);
However, Cioppa does not appear to explicitly teach extracting a set of reference points based on one or more geometric constraints associated with the identified features, wherein the one or more geometric constraints are defined by the set of rules.
Pertaining to the same field of endeavor, Le Floch teaches extracting a set of reference points based on one or more geometric constraints associated with the identified features, wherein the one or more geometric constraints are defined by the set of rules, and wherein each reference point of the set of reference points corresponds to a three-dimensional coordinate in space determined from a geometric relationship within the 3D scene (Le Floch ¶¶0167: “The position and orientation of the scene volume V captured are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known)”; Le Floch ¶¶0175: “This matching is a pure geometrical issue based on known intrinsic and extrinsic parameters”; Le Floch ¶¶0181: “In case a volume V of the captured scene is delimited, its position and orientation are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known)”; Le Floch ¶¶0265: “This is a geometrical issue which depends only on the extrinsic and intrinsic parameters of each source camera 12i, given the elementary voxels considered”);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the camera calibration and player localization for action spotting (as taught by Cioppa) to use a geometric constraint (as taught by Le Floch) because the combination efficiently works on complex scenes like sport events with multiple players in a stadium, with an ability to detect a wide number of interoperating 3D objects like multiple players (Le Floch Abstract).
Regarding claim 10, Cioppa, in view of Le Floch, teaches the method of claim 9, wherein generating the set of camera parameters for the image capture device comprises:
generating, based on a subset of reference points sampled from the set of reference points, a plurality of candidate sets of camera parameters (Cioppa pg. 4 left column: “Our training dataset is 60x larger than the World Cup 2014 dataset [21] used in [38] and contains a larger variety of camera viewpoints, making our student calibration network a valuable candidate for universal camera calibration in the context of soccer”);
filtering the candidate sets of camera parameters to generate filtered candidate sets by at least one of:
evaluating a candidate set of camera parameters of the candidate sets using a test reference point that is local to a subset of reference points used to generate the candidate set; or evaluating the candidate set of camera parameters using a plurality of test reference points that are different from the subset of reference points used to generate the candidate set of camera parameters (Cioppa pg. 4 left column: “A binary score about the relevance of the calibration, set to 1 for frames with a plausible estimated calibration, is also computed by our student. This allows to discard cameras views that are not recorded by the main camera, such as close-up views, or public views”; Cioppa pg. 5 right column: “The World Cup 2014 dataset [21] stands as reference for evaluating soccer camera calibration methods. The test set comprises 186 calibrated images taken from 10 different games in various stadiums, perspectives, lighting conditions, and moments of the day … for each test image, we compute the entire intersection over union (IoU entire) between the top view projections of the field model by the ground0truth camera and by the estimated camera”; Cioppa pg. 6 left column: “consists in 110,458 action timestamps spread over 17 classes within the 500 complete games of the SoccerNet [17] dataset, which 22,551 actions related to the 100 test games”); and
processing the filtered candidate sets to generate the set of camera parameters for the image capture device (Cioppa Fig. 2: Camera calibration CCBV-SN [Wingdings font/0xE0] calibrated data representation; Cioppa pg. 4 left column: “We release those estimates along our trained calibration network, which can be used with a wide variety of soccer videos. We denote CCBV-SN our student trained on SoccerNet”; Cioppa pg. 4 right column: “We use the homography computed by CCBV-SN to estimate the player localization on the field in real-world coordinates from the middle point of the bottom of their bounding box”).
Regarding claim 11, Cioppa, in view of Le Floch, teaches the method of claim 9, wherein the set of rules and the machine learning model are each associated with a scene type (Cioppa Abstract: “SoccerNet dataset … broadcast videos of 500 soccer games”).
Regarding claim 12, Cioppa, in view of Le Floch, teaches the method of claim 11, wherein:
the scene type is a football game (Cioppa Abstract discussed above; Cioppa Fig. 1);
the machine learning model is trained to identify one or more of:
a set of yard line features; a set of sideline features; and a set of hash mark features (Cioppa Fig. 3); and
the one or more geometric constraints define a relationship between one or more of:
the set of yard line features; the set of sideline features; and the set of hash mark features (Cioppa Fig. 3; Le Floch ¶¶0167, ¶¶0175, ¶¶0181, ¶¶0265 discussed above).
Regarding claim 13, Cioppa, in view of Le Floch, teaches that the system performs a method for automated calibration of an image capture device comprising the processes described in claim 1 (Cioppa Fig. 2). Therefore, claim 13 is rejected using the same rationale as applied to claim 1 discussed above.
Claim 14 is rejected using the same rationale as applied to claim 2 discussed above.
Claim 15 is rejected using the same rationale as applied to claim 3 discussed above.
Claim 17 is rejected using the same rationale as applied to claim 5 discussed above.
Claim 18 is rejected using the same rationale as applied to claim 6 discussed above.
Claim 19 is rejected using the same rationale as applied to claim 7 discussed above.
Claim 20 is rejected using the same rationale as applied to claim 8 discussed above.
Claim(s) 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cioppa et al. (2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops), in view of Le Floch (US 2019/0266780 A1), and further in view of Pham et al. (US 2020/0341466 A1), hereinafter referred to as Cioppa, Le Floch, and Pham, respectively.
Regarding claim 21, Cioppa, in view of Le Floch, teaches the system of claim 1, but does not appear to explicitly teach not relying on a template or dictionary.
Pertaining to the same field of endeavor, Pham teaches that the features are identified within the image data without relying on a pretrained template or an offline dictionary (Pham ¶¶0025: “the system may learn to diagnose each intersection in real-time or near real-time, without requiring prior experience or knowledge”; Pham ¶¶0200: “Training may be executed according to any one or more classes of machine learning techniques, including, without limitation, classes such as: … unsupervised training, self learning …”).
Cioppa, in view of Le Floch, and Pham are considered to be analogous art because they are directed to image processing for detecting objects. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the camera calibration and player localization for action spotting (as taught by Cioppa, in view of Le Floch) not rely on a dictionary or template (as taught by Le Floch) because modifying the algorithm to not rely on a template/dictionary can detect and classify objects in real-time (Pham ¶¶0025).
Regarding claim 22, Cioppa, in view of Le Floch, teaches the system of claim 1, but does not appear to explicitly teach not requiring prior camera training data.
Pertaining to the same field of endeavor, Pham teaches that the final set of camera parameters is generated in a single processing operation, without requiring prior camera-specific training data (Pham ¶¶0025 & ¶¶0200 discussed above).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the camera calibration and player localization for action spotting (as taught by Cioppa, in view of Le Floch) not rely on a dictionary or template (as taught by Le Floch) because the modification can detect and classify objects in real-time (Pham ¶¶0025).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOO J SHIN whose telephone number is (571)272-9753. The examiner can normally be reached M-F; 10-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Soo Shin/Primary Examiner, Art Unit 2667