DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s response to the last Office Action, filed 1/8/2026, has been entered and made of record.
Applicant has amended claims 1, 10, and 11. Claims 1-20 are currently pending.
Applicant's arguments filed 1/8/2026, with respect to the rejection of claims 1, 10, and 11 under 35 U.S.C. 103 have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground of rejection is made in view of Wu (X. Wu, G. Li, Q. Cao, Q. Ji and L. Lin, "Interpretable Video Captioning via Trajectory Structured Localization," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6829-6837.)
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-7, 11-17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Krishnamurthy (U.S. Patent Pub. No. 2020/0134316) in view of Zhou et al., “Global Tracking Transformers.”, arXiv:2203.13250v2, April 25, 2022, 10 pages. hereinafter referred to as (Zhou) in view of X. Wu, G. Li, Q. Cao, Q. Ji and L. Lin, "Interpretable Video Captioning via Trajectory Structured Localization," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6829-6837 hereinafter referred to as (Wu).
Regarding Claim 1, Krishnamurthy teaches a computer-implemented method to perform dense video object captioning, the method comprising (¶65 The Scene Annotation component module 120 uses an image frame from a video stream presented to a user to generate a text description of scene elements within the image frame:)
obtaining, by a computing system comprising one or more computing devices (¶41 Each module may be separate and independent or each module may simply be a process carried out by single general-purpose computer,) a video comprising a plurality of image frames, wherein the video depicts a plurality of objects (¶65 The Scene Annotation component module 120 uses an image frame from a video stream presented to a user to generate a text description of scene elements (objects) within the image frame;)
respectively processing, by the computing system, each image frame with a machine-learned object detection model to extract one or more sets of feature data respectively for one or more of the plurality of objects that are depicted in the image frame (¶66 The neural networks may be arranged as an encoder pair as shown in FIG. 5. The first NN, referred to herein as the encoder 501, is a deep convolutional network (CNN) type that outputs a feature vector 502; ¶67 The encoder 501 is trained to classify objects within the image frame;)
respectively processing, by the computing system, at least some of the respective sets of feature data that correspond to each object with a machine-learned text generation model to generate a textual caption for each object; and (¶66 The second NN, referred to herein as the decoder 503, is a deep network, e.g., a RNN or LSTM that outputs captions word by word representing the elements of the scene) providing, by the computing system, the textual caption for each object as an output (¶67 The decoder 503 takes feature vectors and outputs captions for the image frames.)
Krishnamurthy does not explicitly disclose processing, by the computing system, the sets of feature data extracted for the plurality of image frames with a machine-learned tracking model to generate a plurality of trajectories respectively for the plurality of objects, wherein the trajectory for each object identifies the sets of feature data that correspond to the object;
respectively processing, by the computing system and based on the plurality of trajectories, at least some of the respective sets of feature data that correspond to the trajectory of each object with a machine-learned text generation model to generate a textual caption for each object; and providing, by the computing system, the textual caption for each object as an output.
Zhou is in the same field of art of image analysis. Further, Zhou teaches processing, by the computing system, the sets of feature data extracted for the plurality of image frames with a machine-learned tracking model to generate a plurality of trajectories respectively for the plurality of objects, wherein the trajectory for each object identifies the sets of feature data that correspond to the object (Zhou, Abstract: Our network takes a short sequence of frames as input and produces global trajectories for all objects. The core component is a global tracking transformer that operates on objects from all frames in the sequence. The transformer encodes object features from all frames, and uses trajectory queries to group them into trajectories. The trajectory queries are object features from a single frame and naturally produce unique trajectories.)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Krishnamurthy by generating trajectories for the objects that is taught by Zhou; thus, one of ordinary skilled in the art would be motivated to combine the references to improve upon baselines based on pairwise association (Zhou Abstract).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.
Wu is in the same field of image analysis. Further Wu teaches respectively processing, by the computing system and based on the plurality of trajectories, at least some of the respective sets of feature data that correspond to the trajectory of each object with a machine-learned (section 3: As with the encoder, the architecture of the de coder can be CNN or RNN) text generation model to generate a textual caption for each object; and providing, by the computing system, the textual caption for each object as an output (Fig. 1; Section 1: we propose a trajectory structured attentional encoder-decoder frame work (TSA-ED) which works by incorporating an attentive structured localization mechanism in a prevailing LSTM based encoder and decoder framework. In particular, our proposed TSA-ED is composed of a pre-processing stage for trajectory cluster feature representation and a structured aware encoder-decoder network framework. In the pre processing stage, we extract a set of trajectory cluster features. Each trajectory cluster well captures one specific local motion pattern and it is used in the decoding phase for local spatial-temporal feature attention. During the de coding phase, we dynamically change the feature vectors of candidate spatial-temporal regions in video and simultaneously generate the caption.)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Krishnamurthy by inputting trajectories into a model to generate a caption that is taught by Wu; thus, one of ordinary skilled in the art would be motivated to combine the references since It is able to generate more elaborate and more accurate video captioning than existing traditional global image feature or static object representation based methods (Wu Section 1).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.
Regarding Claim 2, Krishnamurthy in view of Zhou in view of Wu discloses the computer-implemented method of claim 1, wherein two or more of the machine-learned object detection model, the machine-learned tracking model, and the machine-learned text generation model have been trained on two or more disjoint tasks (Krishnamurthy, ¶67 During training the encoder and decoder may be trained separately ... The inputs to the encoder during training are labeled image frames ... The labels are hidden from the encoder and checked with the encoder output during training ... The input to the decoder are image feature vectors having captions that are hidden from the decoder and checked during training.)
Regarding Claim 3, Krishnamurthy in view of Zhou in view of Wu discloses the computer-implemented method of claim 2, wherein the two or more disjoint tasks comprise two or more of: an object detection task (Krishnamurthy, ¶67 The encoder 501 is trained to classify objects within the image frame. The inputs to the encoder during training are labeled image frames ... The labels are hidden from the encoder and checked with the encoder output during training;) a dense captioning in images task (Krishnamurthy, ¶67 he decoder 503 takes feature vectors and outputs captions for the image frames. The input to the decoder are image feature vectors having captions that are hidden from the decoder and checked during training;) a global video captioning task; and a tracking task.
Regarding Claim 4, Krishnamurthy in view of Zhou in view of Wu discloses the computer-implemented method of claim 1, wherein two or more of the machine-learned object detection model, the machine-learned tracking model, and the machine-learned text generation model have been trained jointly together (Krishnamurthy, ¶67 In alternative implementations, a encoder-decoder architecture may be trained jointly to translate an image to text. By way of example, and not by way of limitation, the encoder, e.g., a deep CNN, may generate an image embedding from an image. The decoder, e.g., an RNN variant, may then take this image embedding and generate corresponding text. The NN algorithms discussed above are used for adjustment of weights and optimization.)
Regarding Claim 5, Krishnamurthy in view of Zhou in view of Wu discloses the computer-implemented method of claim 1, wherein:
the machine-learned object detection model has been trained using an object detection loss function (Krishnamurthy, ¶67 The encoder 501 is trained to classify objects within the image frame. The inputs to the encoder during training are labeled image frames... The labels are hidden from the encoder and checked with the encoder output during training; This will give a detection loss during the training. ¶50 also explains example loss functions while training neural networks;)
the machine-learned tracking model has been trained using an association loss function; and (Zhou, Section 4.2: We train Lasso jointly with standard detection losses [70], including classification and bounding-box regression losses, and optionally second stage classification and regression losses for multi-class tracking [13])
the machine-learned text generation has been trained using a caption loss function (Krishnamurthy, ¶67 The decoder 503 takes feature vectors and outputs captions for the image frames. The input to the decoder are image feature vectors having captions that are hidden from the decoder and checked during training; This will give a caption loss during the training. ¶50 explains example loss function while training neural networks.)
The reasons for combining Krishnamurthy and Zhou and Wu are similar to that stated in the rejection of claim 1. In addition, this same reasoning is pertinent and applicable to the rejections of claims 6 and 7 below.
Regarding Claim 6, Krishnamurthy in view of Zhou in view of Wu discloses the computer-implemented method of claim 1, wherein processing, by the computing system, the sets of feature data extracted for the plurality of image frames with the machine-learned tracking model to generate the plurality of trajectories comprises generating, by the computing system using the machine-learned tracking model, a global association matrix that assigns the sets of feature data to the plurality of objects (Zhou, Section 4.4: The global tracking transformer takes a stack of object features F ∈ R N×D as the encoder input, a matrix of queries Q ∈ RM×D as the decoder input, and produces an association matrix G ∈ RM×N between queries and objects.)
Regarding Claim 7, Krishnamurthy in view of Zhou in view of Wu discloses the computer-implemented method of claim 5, wherein processing, by the computing system, the sets of feature data extracted for the plurality of image frames with the machine-learned tracking model to generate the plurality of trajectories further comprises performing, by the computing system, a greedy grouping algorithm on the global association matrix (Zhou, Section 1: Local trackers [4, 5, 54, 55, 60, 66] primarily consider pairwise associations in a greedy way (Figure 1a). They maintain a status of each trajectory based on location [5, 68] and/or identity features [55, 66], and associate current-frame detections with each trajectory based on its last visible status; Section 3: Most prior works define the association greedily through pairwise matches between objects in adjacent or nearby frames [4, 5, 66, 68], or rely on offline combinatorial optimization for global association [6, 16, 65].)
Regarding claim 11, claim 11 has been analyzed with regard to claim 1 and is rejected for the same reasons of obviousness as used above as well as in accordance with Krishnamurthy further teaching on: A computing system comprising one or more processors and one or more non-transitory computer-readable media that store computer-readable instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising (¶21 a computer platform which is an electronic computing device that includes a processor which manipulates and transforms data represented as physical (e.g., electronic) quantities within the processor's registers and accessible platform memories into other data similarly represented as physical quantities within the computer platform memories, processor registers, or display screen; ¶22 A computer program may be stored in a computer readable storage medium, such as, but not limited to … any other type of non-transitory media suitable for storing electronic instructions.)
Claim 12 recites limitations similar to claim 2 and is rejected under the same rationale and reasoning.
Claim 13 recites limitations similar to claim 3 and is rejected under the same rationale and reasoning.
Claim 14 recites limitations similar to claim 4 and is rejected under the same rationale and reasoning.
Claim 15 recites limitations similar to claim 5 and is rejected under the same rationale and reasoning.
Claim 16 recites limitations similar to claim 6 and is rejected under the same rationale and reasoning.
Claim 17 recites limitations similar to claim 7 and is rejected under the same rationale and reasoning.
Regarding Claim 20, Krishnamurthy in view of Zhou in view of Wu discloses the computing system of claim 11, wherein the non-transitory computer readable media further stores the machine-learned object detection model, the machine-learned tracking model, and the machine-learned text generation model (Krishnamurthy, ¶21 Unless specifically stated or otherwise as apparent from the following discussion, it is to be appreciated that throughout the description, discussions utilizing terms such as “processing”, “computing”, “converting”, “reconciling”, “determining” or “identifying,” refer to the actions and processes of a computer platform which is an electronic computing device that includes a processor which manipulates and transforms data represented as physical (e.g., electronic) quantities within the processor's registers and accessible platform memories into other data similarly represented as physical quantities within the computer platform memories, processor registers, or display screen; ¶22 A computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks (e.g., compact disc read only memory (CD-ROMs), digital video discs (DVDs), Blu-Ray Discs™, etc.), and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, flash memories, or any other type of non-transitory media suitable for storing electronic instructions.)
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Krishnamurthy (U.S. Patent Pub. No. 2020/0134316) in view of (Zhou) in view of (Wu) in view of Cohen (U.S. Patent Pub. No. 2021/0027083).
Regarding Claim 10, Krishnamurthy teaches one or more non-transitory computer-readable media that store computer-readable instructions that, when executed by a computing system, cause the computing system to perform operations, the operations comprising (¶21 a computer platform which is an electronic computing device that includes a processor which manipulates and transforms data represented as physical (e.g., electronic) quantities within the processor's registers and accessible platform memories into other data similarly represented as physical quantities within the computer platform memories, processor registers, or display screen; ¶22 A computer program may be stored in a computer readable storage medium, such as, but not limited to … any other type of non-transitory media suitable for storing electronic instructions:)
obtaining, by the computing system, a video comprising a plurality of image frames, wherein the video depicts a plurality of objects (¶65 The Scene Annotation component module 120 uses an image frame from a video stream presented to a user to generate a text description of scene elements (objects) within the image frame;)
respectively processing, by the computing system, each image frame with a machine-learned object detection model to extract one or more sets of feature data respectively for one or more of the plurality of objects that are depicted in the image frame (¶66 The neural networks may be arranged as an encoder pair as shown in FIG. 5. The first NN, referred to herein as the encoder 501, is a deep convolutional network (CNN) type that outputs a feature vector 502; ¶67 The encoder 501 is trained to classify objects within the image frame.)
Krishnamurthy does not explicitly disclose to generate a set of bounding boxes and detection scores for the plurality of objects;
processing, by the computing system, the sets of feature data extracted for the plurality of image frames with a machine-learned tracking model to generate a plurality of trajectories respectively for the plurality of objects, wherein the trajectory for each object identifies the sets of feature data that correspond to the object;
obtaining, by the computing system, a textual query; and
identifying, by the computing system based on the set of bounding boxes and detection scores and using a machine-learned text generation model, one or more bounding boxes with a highest weighted likelihood for the textual query.
Zhou is in the same field of art of image analysis. Further, Zhou teaches processing, by the computing system, the sets of feature data extracted for the plurality of image frames with a machine-learned tracking model to generate a plurality of trajectories respectively for the plurality of objects, wherein the trajectory for each object identifies the sets of feature data that correspond to the object (Zhou, Abstract: Our network takes a short sequence of frames as input and produces global trajectories for all objects. The core component is a global tracking transformer that operates on objects from all frames in the sequence. The transformer encodes object features from all frames, and uses trajectory queries to group them into trajectories. The trajectory queries are object features from a single frame and naturally produce unique trajectories.)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Krishnamurthy by generating trajectories for the objects that is taught by Zhou; thus, one of ordinary skilled in the art would be motivated to combine the references to improve upon baselines based on pairwise association (Zhou Abstract).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.
Wu is in the same field of image analysis. Further Wu teaches obtaining based on the trajectory for each object a textual caption (Fig. 1; Section 1: we propose a trajectory structured attentional encoder-decoder frame work (TSA-ED) which works by incorporating an attentive structured localization mechanism in a prevailing LSTM based encoder and decoder framework. In particular, our proposed TSA-ED is composed of a pre-processing stage for trajectory cluster feature representation and a structured aware encoder-decoder network framework. In the pre processing stage, we extract a set of trajectory cluster features. Each trajectory cluster well captures one specific local motion pattern and it is used in the decoding phase for local spatial-temporal feature attention. During the de coding phase, we dynamically change the feature vectors of candidate spatial-temporal regions in video and simultaneously generate the caption.)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Krishnamurthy by inputting trajectories into a model to generate a caption that is taught by Wu; thus, one of ordinary skilled in the art would be motivated to combine the references since It is able to generate more elaborate and more accurate video captioning than existing traditional global image feature or static object representation based methods (Wu Section 1).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.
Cohen is in the same field of art of image analysis. Further, Cohen teaches to generate a set of bounding boxes and detection scores for the plurality of objects (¶86 the object selection system 106 can utilize the object classification neural network to tag each bounding box with a prediction of one or more known objects identified within the bounding box. In some embodiments, the label includes known object detection confidence scores (e.g., prediction probability scores) for each of the object class tags;)
obtaining, by the computing system, a textual query; and (¶58 As shown in FIG. 2 in connection with the act 202, the user provides the query string of “sign” to be selected from the image (can use query taught by Wu))
identifying, by the computing system based on the set of bounding boxes and detection scores and using a machine-learned text generation model, one or more bounding boxes with a highest weighted likelihood for the textual query (¶130 the object selection system 106 can detect the query object from the potential objects. For example, as shown in FIG. 7, the act 310 further includes the act 714 of the object selection system 106 detecting the query object in the image based on the correlation scores. In one or more embodiments, the object selection system 106 can select the potential object that has the highest correlation score as the detected query object; ¶131 the object selection system 106 can label the bounding box of a detected query object. For example, upon determining that a potential object in the image correlates with the query object, the object selection system 106 can tag the bounding box of the detected query object with a label matching the query object and/or the object class of the query object.)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Krishnamurthy in view of Zhou by generating bounding boxes and confidence scores and using those to determine the most likely bounding box caption that is taught by Cohen; thus, one of ordinary skilled in the art would be motivated to combine the references to accurately detect and optionally automatically selects user-requested objects (e.g., query objects) in digital images (Cohen Abstract).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.
Allowable Subject Matter
Claims 8-9, and 18-19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Regarding claims 8 and 18, No prior art teaches the computer-implemented method of claim 1, wherein respectively processing, by the computing system, at least some of the respective sets of feature data that correspond to each object with the machine-learned text generation model to generate the textual caption for each object comprises: uniformly sampling, by the computing system, from the sets of feature data that correspond to each object to obtain sampled sets of feature data for each object; and processing, by the computing system, the sampled sets of feature data for each object with the machine-learned text generation model to generate the textual caption for the object.
Regarding claims 9 and 19, No prior art teaches the computer-implemented method of claim 1, wherein respectively processing, by the computing system, at least some of the respective sets of feature data that correspond to each object with the machine-learned text generation model to generate the textual caption for each object comprises: computing, by the computing system for each object, a weighted sum of features in the trajectory from other image frames; and processing, by the computing system, the weighted sum of features for each object with the machine-learned text generation model to generate the textual caption for the object.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DUSTIN BILODEAU whose telephone number is (571)272-1032. The examiner can normally be reached 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached at (571) 272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DUSTIN BILODEAU/Examiner, Art Unit 2664
/JENNIFER MEHMOOD/Supervisory Patent Examiner, Art Unit 2664