Office Action Analysis: 18709766 — OBJECT DETECTION IN VIDEO IMAGE FRAMES

Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 05/13/2024 is being considered by the examiner.

Claim Objections
Claims 4, 9, 13, and 18 are objected to because of the following informalities:
In claim 4, line 3, the term “regions of the second image frame” should be changed to “regions of the second image in order to maintain consistency in terminology and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 4, line 5, the term “recognition on the second image frame” should be changed to “recognition on the second image in order to maintain consistency in terminology and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 4, lines 10-11, the term “location in the second and third image frames” should be changed to “location in the second image and third image framein order to maintain consistency in terminology and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 9, line 3, the term “when the object detector detects” should be changed to “when the main object detector detects” in order to maintain consistency in terminology and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 9, line 5, the term “bypass performing OCR on the object” should be changed to “bypass performing optical character recognition (OCR) on the object” as acronyms must be presented with their meanings the first time they are mentioned in the group of claims.
In claim 9, line 8, the term “when the object detector does not” should be changed to “when the main object detector does not” in order to maintain consistency in terminology and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 9, line 15, the term “detected by the backup detector; and” should be changed to “detected by the backup object detector; and” in order to maintain consistency in terminology and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 13, line 3, the term “regions of the second image frame” should be changed to “regions of the second image in order to maintain consistency in terminology and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 13, line 5, the term “recognition on the second image frame” should be changed to “recognition on the second image in order to maintain consistency in terminology and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 13, lines 10-11, the term “location in the second and third image frames” should be changed to “location in the second image and third image framein order to maintain consistency and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 18, line 3, the term “when the object detector does not” should be changed to “when the main object detector does not” in order to maintain consistency and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 18, line 5, the term “bypass performing OCR on the object” should be changed to “bypass performing optical character recognition (OCR) on the object” as acronyms must be presented with their meanings the first time they are mentioned in the group of claims.
In claim 18, line 8, the term “when the object detector does not” should be changed to “when the main object detector does not” in order to maintain consistency and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
In claim 18, line 15, the term “detected by the backup detector; and” should be changed to “detected by the backup object detector; and” in order to maintain consistency and to avoid insufficient antecedent issue and prevent a rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph issues. 
Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim limitation(s) recite(s) sufficient structure, materials, or acts to entirely perform the recited function.  

Claims 1, 9-10, and 18-19 recite limitations that use words like “means” (or “step”) or similar terms with functional language but do not invoke 35 U.S.C. 112(f):
Claim 1; recites the limitation, “a main object detector configured to……,” [Line 3].
Claim 10; recites the limitation, “a main object detector configured to……,” [Line 5].
Claim 19; recites the limitation, “a main object detector configured to……,” [Line 4].
Claim 1; recites the limitation, “a backup object detector configured to……,” [Line 5].
Claim 10; recites the limitation, “a backup object detector configured to……,” [Line 8].
Claim 19; recites the limitation, “a backup object detector configured to……,” [Line 6].
Claim 9; recites the limitation, “the object detector detects……,” [Line 3].
Claim 9; recites the limitation, “detection using the backup object detector……,” [Line 9].
Claim 9; recites the limitation, “the backup object detector detects……,” [Line 10].
Claim 9; recites the limitation, “detected by the backup object detector……,” [Line 13].
Claim 9; recites the limitation, “detected by the backup detector……,” [Line 15].
Claim 9; recites the limitation, “detected by the backup detector……,” [Line 16].
Claim 9; recites the limitation, “detected by the backup detector……,” [Line 17].
Claim 18; recites the limitation, “the object detector detects……,” [Line 3].
Claim 18; recites the limitation, “detection using the backup object detector……,” [Line 9].
Claim 18; recites the limitation, “the backup object detector detects……,” [Line 10].
Claim 18; recites the limitation, “detected by the backup object detector……,” [Line 13].
Claim 18; recites the limitation, “detected by the backup detector……,” [Line 15].
Claim 18; recites the limitation, “detected by the backup detector……,” [Line 16].
Claim 18; recites the limitation, “detected by the backup detector……,” [Line 17].

Such claim limitation(s) is/are:
(i) “main object detector….” has a structure associated with it a detector.
(ii) “backup object detector….” has a structure associated with it a detector.
(iii) “object detector….” has a structure associated with it a detector.
(iiii) “backup detector….” has a structure associated with it a detector.

Because this/these claim limitation(s) is/are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are not being interpreted to cover only the corresponding structure, material, or acts described in the specification as performing the claimed function, and equivalents thereof.
If applicant intends to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to remove the structure, materials, or acts that performs the claimed function; or (2) present a sufficient showing that the claim limitation(s) does/do not recite sufficient structure, materials, or acts to perform the claimed function.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-3, 5-7, 10-12, 14-16, and 19-20 are rejected under 35 U.S.C. 102(a)(1)/(a)(2) as being anticipated by CHEN et al. (US 20190130580 A1), hereinafter referenced as CHEN. 

Regarding claim 1, CHEN explicitly teaches a method comprising (Fig. 2. Paragraph [0143]-CHEN discloses the video analytics system (e.g., video analytics system 100) processing video frames across time t.):  
receiving a video feed (Fig. 1. Paragraph [0136]-CHEN discloses FIG. 1 is a block diagram illustrating an example of a video analytics system 100. The video analytics system 100 receives video frames 102 from a video source 130.);
initializing a main object detector (Fig. 12, #1208 called a deep learning system. Paragraph [0210]-CHEN discloses the deep learning system 1208 can implement a complex object detector.) configured to receive image frames from the video feed (Fig. 12, #1202 called video frames. Paragraph [0210]-CHEN discloses the deep learning system 1208 can implement a complex object detector. For example, the complex object detector can be implemented using one or more trained neural networks (e.g., a deep learning network) to one or more of the frames 1202 of the received video sequence to locate and classify objects in the one or more frames.) and determine whether an object of interest is present in the image frames (Fig. 12. Paragraph [0210]-CHEN discloses an output of the deep learning system 1208 can include a set of detector bounding boxes representing the detected and classified objects.).);
executing a backup object detector (Fig. 2, #204N called a blob detection system. Paragraph [0144]) configured to perform operations comprising (Fig. 2. Paragraph [0144]-CHEN discloses the blob detection system 204N generates foreground blobs 208N for the frame N 202N. The object tracking system 206N can then perform temporal tracking of the blobs 208N.):
receiving a location of the object of interest in a first image of the video feed (Fig. 2. Paragraph [0142]-CHEN discloses the blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A. For example, a blob tracker's state and location for the video frame A 202A can be calculated and updated. The blob tracker's location in a next video frame N 202N can also be predicted from the current video frame A 202A. For example, the predicted location of a blob tracker for the next video frame N 202N can include the location of the blob tracker (and its associated blob) in the current video frame A 202A. Tracking of blobs of the current frame A 202A can be performed once the updated blob trackers 310A are generated (wherein the location of the object of interest is a blob).);
determining a background pattern of the object of interest in the first image (Fig. 3. Paragraph [0145]-CHEN discloses the blob detection system 104 includes a background subtraction engine 312 that receives video frames 302. The background subtraction engine 312 can perform background subtraction to detect foreground pixels in one or more of the video frames 302. For example, the background subtraction can be used to segment moving objects from the global background in a video sequence and to generate a foreground-background binary mask (referred to herein as a foreground mask (wherein the foreground-background binary mask is the background pattern).);
receiving a second image of the video feed (Fig. 2, illustrates the receiving of a second video frame, #202N called video frame N. Paragraph [0144]-CHEN discloses when a next video frame N 202N is received, the blob detection system 204N generates foreground blobs 208N for the frame N 202N.);
determining whether select regions (Fig. 8A, illustrates selected regions being boxes surrounding a cat and a dog.) of the second image in the received location comprise the background pattern (Fig. 8A-8C. Paragraph [0198]-CHEN discloses for each default box in each cell, the SSD neural network outputs a probability vector of length c, where c is the number of classes, representing the probabilities of the box containing an object of each class. In some cases, a background class is included that indicates that there is no object in the box (wherein a box is a select region). Further in paragraph [0198]-CHEN discloses for the image shown in FIG. 8A, all probability labels would indicate the background class with the exception of the three matched boxes (two for the cat, one for the dog).); and
labeling the second image (Fig. 10, illustrates a second image called a frame #1000.) as comprising the object of interest when the select regions in the received location comprise the background pattern (Fig. 10, illustrates a classified object with the background pattern in the selected region called bounding box #1004. Paragraph [0217]-CHEN discloses the deep learning system 1208 can generate and output classifications and confidence levels (also referred to as confidence values) for each object detected in a key frame. One illustrative example is shown by the classifications and confidence levels shown in FIG. 10 (the object classified as a person with a 93% confidence using bounding box 1004). A classification and confidence level determined for an object can be associated with the bounding box determined for the object. For instance, the deep learning network applied by the deep learning system 1208 may provide detector bounding boxes 1323 for a key frame, along with a category classification and a confidence level (CL) associated with each detector bounding box. The object classification indicates a category determined for an object detected in a key frame using the deep learning classification network. Any number of classes or categories can be determined for an object, such as a person, a car, or other suitable object class that the deep network is configured to detect and classify.).

Regarding claim 2, CHEN explicitly teaches the method of claim 1, 
CHEN further explicitly teaches wherein receiving the location comprises receiving coordinates of a geometric shape (Fig. 8A, illustrates geometric shapes, i.e. rectangular bounding boxes or blobs. Paragraph [0224]-CHEN discloses representing each bounding box with (x, y, w, h), where (x, y) is the upper-left coordinate of a bounding box, w and h are the width and height of the bounding box, respectively.) approximating the boundary of the object with margins outside of the object (Fig. 8A, illustrates a boundary of the object with margins outside of the object. Please see annotated Fig. 8A below. Paragraph [0224]) and wherein the select regions comprise portions of the geometric shape in the margins outside of the object (Fig. 8A, illustrates a region comprised of portions of a shape in the margins outside of the object (wherein the dog is the object, the bounding box is the margin, and tile floor and cat’s paw are select regions in the margins outside of the object). Paragraph [0197]-CHEN discloses FIG. 8A includes an image and FIG. 8B and FIG. 8C include diagrams illustrating how SSD (with the VGG deep network base model) operates. For example, SSD matches objects with default boxes of different aspect ratios (shown as dashed rectangles in FIG. 8B and FIG. 8C). Each element of the feature map has a number of default boxes associated with it. Please see annotated Fig. 8A below.)).

    PNG
    media_image1.png
    426
    321
    media_image1.png
    Greyscale

Annotated diagram of CHEN’s Fig. 8A illustrating a bounding box (i.e. a geometric shape with margins) surrounding an object (i.e. a dog) and select regions inside of the margins but outside of the object (i.e. the cat’s paw and the tile floor)

Regarding claim 3, CHEN explicitly teaches the method of claim 1, 
CHEN further explicitly teaches wherein the background pattern comprises pixel values (Fig. 3. Paragraph [0146]-CHEN discloses a classification of the pixel to either a foreground pixel or a background pixel is done by comparing the difference between the pixel value and the mean of the designated Gaussian model. In one illustrative example, if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel).

Regarding claim 5, CHEN explicitly teaches the method of claim 1, 
CHEN further explicitly teaches wherein the main object detector is a trainable model using training data, and wherein the training data comprises a plurality of labeled second images (Fig. 12. Paragraph [0193]-CHEN discloses a complex object detector can be based on a trained classification neural network, such as a deep learning network (also referred to herein as a deep network and a deep neural network), that can be used to classify and/or localize objects in a video frame. A trained deep learning network can identify objects in an image based on knowledge gleaned from training images (or other data) that include similar objects and labels indicating the classification of those objects (wherein training images are labeled second images).).

Regarding claim 6, CHEN explicitly teaches the method of claim 1, 
CHEN further explicitly teaches wherein determining the background pattern of the object in the first image comprises (Fig. 3. Paragraph [0146]-CHEN discloses the background subtraction engine 312 can model the background of a scene (e.g., captured in the video sequence) using any suitable background subtraction technique (also referred to as background extraction).): 
receiving a geometric shape (Fig. 8A, illustrates geometric shapes, i.e. bounding boxes or blobs. Paragraph [0224]-CHEN discloses representing each bounding box with (x, y, w, h), where (x, y) is the upper-left coordinate of a bounding box, w and h are the width and height of the bounding box, respectively.) approximating boundary of the object in the first image with margins outside of the object (Fig. 8A, illustrates a boundary of the object with margins outside of the object. Paragraph [0197]-CHEN discloses FIG. 8A includes an image and FIG. 8B and FIG. 8C include diagrams illustrating how SSD (with the VGG deep network base model) operates. For example, SSD matches objects with default boxes of different aspect ratios (shown as dashed rectangles in FIG. 8B and FIG. 8C). Each element of the feature map has a number of default boxes associated with it. Any default box with an intersection-over-union with a ground truth box over a threshold (e.g., 0.4, 0.5, 0.6, or other suitable threshold) is considered a match for the object. Please see annotated Fig. 8A below.);
determining a range of pixel values (Fig. 3. Paragraph [0146]-CHEN discloses if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel (wherein less than 3 times of the variance is a range).) in the select regions in the geometric shape in first image in the received location (Figs. 33A-B, illustrate select regions that the range of pixel values is applied to. Paragraph [0415]-CHEN discloses FIG. 33A is a video frame 3300A with a person (represented by bounding box 3302) that is standing still. FIG. 33B is a portion of the foreground mask binary image 3300B corresponding to the bounding box 3302. As shown in the portion of the foreground mask binary image 3300B, multiple separated blobs are detected for the same person, due to background subtraction absorbing portions of the person into the background portion of the foreground mask binary image (wherein the select regions are blobs).), wherein the select regions are selected to be in the margins outside of the object (Fig. 8A, illustrates a region comprised of portions of a shape in the margins outside of the object (wherein the dog is the object, the bounding box is the margin, and tile floor and cat’s paw are select regions in the margins outside of the object). Paragraph [0197]-CHEN discloses FIG. 8A includes an image and FIG. 8B and FIG. 8C include diagrams illustrating how SSD (with the VGG deep network base model) operates. For example, SSD matches objects with default boxes of different aspect ratios (shown as dashed rectangles in FIG. 8B and FIG. 8C). Each element of the feature map has a number of default boxes associated with it. Please see annotated Fig. 8A below.));
determining the background pattern, at least in part, based on the range of pixel values (Fig. 3. Paragraph [0146]-CHEN discloses if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel (wherein less than 3 times of the variance is a range).).

    PNG
    media_image1.png
    426
    321
    media_image1.png
    Greyscale

Annotated diagram of CHEN’s Fig. 8A illustrating a bounding box (i.e. a geometric shape with margins) surrounding an object (i.e. a dog) and select regions inside of the margins but outside of the object (i.e. the cat’s paw and the tile floor)

Regarding claim 7, CHEN explicitly teaches the method of claim 1, 
 CHEN further explicitly teaches wherein the select regions are chosen to be in regions having minimum variation in pixel color values (Fig. 32. Paragraph [0146]-CHEN discloses a classification of the pixel to either a foreground pixel or a background pixel is done by comparing the difference between the pixel value and the mean of the designated Gaussian model. In one illustrative example, if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel. Otherwise, in this illustrative example, the pixel is classified as a foreground pixel. Further in paragraph [0414]-CHEN discloses FIG. 32B is a foreground mask binary image 3200B showing that the person 3202, the person 3204, and the person 3206 are detected as a merged blob (represented by bounding box 3210) (wherein the select region is the foreground mask comprised of foreground pixels).).

Regarding claim 10, CHEN explicitly teaches non-transitory computer storage that stores executable program instructions that (Fig. 12. Paragraph [0297]-CHEN discloses the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. The computer-readable or machine-readable storage medium may be non-transitory.), when executed by one or more computing devices (Fig. 12. Paragraph [0295]-CHEN discloses the process 1800 may be performed by a computing device or an apparatus, such as the video analytics system 100.), configure the one or more computing devices to perform operations comprising (Fig. 12. Paragraph [0295]-CHEN discloses the process 1800 may be performed by a computing device or an apparatus, such as the video analytics system 100. In one illustrative example, the process 1800 can be performed by the video analytics system 1200 shown in FIG. 12 and FIG. 13. In some cases, the computing device or apparatus may include a processor, microprocessor, microcomputer, or other component of a device that is configured to carry out the steps of process 1800.):
receiving a video feed (Fig. 1. Paragraph [0136]-CHEN discloses FIG. 1 is a block diagram illustrating an example of a video analytics system 100. The video analytics system 100 receives video frames 102 from a video source 130.);
initializing a main object detector (Fig. 12, #1208 called a deep learning system. Paragraph [0210]-CHEN discloses the deep learning system 1208 can implement a complex object detector.) configured to receive image frames from the video feed (Fig. 12, #1202 called video frames. Paragraph [0210]-CHEN discloses the deep learning system 1208 can implement a complex object detector. For example, the complex object detector can be implemented using one or more trained neural networks (e.g., a deep learning network) to one or more of the frames 1202 of the received video sequence to locate and classify objects in the one or more frames.) and determine whether an object of interest is present in the image frames (Fig. 12. Paragraph [0210]-CHEN discloses An output of the deep learning system 1208 can include a set of detector bounding boxes representing the detected and classified objects.).);
executing a backup object detector (Fig. 2, #204N called a blob detection system. Paragraph [0144]) configured to perform operations comprising (Fig. 2. Paragraph [0144]-CHEN discloses the blob detection system 204N generates foreground blobs 208N for the frame N 202N. The object tracking system 206N can then perform temporal tracking of the blobs 208N.):
receiving a location of the object of interest in a first image of the video feed (Fig. 2. Paragraph [0142]-CHEN discloses the blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A. For example, a blob tracker's state and location for the video frame A 202A can be calculated and updated. The blob tracker's location in a next video frame N 202N can also be predicted from the current video frame A 202A. For example, the predicted location of a blob tracker for the next video frame N 202N can include the location of the blob tracker (and its associated blob) in the current video frame A 202A. Tracking of blobs of the current frame A 202A can be performed once the updated blob trackers 310A are generated (wherein the location of the object of interest is a blob).);
determining a background pattern of the object of interest in the first image (Fig. 3. Paragraph [0145]-CHEN discloses the blob detection system 104 includes a background subtraction engine 312 that receives video frames 302. The background subtraction engine 312 can perform background subtraction to detect foreground pixels in one or more of the video frames 302. For example, the background subtraction can be used to segment moving objects from the global background in a video sequence and to generate a foreground-background binary mask (referred to herein as a foreground mask (wherein the foreground-background binary mask is the background pattern).);
receiving a second image of the video feed (Fig. 2, illustrates the receiving of a second video frame, #202N called video frame N. Paragraph [0144]-CHEN discloses when a next video frame N 202N is received, the blob detection system 204N generates foreground blobs 208N for the frame N 202N.);
determining whether select regions (Fig. 8A, illustrates selected regions being boxes surrounding a cat and a dog.) of the second image in the received location comprise the background pattern (Fig. 8A-8C. Paragraph [0198]-CHEN discloses for each default box in each cell, the SSD neural network outputs a probability vector of length c, where c is the number of classes, representing the probabilities of the box containing an object of each class. In some cases, a background class is included that indicates that there is no object in the box (wherein a box is a select region). Further in paragraph [0198]-CHEN discloses for the image shown in FIG. 8A, all probability labels would indicate the background class with the exception of the three matched boxes (two for the cat, one for the dog).); and
labeling the second image (Fig. 10, illustrates a second image called a frame #1000.) as comprising the object of interest when the select regions in the received location comprise the background pattern (Fig. 10, illustrates a classified object with the background pattern in the selected region called bounding box #1004. Paragraph [0217]-CHEN discloses the deep learning system 1208 can generate and output classifications and confidence levels (also referred to as confidence values) for each object detected in a key frame. One illustrative example is shown by the classifications and confidence levels shown in FIG. 10 (the object classified as a person with a 93% confidence using bounding box 1004). A classification and confidence level determined for an object can be associated with the bounding box determined for the object. For instance, the deep learning network applied by the deep learning system 1208 may provide detector bounding boxes 1323 for a key frame, along with a category classification and a confidence level (CL) associated with each detector bounding box. The object classification indicates a category determined for an object detected in a key frame using the deep learning classification network. Any number of classes or categories can be determined for an object, such as a person, a car, or other suitable object class that the deep network is configured to detect and classify.).

Regarding claim 11, CHEN explicitly teaches the non-transitory computer storage of claim 10, 
CHEN further explicitly teaches wherein receiving the location comprises receiving coordinates of a geometric shape (Fig. 8A, illustrates geometric shapes, i.e. rectangular bounding boxes or blobs. Paragraph [0224]-CHEN discloses representing each bounding box with (x, y, w, h), where (x, y) is the upper-left coordinate of a bounding box, w and h are the width and height of the bounding box, respectively.) approximating the boundary of the object with margins outside of the object (Fig. 8A, illustrates a boundary of the object with margins outside of the object. Please see annotated Fig. 8A below. Paragraph [0224]) and wherein the select regions comprise portions of the geometric shape in the margins outside of the object (Fig. 8A, illustrates a region comprised of portions of a shape in the margins outside of the object (wherein the dog is the object, the bounding box is the margin, and tile floor and cat’s paw are select regions in the margins outside of the object). Paragraph [0197]-CHEN discloses FIG. 8A includes an image and FIG. 8B and FIG. 8C include diagrams illustrating how SSD (with the VGG deep network base model) operates. For example, SSD matches objects with default boxes of different aspect ratios (shown as dashed rectangles in FIG. 8B and FIG. 8C). Each element of the feature map has a number of default boxes associated with it. Please see annotated Fig. 8A below.)).

    PNG
    media_image1.png
    426
    321
    media_image1.png
    Greyscale

Annotated diagram of CHEN’s Fig. 8A illustrating a bounding box (i.e. a geometric shape with margins) surrounding an object (i.e. a dog) and select regions inside of the margins but outside of the object (i.e. the cat’s paw and the tile floor)

Regarding claim 12, CHEN explicitly teaches the non-transitory computer storage of claim 10, 
CHEN further explicitly teaches wherein the background pattern comprises pixel values (Fig. 3. Paragraph [0146]-CHEN discloses a classification of the pixel to either a foreground pixel or a background pixel is done by comparing the difference between the pixel value and the mean of the designated Gaussian model. In one illustrative example, if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel).

Regarding claim 14, CHEN explicitly teaches the non-transitory computer storage of claim 10, 
CHEN further explicitly teaches wherein the main object detector is a trainable model using training data, and wherein the training data comprises a plurality of labeled second images (Fig. 12. Paragraph [0193]-CHEN discloses a complex object detector can be based on a trained classification neural network, such as a deep learning network (also referred to herein as a deep network and a deep neural network), that can be used to classify and/or localize objects in a video frame. A trained deep learning network can identify objects in an image based on knowledge gleaned from training images (or other data) that include similar objects and labels indicating the classification of those objects (wherein training images are labeled second images).).

Regarding claim 15, CHEN explicitly teaches the non-transitory computer storage of claim 10, 
CHEN further explicitly teaches wherein determining the background pattern of the object in the first image comprises (Fig. 3. Paragraph [0146]-CHEN discloses the background subtraction engine 312 can model the background of a scene (e.g., captured in the video sequence) using any suitable background subtraction technique (also referred to as background extraction).): 
receiving a geometric shape (Fig. 8A, illustrates geometric shapes, i.e. bounding boxes or blobs. Paragraph [0224]-CHEN discloses representing each bounding box with (x, y, w, h), where (x, y) is the upper-left coordinate of a bounding box, w and h are the width and height of the bounding box, respectively.) approximating boundary of the object in the first image with margins outside of the object (Fig. 8A, illustrates a boundary of the object with margins outside of the object. Paragraph [0197]-CHEN discloses FIG. 8A includes an image and FIG. 8B and FIG. 8C include diagrams illustrating how SSD (with the VGG deep network base model) operates. For example, SSD matches objects with default boxes of different aspect ratios (shown as dashed rectangles in FIG. 8B and FIG. 8C). Each element of the feature map has a number of default boxes associated with it. Any default box with an intersection-over-union with a ground truth box over a threshold (e.g., 0.4, 0.5, 0.6, or other suitable threshold) is considered a match for the object. Please see annotated Fig. 8A below.);
determining a range of pixel values (Fig. 3. Paragraph [0146]-CHEN discloses if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel (wherein less than 3 times of the variance is a range).) in the select regions in the geometric shape in first image in the received location (Figs. 33A-B, illustrate select regions that the range of pixel values is applied to. Paragraph [0415]-CHEN discloses FIG. 33A is a video frame 3300A with a person (represented by bounding box 3302) that is standing still. FIG. 33B is a portion of the foreground mask binary image 3300B corresponding to the bounding box 3302. As shown in the portion of the foreground mask binary image 3300B, multiple separated blobs are detected for the same person, due to background subtraction absorbing portions of the person into the background portion of the foreground mask binary image (wherein the select regions are blobs).), wherein the select regions are selected to be in the margins outside of the object (Fig. 8A, illustrates a region comprised of portions of a shape in the margins outside of the object (wherein the dog is the object, the bounding box is the margin, and tile floor and cat’s paw are select regions in the margins outside of the object). Paragraph [0197]-CHEN discloses FIG. 8A includes an image and FIG. 8B and FIG. 8C include diagrams illustrating how SSD (with the VGG deep network base model) operates. For example, SSD matches objects with default boxes of different aspect ratios (shown as dashed rectangles in FIG. 8B and FIG. 8C). Each element of the feature map has a number of default boxes associated with it. Please see annotated Fig. 8A below.));
determining the background pattern, at least in part, based on the range of pixel values (Fig. 3. Paragraph [0146]-CHEN discloses if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel (wherein less than 3 times of the variance is a range).).

    PNG
    media_image1.png
    426
    321
    media_image1.png
    Greyscale

Annotated diagram of CHEN’s Fig. 8A illustrating a bounding box (i.e. a geometric shape with margins) surrounding an object (i.e. a dog) and select regions inside of the margins but outside of the object (i.e. the cat’s paw and the tile floor)

Regarding claim 16, CHEN further explicitly teaches the non-transitory computer storage of claim 10, 
CHEN further explicitly teaches wherein the select regions are chosen to be in regions having minimum variation in pixel color values (Fig. 32. Paragraph [0146]-CHEN discloses a classification of the pixel to either a foreground pixel or a background pixel is done by comparing the difference between the pixel value and the mean of the designated Gaussian model. In one illustrative example, if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel. Otherwise, in this illustrative example, the pixel is classified as a foreground pixel. Further in paragraph [0414]-CHEN discloses FIG. 32B is a foreground mask binary image 3200B showing that the person 3202, the person 3204, and the person 3206 are detected as a merged blob (represented by bounding box 3210) (wherein the select region is the foreground mask comprised of foreground pixels).).

Regarding claim 19, CHEN explicitly teaches a system comprising a processor (Fig. 12. Paragraph [0295]-CHEN discloses the process 1800 may be performed by a computing device or an apparatus, such as the video analytics system 100. In one illustrative example, the process 1800 can be performed by the video analytics system 1200 shown in FIG. 12 and FIG. 13. In some cases, the computing device or apparatus may include a processor, microprocessor, microcomputer, or other component of a device that is configured to carry out the steps of process 1800.), the processor configured to perform operations comprising (Fig. 12. Paragraph [0297]-CHEN discloses process 1800 may be performed under the control of one or more computer systems configured with executable instructions and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, by hardware, or combinations thereof.),: 
receiving a video feed (Fig. 1. Paragraph [0136]-CHEN discloses FIG. 1 is a block diagram illustrating an example of a video analytics system 100. The video analytics system 100 receives video frames 102 from a video source 130.);
initializing a main object detector (Fig. 12, #1208 called a deep learning system. Paragraph [0210]-CHEN discloses the deep learning system 1208 can implement a complex object detector.) configured to receive image frames from the video feed (Fig. 12, #1202 called video frames. Paragraph [0210]-CHEN discloses the deep learning system 1208 can implement a complex object detector. For example, the complex object detector can be implemented using one or more trained neural networks (e.g., a deep learning network) to one or more of the frames 1202 of the received video sequence to locate and classify objects in the one or more frames.) and determine whether an object of interest is present in the image frames (Fig. 12. Paragraph [0210]-CHEN discloses An output of the deep learning system 1208 can include a set of detector bounding boxes representing the detected and classified objects.).);
executing a backup object detector (Fig. 2, #204N called a blob detection system. Paragraph [0144]) configured to perform operations comprising (Fig. 2. Paragraph [0144]-CHEN discloses the blob detection system 204N generates foreground blobs 208N for the frame N 202N. The object tracking system 206N can then perform temporal tracking of the blobs 208N.):
receiving a location of the object of interest in a first image of the video feed (Fig. 2. Paragraph [0142]-CHEN discloses the blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A. For example, a blob tracker's state and location for the video frame A 202A can be calculated and updated. The blob tracker's location in a next video frame N 202N can also be predicted from the current video frame A 202A. For example, the predicted location of a blob tracker for the next video frame N 202N can include the location of the blob tracker (and its associated blob) in the current video frame A 202A. Tracking of blobs of the current frame A 202A can be performed once the updated blob trackers 310A are generated (wherein the location of the object of interest is a blob).);
determining a background pattern of the object of interest in the first image (Fig. 3. Paragraph [0145]-CHEN discloses the blob detection system 104 includes a background subtraction engine 312 that receives video frames 302. The background subtraction engine 312 can perform background subtraction to detect foreground pixels in one or more of the video frames 302. For example, the background subtraction can be used to segment moving objects from the global background in a video sequence and to generate a foreground-background binary mask (referred to herein as a foreground mask (wherein the foreground-background binary mask is the background pattern).);
receiving a second image of the video feed (Fig. 2, illustrates the receiving of a second video frame, #202N called video frame N. Paragraph [0144]-CHEN discloses when a next video frame N 202N is received, the blob detection system 204N generates foreground blobs 208N for the frame N 202N.);
determining whether select regions (Fig. 8A, illustrates selected regions being boxes surrounding a cat and a dog.) of the second image in the received location comprise the background pattern (Fig. 8A-8C. Paragraph [0198]-CHEN discloses for each default box in each cell, the SSD neural network outputs a probability vector of length c, where c is the number of classes, representing the probabilities of the box containing an object of each class. In some cases, a background class is included that indicates that there is no object in the box (wherein a box is a select region). Further in paragraph [0198]-CHEN discloses for the image shown in FIG. 8A, all probability labels would indicate the background class with the exception of the three matched boxes (two for the cat, one for the dog).); and
labeling the second image (Fig. 10, illustrates a second image called a frame #1000.) as comprising the object of interest when the select regions in the received location comprise the background pattern (Fig. 10, illustrates a classified object with the background pattern in the selected region called bounding box #1004. Paragraph [0217]-CHEN discloses the deep learning system 1208 can generate and output classifications and confidence levels (also referred to as confidence values) for each object detected in a key frame. One illustrative example is shown by the classifications and confidence levels shown in FIG. 10 (the object classified as a person with a 93% confidence using bounding box 1004). A classification and confidence level determined for an object can be associated with the bounding box determined for the object. For instance, the deep learning network applied by the deep learning system 1208 may provide detector bounding boxes 1323 for a key frame, along with a category classification and a confidence level (CL) associated with each detector bounding box. The object classification indicates a category determined for an object detected in a key frame using the deep learning classification network. Any number of classes or categories can be determined for an object, such as a person, a car, or other suitable object class that the deep network is configured to detect and classify.).

Regarding claim 20, CHEN explicitly teaches the system of claim 19, 
CHEN further explicitly teaches wherein determining the background pattern of the object in the first image comprises (Fig. 3. Paragraph [0146]-CHEN discloses the background subtraction engine 312 can model the background of a scene (e.g., captured in the video sequence) using any suitable background subtraction technique (also referred to as background extraction).): 
receiving a geometric shape (Fig. 8A, illustrates geometric shapes, i.e. bounding boxes or blobs. Paragraph [0224]-CHEN discloses representing each bounding box with (x, y, w, h), where (x, y) is the upper-left coordinate of a bounding box, w and h are the width and height of the bounding box, respectively.) approximating boundary of the object in the first image with margins outside of the object (Fig. 8A, illustrates a boundary of the object with margins outside of the object. Paragraph [0197]-CHEN discloses FIG. 8A includes an image and FIG. 8B and FIG. 8C include diagrams illustrating how SSD (with the VGG deep network base model) operates. For example, SSD matches objects with default boxes of different aspect ratios (shown as dashed rectangles in FIG. 8B and FIG. 8C). Each element of the feature map has a number of default boxes associated with it. Any default box with an intersection-over-union with a ground truth box over a threshold (e.g., 0.4, 0.5, 0.6, or other suitable threshold) is considered a match for the object. Please see annotated Fig. 8A below.);
determining a range of pixel values (Fig. 3. Paragraph [0146]-CHEN discloses if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel (wherein less than 3 times of the variance is a range).) in the select regions in the geometric shape in first image in the received location (Figs. 33A-B, illustrate select regions that the range of pixel values is applied to. Paragraph [0415]-CHEN discloses FIG. 33A is a video frame 3300A with a person (represented by bounding box 3302) that is standing still. FIG. 33B is a portion of the foreground mask binary image 3300B corresponding to the bounding box 3302. As shown in the portion of the foreground mask binary image 3300B, multiple separated blobs are detected for the same person, due to background subtraction absorbing portions of the person into the background portion of the foreground mask binary image (wherein the select regions are blobs).), wherein the select regions are selected to be in the margins outside of the object (Fig. 8A, illustrates a region comprised of portions of a shape in the margins outside of the object (wherein the dog is the object, the bounding box is the margin, and tile floor and cat’s paw are select regions in the margins outside of the object). Paragraph [0197]-CHEN discloses FIG. 8A includes an image and FIG. 8B and FIG. 8C include diagrams illustrating how SSD (with the VGG deep network base model) operates. For example, SSD matches objects with default boxes of different aspect ratios (shown as dashed rectangles in FIG. 8B and FIG. 8C). Each element of the feature map has a number of default boxes associated with it. Please see annotated Fig. 8A below.));
determining the background pattern, at least in part, based on the range of pixel values (Fig. 3. Paragraph [0146]-CHEN discloses if the distance of the pixel value and the Gaussian Mean is less than 3 times of the variance, the pixel is classified as a background pixel (wherein less than 3 times of the variance is a range).).

    PNG
    media_image1.png
    426
    321
    media_image1.png
    Greyscale

Annotated diagram of CHEN’s Fig. 8A illustrating a bounding box (i.e. a geometric shape with margins) surrounding an object (i.e. a dog) and select regions inside of the margins but outside of the object (i.e. the cat’s paw and the tile floor)


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4, 9, 13, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over CHEN et al. (US 20190130580 A1), hereinafter referenced as CHEN, in view of HWANG et al. (US 20100202690 A1), hereinafter referenced as HWANG, and further in view of PORIKLI et al. (US 20190213406 A1), hereinafter referenced as PORIKLI.

Regarding claim 4, CHEN explicitly teaches the method of claim 1, 
CHEN further explicitly teaches and the method further comprises (Fig. 2. Paragraph [0143]-CHEN discloses an example of the video analytics system (e.g., video analytics system 100) processing video frames across time t.): 
storing the select regions of the second image frame (Fig. 4. Paragraph [0139]-CHEN discloses the prediction of the location of the blob tracker in the current frame can be based on the location of the blob in the previous frame. A history or motion model can be maintained for a blob tracker, including a history of various states, a history of the velocity, and a history of location, of continuous frames, for the blob tracker, as described in more detail below (wherein the blob tracker is a select region).) when the select regions comprise the background pattern in the received location (Fig. 8A-8C, illustrate select regions with the background pattern in received locations. Paragraph [0198]-CHEN discloses for each default box in each cell, the SSD neural network outputs a probability vector of length c, where c is the number of classes, representing the probabilities of the box containing an object of each class. In some cases, a background class is included that indicates that there is no object in the box (wherein a box is a select region). Further in paragraph [0198]-CHEN discloses for the image shown in FIG. 8A, all probability labels would indicate the background class with the exception of the three matched boxes (two for the cat, one for the dog).);
receiving a third image frame (Figs. 11A-C, illustrate three image frames. Paragraph [0126]-CHEN discloses the video analytics system 100 receives video frames 102 from a video source 130. The video frames 102 can also be referred to herein as a video picture or a picture. The video frames 102 can be part of one or more video sequences (wherein a video comprises multiple image frames).);
determining whether the third image frame in the select regions (Fig. 8A, illustrates selected regions being boxes surrounding a cat and a dog (wherein Fig. 8A is the third image frame from a video sequence).) comprises the background pattern in the received location (Fig. 8A-8C. Paragraph [0198]-CHEN discloses for each default box in each cell, the SSD neural network outputs a probability vector of length c, where c is the number of classes, representing the probabilities of the box containing an object of each class. In some cases, a background class is included that indicates that there is no object in the box (wherein a box represents the received location). Further in paragraph [0198]-CHEN discloses for the image shown in FIG. 8A, all probability labels would indicate the background class with the exception of the three matched boxes (two for the cat, one for the dog).);
CHEN fails to explicitly teach wherein the object comprises text.
However, HWANG explicitly teaches wherein the object comprises text (Fig. 3. Paragraph [0024]-HWANG discloses in FIG. 3, an object intended for character recognition according to an embodiment of the present invention is a signboard photographed image. Of the signboard photographed image, the background region 31, a center text region 32 necessary for delivering advertisement information and also an outer boundary region 33a and an inner boundary region 33b, 33c wrap the center text region 32 that corresponds to a character recognition object according to an embodiment of the present invention.) 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN of a method comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising: receiving a location of the object of interest in a first image of the video feed; determining a background pattern of the object of interest in the first image with the teachings of HWANG of wherein the object comprises text.
Wherein having CHEN’s object detection system wherein the object comprises text.
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and HWANG relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while HWANG improves upon the prior art by extracting and recognizing the text from signboards, when the text may not be recognized normally. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and HWANG et al. (US 20100202690 A1), Paragraph [0007].
CHEN in view of HWANG fail to explicitly teach performing optical character recognition on the second image frame in the received location; performing a pixel comparison between the received location in the second and third image frames; and when a similarity threshold between the pixels in the received location in the second and third image frames are above a predetermined threshold, bypassing performing optical character recognition on the third image frame.
However, PORIKLI explicitly teaches performing optical character recognition on the second image frame in the received location (Fig. 4. Paragraph [0096]-PORIKLI discloses hand detection is triggered and performed in the second image frame. The global ROI detector may include a machine learning component to perform the object detection. The machine learning component may utilize deep learning technology (e.g., convolutional neural networks (CNN), recurrent neural networks (RNN), or long/short term memory (LSTM)) to learn to recognize features in an image that represent the first object of interest. In some aspects, these image features can include different shapes, colors, scales, and motions that indicate a hand (wherein the character being recognized is a hand).);
performing a pixel comparison (Fig. 4, #410 determine similarity score between images. Paragraph [0084]) between the received location in the second and third image frames (Fig. 4. Paragraph [0084]-PORIKLI discloses the raw image frames are fed into the spatial constraint component which may use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimation, similar images are given a higher similarity score. The similarity score may reflect global similarity between the first image and the second image or may reflect similarity between a first windowed portion of the first image frame and the first windowed portion of the second image frame (wherein the windowed portions of the first and second images are received locations of the object in the second and third images and the images and windowed portions of images are comprised of pixels).); and
when a similarity threshold between the pixels in the received location in the second and third image frames (Fig. 4. Paragraph [0084]-PORIKLI discloses the raw image frames are fed into the spatial constraint component which may use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimation, similar images are given a higher similarity score. The similarity score may reflect global similarity between the first image and the second image or may reflect similarity between a first windowed portion of the first image frame and the first windowed portion of the second image frame (wherein a similarity score is a similarity threshold, the first and second images and windowed portions of the first and second images are comprised of pixels, and the received locations are windowed portions).) are above a predetermined threshold, bypassing performing optical character recognition on the third image frame (Fig. 4. Paragraph [0085]-PORIKLI discloses the similarity score threshold that determines to skip an image may be manually specified (e.g., programmed) or can be learned by the activity recognition device from training data. The number of frames skipped or omitted in the hand detection processing is determined by the similarity score threshold (wherein hand detection processing is optical character recognition and the hand is the character being recognized).).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN in view of HWANG of a method comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising: receiving a location of the object of interest in a first image of the video feed; determining a background pattern of the object of interest in the first image with the teachings of PORIKLI of performing optical character recognition on the second image frame in the received location; performing a pixel comparison between the received location in the second and third image frames; and when a similarity threshold between the pixels in the received location in the second and third image frames are above a predetermined threshold, bypassing performing optical character recognition on the third image frame.
Wherein having CHEN’s object detection system performing optical character recognition on the second image frame in the received location; performing a pixel comparison between the received location in the second and third image frames; and when a similarity threshold between the pixels in the received location in the second and third image frames are above a predetermined threshold, bypassing performing optical character recognition on the third image frame.
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and PORIKLI relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while PORIKLI recognized a need for improved activity detection for vehicle perception. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and PORIKLI et al. (US 20190213406 A1), Paragraph [0002].

Regarding claim 9, CHEN explicitly teaches the method of claim 1, 
CHEN further explicitly teaches and the method further comprises (Fig. 2. Paragraph [0143]-CHEN discloses an example of the video analytics system (e.g., video analytics system 100) processing video frames across time t.): 
when the object detector detects the object in an image frame (Figs. 2-3. Paragraph [0138]-CHEN discloses the blob detection system 104 can detect one or more blobs in video frames (e.g., video frames 102) of a video sequence, and the object tracking system 106 can track the one or more blobs across the frames of the video sequence. As used herein, a blob refers to foreground pixels of at least a portion of an object (e.g., a portion of an object or an entire object) in a video frame.), determine whether the object is the same object as a previous image frame (Fig. 5. Paragraph [0139]-CHEN discloses a bounding box for a blob tracker in a current frame can be the bounding box of a previous blob in a previous frame for which the blob tracker was associated. For instance, when the blob tracker is updated in the previous frame (after being associated with the previous blob in the previous frame), updated information for the blob tracker can include the tracking information for the previous frame and also prediction of a location of the blob tracker in the next frame (which is the current frame in this example). The prediction of the location of the blob tracker in the current frame can be based on the location of the blob in the previous frame (wherein a blob is indicative of an object).);
when the object is not the same object, store the object as the previous frame object to compare against future detected objects (Fig. 2. Paragraph [0139]-CHEN discloses a history or motion model can be maintained for a blob tracker, including a history of various states, a history of the velocity, and a history of location, of continuous frames, for the blob tracker, as described in more detail below. Further in paragraph [0143]-CHEN discloses the object tracking system 206A can perform data association to associate or match the blob trackers (e.g., blob trackers generated or updated based on a previous frame or newly generated blob trackers) and blobs 208A using the calculated costs (e.g., using a cost matrix or other suitable association technique). The blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A (wherein a not the same object is a newly generated blob).);
when the object detector does not detect the object in the image frame (Fig. 12. Paragraph [0205]-CHEN discloses whenever a person is moving with a normal speed in a scene, the person cannot be detected by a deep learning based complex object detector. FIG. 11A-FIG. 11C show an example of a person 1102 moving within a scene. As shown in FIG. 11A, the person 1102 is standing still, in which case the person is detected with confidence (illustrated by bounding box 1104, a “person” class, and a confidence level or value of 0.88) by a deep learning based detector. However, when the person 1102 is walking with normal speed, as shown in FIG. 11B and FIG. 11C, the person is not detected at all. Such a lack of detection is based on the latency required to detect an object being greater than real-time.), perform object detection using the backup object detector (Fig. 12. Paragraph [0207]-CHEN discloses by applying a combined video analytics object detection/tracking system and a complex object detection system, the problems described above can be avoided. For example, the hybrid video analytics system described herein provides object detection and tracking with high-accuracy, while achieving real-time performance without such latencies. Such high-accuracy, real-time performance can even be achieved by a device (e.g., an IP camera or other device) that does not have a graphics card (wherein the hybrid video analytics system is the backup object detector).);
when the backup object detector detects the object (Fig. 12. Paragraph [0207]-CHEN discloses by applying a combined video analytics object detection/tracking system and a complex object detection system, the problems described above can be avoided. For example, the hybrid video analytics system described herein provides object detection and tracking with high-accuracy, while achieving real-time performance without such latencies. Such high-accuracy, real-time performance can even be achieved by a device (e.g., an IP camera or other device) that does not have a graphics card (wherein the hybrid video analytics system is the backup object detector).), label the image frame as containing the object (Fig. 8A. Paragraph [0209]-CHEN discloses the blob detection system 1204 can perform object detection to detect one or more blobs (representing one or more objects) for the video frames 1202. Blob bounding boxes associated with the blobs are generated by the blob detection system 1204. The blob bounding boxes generated using blob detection can also be referred to as foreground bounding boxes. The blobs and/or the blob bounding boxes can be output for further processing by the video analytics system 1200 (wherein a blob is a label). Further in paragraph [0194]-CHEN discloses the output can include probability values indicating probabilities (or confidence levels or confidence values) that the object includes one or more classes of objects (e.g., a probability the object is a person, a probability the object is a dog, a probability the object is a cat, or the like).) and determine whether the object is the same as object in a previous image frame (Fig. 2. Paragraph [0139]-CHEN discloses a history or motion model can be maintained for a blob tracker, including a history of various states, a history of the velocity, and a history of location, of continuous frames, for the blob tracker, as described in more detail below. Further in paragraph [0143]-CHEN discloses the object tracking system 206A can perform data association to associate or match the blob trackers (e.g., blob trackers generated or updated based on a previous frame or newly generated blob trackers) and blobs 208A using the calculated costs (e.g., using a cost matrix or other suitable association technique). The blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A (wherein a blob represents an object).);
when the object detected by the backup object detector is not similar to the object in the previous frame (Fig. 2. Paragraph [0143]-CHEN discloses the object tracking system 206A can perform data association to associate or match the blob trackers (e.g., blob trackers generated or updated based on a previous frame or newly generated blob trackers) and blobs 208A using the calculated costs (e.g., using a cost matrix or other suitable association technique) (wherein a not similar object is indicated by a newly generated blob).), store the object detected by the backup object detector as the previous image frame object to compare against future detected objects (Fig. 2. Paragraph [0139]-CHEN discloses a history or motion model can be maintained for a blob tracker, including a history of various states, a history of the velocity, and a history of location, of continuous frames, for the blob tracker, as described in more detail below. Further in paragraph [0143]-CHEN discloses the object tracking system 206A can perform data association to associate or match the blob trackers (e.g., blob trackers generated or updated based on a previous frame or newly generated blob trackers) and blobs 208A using the calculated costs (e.g., using a cost matrix or other suitable association technique). The blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A (wherein object tracking system 206A is the backup object detector).).
CHEN fails to explicitly teach wherein the object comprises text. 
However, HWANG explicitly teaches wherein the object comprises text (Fig. 3, illustrates an object comprised of text with a border region surrounding the text. Paragraph [0024]-HWANG discloses in FIG. 3, an object intended for character recognition according to an embodiment of the present invention is a signboard photographed image. Of the signboard photographed image, the background region 31, a center text region 32 necessary for delivering advertisement information and also an outer boundary region 33 a and an inner boundary region 33 b, 33 c wrap the center text region 32 that corresponds to a character recognition object according to an embodiment of the present invention.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN of a method comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising: receiving a location of the object of interest in a first image of the video feed; determining a background pattern of the object of interest in the first image with the teachings of HWANG of wherein the object comprises text. 
Wherein having CHEN’s object detection system wherein the object comprises text. 
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and HWANG relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while HWANG improves upon the prior art by extracting and recognizing the text from signboards, when the text may not be recognized normally. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and HWANG et al. (US 20100202690 A1), Paragraph [0007].
CHEN in view of HWANG fail to explicitly teach when the object is the same object, bypass performing OCR on the object; when the object detected by the backup object detector is similar to an object in a previous frame within a same threshold, bypass performing OCR on the object detected by the backup detector;
However, PORIKLI explicitly teaches when the object is the same object (Fig. 4. Paragraph [0084]-PORIKLI discloses the raw image frames are fed into the spatial constraint component which may use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimation, similar images are given a higher similarity score.), bypass performing OCR on the object (Fig. 4. Paragraph [0085]-PORIKLI discloses the similarity score threshold that determines to skip an image may be manually specified (e.g., programmed) or can be learned by the activity recognition device from training data. The number of frames skipped or omitted in the hand detection processing is determined by the similarity score threshold (wherein hand detection processing is optical character recognition and the hand is the character being recognized).);
when the object detected by the backup object detector is similar to an object (Fig. 4. Paragraph [0084]-PORIKLI discloses the raw image frames are fed into the spatial constraint component which may use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimation, similar images are given a higher similarity score. The similarity score may reflect global similarity between the first image and the second image or may reflect similarity between a first windowed portion of the first image frame and the first windowed portion of the second image frame.) in a previous frame within a same threshold, bypass performing OCR on the object detected by the backup detector (Fig. 4. Paragraph [0085]-PORIKLI discloses the similarity score threshold that determines to skip an image may be manually specified (e.g., programmed) or can be learned by the activity recognition device from training data. The number of frames skipped or omitted in the hand detection processing is determined by the similarity score threshold (wherein hand detection processing is optical character recognition and the hand is the character being recognized and wherein the).); and
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN in view of HWANG of a method comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising: receiving a location of the object of interest in a first image of the video feed; determining a background pattern of the object of interest in the first image with the teachings of PORIKLI of when the object is the same object, bypass performing OCR on the object; when the object detected by the backup object detector is similar to an object in a previous frame within a same threshold, bypass performing OCR on the object detected by the backup detector.
Wherein having CHEN’s object detection system when the object is the same object, bypass performing OCR on the object; when the object detected by the backup object detector is similar to an object in a previous frame within a same threshold, bypass performing OCR on the object detected by the backup detector.
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and PORIKLI relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while PORIKLI recognized a need for improved activity detection for vehicle perception. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and PORIKLI et al. (US 20190213406 A1), Paragraph [0002].

Regarding claim 13, CHEN explicitly teaches the non-transitory computer storage of claim 10, 
CHEN further explicitly teaches and the operations further comprise (Fig. 2. Paragraph [0143]-CHEN discloses an example of the video analytics system (e.g., video analytics system 100) processing video frames across time t.):
storing the select regions of the second image frame (Fig. 4. Paragraph [0139]-CHEN discloses the prediction of the location of the blob tracker in the current frame can be based on the location of the blob in the previous frame. A history or motion model can be maintained for a blob tracker, including a history of various states, a history of the velocity, and a history of location, of continuous frames, for the blob tracker, as described in more detail below (wherein the blob tracker is a select region).) when the select regions comprise the background pattern in the received location (Fig. 8A-8C, illustrate select regions with the background pattern in received locations. Paragraph [0198]-CHEN discloses for each default box in each cell, the SSD neural network outputs a probability vector of length c, where c is the number of classes, representing the probabilities of the box containing an object of each class. In some cases, a background class is included that indicates that there is no object in the box (wherein a box is a select region). Further in paragraph [0198]-CHEN discloses for the image shown in FIG. 8A, all probability labels would indicate the background class with the exception of the three matched boxes (two for the cat, one for the dog).);
receiving a third image frame (Figs. 11A-C, illustrate three image frames. Paragraph [0126]-CHEN discloses the video analytics system 100 receives video frames 102 from a video source 130. The video frames 102 can also be referred to herein as a video picture or a picture. The video frames 102 can be part of one or more video sequences (wherein a video comprises multiple image frames).);
determining whether the third image frame in the select regions comprises (Fig. 8A, illustrates selected regions being boxes surrounding a cat and a dog (wherein Fig. 8A is the third image frame from a video sequence).) the background pattern in the received location (Fig. 8A-8C. Paragraph [0198]-CHEN discloses for each default box in each cell, the SSD neural network outputs a probability vector of length c, where c is the number of classes, representing the probabilities of the box containing an object of each class. In some cases, a background class is included that indicates that there is no object in the box (wherein a box represents the received location). Further in paragraph [0198]-CHEN discloses for the image shown in FIG. 8A, all probability labels would indicate the background class with the exception of the three matched boxes (two for the cat, one for the dog).);
CHEN fails to explicitly teach wherein the object comprises text.
However, HWANG explicitly teaches wherein the object comprises text (Fig. 3. Paragraph [0024]-HWANG discloses in FIG. 3, an object intended for character recognition according to an embodiment of the present invention is a signboard photographed image. Of the signboard photographed image, the background region 31, a center text region 32 necessary for delivering advertisement information and also an outer boundary region 33a and an inner boundary region 33b, 33c wrap the center text region 32 that corresponds to a character recognition object according to an embodiment of the present invention.) 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN of non-transitory computer storage that stores executable program instructions that, when executed by one or more computing devices, configure the one or more computing devices to perform operations comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising with the teachings of HWANG of wherein the object comprises text.
Wherein having CHEN’s object detection system wherein the object comprises text.
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and HWANG relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while HWANG improves upon the prior art by extracting and recognizing the text from signboards, when the text may not be recognized normally. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and HWANG et al. (US 20100202690 A1), Paragraph [0007].
CHEN in view of HWANG fail to explicitly teach performing optical character recognition on the second image frame in the received location; performing a pixel comparison between the received location in the second and third image frames; and when a similarity threshold between the pixels in the received location in the second and third image frames are above a predetermined threshold, bypassing performing optical character recognition on the third image frame.
However, PORIKLI explicitly teaches performing optical character recognition on the second image frame in the received location (Fig. 4. Paragraph [0096]-PORIKLI discloses hand detection is triggered and performed in the second image frame. The global ROI detector may include a machine learning component to perform the object detection. The machine learning component may utilize deep learning technology (e.g., convolutional neural networks (CNN), recurrent neural networks (RNN), or long/short term memory (LSTM)) to learn to recognize features in an image that represent the first object of interest. In some aspects, these image features can include different shapes, colors, scales, and motions that indicate a hand (wherein the character being recognized is a hand).);
performing a pixel comparison (Fig. 4, #410 determine similarity score between images. Paragraph [0084]) between the received location in the second and third image frames (Fig. 4. Paragraph [0084]-PORIKLI discloses the raw image frames are fed into the spatial constraint component which may use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimation, similar images are given a higher similarity score. The similarity score may reflect global similarity between the first image and the second image or may reflect similarity between a first windowed portion of the first image frame and the first windowed portion of the second image frame (wherein the windowed portions of the first and second images are received locations of the object in the second and third images and the images and windowed portions of images are comprised of pixels).); and
when a similarity threshold between the pixels in the received location in the second and third image frames (Fig. 4. Paragraph [0084]-PORIKLI discloses the raw image frames are fed into the spatial constraint component which may use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimation, similar images are given a higher similarity score. The similarity score may reflect global similarity between the first image and the second image or may reflect similarity between a first windowed portion of the first image frame and the first windowed portion of the second image frame (wherein a similarity score is a similarity threshold, the first and second images and windowed portions of the first and second images are comprised of pixels, and the received locations are windowed portions).) are above a predetermined threshold, bypassing performing optical character recognition on the third image frame (Fig. 4. Paragraph [0085]-PORIKLI discloses the similarity score threshold that determines to skip an image may be manually specified (e.g., programmed) or can be learned by the activity recognition device from training data. The number of frames skipped or omitted in the hand detection processing is determined by the similarity score threshold (wherein hand detection processing is optical character recognition and the hand is the character being recognized and wherein the).).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN in view of HWANG of non-transitory computer storage that stores executable program instructions that, when executed by one or more computing devices, configure the one or more computing devices to perform operations comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising with the teachings of PORIKLI of performing optical character recognition on the second image frame in the received location; performing a pixel comparison between the received location in the second and third image frames; and when a similarity threshold between the pixels in the received location in the second and third image frames are above a predetermined threshold, bypassing performing optical character recognition on the third image frame.
Wherein having CHEN’s object detection system performing optical character recognition on the second image frame in the received location; performing a pixel comparison between the received location in the second and third image frames; and when a similarity threshold between the pixels in the received location in the second and third image frames are above a predetermined threshold, bypassing performing optical character recognition on the third image frame.
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and PORIKLI relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while PORIKLI recognized a need for improved activity detection for vehicle perception. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and PORIKLI et al. (US 20190213406 A1), Paragraph [0002].

Regarding claim 18, CHEN explicitly teaches the non-transitory computer storage of claim 10, 
CHEN further explicitly teaches and the operations further comprise (Fig. 2. Paragraph [0143]-CHEN discloses an example of the video analytics system (e.g., video analytics system 100) processing video frames across time t.):
when the object detector detects the object in an image frame (Figs. 2-3. Paragraph [0138]-CHEN discloses the blob detection system 104 can detect one or more blobs in video frames (e.g., video frames 102) of a video sequence, and the object tracking system 106 can track the one or more blobs across the frames of the video sequence. As used herein, a blob refers to foreground pixels of at least a portion of an object (e.g., a portion of an object or an entire object) in a video frame.), determine whether the object is the same object as a previous image frame (Fig. 5. Paragraph [0139]-CHEN discloses a bounding box for a blob tracker in a current frame can be the bounding box of a previous blob in a previous frame for which the blob tracker was associated. For instance, when the blob tracker is updated in the previous frame (after being associated with the previous blob in the previous frame), updated information for the blob tracker can include the tracking information for the previous frame and also prediction of a location of the blob tracker in the next frame (which is the current frame in this example). The prediction of the location of the blob tracker in the current frame can be based on the location of the blob in the previous frame (wherein a blob is indicative of an object).);
when the object is not the same object, store the object as the previous frame object to compare against future detected objects (Fig. 2. Paragraph [0139]-CHEN discloses a history or motion model can be maintained for a blob tracker, including a history of various states, a history of the velocity, and a history of location, of continuous frames, for the blob tracker, as described in more detail below. Further in paragraph [0143]-CHEN discloses the object tracking system 206A can perform data association to associate or match the blob trackers (e.g., blob trackers generated or updated based on a previous frame or newly generated blob trackers) and blobs 208A using the calculated costs (e.g., using a cost matrix or other suitable association technique). The blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A (wherein a not the same object is a newly generated blob).);
when the object detector does not detect the object in the image frame (Fig. 12. Paragraph [0205]-CHEN discloses whenever a person is moving with a normal speed in a scene, the person cannot be detected by a deep learning based complex object detector. FIG. 11A-FIG. 11C show an example of a person 1102 moving within a scene. As shown in FIG. 11A, the person 1102 is standing still, in which case the person is detected with confidence (illustrated by bounding box 1104, a “person” class, and a confidence level or value of 0.88) by a deep learning based detector. However, when the person 1102 is walking with normal speed, as shown in FIG. 11B and FIG. 11C, the person is not detected at all. Such a lack of detection is based on the latency required to detect an object being greater than real-time.), perform object detection using the backup object detector (Fig. 12. Paragraph [0207]-CHEN discloses by applying a combined video analytics object detection/tracking system and a complex object detection system, the problems described above can be avoided. For example, the hybrid video analytics system described herein provides object detection and tracking with high-accuracy, while achieving real-time performance without such latencies. Such high-accuracy, real-time performance can even be achieved by a device (e.g., an IP camera or other device) that does not have a graphics card (wherein the hybrid video analytics system is the backup object detector).);
when the backup object detector detects the object (Fig. 12. Paragraph [0207]-CHEN discloses by applying a combined video analytics object detection/tracking system and a complex object detection system, the problems described above can be avoided. For example, the hybrid video analytics system described herein provides object detection and tracking with high-accuracy, while achieving real-time performance without such latencies. Such high-accuracy, real-time performance can even be achieved by a device (e.g., an IP camera or other device) that does not have a graphics card (wherein the hybrid video analytics system is the backup object detector).), label the image frame as containing the object (Fig. 8A. Paragraph [0209]-CHEN discloses the blob detection system 1204 can perform object detection to detect one or more blobs (representing one or more objects) for the video frames 1202. Blob bounding boxes associated with the blobs are generated by the blob detection system 1204. The blob bounding boxes generated using blob detection can also be referred to as foreground bounding boxes. The blobs and/or the blob bounding boxes can be output for further processing by the video analytics system 1200 (wherein a blob is a label). Further in paragraph [0194]-CHEN discloses the output can include probability values indicating probabilities (or confidence levels or confidence values) that the object includes one or more classes of objects (e.g., a probability the object is a person, a probability the object is a dog, a probability the object is a cat, or the like).) and determine whether the object is the same as object in a previous image frame (Fig. 2. Paragraph [0139]-CHEN discloses a history or motion model can be maintained for a blob tracker, including a history of various states, a history of the velocity, and a history of location, of continuous frames, for the blob tracker, as described in more detail below. Further in paragraph [0143]-CHEN discloses the object tracking system 206A can perform data association to associate or match the blob trackers (e.g., blob trackers generated or updated based on a previous frame or newly generated blob trackers) and blobs 208A using the calculated costs (e.g., using a cost matrix or other suitable association technique). The blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A (wherein a blob represents an object).);
when the object detected by the backup object detector is not similar to the object in the previous frame (Fig. 2. Paragraph [0143]-CHEN discloses the object tracking system 206A can perform data association to associate or match the blob trackers (e.g., blob trackers generated or updated based on a previous frame or newly generated blob trackers) and blobs 208A using the calculated costs (e.g., using a cost matrix or other suitable association technique) (wherein a not similar object is indicated by a newly generated blob).), store the object detected by the backup object detector as the previous image frame object to compare against future detected objects (Fig. 2. Paragraph [0139]-CHEN discloses a history or motion model can be maintained for a blob tracker, including a history of various states, a history of the velocity, and a history of location, of continuous frames, for the blob tracker, as described in more detail below. Further in paragraph [0143]-CHEN discloses the object tracking system 206A can perform data association to associate or match the blob trackers (e.g., blob trackers generated or updated based on a previous frame or newly generated blob trackers) and blobs 208A using the calculated costs (e.g., using a cost matrix or other suitable association technique). The blob trackers can be updated, including in terms of positions of the trackers, according to the data association to generate updated blob trackers 310A (wherein object tracking system 206A is the backup object detector).).
CHEN fails to explicitly teach wherein the object comprises text. 
However, HWANG explicitly teaches wherein the object comprises text (Fig. 3, illustrates an object comprised of text with a border region surrounding the text. Paragraph [0024]-HWANG discloses in FIG. 3, an object intended for character recognition according to an embodiment of the present invention is a signboard photographed image. Of the signboard photographed image, the background region 31, a center text region 32 necessary for delivering advertisement information and also an outer boundary region 33 a and an inner boundary region 33 b, 33 c wrap the center text region 32 that corresponds to a character recognition object according to an embodiment of the present invention.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN of a method comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising: receiving a location of the object of interest in a first image of the video feed; determining a background pattern of the object of interest in the first image with the teachings of HWANG of wherein the object comprises text. 
Wherein having CHEN’s object detection system wherein the object comprises text. 
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and HWANG relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while HWANG improves upon the prior art by extracting and recognizing the text from signboards, when the text may not be recognized normally. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and HWANG et al. (US 20100202690 A1), Paragraph [0007].
CHEN in view of HWANG fail to explicitly teach when the object is the same object, bypass performing OCR on the object; when the object detected by the backup object detector is similar to an object in a previous frame within a same threshold, bypass performing OCR on the object detected by the backup detector;
However, PORIKLI explicitly teaches when the object is the same object (Fig. 4. Paragraph [0084]-PORIKLI discloses the raw image frames are fed into the spatial constraint component which may use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimation, similar images are given a higher similarity score.), bypass performing OCR on the object (Fig. 4. Paragraph [0085]-PORIKLI discloses the similarity score threshold that determines to skip an image may be manually specified (e.g., programmed) or can be learned by the activity recognition device from training data. The number of frames skipped or omitted in the hand detection processing is determined by the similarity score threshold (wherein hand detection processing is optical character recognition and the hand is the character being recognized).);
when the object detected by the backup object detector is similar to an object (Fig. 4. Paragraph [0084]-PORIKLI discloses the raw image frames are fed into the spatial constraint component which may use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimation, similar images are given a higher similarity score. The similarity score may reflect global similarity between the first image and the second image or may reflect similarity between a first windowed portion of the first image frame and the first windowed portion of the second image frame.) in a previous frame within a same threshold, bypass performing OCR on the object detected by the backup detector (Fig. 4. Paragraph [0085]-PORIKLI discloses the similarity score threshold that determines to skip an image may be manually specified (e.g., programmed) or can be learned by the activity recognition device from training data. The number of frames skipped or omitted in the hand detection processing is determined by the similarity score threshold (wherein hand detection processing is optical character recognition and the hand is the character being recognized and wherein the).); and
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN in view of HWANG of a method comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising: receiving a location of the object of interest in a first image of the video feed; determining a background pattern of the object of interest in the first image with the teachings of PORIKLI of when the object is the same object, bypass performing OCR on the object; when the object detected by the backup object detector is similar to an object in a previous frame within a same threshold, bypass performing OCR on the object detected by the backup detector.
Wherein having CHEN’s object detection system when the object is the same object, bypass performing OCR on the object; when the object detected by the backup object detector is similar to an object in a previous frame within a same threshold, bypass performing OCR on the object detected by the backup detector.
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and PORIKLI relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while PORIKLI recognized a need for improved activity detection for vehicle perception. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and PORIKLI et al. (US 20190213406 A1), Paragraph [0002].

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over CHEN et al. (US 20190130580 A1), hereinafter referenced as CHEN, in view of HWANG et al. (US 20100202690 A1), hereinafter referenced as HWANG. 

Regarding claim 8, CHEN explicitly teaches the method of claim 1, 
CHEN fails to explicitly teach wherein the object comprises text and the select regions comprise a border surrounding the text.
However, HWANG explicitly teaches wherein the object comprises text and the select regions comprise a border surrounding the text (Fig. 3, illustrates an object comprised of text with a border region surrounding the text. Paragraph [0024]-HWANG discloses in FIG. 3, an object intended for character recognition according to an embodiment of the present invention is a signboard photographed image. Of the signboard photographed image, the background region 31, a center text region 32 necessary for delivering advertisement information and also an outer boundary region 33 a and an inner boundary region 33 b, 33 c wrap the center text region 32 that corresponds to a character recognition object according to an embodiment of the present invention (wherein the outer boundary region 33 a is the select region).).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN of a method comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising: receiving a location of the object of interest in a first image of the video feed; determining a background pattern of the object of interest in the first image with the teachings of HWANG of wherein the object comprises text and the select regions comprise a border surrounding the text.
Wherein having CHEN’s object detection system wherein the object comprises text and the select regions comprise a border surrounding the text.
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and HWANG relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while HWANG improves upon the prior art by extracting and recognizing the text from signboards, when the text may not be recognized normally. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and HWANG et al. (US 20100202690 A1), Paragraph [0007].

Regarding claim 17, CHEN explicitly teaches the non-transitory computer storage of claim 10, 
CHEN fails to explicitly teach wherein the object comprises text and the select regions comprise a border surrounding the text.
However, HWANG explicitly teaches wherein the object comprises text and the select regions comprise a border surrounding the text (Fig. 3, illustrates an object comprised of text with a border region surrounding the text. Paragraph [0024]-HWANG discloses in FIG. 3, an object intended for character recognition according to an embodiment of the present invention is a signboard photographed image. Of the signboard photographed image, the background region 31, a center text region 32 necessary for delivering advertisement information and also an outer boundary region 33 a and an inner boundary region 33 b, 33 c wrap the center text region 32 that corresponds to a character recognition object according to an embodiment of the present invention (wherein the outer boundary region 33 a is the select region).).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of CHEN of non-transitory computer storage that stores executable program instructions that, when executed by one or more computing devices, configure the one or more computing devices to perform operations comprising: receiving a video feed; initializing a main object detector configured to receive image frames from the video feed and determine whether an object of interest is present in the image frames; executing a backup object detector configured to perform operations comprising with the teachings of HWANG of wherein the object comprises text and the select regions comprise a border surrounding the text.
Wherein having CHEN’s object detection system wherein the object comprises text and the select regions comprise a border surrounding the text.
The motivation behind the modification would have been to obtain an object detection system that enhances the efficiency and recognition ability of the detection system. Since both CHEN and HWANG relate to detecting objects in images, wherein CHEN the amount of computer resources (e.g., devices, storage, and processor usage) required to generate detection, tracking, and classification results is reduced while HWANG improves upon the prior art by extracting and recognizing the text from signboards, when the text may not be recognized normally. Please see CHEN et al. (US 20190130580 A1), Paragraph [0207], and HWANG et al. (US 20100202690 A1), Paragraph [0007].









Conclusion
Listed below are the prior arts made of record and not relied upon but are considered pertinent to applicant’s disclosure.

TAKEDA et al. (US 20180068431 A1) - Various aspects of a video-processing system and method for object detection in a sequence of image frames are disclosed herein. The system includes an image-processing device configured to receive a first object template for an object in a first image frame that includes one or more objects. A plurality of object candidates that corresponds to the object for a second image frame are determined by use of the shape of the received first object template. One of the determined plurality of object candidates is selected as a second object template, based on one or more parameters. The received first object template is updated to the selected second object template to enable segmentation of the object in the second image frame and/or subsequent image frames…Fig. 4A-B, Abstract.
TAHERI et al. (US 20210117724 A1) - Methods, systems, and apparatus, including computer programs encoded on computer storage media, for model co-occurrence object detection. One of the methods includes accessing, for a training image, first data that indicates a detected bounding box for a first object depicted in the training image and a predicted type label, accessing, for the training image, ground truth data for one or more ground truth objects, determining, using the first data and the ground truth data, that i) the detected bounding box represents an object that is not a ground truth object represented by the ground truth data or ii) the predicted type label for the first object does not match a ground truth label for the first object identified by the ground truth data, determining a penalty to adjust the model using a distance between the detected bounding box and the labeled bounding box, and training the model using the penalty…Fig. 1, Abstract.
BOULT et al. (US 20190378283 A1) - The present invention is a computer-implemented system and method for transforming video data into directional object counts. The method of transforming video data is uniquely efficient in that it uses only a single column or row of pixels in a video camera to define the background from a moving object, count the number of objects and determine their direction. By taking an image of a single column or row every frame and concatenating them together, the result is an image of the object that has passed, referred to herein as a sweep image. In order to determine the direction, two different methods can be used. Method one involves constructing another image using the same method. The two images are then compared, and the direction is determined by the location of the object in the second image compared to the location of the object in the first image.…Fig. 5, Abstract.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ETHAN N WOLFSON whose telephone number is (571)272-1898. The examiner can normally be reached Monday - Friday 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chineyere Wills-Burns can be reached at (571) 272-9752. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ETHAN N WOLFSON/ Examiner, Art Unit 2673 
                                                                                                                                                                                          /CHINEYERE WILLS-BURNS/Supervisory Patent Examiner, Art Unit 2673
Read full office action
OBJECT DETECTION IN VIDEO IMAGE FRAMES

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

OBJECT DETECTION IN VIDEO IMAGE FRAMES

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in for Full Analysis