DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see Applicants Remarks pages 10-17, filed 02/18/26, with respect to the rejection(s) of claim(s) 1-6, 8-13, 15-17, 19 and 20 have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Boult US 2019/0378283.
Regarding claim 1, 10 and 17, Applicant has amended claims 1, 10 and 17 with the objected limitations of claims 7, 14 and 17 to recite that the machine learning mode is trained with training data comprising a combination of document boundary detection, hand detection and change in document histograms between two image frames (Applicants Remarks pages 10-17). Boult teaches bounding boxes “ground truth” sweep images can then be used for machine learning to search for the parameters that optimize the sweep image-based transformation for the particular deployment location/conditions (paragraph 0062) mapping between ground truth boxes in sweep images and frames is a bi-directional transformation of data using coordinates in the sweep images as a time index into the sequence of video frames. Thus, a bounding box in the sweep image can also be used to determine a sequence of bounding boxes in the raw video, labeling data in the sweep image can be used to determine ground truth for training data from the video (change in document histograms between two image frames), With the front edge detected and tracked 830 and trailing edge 832 identified the object in the frame can be converted to a template 850 and tracked in the video (document boundary detection) and he boxes in the sweep image could be hand-drawn ground truth which could then be transferred to the video and used to speed up the ground-truth labeling of video data for training video-based object detector (hand detection) (paragraph 0063) Note: the bounding box is used to train machine learning model in paragraph 0062. The bounding box contains ground truth and training from detecting leading edge and training edge (document boundary detection), ground truth between image frames (the transformation of data between the image frames is read as histogram data) and hand drawn ground truth (hand detection). Therefore, the bounding box contains document boundary detection, hand detection and change in document histograms between two image frames required to train machine learning model. The claims are rejected
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-4, 6, 8-11, 13, 15-16, is/are rejected under 35 U.S.C. 103 as being unpatentable over Emmett et al US 2018/0278845 in view of Manohar et al US 9191554 further in view of Boult US 2019/0378283 further in view of Mathy US 2017/0272651
Regarding claim 1, Emmett et al teaches a system comprising:
a memory component (memory (paragraph 0025); and
one or more processing devices coupled to the memory component (processor and memory (paragraph 0025), the one or more processing device to perform operations comprising:
receiving event data comprising: a video stream and sensor data captured from one or more sensors of a user device, wherein the video stream includes image frames that capture a plurality of pages of a document (the video capture module to receive an image sequence of: (ii) a multi-page document while the user turns the pages; (iii) a multi-page document (paragraph 0038). Note: each page of a document or multiple pages that the smart phone videos is an image frame. Emmett et al teaches the artifact is a document such as a single-page document or multi-page document. The sensing modules (sensor data) may include a video capture module of a mobile electronic device that, when initiated or launched, captures a video 301 of a scene that includes the artifact. The video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), which teaches that each page is an artifact and is an image frame of different pages of a document;
detecting, a new page event, wherein the new page event indicates that a page of the plurality of pages available for scanning has changed from a first page to a second page (the video capture module to receive an image sequence of: (i) a single-page document while the user moves the electronic device around an area of where the artifact exists; (ii) a multi-page document while the user turns the pages; (iii) a multi-page document where the pages are laid out side by side on a surface (paragraph 0038). Note: Emmett et al teaches video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event)
Although Emmitt et al teaches based on the detection of the new page event, capturing an image frame of the page from the video stream (video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event).
Emmett et al fails to teach detecting, via a machine learning model trained to infer events from the event data, a new page event; and detecting, via the machine learning model, a page capture event, wherein the page capture event indicates that at least one image from the image frames comprises a stable image of the page; and based on a combination of the detection by the machine learning model of both the new page event and the page capture event, capturing an image frame of the page from the video stream.
Manohar et al teaches detecting, via a machine learning model trained to infer events from the video stream, a new page event (A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67);
detecting, via the machine learning model, a page capture event, wherein the page capture event indicates that at least one image from the image frames comprises a stable image of the page (using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames (image) corresponding to each page in the book may be identified (column 7, lines 15-24), Note: a frame can be read as a stable image since a frame is a single still image; and
based on a combination of the detection by the machine learning model of both the new page event and the page capture event, capturing an image frame of the page from the video stream (using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124 (new page event). The output of the page-turn detector 124 may include a start frame and an end frame (image capture event) for each page-turn event (new page event) that occurs in the video data (column 7, lines 15-24). By detecting the start frame and end frame of a page turn event, this would read on based on the machine learning model detecting both the new page event and the page capture event
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: detecting, via a machine learning model trained to infer events from the video stream, a new page event, detecting, via the machine learning model, a page capture event, wherein the page capture event indicates that at least one image from the image frames comprises a stable image of the page; and based on a combination of the detection by the machine learning model of both the new page event and the page capture event, capturing an image frame of the page from the video stream
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Emmett et al in view of Manohar et al fails to teach wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data;
Boult teaches wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data (bounding boxes “ground truth” sweep images can then be used for machine learning (paragraph 0062) mapping between ground truth boxes in sweep images and frames is a bi-directional transformation of data using coordinates in the sweep images as a time index into the sequence of video frames. Thus, a bounding box in the sweep image can also be used to determine a sequence of bounding boxes in the raw video, labeling data in the sweep image can be used to determine ground truth for training data from the video (change in document histograms between two image frames), With the front edge detected and tracked 830 and trailing edge 832 identified the object in the frame can be tracked in the video (document boundary detection) and the boxes in the sweep image could be hand-drawn ground truth which could then be transferred to the video and used to speed up the ground-truth labeling of video data for training video-based object detector (hand detection) (paragraph 0063) Note: the bounding box is used to train machine learning model in paragraph 0062. The bounding box contains ground truth and training from detecting leading edge and training edge (document boundary detection), ground truth between image frames (the transformation of data between the image frames is read as histogram data) and hand drawn ground truth (hand detection). Therefore, the bounding box contains document boundary detection, hand detection and change in document histograms between two image frames required to train machine learning model.;
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Manohar et al’s mobile device with machine learning to include: wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data
The reason of doing so would be to improve the operability of the page turning in a document.
Emmett et al in view Manohar et al further in view of Boult fails to teach wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames;
Mathy teaches wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames (machine learning algorithms performing feature extraction on depth images/videos usually assume that the scene of interest is described using a sequence (video) of frames at a given frame rate (typically 30 FPS). Each frame can either contain only a depth image, only an image obtained from a time-insensitive sensor or a combination of both. This disclosure focuses on the case when each frame contains at least a depth image acquired by a time-of-flight camera (paragraph 0097 and 0106). an algorithm is described that enables the updating of depth images in a sequence of frames by tracking objects undergoing translation without having to continuously turn on the laser. This procedure allows for the estimation of how the depths of each object changes given an initial depth measurement that is obtained using nominal laser power and a series of “cheaply acquired” images (paragraph 0140));
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view Manohar et al further in view of Boult’s mobile device with machine learning to include: wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames
The reason of doing so would be to improve the operability of the page turning in a document or book.
Regarding claim 2, Emmett et al in view Manohar et al further in view of Boult teaches after detecting the new page event (Emmett et al: A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. (paragraph 0029),
detecting the page capture event based on a page capture event confidence value generated by the model (the criteria used to identify a suitable instance may include a requirement that the image have at least a threshold image quality score representing machine readability of the printed artifact. The system may use any suitable method such as retrieving a score for the frame by processing the pooled features via a classifier (paragraph 0012). The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029)
machine learning model (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book.
Regarding claim 3, Emmett et al in view Manohar et al further in view of Boult teaches receiving sensor data from one or more sensors of a user device, a weighted combination of the sensor data and the video stream (Emmett et al: A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029)
wherein the machine learning model is trained to detect the new page event based on (Manohar et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames corresponding to each page in the book may be identified. Identifying page-turn events may enable a temporal extent of each page in the frames of the video data to be determined (column 7, lines 15-24)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: wherein the machine learning model is trained to detect the new page event based.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Regarding claim 4, Emmett et al in view Manohar et al further in view of Boult teaches wherein the one or more sensors comprise at least one of: a depth sensor; an audio sensor; or an inertial measurement sensor (Emmett et al: system may use data from the mobile electronic device's inertial or motion sensors such as an accelerometer for detecting frames (paragraph 0053) and optionally additional sensing modules such as the motion sensor 405 and audio sensor 407 may be fed to a decision module 401 (paragraph 0067)
Regarding claim 6, Emmett et al in view Manohar et al further in view of Boult teaches processing a float value vector computed by the machine learning model from at least a first image frame to detect events from a second image frame (Manohar et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events (float vector) occur, frames corresponding to each page in the book may be identified (column 7, lines 15-18). Note: the page turn event is read as a float vector because it is used to identify frames. In order to calculate the page turn event, a first page or frame would have had to been detected and data from the initial detection is used to detect the next pages, etc, indicating a turn page event)
Regarding claim 8, Emmett et al in view Manohar et al further in view of Boult teach of wherein training data comprising one or more of: audio samples, page depth data, and inertial measurement data (Emmett et al: inputs from the video sensing and analysis module 403, and optionally additional sensing modules such as the motion sensor 405 and audio sensor 407 may be fed to a decision module 401 (paragraph 0067). Note: A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture (paragraph 0029).
the machine learning model is trained (Manohar et al: machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124 (column 7, lines 15-24)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model is trained.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Regarding claim 9, Emmett et al in view Manohar et al further in view of Boult teaches wherein the machine learning model generates an indication of the new page event in response to detecting a turn of a page from the video stream from the first page to the second page, or detecting a change in view from the video stream from the first page to the second page (Manohar et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames corresponding to each page in the book may be identified (column 7, lines 15-18). Note: the machine learning algorithm allows for a frame of a page to be identified, indicating a turn page event
Regarding claim 10, Emmett et al teaches A non-transitory computer-readable medium storing executable instructions (paragraph 0073), which when executed by a processing device, cause the processing device to perform operations comprising:
receiving event data comprising: a video stream and sensor data from one or more sensors of a user device (inputs from the video sensing and analysis module 403, and optionally additional sensing modules such as the motion sensor 405 and audio sensor 407 may be fed to a decision module 401 (paragraph 0067);
detecting, based on the sensor data a new page event, wherein detection of the new page event indicates that a page of the plurality of pages available for scanning has changed from a first page to a second page (A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029) The video capture module to receive an image sequence of: (i) a single-page document while the user moves the electronic device around an area of where the artifact exists; (ii) a multi-page document while the user turns the pages; (iii) a multi-page document where the pages are laid out side by side on a surface (paragraph 0038). Note: Emmett et al teaches video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event); and
detecting, by model based on the event data, a page capture event, wherein detection of the page capture event indicates that the event data comprises a stable image of the page (A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029);
capturing an image frame of the page from the event data based on detection of the page capture event by model (A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029). Video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event).
Emmett et al fails to teach capturing an image frame of the page based on a combination of both the detection of the new page event and the detection of the page capture event by the machine learning model,
Manohar et al teaches capturing an image frame of the page based on a combination of both the detection of the new page event and the detection of the page capture event by the machine learning model (using machine learning algorithms, a classifier may be trained to detect page-turn of events (new page event) in video data to create the page-turn detector 124 (new page event). The output of the page-turn detector 124 may include a start frame and an end frame (image capture event) for each page-turn event (new page event) that occurs in the video data (column 7, lines 15-24). Note: By detecting the start frame and end frame of a page turn event, this would read on based on the machine learning model detecting both the new page event and the page capture event
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: capturing an image frame of the page based on a combination of both the detection of the new page event and the detection of the page capture event by the machine learning model.
The reason of doing so would be to effectively and efficiently identify images in new or turned pages in a document or book
Emmett et al in view of Manohar et al fails to teach wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data;
Boult teaches wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data (bounding boxes “ground truth” sweep images can then be used for machine learning (paragraph 0062) mapping between ground truth boxes in sweep images and frames is a bi-directional transformation of data using coordinates in the sweep images as a time index into the sequence of video frames. Thus, a bounding box in the sweep image can also be used to determine a sequence of bounding boxes in the raw video, labeling data in the sweep image can be used to determine ground truth for training data from the video (change in document histograms between two image frames), With the front edge detected and tracked 830 and trailing edge 832 identified the object in the frame can be tracked in the video (document boundary detection) and the boxes in the sweep image could be hand-drawn ground truth which could then be transferred to the video and used to speed up the ground-truth labeling of video data for training video-based object detector (hand detection) (paragraph 0063) Note: the bounding box is used to train machine learning model in paragraph 0062. The bounding box contains ground truth and training from detecting leading edge and training edge (document boundary detection), ground truth between image frames (the transformation of data between the image frames is read as histogram data) and hand drawn ground truth (hand detection). Therefore, the bounding box contains document boundary detection, hand detection and change in document histograms between two image frames required to train machine learning model.;
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Manohar et al’s mobile device with machine learning to include: wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data
The reason of doing so would be to improve the operability of the page turning in a document.
Emmett et al in view of Manohar et al further in view of Boult fails to teach wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames;
Mathy teaches wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames (machine learning algorithms performing feature extraction on depth images/videos usually assume that the scene of interest is described using a sequence (video) of frames at a given frame rate (typically 30 FPS). Each frame can either contain only a depth image, only an image obtained from a time-insensitive sensor or a combination of both. This disclosure focuses on the case when each frame contains at least a depth image acquired by a time-of-flight camera (paragraph 0097 and 0106). an algorithm is described that enables the updating of depth images in a sequence of frames by tracking objects undergoing translation without having to continuously turn on the laser. This procedure allows for the estimation of how the depths of each object changes given an initial depth measurement that is obtained using nominal laser power and a series of “cheaply acquired” images (paragraph 0140));
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Manohar et al further in view of Boult’s mobile device with machine learning to include: wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames
The reason of doing so would be to improve the operability of the page turning in a document or book.
Regarding claim 11, Emmett et al in view of Manohar et al further in view of Boult teaches after detecting the new page event (Emmett et al: A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. (paragraph 0029),
detecting the page capture event based on a page capture event confidence value generated by the model (the criteria used to identify a suitable instance may include a requirement that the image have at least a threshold image quality score representing machine readability of the printed artifact. The system may use any suitable method such as retrieving a score for the frame by processing the pooled features via a classifier (paragraph 0012). The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029)
machine learning model (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Regarding claim 13, Emmett et al in view of Manohar et al further in view of Boult teaches processing a float value vector computed by the machine learning model from at least a first image frame from the sensor data to detect events from a second image frame of the sensor data (Emmett et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events (float vector) occur, frames corresponding to each page in the book may be identified (column 7, lines 15-18). Note: the page turn event is read as a float vector because it is used to identify frames. In order to calculate the page turn event, a first page or frame would have had to been detected and data from the initial detection is used to detect the next pages, etc, indicating a turn page event)
Regarding claim 15, Emmett et al in view of Manohar et al further in view of Boult teaches wherein detects the new page event based on detecting a turn of one or more pages of the plurality of pages, or detecting of a change in view from the sensor data from a first document page to a second document page (Emmett et al: the user may operate the video capture module to receive an image sequence of: (i) a single-page document while the user moves the electronic device around an area of where the artifact exists; (ii) a multi-page document while the user turns the pages; (iii) a multi-page document where the pages are laid out side by side on a surface; (iv) a multi-sided document while the user flips the document over (paragraph 0038);
the machine learning model (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Regarding claim 16, Emmett et al in view of Manohar et al further in view of Boult teaches wherein the model detects the page capture event at least in part in based on a combination of image stream data and inertial measurements from the one or more sensors (Emmett et al: the system may receive motion sensor data as the video is captured and correlate the motion sensor data to the analyzed image frames (paragraph 0009). a decision module 401 that analyzes various input data to determine a suitable instance of an image frame that will trigger the system to switch from image capture. input may include: image frames and/or analyzed data about image frames from the video capture module 403; data from one or more motion sensors 405 (paragraph 0046)
the machine learning model (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Claim(s) 17, 19, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Emmett et al US 2018/0278845 in view of Manohar et al US 9191554 further in view of Boult US 20190378283.
Regarding claim 17, Emmett et al teaches a method comprising:
receiving training dataset comprising a video stream, wherein the video stream includes image frames that capture a plurality of pages of a document (the video capture module to receive an image sequence of: (ii) a multi-page document while the user turns the pages; (iii) a multi-page document (paragraph 0038). Note: each page of a document or multiple pages that the smart phone videos is an image frame. Emmett et al teaches the artifact is a document such as a single-page document or multi-page document. The sensing modules may include a video capture module of a mobile electronic device that, when initiated or launched, captures a video 301 of a scene that includes the artifact. The video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), which teaches that each page is an artifact and is an image frame of different pages of a document; and
wherein the new page event indicates that a page available for scanning has changed from a first page to a second page (the video capture module to receive an image sequence of: (i) a single-page document while the user moves the electronic device around an area of where the artifact exists; (ii) a multi-page document while the user turns the pages; (iii) a multi-page document where the pages are laid out side by side on a surface (paragraph 0038). Note: Emmett et al teaches video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event)
Emmett et al fails to teach using the training dataset, to detect from a set of one or more image frames from the video stream, a new page event and a page capture event;
training a machine learning model,
wherein the page capture event indicates that the video frame comprises a stable image of the page;
Manohar et al teaches using the training dataset, to detect from a set of one or more image frames from the video stream, a new page event and a page capture event (the training videos 206 (training data set) may identify which frames in the training videos 206 include page-turn events (a new page event ) and/or a temporal extent of each page in the frames (a page capture event ) of the training videos 206. each of the training videos 206 may identify which frames include a page turn event (a new page event ), which frames include a particular page in a book (a page capture event ), or both (column 6, lines 49-60). Thus, using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames corresponding to each page in the book may be identified. Identifying page-turn events may enable a temporal extent of each page in the frames of the video data to be determined (column 7, lines 16-24);
training a machine learning model (A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67);
wherein the page capture event indicates that the video frame comprises a stable image of the page (using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames (image) corresponding to each page in the book may be identified (column 7, lines 15-24), Note: a frame can be read as a stable image since a frame is a single still image
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: using the training dataset, to detect from a set of one or more image frames from the video stream, a new page event and a page capture event; training a machine learning model and wherein the page capture event indicates that the video frame comprises a stable image of the page;.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Emmett et al in view of Manohar et al fails to teach wherein the machine learning model is trained with training data produced from document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data;
Boult teaches wherein the machine learning model is trained with training data produced from document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data (bounding boxes “ground truth” sweep images can then be used for machine learning (paragraph 0062) mapping between ground truth boxes in sweep images and frames is a bi-directional transformation of data using coordinates in the sweep images as a time index into the sequence of video frames. Thus, a bounding box in the sweep image can also be used to determine a sequence of bounding boxes in the raw video, labeling data in the sweep image can be used to determine ground truth for training data from the video (change in document histograms between two image frames), With the front edge detected and tracked 830 and trailing edge 832 identified the object in the frame can be tracked in the video (document boundary detection) and the boxes in the sweep image could be hand-drawn ground truth which could then be transferred to the video and used to speed up the ground-truth labeling of video data for training video-based object detector (hand detection) (paragraph 0063) Note: the bounding box is used to train machine learning model in paragraph 0062. The bounding box contains ground truth and training from detecting leading edge and training edge (document boundary detection), ground truth between image frames (the transformation of data between the image frames is read as histogram data) and hand drawn ground truth (hand detection). Therefore, the bounding box contains document boundary detection, hand detection and change in document histograms between two image frames required to train machine learning model.;
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Manohar et al’s mobile device with machine learning to include: wherein the machine learning model is trained with training data produced from document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data
The reason of doing so would be to improve the operability of the page turning in a document.
Regarding claim 19 Emmett et al in view of Manohar et al further in view of Boult teaches wherein with training data produced from one or both of a document boundary detection model and a hand detection model (Emmett et al: the step of determining whether a frame or set of frames satisfies image quality criteria may include determining whether the image contained in the frames exhibits movement of an object such as page turn or hand interaction or camera motion (paragraph 0045) Note: Emmett et al looks at image in frame (document boundary) and hand interaction (hand detection). Edge detection is also used (paragraph 0045)
the machine learning model is trained at least in part (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model is trained at least in part.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Regarding claim 20, Emmett et al in view of Manohar et al further in view of Boult teaches wherein is further trained at least in part with training data comprising one or more of: audio samples, page depth data, and inertial measurement data (Emmett et al: system may use data from the mobile electronic device's inertial or motion sensors such as an accelerometer for detecting frames (paragraph 0053) and optionally additional sensing modules such as the motion sensor 405 and audio sensor 407 may be fed to a decision module 401 (paragraph 0067)
the machine learning model is trained (Manohar et al: machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124 (column 7, lines 15-24)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include the machine learning model is trained.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Claim(s) 5 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Emmett et al US 2018/0278845 in view of Manohar et al US 9191554 further in view of Boult US 2019/0378283 further in view of Mathy US 2017/0272651 further in view of Swaminathan et al US 2012/0243732
Regarding 5, Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy teach all of the limitations of claim 1 and 10
Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy fails to teach wherein the new page event is determined by the machine learning model based on a change in image histogram over a plurality of frames of the video stream;
Swaminathan et al teaches wherein the new page event is determined by the machine learning model based on a change in image histogram over a plurality of frames of the video stream (a typical book-flipping use case in which five pages are turned in 50 seconds, based on a combined optical flow and histogram based scene change detector (SCD) (as described in FIG. 6) along with the reference based tracker 314 and the timing manager 305 (FIG. 3) (fig 6-7 and paragraph 0062);
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy’s mobile device with machine learning to include: wherein the new page event is determined by the machine learning model based on a change in image histogram over a plurality of frames of the video stream
The reason of doing so would be to improve the operability of the page turning in a document or book
Regarding claim 12, Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy teach all of the limitations of claim 10
Manohar et al teaches wherein the new page event and the page capture event are determined by the machine learning model based on a plurality of frames of a video stream (Manohar et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames corresponding to each page in the book may be identified. Identifying page-turn events may enable a temporal extent of each page in the frames of the video data to be determined (column 7, lines 15-24)
Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy fails to teach a change in image histogram over a plurality of frames of the video stream;
Swaminathan et al teaches a change in image histogram over a plurality of frames of the video stream (a typical book-flipping use case in which five pages are turned in 50 seconds, based on a combined optical flow and histogram based scene change detector (SCD) (as described in FIG. 6) along with the reference based tracker 314 and the timing manager 305 (FIG. 3) (fig 6-7 and paragraph 0062);
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy’s mobile device with machine learning to include: wherein the new page event is determined by the machine learning model based on a change in image histogram over a plurality of frames of the video stream
The reason of doing so would be to improve the operability of the page turning in a document or book
Conclusion
Any inquiry concerning this communication should be directed to Michael Burleson whose telephone number is (571) 272-7460 and fax number is (571) 273-7460. The examiner can normally be reached Monday thru Friday from 8:00 a.m. – 4:30p.m. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi Sarpong can be reached at (571) 270- 3438.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Michael Burleson
Patent Examiner
Art Unit 2681
Michael Burleson
March 21, 2026
/MICHAEL BURLESON/
/AKWASI M SARPONG/ SPE, Art Unit 2681