Last updated: April 19, 2026
Application No. 17/663,785
MACHINE LEARNING BASED MULTIPAGE SCANNING

Non-Final OA §103
Filed
May 17, 2022
Examiner
BURLESON, MICHAEL L
Art Unit
2681
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
7 (Non-Final)
Interview Optional

— -6.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 489 resolved cases, 2023–2026
Examiner Intelligence

BURLESON, MICHAEL L View full profile →
Grants 75% — above average
Career Allow Rate
365 granted / 489 resolved
+12.6% vs TC avg
Minimal -6% lift
Without
With
+-6.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
36 currently pending
Career history
525
Total Applications
across all art units
Statute-Specific Performance

§101
12.1%
-27.9% vs TC avg
§103
55.2%
+15.2% vs TC avg
§102
21.8%
-18.2% vs TC avg
§112
8.3%
-31.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 489 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see Applicants Remarks pages 10-17, filed 02/18/26, with respect to the rejection(s) of claim(s) 1-6, 8-13, 15-17, 19 and 20 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Boult US 2019/0378283.
Regarding claim 1, 10 and 17, Applicant has amended claims 1, 10 and 17 with the objected limitations of claims 7, 14 and 17 to recite that the machine learning mode is trained with training data comprising a combination of document boundary detection, hand detection and change in document histograms between two image frames (Applicants Remarks pages 10-17).  Boult teaches bounding boxes “ground truth” sweep images can then be used for machine learning to search for the parameters that optimize the sweep image-based transformation for the particular deployment location/conditions (paragraph 0062)  mapping between ground truth boxes in sweep images and frames is a bi-directional transformation of data using coordinates in the sweep images as a time index into the sequence of video frames. Thus, a bounding box in the sweep image can also be used to determine a sequence of bounding boxes in the raw video, labeling data in the sweep image can be used to determine ground truth for training data from the video (change in document histograms between two image frames), With the front edge detected and tracked 830 and trailing edge 832 identified the object in the frame can be converted to a template 850 and tracked in the video (document boundary detection) and he boxes in the sweep image could be hand-drawn ground truth which could then be transferred to the video and used to speed up the ground-truth labeling of video data for training video-based object detector (hand detection) (paragraph 0063) Note: the bounding box is used to train machine learning model in paragraph 0062.  The bounding box contains ground truth and training from detecting leading edge and training edge (document boundary detection), ground truth between image frames (the transformation of data between the image frames is read as histogram data) and hand drawn ground truth (hand detection).  Therefore, the bounding box contains document boundary detection, hand detection and change in document histograms between two image frames required to train machine learning model.  The claims are rejected

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-4, 6, 8-11, 13, 15-16, is/are rejected under 35 U.S.C. 103 as being unpatentable over Emmett et al US 2018/0278845 in view of Manohar et al US 9191554 further in view of Boult US 2019/0378283 further in view of Mathy US 2017/0272651 
Regarding claim 1, Emmett et al teaches a system comprising: 
a memory component (memory (paragraph 0025); and 
one or more processing devices coupled to the memory component (processor and memory (paragraph 0025), the one or more processing device to perform operations comprising: 
receiving event data comprising: a video stream and sensor data captured from one or more sensors of a user device, wherein the video stream includes image frames that capture a plurality of pages of a document (the video capture module to receive an image sequence of: (ii) a multi-page document while the user turns the pages; (iii) a multi-page document (paragraph 0038). Note: each page of a document or multiple pages that the smart phone videos is an image frame.  Emmett et al teaches the artifact is a document such as a single-page document or multi-page document. The sensing modules (sensor data) may include a video capture module of a mobile electronic device that, when initiated or launched, captures a video 301 of a scene that includes the artifact. The video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), which teaches that each page is an artifact and is an image frame of different pages of a document; 
detecting, a new page event, wherein the new page event indicates that a page of the plurality of pages available for scanning has changed from a first page to a second page (the video capture module to receive an image sequence of: (i) a single-page document while the user moves the electronic device around an area of where the artifact exists; (ii) a multi-page document while the user turns the pages; (iii) a multi-page document where the pages are laid out side by side on a surface (paragraph 0038).  Note: Emmett et al teaches video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event)

Although Emmitt et al teaches based on the detection of the new page event, capturing an image frame of the page from the video stream (video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event).

Emmett et al fails to teach detecting, via a machine learning model trained to infer events from the event data, a new page event; and detecting, via the machine learning model, a page capture event, wherein the page capture event indicates that at least one image from the image frames comprises a stable image of the page; and based on a combination of the detection by the machine learning model of both the new page event and the page capture event, capturing an image frame of the page from the video stream.

Manohar et al teaches detecting, via a machine learning model trained to infer events from the video stream, a new page event (A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67); 
detecting, via the machine learning model, a page capture event, wherein the page capture event indicates that at least one image from the image frames comprises a stable image of the page (using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames (image) corresponding to each page in the book may be identified (column 7, lines 15-24), Note: a frame can be read as a stable image since a frame is a single still image; and 
based on a combination of the detection by the machine learning model of both the new page event and the page capture event, capturing an image frame of the page from the video stream (using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124 (new page event). The output of the page-turn detector 124 may include a start frame and an end frame (image capture event) for each page-turn event (new page event) that occurs in the video data (column 7, lines 15-24).  By detecting the start frame and end frame of a page turn event, this would read on based on the machine learning model detecting both the new page event and the page capture event
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: detecting, via a machine learning model trained to infer events from the video stream, a new page event, detecting, via the machine learning model, a page capture event, wherein the page capture event indicates that at least one image from the image frames comprises a stable image of the page; and based on a combination of the detection by the machine learning model of both the new page event and the page capture event, capturing an image frame of the page from the video stream
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book

Emmett et al in view of Manohar et al fails to teach wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data;
Boult teaches wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data (bounding boxes “ground truth” sweep images can then be used for machine learning (paragraph 0062)  mapping between ground truth boxes in sweep images and frames is a bi-directional transformation of data using coordinates in the sweep images as a time index into the sequence of video frames. Thus, a bounding box in the sweep image can also be used to determine a sequence of bounding boxes in the raw video, labeling data in the sweep image can be used to determine ground truth for training data from the video (change in document histograms between two image frames), With the front edge detected and tracked 830 and trailing edge 832 identified the object in the frame can be tracked in the video (document boundary detection) and the boxes in the sweep image could be hand-drawn ground truth which could then be transferred to the video and used to speed up the ground-truth labeling of video data for training video-based object detector (hand detection) (paragraph 0063) Note: the bounding box is used to train machine learning model in paragraph 0062.  The bounding box contains ground truth and training from detecting leading edge and training edge (document boundary detection), ground truth between image frames (the transformation of data between the image frames is read as histogram data) and hand drawn ground truth (hand detection).  Therefore, the bounding box contains document boundary detection, hand detection and change in document histograms between two image frames required to train machine learning model.;
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Manohar et al’s mobile device with machine learning to include: wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data
The reason of doing so would be to improve the operability of the page turning in a document.

Emmett et al in view Manohar et al further in view of Boult fails to teach wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames;

Mathy teaches wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames (machine learning algorithms performing feature extraction on depth images/videos usually assume that the scene of interest is described using a sequence (video) of frames at a given frame rate (typically 30 FPS). Each frame can either contain only a depth image, only an image obtained from a time-insensitive sensor or a combination of both. This disclosure focuses on the case when each frame contains at least a depth image acquired by a time-of-flight camera (paragraph 0097 and 0106).  an algorithm is described that enables the updating of depth images in a sequence of frames by tracking objects undergoing translation without having to continuously turn on the laser. This procedure allows for the estimation of how the depths of each object changes given an initial depth measurement that is obtained using nominal laser power and a series of “cheaply acquired” images (paragraph 0140));
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view Manohar et al further in view of Boult’s mobile device with machine learning to include: wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames
The reason of doing so would be to improve the operability of the page turning in a document or book.

Regarding claim 2, Emmett et al in view Manohar et al further in view of Boult teaches after detecting the new page event (Emmett et al: A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. (paragraph 0029), 
detecting the page capture event based on a page capture event confidence value generated by the model (the criteria used to identify a suitable instance may include a requirement that the image have at least a threshold image quality score representing machine readability of the printed artifact.  The system may use any suitable method such as retrieving a score for the frame by processing the pooled features via a classifier (paragraph 0012).  The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029)
machine learning model (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book.

Regarding claim 3, Emmett et al in view Manohar et al further in view of Boult teaches receiving sensor data from one or more sensors of a user device, a weighted combination of the sensor data and the video stream (Emmett et al: A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029)
wherein the machine learning model is trained to detect the new page event based on (Manohar et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames corresponding to each page in the book may be identified. Identifying page-turn events may enable a temporal extent of each page in the frames of the video data to be determined (column 7, lines 15-24)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: wherein the machine learning model is trained to detect the new page event based.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book

Regarding claim 4, Emmett et al in view Manohar et al further in view of Boult teaches wherein the one or more sensors comprise at least one of: a depth sensor; an audio sensor; or an inertial measurement sensor (Emmett et al: system may use data from the mobile electronic device's inertial or motion sensors such as an accelerometer for detecting frames (paragraph 0053) and optionally additional sensing modules such as the motion sensor 405 and audio sensor 407 may be fed to a decision module 401 (paragraph 0067)

	Regarding claim 6, Emmett et al in view Manohar et al further in view of Boult teaches processing a float value vector computed by the machine learning model from at least a first image frame to detect events from a second image frame (Manohar et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124.  By detecting where in the video data the page-turn events (float vector) occur,  frames corresponding to each page in the book may be identified (column 7, lines 15-18).  Note: the page turn event is read as a float vector because it is used to identify frames.  In order to calculate the page turn event, a first page or frame would have had to been detected and data from the initial detection is used to detect the next pages, etc, indicating a turn page event)
	
	Regarding claim 8, Emmett et al in view Manohar et al further in view of Boult teach of wherein training data comprising one or more of: audio samples, page depth data, and inertial measurement data (Emmett et al: inputs from the video sensing and analysis module 403, and optionally additional sensing modules such as the motion sensor 405 and audio sensor 407 may be fed to a decision module 401 (paragraph 0067). Note: A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture (paragraph 0029).  
	the machine learning model is trained (Manohar et al: machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124 (column 7, lines 15-24)
	Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model is trained.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
	
	Regarding claim 9, Emmett et al in view Manohar et al further in view of Boult teaches wherein the machine learning model generates an indication of the new page event in response to detecting a turn of a page from the video stream from the first page to the second page, or detecting a change in view from the video stream from the first page to the second page (Manohar et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124.  By detecting where in the video data the page-turn events occur,  frames corresponding to each page in the book may be identified (column 7, lines 15-18).  Note: the machine learning algorithm allows for a frame of a page to be identified, indicating a turn page event

Regarding claim 10, Emmett et al teaches A non-transitory computer-readable medium storing executable instructions (paragraph 0073), which when executed by a processing device, cause the processing device to perform operations comprising: 
receiving event data comprising: a video stream and sensor data from one or more sensors of a user device (inputs from the video sensing and analysis module 403, and optionally additional sensing modules such as the motion sensor 405 and audio sensor 407 may be fed to a decision module 401 (paragraph 0067); 
detecting, based on the sensor data a new page event, wherein detection of the new page event indicates that a page of the plurality of pages available for scanning has changed from a first page to a second page (A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029)  The video capture module to receive an image sequence of: (i) a single-page document while the user moves the electronic device around an area of where the artifact exists; (ii) a multi-page document while the user turns the pages; (iii) a multi-page document where the pages are laid out side by side on a surface (paragraph 0038).  Note: Emmett et al teaches video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event); and 
detecting, by model based on the event data, a page capture event, wherein detection of the page capture event indicates that the event data comprises a stable image of the page (A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029);
capturing an image frame of the page from the event data based on detection of the page capture event by model (A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029). Video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event).  

Emmett et al fails to teach capturing an image frame of the page based on a combination of both the detection of the new page event and the detection of the page capture event by the machine learning model,
Manohar et al teaches capturing an image frame of the page based on a combination of both the detection of the new page event and the detection of the page capture event by the machine learning model (using machine learning algorithms, a classifier may be trained to detect page-turn of events (new page event) in video data to create the page-turn detector 124 (new page event). The output of the page-turn detector 124 may include a start frame and an end frame (image capture event) for each page-turn event (new page event) that occurs in the video data (column 7, lines 15-24).  Note: By detecting the start frame and end frame of a page turn event, this would read on based on the machine learning model detecting both the new page event and the page capture event

Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: capturing an image frame of the page based on a combination of both the detection of the new page event and the detection of the page capture event by the machine learning model.
The reason of doing so would be to effectively and efficiently identify images in new or turned pages in a document or book

Emmett et al in view of Manohar et al fails to teach wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data;
Boult teaches wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data (bounding boxes “ground truth” sweep images can then be used for machine learning (paragraph 0062)  mapping between ground truth boxes in sweep images and frames is a bi-directional transformation of data using coordinates in the sweep images as a time index into the sequence of video frames. Thus, a bounding box in the sweep image can also be used to determine a sequence of bounding boxes in the raw video, labeling data in the sweep image can be used to determine ground truth for training data from the video (change in document histograms between two image frames), With the front edge detected and tracked 830 and trailing edge 832 identified the object in the frame can be tracked in the video (document boundary detection) and the boxes in the sweep image could be hand-drawn ground truth which could then be transferred to the video and used to speed up the ground-truth labeling of video data for training video-based object detector (hand detection) (paragraph 0063) Note: the bounding box is used to train machine learning model in paragraph 0062.  The bounding box contains ground truth and training from detecting leading edge and training edge (document boundary detection), ground truth between image frames (the transformation of data between the image frames is read as histogram data) and hand drawn ground truth (hand detection).  Therefore, the bounding box contains document boundary detection, hand detection and change in document histograms between two image frames required to train machine learning model.;
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Manohar et al’s mobile device with machine learning to include: wherein the machine learning model is trained with training data based on document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data
The reason of doing so would be to improve the operability of the page turning in a document.

Emmett et al in view of Manohar et al further in view of Boult fails to teach wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames;
Mathy teaches wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames (machine learning algorithms performing feature extraction on depth images/videos usually assume that the scene of interest is described using a sequence (video) of frames at a given frame rate (typically 30 FPS). Each frame can either contain only a depth image, only an image obtained from a time-insensitive sensor or a combination of both. This disclosure focuses on the case when each frame contains at least a depth image acquired by a time-of-flight camera (paragraph 0097 and 0106).  an algorithm is described that enables the updating of depth images in a sequence of frames by tracking objects undergoing translation without having to continuously turn on the laser. This procedure allows for the estimation of how the depths of each object changes given an initial depth measurement that is obtained using nominal laser power and a series of “cheaply acquired” images (paragraph 0140));
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Manohar et al further in view of Boult’s mobile device with machine learning to include: wherein the sensor data includes sensor data frames comprising depth data representing a change in page depth, the sensor data frames synchronized in time with the image frames
The reason of doing so would be to improve the operability of the page turning in a document or book.
Regarding claim 11, Emmett et al in view of Manohar et al further in view of Boult teaches after detecting the new page event (Emmett et al: A “decision module” refers to a software application that receives data or analysis from the video capture module and other sensors and applies one or more rules to determine whether the data or analysis satisfies one or more pre-defined criteria for triggering still photo capture. (paragraph 0029), 
detecting the page capture event based on a page capture event confidence value generated by the model (the criteria used to identify a suitable instance may include a requirement that the image have at least a threshold image quality score representing machine readability of the printed artifact.  The system may use any suitable method such as retrieving a score for the frame by processing the pooled features via a classifier (paragraph 0012).  The module may employ different strategies to rank, weight and combine the analysis corresponding to the various sensing modules in order to make a final decision towards suitable instance for still capture (paragraph 0029)
machine learning model (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book

Regarding claim 13,  Emmett et al in view of Manohar et al further in view of Boult teaches processing a float value vector computed by the machine learning model from at least a first image frame from the sensor data to detect events from a second image frame of the sensor data (Emmett et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124.  By detecting where in the video data the page-turn events (float vector) occur,  frames corresponding to each page in the book may be identified (column 7, lines 15-18).  Note: the page turn event is read as a float vector because it is used to identify frames.  In order to calculate the page turn event, a first page or frame would have had to been detected and data from the initial detection is used to detect the next pages, etc, indicating a turn page event)


Regarding claim 15, Emmett et al in view of Manohar et al further in view of Boult teaches wherein detects the new page event based on detecting a turn of one or more pages of the plurality of pages, or detecting of a change in view from the sensor data from a first document page to a second document page (Emmett et al: the user may operate the video capture module to receive an image sequence of: (i) a single-page document while the user moves the electronic device around an area of where the artifact exists; (ii) a multi-page document while the user turns the pages; (iii) a multi-page document where the pages are laid out side by side on a surface; (iv) a multi-sided document while the user flips the document over (paragraph 0038);
the machine learning model (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Regarding claim 16,  Emmett et al in view of Manohar et al further in view of Boult teaches wherein the model detects the page capture event at least in part in based on a combination of image stream data and inertial measurements from the one or more sensors (Emmett et al: the system may receive motion sensor data as the video is captured and correlate the motion sensor data to the analyzed image frames (paragraph 0009).  a decision module 401 that analyzes various input data to determine a suitable instance of an image frame that will trigger the system to switch from image capture.  input may include: image frames and/or analyzed data about image frames from the video capture module 403; data from one or more motion sensors 405 (paragraph 0046)
the machine learning model (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book

Claim(s) 17, 19, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Emmett et al US 2018/0278845 in view of Manohar et al US 9191554 further in view of Boult US 20190378283.

	Regarding claim 17, Emmett et al teaches a method comprising: 
receiving training dataset comprising a video stream, wherein the video stream includes image frames that capture a plurality of pages of a document (the video capture module to receive an image sequence of: (ii) a multi-page document while the user turns the pages; (iii) a multi-page document (paragraph 0038). Note: each page of a document or multiple pages that the smart phone videos is an image frame.  Emmett et al teaches the artifact is a document such as a single-page document or multi-page document. The sensing modules may include a video capture module of a mobile electronic device that, when initiated or launched, captures a video 301 of a scene that includes the artifact. The video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), which teaches that each page is an artifact and is an image frame of different pages of a document; and 
wherein the new page event indicates that a page available for scanning has changed from a first page to a second page (the video capture module to receive an image sequence of: (i) a single-page document while the user moves the electronic device around an area of where the artifact exists; (ii) a multi-page document while the user turns the pages; (iii) a multi-page document where the pages are laid out side by side on a surface (paragraph 0038).  Note: Emmett et al teaches video capture may occur while the user moves the artifact, the image capture device, or both so that the video includes a sequence of image frames, at least some of which contain images of the artifact (paragraph 0038), In other words, once a page of the document has been scanned for video, the user turns to the next page to be scanned for video and the video capture module senses this event)
Emmett et al fails to teach using the training dataset, to detect from a set of one or more image frames from the video stream, a new page event and a page capture event;
training a machine learning model,
wherein the page capture event indicates that the video frame comprises a stable image of the page;
Manohar et al teaches using the training dataset, to detect from a set of one or more image frames from the video stream, a new page event and a page capture event (the training videos 206 (training data set) may identify which frames in the training videos 206 include page-turn events (a new page event ) and/or a temporal extent of each page in the frames (a page capture event ) of the training videos 206.  each of the training videos 206 may identify which frames include a page turn event (a new page event ), which frames include a particular page in a book (a page capture event ), or both (column 6, lines 49-60).  Thus, using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames corresponding to each page in the book may be identified. Identifying page-turn events may enable a temporal extent of each page in the frames of the video data to be determined (column 7, lines 16-24);
training a machine learning model (A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67); 
wherein the page capture event indicates that the video frame comprises a stable image of the page (using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames (image) corresponding to each page in the book may be identified (column 7, lines 15-24), Note: a frame can be read as a stable image since a frame is a single still image

Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: using the training dataset, to detect from a set of one or more image frames from the video stream, a new page event and a page capture event; training a machine learning model and wherein the page capture event indicates that the video frame comprises a stable image of the page;.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book
Emmett et al in view of Manohar et al fails to teach wherein the machine learning model is trained with training data produced from document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data;
Boult teaches wherein the machine learning model is trained with training data produced from document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data (bounding boxes “ground truth” sweep images can then be used for machine learning (paragraph 0062)  mapping between ground truth boxes in sweep images and frames is a bi-directional transformation of data using coordinates in the sweep images as a time index into the sequence of video frames. Thus, a bounding box in the sweep image can also be used to determine a sequence of bounding boxes in the raw video, labeling data in the sweep image can be used to determine ground truth for training data from the video (change in document histograms between two image frames), With the front edge detected and tracked 830 and trailing edge 832 identified the object in the frame can be tracked in the video (document boundary detection) and the boxes in the sweep image could be hand-drawn ground truth which could then be transferred to the video and used to speed up the ground-truth labeling of video data for training video-based object detector (hand detection) (paragraph 0063) Note: the bounding box is used to train machine learning model in paragraph 0062.  The bounding box contains ground truth and training from detecting leading edge and training edge (document boundary detection), ground truth between image frames (the transformation of data between the image frames is read as histogram data) and hand drawn ground truth (hand detection).  Therefore, the bounding box contains document boundary detection, hand detection and change in document histograms between two image frames required to train machine learning model.;
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Manohar et al’s mobile device with machine learning to include: wherein the machine learning model is trained with training data produced from document boundary detection, hand detection and a change in document histogram between two data frames, wherein the document boundary detection model and the hand detection model compute the training data from ground truth training data
The reason of doing so would be to improve the operability of the page turning in a document.

	Regarding claim 19 Emmett et al in view of Manohar et al further in view of Boult teaches wherein with training data produced from one or both of a document boundary detection model and a hand detection model (Emmett et al: the step of determining whether a frame or set of frames satisfies image quality criteria may include determining whether the image contained in the frames exhibits movement of an object such as page turn or hand interaction or camera motion (paragraph 0045) Note: Emmett et al looks at image in frame (document boundary) and hand interaction (hand detection).  Edge detection is also used (paragraph 0045)

	the machine learning model is trained at least in part (Manohar et al: A page-turn detector that is trained using machine-learning techniques may be used to identify frames of the video in which a page-turn event occurs (column 1, lines 63-67)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include: the machine learning model is trained at least in part.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book

Regarding claim 20, Emmett et al in view of Manohar et al further in view of Boult teaches wherein is further trained at least in part with training data comprising one or more of: audio samples, page depth data, and inertial measurement data (Emmett et al: system may use data from the mobile electronic device's inertial or motion sensors such as an accelerometer for detecting frames (paragraph 0053) and optionally additional sensing modules such as the motion sensor 405 and audio sensor 407 may be fed to a decision module 401 (paragraph 0067)
the machine learning model is trained (Manohar et al: machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124 (column 7, lines 15-24)
	Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al’s mobile device to include the machine learning model is trained.
The reason of doing so would be to effectively and efficiently identify new or turned pages in a document or book

Claim(s) 5 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Emmett et al US 2018/0278845 in view of Manohar et al US 9191554 further in view of Boult US 2019/0378283 further in view of Mathy US 2017/0272651 further in view of Swaminathan et al US 2012/0243732
Regarding 5, Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy teach all of the limitations of claim 1 and 10
Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy fails to teach wherein the new page event is determined by the machine learning model based on a change in image histogram over a plurality of frames of the video stream;
Swaminathan et al teaches wherein the new page event is determined by the machine learning model based on a change in image histogram over a plurality of frames of the video stream (a typical book-flipping use case in which five pages are turned in 50 seconds, based on a combined optical flow and histogram based scene change detector (SCD) (as described in FIG. 6) along with the reference based tracker 314 and the timing manager 305 (FIG. 3) (fig 6-7 and paragraph 0062);
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy’s mobile device with machine learning to include: wherein the new page event is determined by the machine learning model based on a change in image histogram over a plurality of frames of the video stream
The reason of doing so would be to improve the operability of the page turning in a document or book

Regarding claim 12, Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy teach all of the limitations of claim 10
Manohar et al teaches wherein the new page event and the page capture event are determined by the machine learning model based on a plurality of frames of a video stream (Manohar et al: using machine learning algorithms, a classifier may be trained to detect page-turn of events in video data to create the page-turn detector 124. By detecting where in the video data the page-turn events occur, frames corresponding to each page in the book may be identified. Identifying page-turn events may enable a temporal extent of each page in the frames of the video data to be determined (column 7, lines 15-24)
Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy fails to teach a change in image histogram over a plurality of frames of the video stream;
Swaminathan et al teaches a change in image histogram over a plurality of frames of the video stream (a typical book-flipping use case in which five pages are turned in 50 seconds, based on a combined optical flow and histogram based scene change detector (SCD) (as described in FIG. 6) along with the reference based tracker 314 and the timing manager 305 (FIG. 3) (fig 6-7 and paragraph 0062);
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Emmett et al in view of Boult further in view of Manohar et al further in view of Mathy’s mobile device with machine learning to include: wherein the new page event is determined by the machine learning model based on a change in image histogram over a plurality of frames of the video stream
The reason of doing so would be to improve the operability of the page turning in a document or book


Conclusion

Any inquiry concerning this communication should be directed to Michael Burleson whose telephone number is (571) 272-7460 and fax number is (571) 273-7460.  The examiner can normally be reached Monday thru Friday from 8:00 a.m. – 4:30p.m.  If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi Sarpong can be reached at (571) 270- 3438.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Michael Burleson
Patent Examiner
Art Unit 2681

Michael Burleson
March 21, 2026
/MICHAEL BURLESON/

/AKWASI M SARPONG/               SPE, Art Unit 2681
Read full office action
Prosecution Timeline

May 17, 2022
Application Filed
Feb 23, 2024
Non-Final Rejection — §103
Apr 19, 2024
Interview Requested
Apr 26, 2024
Applicant Interview (Telephonic)
Apr 27, 2024
Examiner Interview Summary
May 17, 2024
Response Filed
Jul 27, 2024
Final Rejection — §103
Aug 19, 2024
Interview Requested
Sep 05, 2024
Applicant Interview (Telephonic)
Sep 07, 2024
Examiner Interview Summary
Sep 12, 2024
Response after Non-Final Action
Oct 04, 2024
Response after Non-Final Action
Oct 18, 2024
Request for Continued Examination
Oct 22, 2024
Response after Non-Final Action
Nov 30, 2024
Non-Final Rejection — §103
Jan 21, 2025
Interview Requested
Jan 30, 2025
Applicant Interview (Telephonic)
Jan 30, 2025
Examiner Interview Summary
Feb 06, 2025
Response Filed
May 03, 2025
Final Rejection — §103
May 27, 2025
Interview Requested
Jun 05, 2025
Applicant Interview (Telephonic)
Jun 11, 2025
Examiner Interview Summary
Jun 26, 2025
Response after Non-Final Action
Jul 07, 2025
Non-Final Rejection — §103
Sep 09, 2025
Interview Requested
Sep 17, 2025
Applicant Interview (Telephonic)
Sep 18, 2025
Examiner Interview Summary
Oct 07, 2025
Response Filed
Nov 21, 2025
Final Rejection — §103
Jan 20, 2026
Response after Non-Final Action
Feb 18, 2026
Request for Continued Examination
Feb 23, 2026
Response after Non-Final Action
Mar 21, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/338,560
Patent 12603965
PRINTING DEVICE SETTING EXPANDED REGION AND GENERATING PATCH CHART PRINT DATA BASED ON PIXELS IN EXPANDED REGION
2y 5m to grant Granted Apr 14, 2026
18/148,379
Patent 12585826
DOCUMENT AUTHENTICATION USING ELECTROMAGNETIC SOURCES AND SENSORS
2y 5m to grant Granted Mar 24, 2026
17/940,591
Patent 12566125
SEQUENCER FOCUS QUALITY METRICS AND FOCUS TRACKING FOR PERIODICALLY PATTERNED SURFACES
2y 5m to grant Granted Mar 03, 2026
17/632,718
Patent 12561548
SYSTEM SIMULATING A DECISIONAL PROCESS IN A MAMMAL BRAIN ABOUT MOTIONS OF A VISUALLY OBSERVED BODY
2y 5m to grant Granted Feb 24, 2026
18/121,009
Patent 12562549
LIGHT EMITTING ELEMENT, LIGHT SOURCE DEVICE, DISPLAY DEVICE, HEAD-MOUNTED DISPLAY, AND BIOLOGICAL INFORMATION ACQUISITION APPARATUS
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

7-8
Expected OA Rounds
75%
Grant Probability
68%
With Interview (-6.1%)
2y 10m
Median Time to Grant
High
PTA Risk
Based on 489 resolved cases by this examiner. Grant probability derived from career allow rate.
MACHINE LEARNING BASED MULTIPAGE SCANNING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email