Last updated: April 19, 2026
Application No. 18/185,791
CONDITIONED SMART IMAGE CROPPING

Final Rejection §101§103§112
Filed
Mar 17, 2023
Examiner
DIGUGLIELMO, DANIELLA MARIE
Art Unit
2666
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
2 (Final)
Interview Optional

— +26.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 170 resolved cases, 2023–2026
Examiner Intelligence

DIGUGLIELMO, DANIELLA MARIE View full profile →
Grants 81% — above average
Career Allow Rate
137 granted / 170 resolved
+18.6% vs TC avg
Strong +26% interview lift
Without
With
+26.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
25 currently pending
Career history
195
Total Applications
across all art units
Statute-Specific Performance

§101
12.9%
-27.1% vs TC avg
§103
35.5%
-4.5% vs TC avg
§102
10.4%
-29.6% vs TC avg
§112
33.1%
-6.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 170 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Claims 1-7, 11-17, and 20 are pending. Claims 8-10 and 18-19 are canceled. 

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 1/6/26 was filed after the mailing date of the Non-final Office Action on 10/27/25.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
Applicant’s arguments, see p. 10, filed 2/27/26, with respect to the drawings have been fully considered and are persuasive.  The drawing objection of 10/27/25 has been withdrawn. 

Applicant’s arguments, see p. 10, filed 2/27/26, with respect to the specification have been fully considered and are persuasive.  The specification objections of 10/27/25 have been withdrawn. 

Applicant’s arguments, see p. 11, filed 2/27/26, with respect to claims 7 and 17 have been fully considered and are persuasive.  The 35 U.S.C. 112(b) rejections of 10/27/25 have been withdrawn. 

Applicant's arguments filed 2/27/26 with respect to the 35 U.S.C. 101 rejections and the 35 U.S.C. 103 rejections have been fully considered but they are not persuasive.
First, Applicant argues, in p. 11-15 of the remarks, that the 35 U.S.C. 101 rejections should be withdrawn. The Examiner respectfully disagrees. Claims 1-7, 11-17, and 20 still recite an abstract idea, such as a process that, under its broadest reasonable interpretation, covers performance of the limitation manually or in the mind by a human. For example, with respect to claim 11 (method claim), a person/user may identify features in an image, determine a feature of interest/target feature in the image, determine which features in the image are relevant to the target feature (i.e., identify and prioritize features that have aesthetic ambiguity), and crop/cut areas of the image to generate cropped images in which the images are cropped with and without having the optimal/ideal size or aesthetic quality for the features of interest. This is a concept that falls under the grouping of abstract idea mental processes, i.e., a concept performed in the human mind, evaluation, judgment, and/or opinion of a human. The amendments therefore do not provide a technical solution/practical application. 	To advance prosecution, the Examiner suggests that Applicant include details of the ML model and how the ML model is specifically trained. 
Second, Applicant argues, in p. 15-18 of the remarks, that the prior art of record does not teach the following limitation in claim 1: “cropping, based on the one or more cropping candidate portions, the source image multiple times to generate a plurality of cropped images, wherein the source image is cropped based on a set of cropping rules such that each cropped image is in the optimal image configuration but has a different image configuration, characteristics, visual impression, or aesthetical value.” The Examiner respectfully disagrees. The prior art of record Horanyi in the combination teaches generating bounding box annotations for each user description (i.e., a bounding box annotation is generated for the user descriptions of two people sitting on a bench on the grassland and a happy man wearing a white shirt is standing on the green grass. The same image is cropped to produce a cropped image of a happy man wearing a white shirt and a cropped image of two people sitting on a bench, and the image is cropped so that the cropped image includes the two people sitting on a bench) (see Fig. 1 and Pg. 2). Horanyi also teaches different captions generating different crops of an image (see Fig. 8). Additionally, Horanyi teaches an optimization process in which the scale is reduced at each iteration (see Pg. 4 and Algorithm 1), there is multiple cropping regions for each iteration (i.e., yellow bounding boxes), the average of the optimization is output (i.e., red bounding box), and the image is cropped (see Fig. 4). Since the optimal image for each iteration is of a different size/scale, the optimal images have a different image configuration/ characteristic.
Third, Applicant argues, in p. 15-18 of the remarks that the prior art of record does not teach the following limitations in claim 1: “determining contextual broadness or ambiguousness of each of the plurality of target features; prioritizing the plurality of target features based on the contextual broadness or ambiguousness of each target feature; and identifying a portion of the source image contextually relevant to each target feature.” The Examiner respectfully disagrees. Deng in the combination teaches scores that determine aesthetic ambiguity in which higher scores are treated as positive examples and lower scores are treated as negative examples (see Pg. 85 and Fig. 3), classifying features to distinguish images of ambiguous quality (see Pg. 87), and cropped images that are inherently ambiguous (see Pg. 99). The Examiner interprets the high/low scoring as an example of prioritizing. Horanyi in the combination teaches generating bounding box annotations for each user description/caption to crop the image (i.e., a bounding box annotation is generated for the user description of two people sitting on a bench on the grassland) (see Fig. 1 and Pg. 2) and bounding box annotations for different images (see Fig. 7). The Examiner interprets the bounding box for the descriptions/captions as the portion of the image that is contextually relevant. 

Claim Objections
Claim 1 is objected to because of the following informalities: In line 20, “characteristics” should read –characteristic–.  Appropriate correction is required.

Claim 3 is objected to because of the following informalities: In lines 4 and 7-8, “plurality of target features” should read –the plurality of target features–.  Appropriate correction is required.

Claim 11 is objected to because of the following informalities: In line 15, “characteristics” should read –characteristic–.  Appropriate correction is required.

Claim 12 is objected to because of the following informalities: In line 3, “plurality of target features” should read –the plurality of target features–.  Appropriate correction is required.

Claim 13 is objected to because of the following informalities: In lines 4 and 6-7, “plurality of target features” should read –the plurality of target features–.  Appropriate correction is required.

Claim 14 is objected to because of the following informalities: In line 3, 4, and 6, “plurality of target features” should read –the plurality of target features–.  Appropriate correction is required.

Claim 20 is objected to because of the following informalities: In line 17, “characteristics” should read –characteristic–.  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-7, 11-17, and 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 1 recites the limitation "contextually relevant" in line 30.  It is unclear and indefinite if this is the same as the “contextual relevance” previously recited in the claim. 

Claims 2-7 depend on claim 1 and are therefore also rejected under 112(b).

Claim 11 recites the limitation "contextually relevant" in line 23.  It is unclear and indefinite if this is the same as the “contextual relevance” previously recited in the claim. 

Claims 12-17 depend on claim 11 and are therefore also rejected under 112(b).

Claim 20 recites the limitation "contextually relevant" in line 27.  It is unclear and indefinite if this is the same as the “contextual relevance” previously recited in the claim. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-7, 11-17, and 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claims recite a system, method, and non-transitory computer-readable medium for image cropping. With respect to the analysis of Claim 11 (claims 1 and 20 recite similar limitations):

Step 1:
	With regard to Step 1, the claim is directed to a method; and therefore, the claim is directed to one of the statutory categories of inventions.

Step 2A, Prong One:
	With regard to Step 2A, Prong One, the limitations “determining, based on the user intention data, a plurality of target features; identifying a plurality of visual features within the source image; determining a contextual relevance between each target feature and each visual feature of the source image; identifying, based on the determined contextual relevance between each target feature and each identified visual feature of the source image, one or more cropping candidate portions within the source image, each cropping candidate portion showing at least one of the plurality of target features without being in an optimal image configuration; cropping, based on the one or more cropping candidate portions, the source image to generate a plurality of cropped images, wherein the source image is cropped based on a set of cropping rules such that each cropped image is in the optimal image configuration but has a different image configuration, characteristics, visual impression or aesthetical value; wherein determining the contextual relevance between each target feature and each visual feature comprises: determining contextual broadness or ambiguousness of each of the plurality of target features; prioritizing the plurality of target features based on the contextual broadness or ambiguousness of each target feature; and identifying a portion of the source image contextually relevant to each target feature” as drafted, recite an abstract idea, such as a process that, under its broadest reasonable interpretation, covers performance of the limitation manually or in the mind by a human. That is, a person/user may identify features in an image, determine a feature of interest/target feature in the image, determine which features in the image are relevant to the target feature (i.e., identify and prioritize features that have aesthetic ambiguity), and crop/cut areas of the image to generate cropped images in which the images are cropped with and without having the optimal/ideal size or aesthetic quality for the features of interest. This is a concept that falls under the grouping of abstract idea mental processes, i.e., a concept performed in the human mind, evaluation, judgment, and/or opinion of a human.

Step 2A, Prong Two:
	The 2019 PEG defines the phrase “integration into a practical application” to require an additional element or a combination of additional elements in the claim to apply, rely on, or use the judicial exception. In the instant case, there are no additional steps/elements/limitations in the claims, with the exception of the following in the claims: “a processor”, “a computer-readable medium in communication with the processor, the computer-readable medium comprising instructions that, when executed by the processor, cause the processor to control the system to perform functions of: receiving a source image and user intention data”, “causing the plurality of cropped images to be displayed on a display”, and “the instructions, when executed by the processor, further cause the processor to control the system to perform functions of” in claim 1, “receiving a source image and user intention data” and “causing the plurality of cropped images to be displayed on a display” in claim 11, and “a non-transitory computer-readable medium comprising instructions that, when executed by a processor, cause the processor to control a system to perform: receiving a source image and user intention data”,  “and causing the plurality of cropped images to be displayed on a display”, and “the instructions, when executed by the processor, further cause the processor to control the system to perform functions of” in claim 20. The receiving limitation is just data gathering/data acquisition (i.e., data input). The processor, computer-readable medium/non-transitory computer-readable medium, and display are generic computer components. Additionally, the “causing…display” limitation is just data output. These limitations are regarded as adding routine and conventional elements to perform the judicial exception, and do not apply into a practical application. Accordingly, the above-mentioned additional elements/limitations do not integrate the abstract idea into a practical application; and therefore, the claims recite an abstract idea.

Step 2B:
	Because the claims fail under Step 2A, the claims are further evaluated under Step 2B. The claims herein do not include additional elements that are sufficient to amount to significantly more than the judicial exception, because as discussed above with respect to integration of the abstract idea into practical application, the additional elements/limitations to perform the steps, amount to no more than insignificant routine and conventional elements. Mere instructions to apply an exception using generic components cannot provide an inventive concept. Therefore, independent claims 1, 11, and 20 are not patent eligible. 

	Furthermore, with regard to dependent Claims 2-7 and 12-17 viewed individually, these additional steps, under their broadest reasonable interpretation, provide extra-solution activities to cover performance of the limitations as an abstract idea, and do not provide meaningful limitations to transform the abstract idea into a patent eligible application of the abstract idea such that the claims amount to significantly more than the abstract idea itself. Accordingly, they are not patent eligible.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-7, 11-12, 14-17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over “Repurposing existing deep networks for caption and aesthetic-guided image cropping” by Horanyi et al. (hereinafter “Horanyi”) in view of Csurka (US 2009/0208118 A1), and further in view of “Learning the Relation Between Interested Objects and Aesthetic Region for Image Cropping” by Lu et al. (hereinafter “Lu”), and “Image Aesthetic Assessment” by Deng et al. (hereinafter “Deng”).
Regarding claim 1, Horanyi teaches, a system for cropping an image, comprising (Abstract: “We propose a novel optimization framework that crops a given image based on user description and aesthetics”; Figs. 1 and 2; As shown in Pg. 7, the system produces more visually appealing crops of the original image): 


receiving a source image and user intention data (As shown in Fig. 1, an image is cropped based on different users’ descriptions; Fig. 2; Pg. 2-3; Pg. 4: “ensure that the crop will include all the relevant parts of the image based on the user’s description”; Pg. 4, Algorithm 1: an image and user caption are input; As shown in Pg. 6 and Fig. 8, different captions lead to different crops);
determining, based on the user intention data, a plurality of target features (As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image; As shown in Pg. 5 and Fig. 7, the ground truth annotation regions/bounding box annotations are selected in the images based on the captions);
identifying a plurality of visual features within the source image (As shown in Fig. 1, different features are identified by bounding box annotations in the same image, such as a man wearing a white shirt and two people sitting on a bench; Pg. 2);
determining a contextual relevance between each target feature and each visual feature of the source image (As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image (i.e., bounding box annotation is generated for the user description of two people sitting on a bench on the grassland); Fig. 7);
identifying, based on the determined contextual relevance between each target feature and each identified visual feature of the source image, one or more cropping candidate portions within the source image, each cropping candidate portion showing at least one of the plurality of target features (As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image (i.e., bounding box annotation is generated for the user descriptions of two people sitting on a bench on the grassland and a happy man wearing a white shirt is standing on the green grass. The same image is cropped to produce a cropped image of a happy man wearing a white shirt and a cropped image of two people sitting on a bench, and the image is cropped so that the cropped image includes the two people sitting on a bench); Fig. 8; As shown in Pg. 4 and Algorithm 1, there is optimization in which the scale is reduced at each iteration; As shown in Fig. 4, there are multiple cropping regions for each iteration (i.e., yellow bounding boxes), the average of the optimization is output (i.e., red bounding box), and the image is cropped. The optimal image for each iteration is of a different size/scale (i.e., has a different image configuration/characteristic));
cropping, based on the one or more cropping candidate portions, the source image multiple times to generate a plurality of cropped images, wherein the source image is cropped based on a set of cropping rules such that each cropped image is in the optimal image configuration but has a different image configuration, characteristics, visual impression or aesthetical value (As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image (i.e., bounding box annotation is generated for the user descriptions of two people sitting on a bench on the grassland and a happy man wearing a white shirt is standing on the green grass. The same image is cropped to produce a cropped image of a happy man wearing a white shirt and a cropped image of two people sitting on a bench, and the image is cropped so that the cropped image includes the two people sitting on a bench); Fig. 8; As shown in Pg. 4 and Algorithm 1, there is optimization in which the scale is reduced at each iteration; As shown in Fig. 4, there are multiple cropping regions for each iteration (i.e., yellow bounding boxes), the average of the optimization is output (i.e., red bounding box), and the image is cropped. The optimal image for each iteration is of a different size/scale (i.e., has a different image configuration/characteristic));

wherein, for determining the contextual relevance between each target feature and each visual feature (As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image (i.e., bounding box annotation is generated for the user description of two people sitting on a bench on the grassland); Fig. 7)


and identifying a portion of the source image contextually relevant to each target feature (As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image (i.e., bounding box annotation is generated for the user description of two people sitting on a bench on the grassland); Fig. 7; Note: the Examiner interprets the bounding box as the portion of the image). 
Horanyi does not expressly disclose the following limitations: a processor; and a computer-readable medium in communication with the processor, the computer-readable medium comprising instructions that, when executed by the processor, cause the processor to control the system to perform functions of: without being in an optimal image configuration; and causing the plurality of cropped images to be displayed on a display, the instructions, when executed by the processor, further cause the processor to control the system to perform functions of: determining contextual broadness or ambiguousness of each of the plurality of target features; prioritizing the plurality of target features based on the contextual broadness or ambiguousness of each target feature.
However, Csurka teaches, a processor (Para. 0034: the processor executes the instructions for performing the method); 
and a computer-readable medium in communication with the processor, the computer- readable medium comprising instructions that, when executed by the processor, cause the processor to control the system to perform functions of (Para. 0034: the processor executes the instructions for performing the method; Para. 0043: the method may be implemented in a computer program product, such as a computer-readable recording medium, executed on a computer):
and causing the plurality of cropped images to be displayed on a display (As shown in Para. 0001, the exemplary embodiment involves context dependent cropping of images for generation of thumbnail images; Para. 0003: “A thumbnail image is derived from a source image by resizing and/or cropping the source image”; Para. 0018; As shown in Para. 0109, a candidate region is identified and the source image is cropped to generate a thumbnail to be displayed on a screen of a mobile device),
the instructions, when executed by the processor, further cause the processor to control the system to perform functions of (Para. 0034: the processor executes the instructions for performing the method; Para. 0043: the method may be implemented in a computer program product, such as a computer-readable recording medium, executed on a computer).
It would have been obvious, before the effective filing date of the claim invention, to one of ordinary skill in the art to combine a system including computer-readable medium comprising instructions in which a processor performs instructions, such as displaying cropped images on a display as taught by Csurka with the method of Horanyi in order to assist a user in assessing the relevance of the document (Csurka, Para. 0018). Therefore, one of ordinary skill in the art would be capable to have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately.
The combination of Horanyi and Csurka does not expressly disclose the following limitations: without being in an optimal image configuration; determining contextual broadness or ambiguousness of each of the plurality of target features; prioritizing the plurality of target features based on the contextual broadness or ambiguousness of each target feature.
However, Lu teaches, without being in an optimal image configuration (As shown in pg. 3626 and Fig. 6, the optimized cropping windows are the red boxes and the detected IOLs are the green boxes; Note: the Examiner interprets the IOLs (i.e., green boxes) as the cropping portions not in an optimal image configuration).
It would have been obvious, before the effective filing date of the claim invention, to one of ordinary skill in the art to combine a cropped image portions with image features not being in an optimal image configuration as taught by Lu with the combined method of Horanyi and Csurka in order to see that the cropped images obtain better composition and aspect ratio than the original images (Lu, Pg. 3626). Therefore, one of ordinary skill in the art would be capable to have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately.
The combination of Horanyi, Csurka, and Lu does not expressly disclose the following limitations: determining contextual broadness or ambiguousness of each of the plurality of target features; prioritizing the plurality of target features based on the contextual broadness or ambiguousness of each target feature.
However, Deng teaches, determining contextual broadness or ambiguousness of each of the plurality of target features (As shown in Pg. 85, images—which contain attributes—receive an aesthetic quality classification score. The scores determine aesthetic ambiguity, in which higher scores are treated as positive examples and lower scores are treated as negative examples; Fig. 3; As shown in Pg. 87, features are classified to distinguish images of ambiguous quality; Pg. 99: cropped patches were inherently ambiguous); 
prioritizing the plurality of target features based on the contextual broadness or ambiguousness of each target feature (As shown in Pg. 85, images—which contain attributes—receive an aesthetic quality classification score. The scores determine aesthetic ambiguity, in which higher scores are treated as positive examples and lower scores are treated as negative examples; Fig. 3; As shown in Pg. 87, features are classified to distinguish images of ambiguous quality; Pg. 99: cropped patches were inherently ambiguous; Note: the Examiner interprets the high/low scoring as an example of prioritizing).
It would have been obvious, before the effective filing date of the claim invention, to one of ordinary skill in the art to combine determining the ambiguousness of features and prioritizing features based on ambiguousness as taught by Deng with the combined method of Horanyi, Csurka, and Lu in order to assess image aesthetic quality and distinguish high-quality from low-quality photos (Deng, Pg. 80). Therefore, one of ordinary skill in the art would be capable to have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately. It is for at least the aforementioned that the Examiner has reached a conclusion of obviousness with respect to claim 1.

Regarding claim 2, the combination of Horanyi, Csurka, Lu, and Deng teaches the limitations as explained above in claim 1.
The combination of Horanyi, Csurka, Lu, and Deng further teaches, the system of claim 1 (see claim 1 above), wherein the user intention data includes at least one of text data (Horanyi: As shown in Fig. 1, an image is cropped based on different users’ descriptions (i.e., text of “two people sitting on a bench on the grassland” and “a happy man wearing a white shirt is standing on the green grass”); Horanyi, Fig. 2: user caption; Horanyi: Pg. 2-4), audio data (Csurka: As shown in Para. 0029, there is audio information; Csurka: As shown in Para. 0107, audio transcript can provide the focus of interest of the video shot), image data (Horanyi: As shown in Fig. 1, cropped images have different users’ descriptions; Horanyi: As shown in Pg. 6 and Fig. 8, different captions lead to different image crops) or video data containing content characterizing each target feature (Csurka: As shown in Paras. 0028-0029, there are video images; Csurka: As shown in Para. 0107, there are video shots/frames and a relevant thumbnail is created from the frames).
The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 1 apply to claim 2 and are incorporated herein by reference. Thus, the system recited in claim 2 is met by Horanyi, Csurka, Lu, and Deng.

Regarding claim 4, the combination of Horanyi, Csurka, Lu, and Deng teaches the limitations as explained above in claim 1.
The combination of Horanyi, Csurka, Lu, and Deng further teaches, the system of claim 1 (see claim 1 above), wherein: the user intention data includes audio data capturing a speech characterizing each target feature (Csurka: As shown in Para. 0029, the source image may be a video with audio information; Csurka: As shown in Para. 0107, audio transcript can provide the focus of interest of the video shot), 
and for determining each target feature to be extracted from the source image (Csurka: As shown in Para. 0029, the source image may be a video with audio information; Csurka: As shown in Para. 0107, an audio transcript can provide the focus of interest of the video shot and a thumbnail is created from the keyframes using the derived transcript), the instructions, when executed by the processor, further cause the processor to control the system to perform functions of (Csurka, Para. 0034: the processor executes the instructions for performing the method; Csurka, Para. 0043: the method may be implemented in a computer program product, such as a computer-readable recording medium, executed on a computer):
converting the speech captured in the audio data to a text (Csurka: As shown in Para. 0029, a text transcript is derived from audio information of a video); 
and analyzing the text to identify each target feature (Csurka: As shown in Para. 0107, an audio transcript can provide the focus of interest of the video shot and a thumbnail is created from the keyframes using the derived transcript).
The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 1 apply to claim 4 and are incorporated herein by reference. Thus, the system recited in claim 4 is met by Horanyi, Csurka, Lu, and Deng.

Regarding claim 5, the combination of Horanyi, Csurka, Lu, and Deng teaches the limitations as explained above in claim 1.
The combination of Horanyi, Csurka, Lu, and Deng further teaches, the system of claim 1 (see claim 1 above), wherein the instructions, when executed by the processor, further cause the processor to control the system to perform a function of (Csurka, Para. 0034: the processor executes the instructions for performing the method; Csurka, Para. 0043: the method may be implemented in a computer program product, such as a computer-readable recording medium, executed on a computer) providing the source image to a machine learning (ML) engine trained to perform the functions of (Horanyi: As shown in Pg. 2, a new deep networks repurposing framework is proposed to optimize crop parameters directly using a bilinear sampler, a pre-trained image captioning network, and a pre-trained aesthetic estimation network; Horanyi: Pg. 3; Horanyi: As shown in Fig. 2, images are input into the bilinear sampler and then the image captioning network and aesthetic network; Horanyi: Pg. 5, section 4.1.1.):
identifying the plurality of visual features within the source image (Horanyi: As shown in Fig. 1, different features are identified by bounding box annotations in the same image, such as a man wearing a white shirt and two people sitting on a bench; Horanyi: Pg. 2);
determining the contextual relevance between each target feature and each visual feature of the source image (Horanyi: As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image (i.e., bounding box annotation is generated for the user description of two people sitting on a bench on the grassland); Horanyi: Fig. 7); 
and identifying, based on the determined contextual relevance, the plurality of cropping candidate portions within the source image (Horanyi: As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image (i.e., bounding box annotation is generated for the user description of two people sitting on a bench on the grassland, and the image is cropped so that the cropped image includes the two people sitting on a bench); Horanyi: Fig. 8).
The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 1 apply to claim 5 and are incorporated herein by reference. Thus, the system recited in claim 5 is met by Horanyi, Csurka, Lu, and Deng. 

Regarding claim 6, the combination of Horanyi, Csurka, Lu, and Deng teaches the limitations as explained above in claim 1.
The combination of Horanyi, Csurka, Lu, and Deng further teaches, the system of claim 1 (see claim 1 above), wherein, for cropping the source image to generate the plurality of cropped images (Horanyi: As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image (i.e., bounding box annotation is generated for the user descriptions of two people sitting on a bench on the grassland and a happy man wearing a white shirt is standing on the green grass. The same image is cropped to produce a cropped image of a happy man wearing a white shirt and a cropped image of two people sitting on a bench, and the image is cropped so that the cropped image includes the two people sitting on a bench); Horanyi: Fig. 8), the instructions, when executed by the processor, further cause the processor to control the system to perform cropping (Csurka: As shown in Para. 0027, source images are cropped to generate thumbnail images; Csurka, Para. 0034: the processor executes the instructions for performing the method and controls the overall operation of the computing device; Csurka, Para. 0043: the method may be implemented in a computer program product, such as a computer-readable recording medium, executed on a computer), based on the set of cropping rules, the source image, the set of cropping rules being determined based on at least one of usage data/statistics (Csurka: As shown in Para. 0027, the source image is cropped based on its relevance to a visual class. One or more probability maps are generated in which pixels of the source image are assigned a probability of being associated with a set of visual classes, and then regions of the image are identified as candidates for cropping; Note: the identification of regions to crop using data such as probability is interpreted as usage data), user preferences (Horanyi: As shown in Pg. 3, visual aesthetic preference is described; Horanyi: As shown in Pg. 7, users are asked to select the crop described by the caption that looks the best)  or esthetical evaluation statistics (Horanyi: As shown in Pg. 3, aesthetic scores are provided for the image; Horanyi: Fig. 2, aesthetic network).  
The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 1 apply to claim 6 and are incorporated herein by reference. Thus, the system recited in claim 6 is met by Horanyi, Csurka, Lu, and Deng. 
 
Regarding claim 7, the combination of Horanyi, Csurka, Lu, and Deng teaches the limitations as explained above in claim 1.
The combination of Horanyi, Csurka, Lu, and Deng further teaches, The system of claim 1 (see claim 1 above), wherein the set of cropping rules includes at least one of an image size or aspect ratio (Horanyi: As shown in Pg. 3-4, there are cropping sizes with respect to the original image size in which the correct scale is chosen to ensure that the crop includes all the relevant parts of the image based on the user’s description; Note: the Examiner selects the image size limitation).  

Claim 11 recites a method with the steps corresponding to the elements of the system recited in claim 1. Therefore, the recited steps of this claim are mapped to the proposed combination in the same manner as the corresponding elements in its corresponding system claim. Additionally, the rationale and motivation to combine the Horanyi, Csurka, Lu, and Deng references, presented in the rejection of claim 1, apply to this claim.


Regarding claim 12, the combination of Horanyi, Csurka, Lu, and Deng teaches the limitations as explained above in claim 11.
The combination of Horanyi, Csurka, Lu, and Deng further teaches, the method of claim 11 (see claim 11 above), wherein the user intention data includes at least one of text data (Horanyi: As shown in Fig. 1, an image is cropped based on different users’ descriptions (i.e., text of “two people sitting on a bench on the grassland” and “a happy man wearing a white shirt is standing on the green grass”); Horanyi, Fig. 2: user caption; Horanyi: Pg. 2-4), audio data (Csurka: As shown in Para. 0029, there is audio information; Csurka: As shown in Para. 0107, audio transcript can provide the focus of interest of the video shot), image data (Horanyi: As shown in Fig. 1, cropped images have different users’ descriptions; Horanyi: As shown in Pg. 6 and Fig. 8, different captions lead to different image crops) or video data containing content characterizing plurality of target features (Csurka: As shown in Paras. 0028-0029, there are video images; Csurka: As shown in Para. 0107, there are video shots/frames and a relevant thumbnail is created from the frames). 
The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 11 apply to claim 12 and are incorporated herein by reference. Thus, the method recited in claim 12 is met by Horanyi, Csurka, Lu, and Deng.

Regarding claim 14, the combination of Horanyi, Csurka, Lu, and Deng teaches the limitations as explained above in claim 11.
The combination of Horanyi, Csurka, Lu, and Deng further teaches, the method of claim 11 (see claim 11 above), wherein: the user intention data includes audio data capturing a speech characterizing plurality of target features (Csurka: As shown in Para. 0029, the source image may be a video with audio information; Csurka: As shown in Para. 0107, audio transcript can provide the focus of interest of the video shot),
and determining plurality of target features comprises (Csurka: As shown in Para. 0029, the source image may be a video with audio information; Csurka: As shown in Para. 0107, an audio transcript can provide the focus of interest of the video shot and a thumbnail is created from the keyframes using the derived transcript):
converting the speech captured in the audio data to a text (Csurka: As shown in Para. 0029, a text transcript is derived from audio information of a video); 
and analyzing the text to identify plurality of target features (Csurka: As shown in Para. 0107, an audio transcript can provide the focus of interest of the video shot and a thumbnail is created from the keyframes using the derived transcript). 
The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 11 apply to claim 14 and are incorporated herein by reference. Thus, the method recited in claim 14 is met by Horanyi, Csurka, Lu, and Deng.

Claim 15 is rejected for similar reasons as those described in claim 5. The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 5 apply to claim 15 and are incorporated herein by reference. Thus, the method recited in claim 15 is met by Horanyi, Csurka, Lu, and Deng.

Claim 16 is rejected for similar reasons as those described in claim 6. The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 6 apply to claim 16 and are incorporated herein by reference. Thus, the method recited in claim 16 is met by Horanyi, Csurka, Lu, and Deng.

Claim 17 is rejected for similar reasons as those described in claim 7. The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 7 apply to claim 17 and are incorporated herein by reference. Thus, the method recited in claim 17 is met by Horanyi, Csurka, Lu, and Deng.

Claim 20 recites a non-transitory computer-readable storage medium comprising instructions corresponding to the steps recited in claim 1.  Therefore, the recited instructions of this claim are mapped to the proposed combination in the same manner as the corresponding steps in its corresponding system claim.  Additionally, the rationale and motivation to combine the Horanyi, Csurka, Lu, and Deng references, presented in rejection of claim 1, apply to this claim. Finally, the combination of the Horanyi, Csurka, Lu, and Deng references discloses a non-transitory computer-readable storage medium (for example, see Csurka, Para. 0043: the method may be implemented in a computer program product, such as a computer-readable recording medium, executed on a computer).



Claims 3 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over “Repurposing existing deep networks for caption and aesthetic-guided image cropping” by Horanyi et al. (hereinafter “Horanyi”) in view of Csurka (US 2009/0208118 A1), and further in view of “Learning the Relation Between Interested Objects and Aesthetic Region for Image Cropping” by Lu et al. (hereinafter “Lu”), “Image Aesthetic Assessment” by Deng et al. (hereinafter “Deng”), and Jones et al. (US 2019/0042850 A1; hereinafter “Jones”).
Regarding claim 3, the combination of Horanyi, Csurka, Lu, and Deng teaches the limitations as explained above in claim 1.
The combination of Horanyi, Csurka, Lu, and Deng further teaches, the system of claim 1 (see claim 1 above), wherein: the user intention data includes video data containing content characterizing each target feature (Csurka: As shown in Paras. 0028-0029, there are video images; Csurka: As shown in Para. 0107, there are video shots/frames and a relevant thumbnail is created from the frames),
 and for determining plurality of target features (Horanyi: As shown in Fig. 1 and Pg. 2, bounding box annotations are generated for each user description/caption to crop the image; Horanyi: As shown in Pg. 5 and Fig. 7, the ground truth annotation regions/bounding box annotations are selected in the images based on the captions) the instructions, when executed by the processor, further cause the processor to control the system to perform functions of (Csurka, Para. 0034: the processor executes the instructions for performing the method; Csurka, Para. 0043: the method may be implemented in a computer program product, such as a computer-readable recording medium, executed on a computer):


The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, and Deng references presented in the rejection of claim 1 apply to claim 3 and are incorporated herein by reference.
The combination of Horanyi, Csurka, Lu, and Deng does not expressly disclose the following limitations: converting the video data to one or more images; and analyzing the one or more images to identify plurality of target features.
However, Jones teaches, converting the video data to one or more images (Para. 0017: videos are partitioned into video frames); 
and analyzing the one or more images to identify plurality of target features (Para. 0017: a bounding box is located around the tracked object and images are cropped using the bounding box). 
It would have been obvious, before the effective filing date of the claim invention, to one of ordinary skill in the art to combine a converting video data into images and analyzing the images to identify target features as taught by Jones with the combined method of Horanyi, Csurka, Lu, and Deng in order to track the object (Jones, Para. 0017) and detect instances of objects (Jones, Para. 0008). Therefore, one of ordinary skill in the art would be capable to have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately. It is for at least the aforementioned that the Examiner has reached a conclusion of obviousness with respect to claim 3.


Claim 13 is rejected for similar reasons as those described in claim 3. The proposed combination as well as the motivation for combining the Horanyi, Csurka, Lu, Deng, and Jones references presented in the rejection of claim 3 apply to claim 13 and are incorporated herein by reference. Thus, the method recited in claim 13 is met by Horanyi, Csurka, Lu, Deng, and Jones.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Daniella M. DiGuglielmo whose telephone number is (571)272-0183. The examiner can normally be reached Monday - Friday 8:00 AM - 4:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at (571)270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Daniella M. DiGuglielmo/Examiner, Art Unit 2666                                                                                                                                                                                                        
/EMILY C TERRELL/Supervisory Patent Examiner, Art Unit 2666
Read full office action
Prosecution Timeline

Mar 17, 2023
Application Filed
Oct 21, 2025
Non-Final Rejection — §101, §103, §112
Nov 10, 2025
Interview Requested
Dec 10, 2025
Examiner Interview Summary
Dec 10, 2025
Applicant Interview (Telephonic)
Feb 27, 2026
Response Filed
Mar 10, 2026
Final Rejection — §101, §103, §112
Apr 15, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

18/074,942
Patent 12586401
SYSTEMS AND METHODS FOR REPRESENTING AND SEARCHING CHARACTERS
2y 5m to grant Granted Mar 24, 2026
18/014,172
Patent 12567228
IMAGE DATA PROCESSING METHOD, IMAGE DATA PROCESSING APPARATUS, AND COMMERCIAL USE
2y 5m to grant Granted Mar 03, 2026
18/522,372
Patent 12567266
IMAGE RECOGNITION SYSTEM AND IMAGE RECOGNITION METHOD
2y 5m to grant Granted Mar 03, 2026
17/895,617
Patent 12555372
IMAGE SENSOR EVALUATION METHOD USING COMPUTING DEVICE INCLUDING PROCESSOR
2y 5m to grant Granted Feb 17, 2026
17/699,052
Patent 12548147
Systems and Methods Related to Age-Related Macular Degeneration
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
81%
Grant Probability
99%
With Interview (+26.4%)
2y 9m
Median Time to Grant
Moderate
PTA Risk
Based on 170 resolved cases by this examiner. Grant probability derived from career allow rate.