DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/20/2024 has been considered by the examiner.
Specification
The disclosure is objected to because paragraphs [0019] and [0023] contain embedded hyperlinks and/or other form of browser-executable code. Applicant is required to delete the embedded hyperlink and/or other form of browser-executable code; references to websites should be limited to the top-level domain name without any prefix such as http:// or other browser-executable code. See MPEP § 608.01.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 21-40 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-20 of U.S. Patent No. 12014259 B2. Although the claims at issue are not identical, they are not patentably distinct from each other because independent claims 21, 13 and 40 of instant application 18662584 are anticipated by corresponding claims 1, 13 and 20 of U.S. Patent No. 12014259 B2. The rejected dependent claims of the instant application can be found basically word to word in the corresponding claims of U.S. Patent No. 12014259 B2.
Instant Application 18662584
U.S. Patent No. 12014259 B2
21. A method performed by one or more computers, the method comprising: obtaining an input image; processing the input image to generate an alternative representation of the input image in an embedding space representative of features for images rather than text; and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language relating to content of the input image, including selecting at least one word for inclusion in the sequence of words by conditioning the neural network on a preceding word in the sequence of words.
1. A method performed by one or more computers, the method comprising: obtaining an input image; processing the input image to generate an alternative representation of the input image in an embedding space representative of features for images rather than text; and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language that describes the input image, including using the neural network to select words for inclusion in the sequence of words until a stop condition occurs that indicates an end to the sequence of words, wherein each word in the sequence of words after an initial word is selected by conditioning the neural network on a preceding word in the sequence of words.
33. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: obtaining an input image; processing the input image to generate an alternative representation of the input image in an embedding space representative of features for images rather than text; and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language relating to content of the input image, including selecting at least one word for inclusion in the sequence of words by conditioning the neural network on a preceding word in the sequence of words.
13. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: obtaining an input image; processing the input image to generate an alternative representation of the input image in an embedding space representative of features for images rather than text; and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language that describes the input image, including using the neural network to select words for inclusion in the sequence of words until a stop condition occurs that indicates an end to the sequence of words, wherein each word in the sequence of words after an initial word is selected by conditioning the neural network on a preceding word in the sequence of words.
40. A computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising: obtaining an input image; processing the input image to generate an alternative representation of the input image in an embedding space representative of features for images rather than text; and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language relating to content of the input image, including selecting at least one word for inclusion in the sequence of words by conditioning the neural network on a preceding word in the sequence of words.
20. A computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising: obtaining an input image; processing the input image to generate an alternative representation of the input image in an embedding space representative of features for images rather than text; and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language that describes the input image, including using the neural network to select words for inclusion in the sequence of words until a stop condition occurs that indicates an end to the sequence of words, wherein each word in the sequence of words after an initial word is selected by conditioning the neural network on a preceding word in the sequence of words.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 21-23, 27, 31-35 and 39-40 are rejected under 35 U.S.C. 103 as being unpatentable over Koros et al (arXiv:1411.2539v1 10 Nov 2014), hereinafter Koros in view of Karpathy et al (arXiv:1406.5679 22 Jun 2014), hereinafter Karpathy.
-Regarding claim 21, Koros discloses a method performed by one or more computers (one or more processors/computers and memories has to be used in order to implement the system and method in Koros’ Figure 2), the method comprising (Abstract; Figures 1-5): obtaining an input image (Figure 2, input image); processing the input image to generate an alternative representation of the input image in an embedding space representative of features for both the image and a sentence (Figure 2, Encoder, Multimodal space; Sections 1.2-1.3; Section 2.2); and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language relating to contents of the input image (Figures 2-3; SC-NLM decoder; Section 2.5; equations (11), (7)-(10)), including selecting at least one for inclusion in the sequence of words (Figures 2-3; Sections 2.4-2.5) by conditioning the neural network on a preceding word in the sequence of words (Section 2.5, 1st paragraph, “distribution … from previous word context”, 3rd paragraph, “condition on the SC-NLM”).
Koros does not disclose an embedding space representative of features for image only rather than text.
In the same field of endeavor, Karpathy teaches a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data (Karpathy: Abstract; Figures 1-5). Karpathy further teaches an embedding space representative of features for image only rather than text (Karpathy: Figure 2, Image Fragments, footnote, “Left: CNN representations (green) of detected objects are mapped to the fragment embedding space (blue, Section 3.2) …”; Page 3, Sec. 3.2, equation (2)).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros with the teaching of Karpathy by using an embedding space representative of features for image only rather than text in order to break down and embeds fragments to allow imposing a fragment-level loss function (Karpathy: Page 1, Sec. 1., 3rd paragraph).
-Regarding claim 33, Koros discloses a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers (one or more processors/computers and memories has to be used in order to implement the system and method in Koros’ Figure 2), to cause the one or more computers to perform operations comprising (Abstract; Figures 1-5; Section 5.2): obtaining an input image (Figure 2, input image); processing the input image to generate an alternative representation of the input image in an embedding space representative of features for both the image and a sentence (Figure 2, Encoder, Multimodal space; Sections 1.2-1.3; Section 2.2); and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language relating to contents of the input image (Figures 2-3; SC-NLM decoder; Section 2.5; equations (11), (7)-(10)), including selecting at least one for inclusion in the sequence of words (Figures 2-3; Sections 2.4-2.5) by conditioning the neural network on a preceding word in the sequence of words (Section 2.5, 1st paragraph, “distribution … from previous word context”, 3rd paragraph, “condition on the SC-NLM”).
Koros does not disclose an embedding space representative of features for image only rather than text.
In the same field of endeavor, Karpathy teaches a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data (Karpathy: Abstract; Figures 1-5). S Karpathy further teaches an embedding space representative of features for image only rather than text (Karpathy: Figure 2, Image Fragments, footnote, “Left: CNN representations (green) of detected objects are mapped to the fragment embedding space (blue, Section 3.2) …”; Page 3, Sec. 3.2, equation (2)).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros with the teaching of Karpathy by using an embedding space representative of features for image only rather than text in order to break down and embeds fragments to allow imposing a fragment-level loss function (Karpathy: Page 1, Sec. 1., 3rd paragraph).
-Regarding claim 40, Koros discloses a computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that, when executed by one or more computers (one or more processors/computers and memories has to be used in order to implement the system and method in Koros’ Figure 2), cause the one or more computers to perform operations comprising (Abstract; Figures 1-5; Section 5.2): obtaining an input image (Figure 2, input image); processing the input image to generate an alternative representation of the input image in an embedding space representative of features for both the image and a sentence (Figure 2, Encoder, Multimodal space; Sections 1.2-1.3; Section 2.2); and processing the alternative representation of the input image using a neural network to generate a sequence of words in a target natural language relating to contents of the input image (Figures 2-3; SC-NLM decoder; Section 2.5; equations (11), (7)-(10)), including selecting at least one for inclusion in the sequence of words (Figures 2-3; Sections 2.4-2.5) by conditioning the neural network on a preceding word in the sequence of words (Section 2.5, 1st paragraph, “distribution … from previous word context”, 3rd paragraph, “condition on the SC-NLM”).
Koros does not disclose an embedding space representative of features for image only rather than text.
In the same field of endeavor, Karpathy teaches a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data (Karpathy: Abstract; Figures 1-5). Karpathy further teaches an embedding space representative of features for image only rather than text (Karpathy: Figure 2, Image Fragments, footnote, “Left: CNN representations (green) of detected objects are mapped to the fragment embedding space (blue, Section 3.2) …”; Page 3, Sec. 3.2, equation (2)).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros with the teaching of Karpathy by using an embedding space representative of features for image only rather than text in order to break down and embeds fragments to allow imposing a fragment-level loss function (Karpathy: Page 1, Sec. 1., 3rd paragraph).
-Regarding claims 22 and 34, Koros in view of Karpathy teaches the method of claim 21 and the system of claim 33. The combination further teaches comprising processing the input image with a convolutional neural network to generate the alternative representation of the input image (Koros: Figure 2, CNN-LSTM Encoder, Multimodal space).
-Regarding claims 23 and 35, Koros in view of Karpathy teaches the method of claim 22 and the system of claim 34. The combination further teaches wherein the convolutional neural network comprises a plurality of core neural network layers each having a respective set of parameters, wherein processing the input image with the convolutional neural network comprises processing the input through each of the core neural network layers, and wherein the alternative representation of the input image is the output generated by a last core neural network layer in the plurality of core neural network layers (Koros: Figure 2, CNN-LSTM Encoder, Multimodal space).
-Regarding claim 27, Koros in view of Karpathy teaches the method of claim 21.
Koros does not disclose wherein the alternative representation of the input image is generated using a second neural network that is initially trained to perform an image classification task and is subsequently trained in a process that involves backpropagating gradients from the first neural network.
In the same field of endeavor, Karpathy teaches a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data (Karpathy: Abstract; Figures 1-5). Karpathy further teaches wherein the alternative representation of the input image is generated using a second neural network that is initially trained to perform an image classification task and is subsequently trained in a process that involves backpropagating gradients from the first neural network (Karpathy: Figure 2, footnote, “CNN representations (green) of detected objects are mapped to the fragment embedding space”; Page 3, Sec. 3.2, equation (2); Page 7, 4th paragraph, “backpropagate gradient”).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros with the teaching of Karpathy by using a second neural network which provides an embedding space representative of features for image only rather than text in order to break down and embeds fragments to allow imposing a fragment-level loss function (Karpathy: Page 1, Sec. 1., 3rd paragraph).
-Regarding claim 31, Koros in view of Karpathy teaches the method of claim 21. The combination further teaches wherein conditioning the neural network on a preceding word in the sequence of words comprises conditioning the neural network on a numeric representation of the preceding word (Koros: Page 5, Section 2.3, equation (8); Section 2.4, last paragraph).
-Regarding claims 32 and 39, Koros in view of Karpathy teaches the method of claim 21 and the system of claim 33. The combination further teaches wherein the sequence of words is arranged according to an output order, and selecting a word for a current position in the output order comprises conditioning the neural network using a word that was selected at a preceding position in the output order that precedes the current position (Koros: Page 5, Section 2.3, equation (8); Section 2.4, last paragraph).
Claims 25-26, 28-30 and 37-38 are rejected under 35 U.S.C. 103 as being unpatentable over Koros et al (arXiv:1411.2539v1 10 Nov 2014), hereinafter Koros in view of Karpathy et al (arXiv:1406.5679 22 Jun 2014), hereinafter Karpathy, and further in view of Sutskever et al (arXiv 1409.3215v2 29 Oct 2014), hereinafter Sutskever.
-Regarding claims 25 and 37, Koros in view of Karpathy teaches the method of claim 21 and the system of claim 33.
Koros in view of Karpathy does not teach wherein the neural network is a long- short term memory (LSTM) neural network.
Sutskever is an analogous art pertinent to the problem to be solved in this application and teaches sequence to sequence learning with neural networks (Sutskever: Abstract; Figures 1-3). Sutskever further teaches wherein the neural network is a long- short term memory (LSTM) neural network (Sutskever: Abstract, “then another deep LSTM to decode the target sequence from the vector”; Page 3, Section 2).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros in view of Karpathy with the teaching of Sutskever by using LSTM neural network as a decoder in order to learn problems with long range temporal dependencies and map a fixed-dimensional vector representation with a variable length sentence.
-Regarding claims 26 and 38, Koros in view of Karpathy teaches the method of claim 25 and the system of claim 37.
Koros in view of Karpathy does not teach wherein the LSTM neural network is configured to process a representation of a current word in the sequence to generate, in accordance with a current hidden state of the LSTM neural network and current values of a set of parameters of the LSTM neural network, a respective word score for each word in a set of words that represents a respective likelihood that the word is a next word in the sequence.
Sutskever is an analogous art pertinent to the problem to be solved in this application and teaches sequence to sequence learning with neural networks (Sutskever: Abstract; Figures 1-3). Sutskever further teaches wherein the LSTM neural network is configured to process a representation of a current word in the sequence to generate, in accordance with a current hidden state of the LSTM neural network and current values of a set of parameters of the LSTM neural network (Sutskever: Abstract; Figures 1-2; Page 3, Section 2;), a respective word score for each word in a set of words that represents a respective likelihood that the word is a next word in the sequence (Sutskever: Page 4, Section 3.2, last paragraph; Page 2, 4th-5th paragraphs).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros in view of Karpathy with the teaching of Sutskever by using LSTM neural network as a decoder in order to learn problems with long range temporal dependencies and map a fixed-dimensional vector representation with a variable length sentence.
-Regarding claim 28, Koros in view of Karpathy teaches the method of claim 26.
Koros in view of Karpathy teaches wherein the set of words includes a vocabulary of words in the target natural language (Koros: Figures 1-5).
Koros in view of Karpathy does not teach a special stop word.
Sutskever is an analogous art pertinent to the problem to be solved in this application and teaches sequence to sequence learning with neural networks (Sutskever: Abstract; Figures 1-3). Sutskever further teaches a stop condition occurs that indicates an end to the sequence of words (Sutskever: Figure 1, footnote, “The model stops making predictions after outputting the end-of-sentence token”; Page 3, 3rd paragraph, “special end-of-sentence symbol “<EOS>””).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros in view of Karpathy with the teaching of Sutskever by using stop words or end-of-sentence token in order to accurate selecting words.
-Regarding claim 29, Koros in view of Karpathy teaches the method of claim 25.
Koros in view of Karpathy teaches processing the alternative representation of the input image using the neural network comprises: processing the alternative representation using the neural network using a left to right beam search decoding to generate a plurality of possible sequences and a respective sequence score for each of the possible sequences; and selecting one or more highest-scoring possible sequences as descriptions of the input image (Koros: Abstract; Figures 1-5; Page 13, Section 5.2, 5th-7th paragraphs).
Koros in view of Karpathy does not teach processing the alternative representation using the LSTM neural network using a left to right beam search.
Sutskever is an analogous art pertinent to the problem to be solved in this application and teaches sequence to sequence learning with neural networks (Sutskever: Abstract; Figures 1-3). Sutskever further teaches processing the alternative representation using the LSTM neural network using a left to right beam search decoding to generate a plurality of possible sequences (Sutskever: Abstract; Figures 1-2; Page 4, Section 3.2, “a simple left-to-right beam search decoder”; Page 2, 4th paragraph) and a respective sequence score for each of the possible sequences (Sutskever: Page 4, Section 3.2, last paragraph; Page 2, 4th-5th paragraphs).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros in view of Karpathy with the teaching of Sutskever by using LSTM neural network as a decoder in order to learn problems with long range temporal dependencies and map a fixed-dimensional vector representation with a variable length sentence.
-Regarding claim 30, Koros in view of Karpathy teaches the method of claim 21.
Koros in view of Karpathy does not teach selecting the initial word in the sequence by initializing a hidden state of the neural network.
Sutskever is an analogous art pertinent to the problem to be solved in this application and teaches sequence to sequence learning with neural networks (Sutskever: Abstract; Figures 1-3). Sutskever further teaches selecting the initial word in the sequence by initializing a hidden state of the neural network (Sutskever: Page 3, Section 2, 3rd paragraph; equation (1)).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Koros in view of Karpathy with the teaching of Sutskever by using LSTM neural network as a decoder in order to learn problems with long range temporal dependencies and map a fixed-dimensional vector representation with a variable length sentence.
Claims 24 and 36 are rejected under 35 U.S.C. 103 as being unpatentable over Koros et al (arXiv:1411.2539v1 10 Nov 2014), hereinafter Koros in view of Karpathy et al (arXiv:1406.5679 22 Jun 2014), hereinafter Karpathy, and further in view of Erhan et al (2014 CVPR), hereinafter Erhan.
-Regarding claims 24 and 36, Koros in view of Karpathy teaches the method of claim 23 and the system of claim 35.
Koros in view of Karpathy does not teach a third neural network generating a respective score for each of a plurality of object categories, the respective score for each of the plurality of object categories representing a predicted likelihood that the training image contains an image of an object from the object category.
However, Erhan is an analogous art pertinent to the problem to be solved in this application and teaches a method for scalable object detection using deep neural networks (Erhan: Abstract; Section 3, “Proposed approach”). Erhan further teaches a third neural network generating a respective score for each of a plurality of object categories, the respective score for each of the plurality of object categories representing a predicted likelihood that the training image contains an image of an object from the object category (Erhan: Abstract, “detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest”; Page 2, 1st Col., Section 3, 1st paragraph; Page 2, 2nd Col., 1st and 2nd paragraphs; equation (1)).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Koros in view of Karpathy with the teaching of Erhan by training a third neural network on a plurality of training images, and wherein the third neural network includes the plurality of core neural network layers and generating a respective score for each of a plurality of object categories, the respective score for each of the plurality of object categories representing a predicted likelihood that the training image contains an image of an object from the object category in order to provide more accurate representation of the image.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAO LIU whose telephone number is (571)272-4539. The examiner can normally be reached Monday-Thursday and Alternate Fridays 8:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached at (571) 272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/XIAO LIU/Primary Examiner, Art Unit 2664