Last updated: April 19, 2026

Application No. 18/064,977

ARTIFICIAL INTELLIGENCE CAPTIONS USING AN ENSEMBLE METHOD FOR AUDIO TEMPO AND PITCH

Non-Final OA §102§103

Filed

Dec 13, 2022

Examiner

THOMAS-HOMESCU, ANNE L

Art Unit

2656

Tech Center

2600 — Communications

Assignee

International Business Machines Corporation

OA Round

1 (Non-Final)

Interview Optional

— +36.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 360 resolved cases, 2023–2026

Examiner Intelligence

THOMAS-HOMESCU, ANNE L View full profile →

Grants 77% — above average

Career Allow Rate

276 granted / 360 resolved

+14.7% vs TC avg

Strong +37% interview lift

Without

With

+36.7%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

34 currently pending

Career history

394

Total Applications

across all art units

Statute-Specific Performance

§101

16.7%

-23.3% vs TC avg

§103

50.7%

+10.7% vs TC avg

§102

19.9%

-20.1% vs TC avg

§112

7.5%

-32.5% vs TC avg

Black line = Tech Center average estimate • Based on career data from 360 resolved cases

Office Action

§102 §103

DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement

The information disclosure statement (IDS) submitted on 13 December 2022 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Compact Prosecution
In the interest of compact prosecution, the examiner suggests more clearly defining what is meant by “plurality of next word predictions”. Under broadest reasonable interpretation, the phrase “plurality of next word predictions” may have various interpretations. For example, the phrase may be interpreted in terms of translating a word from one language to another (as done in the rejections below). Or, the phrase may be interpreted in terms of the probability of a specific phonetic sound following a current word.  

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-2, 5-9, 12-16, and 19-20 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by US 20240087557, hereinafter referred to as Levine et al.

Regarding claim 1, Levine et al. discloses a processor-implemented method for generating captions, the method comprising: 

capturing input audio comprising audiovisual content (“To this end, FIG. 2 is a flow chart for a method 200 of generating a dubbed video, according to an embodiment of the present disclosure. In an embodiment, the second electronic device 110 can receive video data including audio data in the first language in step 205,” Levine et al., para [0032].); 

processing the input audio to extract an input rate of speech (“The audio segments 405, 410, 415 can be allowed to speed up to a predetermined maximum rate to stay within the corresponding time window 420, 425, 430,” Levine et al., para [0065]. This excerpt makes clear that an original (i.e., input) rate of speech is extracted.), a plurality of input word timings (“The video ingestion process can take the video data and extract audio from video, separate a background audio track from a voice track, extract text from the voice track, translate text, find desirable timings of the text based on the translated text and original speech timing, create a new voice track using the translated text and the desirable timings, merge a new voice track with a background audio track, and merge the combined audio track into the video,” Levine et al., para [0050].), and a plurality of input word predictions (“With reference to step 210 of method 200, the creator can optionally also submit an original transcript along with the video data to the second electronic device 110,” Levine et al., para [0047]. The original transcript is interpreted as word predictions – i.e., a written word is being predicted based on the audio version.); 

generating one or more new audio files by altering the input rate of speech of the input audio to fall within a pre-determined range (“The audio segments 405, 410, 415 can be allowed to speed up to a predetermined maximum rate to stay within the corresponding time window 420, 425, 430…That is to say, the sped up dubbed audio needs the width of the lines representing the corresponding time window 420, 425, 430 to fully play, but only has the width of the box of the audio segments 405, 410, 415 to do so,” Levine et al., para [0065]. Here, the sped up dubbed audio is allowed to speed up to a predetermined maximum rate (i.e., a range).); 

processing the one or more new audio files to extract a plurality of new word timings (“Additionally, during automated text-to-speech dubbing, the creator or automation can produce a timed text. This can be translated text that needs to be rendered via text to speech at predetermined timestamps within the video,” Levine et al., para [0064]. Thus, the timestamps mark new word timings.) and a plurality of new word predictions (“In an embodiment, the second electronic device 110 can align timing windows of portions of the translated preliminary transcript with corresponding segments of the audio data in the first language based on the video data to generate a translated aligned transcript in step 220. Notably, the method can automatically speed up segments, re-align segments, and merge segments automatically to try and find a suitable alignment,” Levine et al., para [0035]. The translated preliminary transcript is interpreted as word predictions – i.e., a new written (i.e., translated) word is being predicted based on the word in the original transcript.); 
creating a mapping that pairs the plurality of input word timings with corresponding new word timings of the plurality of new word timings (“In an embodiment, the second electronic device 110 can align timing windows of portions of the translated preliminary transcript with corresponding segments of the audio data in the first language based on the video data to generate a translated aligned transcript in step 220,” Levine et al., para [0035]. The re-alignment is interpreted as a mapping between input word timings (i.e., audio data in the first language) to new word timings (i.e., translated aligned transcript).); 

selecting a word prediction for each paired input word timing and new word timing based on the mapping (Levine et al., para [0035]. The aligned transcript indicates selected word predictions and associated timings.); and 

integrating the selected word predictions into the audiovisual content for display (“In an embodiment, the video, either in the original native language (original video data) or the dubbed language (the first dubbed video), can be displayed alongside the original transcript or the translated aligned transcript and configured to play at the time in the video corresponding to the selected text in the original transcript or the translated aligned transcript. Further, as the video plays, the text in the original transcript or the translated aligned transcript can have a formatting applied (see below) to show where the audio is currently playing relative to either of the transcript documents,” Levine et al., para [0068]. The translated aligned transcript is interpreted as selected word predictions – i.e., a new written (i.e., translated) word aligned to the word in the original transcript is being predicted based on the word in the original transcript.).  

As to claim 8, system claim 8 and method claim 1 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 8 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

As to claim 15, computer program claim 15 and method claim 1 are related as method and computer program of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 15 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

Regarding claim 2, Levine et al. discloses the method of claim 1, wherein the pre-determined range is between 120 words per minute and 160 words per minute (The selection of specific values for a pre-determined range is a matter of design choice.). 

As to claim 9, system claim 9 and method claim 2 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 9 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

As to claim 16, computer program claim 16 and method claim 2 are related as method and computer program of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 16 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

Regarding claim 5, Levine et al. discloses the method of claim 1, wherein the input audio and the one or more new audio files have a same pitch (“(10) The method of any one of (1) to (9), further comprising…adjusting a speed of the flagged transcript portions while maintaining a pitch of the resulting first speech dub,” Levine et al., para [0166].).  

As to claim 12, system claim 12 and method claim 5 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 12 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

As to claim 19, computer program claim 19 and method claim 5 are related as method and computer program of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 19 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

Regarding claim 6, Levine et al. discloses the method of claim 1, wherein the selected word predictions are associated with the input word timings (“The speech recognition process can take the video data and extract just the audio track and then run the audio track through a text to speech engine to generate an auto-generated original transcript. To produce high-quality dubs, it is important to have precise word-level timings in the original transcript. If a creator-generated original transcript is available, the original transcript generally only includes approximate word-level timings,” Levine et al., para [0052]. The original transcript (i.e., word predictions) have corresponding word-level timings (i.e., input word timings).).  

As to claim 13, system claim 13 and method claim 6 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 13 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

As to claim 20, computer program claim 20 and method claim 6 are related as method and computer program of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 20 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

Regarding claim 7, Levine et al. discloses the method of claim 1, wherein the integrating comprises aligning the plurality of input word timings with one or more corresponding timecodes of the audiovisual content (Levine et al., para [0052]. Here, timings in the original transcript (i.e., input word timings) are aligned to auto-generated original transcript timings.).

As to claim 14, system claim 14 and method claim 7 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 14 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 3, 10, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20240087557, hereinafter referred to as Levine et al., in view of US 20240143936, hereinafter referred to as Giovanardi et al.

Regarding claim 3, Levine et al. discloses the method of claim 1, but not wherein the selecting is performed using an ensemble method. Giovanardi et al. is cited to disclose wherein the selecting is performed using an ensemble method (“…using a number of classifiers from the trained model to generate ensemble class scores; and using the ensemble class scores to predict one or more next step sentences from the sentences in the transcript,” Giovanardi et al.  [0023].). Giovanardi et al. benefits Levine et al. by incorporating a ensemble classifier to produce a single, more accurate, and robust word selection. Therefore, it would be obvious for one skilled in the art to combine the teachings of Levine et al. with those of Bao et al. to better align dubbed audio with original audio from a video-based source.       

As to claim 10, system claim 10 and method claim 3 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 1- is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

As to claim 17, computer program claim 17 and method claim 3 are related as method and computer program of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

Claim(s) 4, 11, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20240087557, hereinafter referred to as Levine et al., in view of WO 2017124116, hereinafter referred to as Bao et al.

Regarding claim 4, Levine et al. discloses the method of claim 1, but not wherein the selecting is performed using a Boyer-Moore Majority Voting Algorithm. Bao et al. is cited to disclose wherein the selecting is performed using a Boyer-Moore Majority Voting Algorithm (“Text-based search/matching: In the embodiments of the disclosed subject matter where audio information are transcribed into text in this disclosed subject matter, many string matching algorithms can be employed to find the matches for a query, including but not limited to, Naive string searching algorithm, Rabin-Karp algorithm, Finite-state automaton search, Knuth-Morris-Pratt algorithm, Boyer-Moore algorithm,” Bao et al., para [00255].). Bao et al. benefits Levine et al. by providing the Boyer-Moore algorithm as an efficient string-searching algorithm to serve as a benchmark for practical pattern matching. Therefore, it would be obvious for one skilled in the art to combine the teachings of Levine et al. with those of Bao et al. to align dubbed audio with original audio from a video-based source.       

As to claim 11, system claim 11 and method claim 4 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 11 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

As to claim 18, computer program claim 18 and method claim 4 are related as method and computer program of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 18 is similarly rejected under the same rationale as applied above with respect to method claim. And, Levine et al., para [0116]-[0117], teaches processor, CRM, storage medium, and memory.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See attached PTO-892. In particular, the examiner notes Carmiel et al., Federico et al., Ren et al., Crawford, Thomson et al., and Candelore et al. as generally addressing methods for synchronizing text with audiovisual material.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899. The examiner can normally be reached Mon-Fri 8-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on 5712727453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2656

Read full office action

Prosecution Timeline

Dec 13, 2022

Application Filed

Oct 26, 2023

Response after Non-Final Action

Jan 20, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/502,648

Patent 12592241

METHOD AND APPARATUS FOR ENCODING AND DECODING AUDIO SIGNAL USING COMPLEX POLAR QUANTIZER

2y 5m to grant Granted Mar 31, 2026

18/703,640

Patent 12591741

VIOLATION PREDICTION APPARATUS, VIOLATION PREDICTION METHOD AND PROGRAM

2y 5m to grant Granted Mar 31, 2026

17/765,668

Patent 12573369

METHOD FOR CONTROLLING UTTERANCE DEVICE, SERVER, UTTERANCE DEVICE, AND PROGRAM

2y 5m to grant Granted Mar 10, 2026

18/630,871

Patent 12561684

Evaluating User Status Via Natural Language Processing and Machine Learning

2y 5m to grant Granted Feb 24, 2026

18/526,170

Patent 12554926

METHOD, DEVICE, COMPUTER EQUIPMENT AND STORAGE MEDIUM FOR DETERMINING TEXT BLOCKS OF PDF FILE

2y 5m to grant Granted Feb 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

77%

Grant Probability

99%

With Interview (+36.7%)

2y 8m

Median Time to Grant

Low

PTA Risk

Based on 360 resolved cases by this examiner. Grant probability derived from career allow rate.