Last updated: April 19, 2026
Application No. 18/744,758
CAPTION DISPLAY CONTROL SYSTEM AND CAPTION DISPLAY CONTROL METHOD

Non-Final OA §102§103§112
Filed
Jun 17, 2024
Examiner
BEUTEL, WILLIAM A
Art Unit
2616
Tech Center
2600 — Communications
Assignee
Sharp Kabushiki Kaisha
OA Round
1 (Non-Final)
Interview Optional

— +20.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 469 resolved cases, 2023–2026
Examiner Intelligence

BEUTEL, WILLIAM A View full profile →
Grants 70% — above average
Career Allow Rate
328 granted / 469 resolved
+7.9% vs TC avg
Strong +20% interview lift
Without
With
+20.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
28 currently pending
Career history
497
Total Applications
across all art units
Statute-Specific Performance

§101
9.9%
-30.1% vs TC avg
§103
49.8%
+9.8% vs TC avg
§102
10.7%
-29.3% vs TC avg
§112
22.0%
-18.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 469 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “a feature extractor” and “a display form determiner” in claim1, and incorporated by reference in claims 2-16, “a signal separator” in claim 2 and incorporated by reference into claims 2-8, 11-12, and 14-16, “a sound feature extractor” in claim 4, “a caption feature extractor” in claims 7 and 11 and incorporated by reference in claims 8 and 12, and “a sound data extractor” in claim 16.  
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim(s) 16 is/are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 16 recites the limitation "the sound data extraction processor" in lines 3-4.  There is insufficient antecedent basis for this limitation in the claim.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1 and 17 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Ikegami (US 2009/0278986 A1).
Regarding claim 1, Ikegami discloses: 
A caption display control system, (Ikegami, Abstract, Fig. 1 and ¶91: closed caption production device) comprising:
a display that displays content; (Ikegami, ¶95: the video output device 300 is composed of a liquid crystal display panel or a plasma display panel)
a feature extractor that extracts a feature of the content (Applicant’s specification, recites a number of different feature extractors, any one of which can be used as the corresponding structure, i.e. a processor with the corresponding algorithm - citations made to PG PUB 2024/0428484 A1 of Applicant’s specification, hereinafter “Spec.”:
¶29 of Spec., disclosing the caption feature extractor 6, the image feature extractor 7, the sound feature extractor 8; Caption feature extractor extracts a character string included in text data of a caption by searching text data for the string (see Spec. ¶37).  Image feature extractor extracts color information as RGB values (Spec. ¶42) and using a “general image recognition technique” for extracting a person or object (Spec. ¶44).  Examiner notes that this is an admission of use of known prior art in the form of object recognition.  Sound feature extractor extracts a sound volume and a pitch based on a sound waveform of a sound signal, “extract[ing] the specific type of sound by means of a general sound recognition technique” (Spec. ¶48).  Examiner notes that this is an admission of use of known prior art in the form of sound recognition to extract a specific type of sound. 
 
Ikegami, ¶17 discloses processing video signal and audio signal to obtain object information and sound position estimation, including “a video position detection means that detects a display position of the video of the sound generating object based on the video information acquired by the video recognition means, a sound recognition means that acquires information related to the sound of the sound generating object as sound information based on the sound signal, a sound position estimation means that estimates a generation position of the sound of the sound generating object based on the sound information acquired by the sound recognition means”; ¶99: video/sound recognizer; ¶104: CPU controls operations)
a display form determining that determines a caption display form based on the content feature (Ikegami, ¶17:  “association means that associates the display position of the video of the sound generating object detected by the video position detection means and the generation position of the sound of the sound generating object estimated by the sound position estimation means, and a synthesized video production means that produces the synthesized video by arranging the closed caption produced by the closed caption production means in a position corresponding to the display position of the video detected by the video position detection means based on the display position of the video and the generation position of the sound that have been associated with each other by the association means.”; ¶104: CPU controls operations); and 
a display controller that displays the caption in the display in the display form determined by the display form determiner. (Ikegami, ¶30: “the closed caption of the sound of each working object is displayed in the synthesized video produced by the synthesized video production means such that the distance between the display position of the video of each working object and the display position of the closed caption of the sound of the working object is shorter than the distance between the display position of the video of the working object and the display position of the closed caption of the sound of the another working object”; ¶94: The closed caption production device 200 outputs the closed caption synthesized video signal WV to the video output device 300; ¶95: the video output device 300 is composed of a liquid crystal display panel or a plasma display panel, which causes video to which a closed caption is added based on the closed caption synthesized video signal WV to be displayed on a screen provided in the video output device 300; also Fig. 4 and ¶166: closed caption video signal WS of FIG. 4 (b) and the video signal V1 are synthesized to produce the closed caption synthesized video signal WV, so that the closed caption synthesized video shown in FIG. 4 (c) is displayed on the screen SC1 of the video output device 300)
Regarding claim 17, the system of claim 1 performs the method of claim 17 and as such claim 17 is rejected based on the same rationale as claim 1 set forth above. 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 2-7 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Ikegami (US 2009/0278986 A1) modified by 
Sharp KK (JP 2007-300323 A, reference included in applicant’s IDS filed 6/17/2024, reference made to English Translation from EPO website)
Regarding claim 2, the limitations included from claim 1 are rejected based on the  same rationale as claim 1 set forth above.  Further regarding claim 2, Ikegami further discloses: 
Wherein the feature extractor extracts a feature of an image as the content feature from the image signal (Ikegami, ¶23: the video position detection means may detect the display position of the video of each working object based on the video information acquired by the video recognition means), and
The display form determiner determines a display form of the caption included in the caption signal based on the feature (Ikegami, ¶23: the synthesized video production means may produce the synthesized video by arranging the closed caption of the sound generated from each working object produced by the closed caption production means in the position corresponding to the display position of the video of each working object detected by the video position detection means based on the display position of the video of each working object; ¶29: The position corresponding to the display position of the video of each working object may be set such that a distance between the display position of the video of each working object and a display position of the closed caption of the sound of the working object is shorter than a distance between the display position of the video of the working object and a display position of a closed caption of sound of another working object)
Ikegami does not explicitly disclose the signal separator to divide a multiplexed reception signal of the content into a caption signal, an image signal, and a sound signal
Sharp KK discloses: 
A signal separator that divides a multiplexed reception signal of the content into a caption signal, an image signal, and a sound signal, (Sharp KK, ¶16: signal received from signal transmitting side is separated into the original subtitle signal, audio signal and video signal by a signal separating unit 2)
Both Ikegami and Sharp KK are directed to systems and techniques for controlling the display of captions or subtitles for improved display in video.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, by including the technique for dividing a signal to obtain video, audio and caption data for processing as provided by Sharp KK, using known electronic interfacing and programming techniques.  The modification merely substitutes one known type of video and caption data for another to obtain predictable results of obtaining the video, caption, and audio data utilized by the system in Ikegami for determining a coordination of image objects and audio timing with caption formatting, i.e. by receiving the data in a known data format that can be split to obtain the relevant components as provided by Sharp KK, instead of the format used by Ikegami.  Instead of the extraction of the data to obtain the three relevant data elements as used by Ikegami, which are then utilized for assembling the synthesized video, the three relevant data elements are extracted from a multiplexed signal as taught by Sharp KK, for use with the object and sound recognition association processing of the data elements.  Moreover, the modification results in an improved video processing system by allowing for extracting the data from different types of input video data for a more useful and versatile system.  
Regarding claim 3, Ikegami further discloses: 
The display form determiner determines the display form with reference to a table indicating corresponding relationship between the image feature and the display form set in advance based on the image feature (Ikegami, Figs. 3-4 disclose sound classification information produced by sound classifier and text production table used for closed caption synthesized video; ¶158: controller 4 compares the positions of the persons P1, P2, P3 on the screen recognized by the video recognition processing with the estimated positions of the presumptive persons Q1, Q2, Q3 stored as the sound classification information CT2; ¶161: As shown in FIG. 4 (a), the produced closed caption text signal WC is stored in the text production table in the storage device 4R; ¶165: “Moreover, the closed caption based on the closed caption text signal WC produced from the flight noise E1 of the helicopter HC of FIG. 3 is shown by the character string CX. Note that the character strings C1, C2, C3, CX are displayed with the respective display colors set in the text production table.”)
Also note that Sharp KK discloses using tables 1 and 2 stored in advance for determining display information of the subtitle/caption (Sharp KK ¶20 discloses use of tables stored in advance; ¶25 discloses position of subtitles outputted by referring to stored tables). The modification is merely a simple substitution of known table types for caption data, yielding predictable results of using a pre-stored data table for caption data for presenting video captions in a synthesized video.  
Regarding claim 4, Ikegami further discloses: 
A sound feature extractor that extracts a sound feature of the content from the sound signal, wherein the display form determiner determines the display form based on the image feature and the sound feature (Ikegami, ¶108: sound classifier performs sound classification based on sound recognition information; ¶110: linking processing in which the working object and the sound source are associated with each other based on the video classification information CT1 and the sound classification information CT2; ¶112: When a result of the comparison indicates that the coordinate value of the working object in the video and the estimated coordinate value of the sound source are close to each other, that is, when the result of the comparison indicates not more than a predetermined threshold value, the controller 4 recognizes that the working object and the sound source correspond to each other; linking also discussed in ¶¶126-127; ¶¶158-159: controller 4 compares the positions of the persons P1, P2, P3 on the screen recognized by the video recognition processing with the estimated positions of the presumptive persons Q1, Q2, Q3 stored as the sound classification information CT2, where the persons P1, P2, P3 that are the working objects in the video are associated with the sounds of the speeches made by the respective persons by the linking processing; Fig. 3 and ¶¶162-163 discloses setting closed caption positions and linking sounds with positions – see e.g. Fig. 3c showing color and position information; Fig. 4 and ¶166 discloses producing video using table data)
Regarding claim 5, Ikegami further disclose: 
Wherein the display form includes at least one of elements selected from among a character size, a font, a color and a display position of the caption (Ikegami, ¶60: associating the display position of the video of the sound generating object that has been detected and the generation position of the sound of the sound generating object that has been estimated, and producing the synthesized video by arranging the closed caption that has been produced in a position corresponding to the display position of the video that has been detected based on the display position of the video and the generation position of the sound that have been associated with each other; ¶155: the sound classifier 5b sets the display time periods of the closed captions of the speeches for the respective persons Q1, Q2, Q3 based on the time information SY, and classifies the display time periods, and in addition, the sound classifier 5b sets display colors of the closed captions of the speeches for the respective persons Q1, Q2, Q3, and classifies the display colors; ¶196: the closed caption position of the person A on the screen is stored in the sound classification table; ¶201: the closed caption of the new speech made by the person A is displayed in the vicinity of the video of the person A on the screen SC1 of the video output device 300 as shown in FIG. 7 (d))
Regarding 6, Ikegami further discloses: 
Wherein the display form determiner determines a display position as the caption display form based on a position of a sound source serving as the image feature (Ikegami, ¶60: detecting a display position of the video of the sound generating object based on the video information that has been acquired, acquiring information related to the sound of the sound generating object as sound information based on the sound signal, estimating a generation position of the sound of the sound generating object based on the sound information that has been acquired, producing a closed caption that represents by a character the sound generated from the sound generating object based on the sound information that has been acquired, associating the display position of the video of the sound generating object that has been detected and the generation position of the sound of the sound generating object that has been estimated, and producing the synthesized video by arranging the closed caption that has been produced in a position corresponding to the display position of the video that has been detected based on the display position of the video and the generation position of the sound that have been associated with each other; ¶64: The closed caption of the sound generated from the sound generating object is arranged in the position corresponding to the display position of the video of the sound generating object based on the display position of the video and the generation position of the sound that have been associated with each other, so that the synthesized video is produced)
Regarding claim 7, Ikegami further disclose: 
A caption feature extractor that extracts a caption feature of the content from the caption signal, (Examiner notes that “caption feature” is just text data – see Spec. ¶36;  
Ikegami, ¶102: character codes representing sounds divided into text units – i.e. the text is the caption signal Ikegami, ¶114: produces a closed caption text signal WC by performing a Japanese syllabary character--Chinese character conversion (Kana-Kanji conversion) to the series of character codes, and provides the closed caption text signal WC to the controller 4; Also ¶129: closed caption production device 200 converts the series of character codes into Chinese characters and Japanese syllabary characters; ¶¶146-147 discloses sound recognizer turning speeches into character codes and dividing the character codes into respective texts, and also turning sounds other than speech into character codes – i.e. this is a caption signal; ¶150: the sound recognition processing is performed by the sound recognizer 3b, so that the sound recognition information IF2 of FIG. 3 (b) is stored in the storage device 4R, and after that, the sound classification processing is performed by the sound classifier 5b – i.e. sound classifier performs functions on the caption signal; Fig. 3(b) and 3(c) and ¶154: the sound classifier 5b associates the character codes of "Speech: KO N NI CHI WA", the character codes of "Speech: WA TA SHI WA PI I TSU U DE SU" and the character codes of "Speech: HA JI ME MA SHI TE" obtained by the sound recognition processing with the persons Q1, Q2, Q3 – i.e. extracts text from the full speech caption to assign to persons Q1, Q2, Q3; Also ¶175: “the sound classifier 5b recognizes the texts from the series of character codes, and separates the character codes for each text.”) 
wherein the display form determiner determines the display form based on the image feature and the caption feature (Ikegami, Fig. 3(b) and 3(c) and ¶154: the sound classifier 5b associates the character codes of "Speech: KO N NI CHI WA", the character codes of "Speech: WA TA SHI WA PI I TSU U DE SU" and the character codes of "Speech: HA JI ME MA SHI TE" obtained by the sound recognition processing with the persons Q1, Q2, Q3, while associating the estimated positions of the sound sources corresponding to the respective speeches with the persons Q1, Q2, Q3; ¶¶158-159: the controller 4 compares the positions of the persons P1, P2, P3 on the screen recognized by the video recognition processing with the estimated positions of the presumptive persons Q1, Q2, Q3 stored as the sound classification information CT2 and associates the persons P1, P2, P3 recognized as the working objects in the video with the presumptive persons Q1, Q2, Q3. In this manner, the persons P1, P2, P3 that are the working objects in the video are associated with the sounds of the speeches made by the respective persons by the linking processing)
Regarding claim 16, Ikegami modified by Sharp KK further discloses: 
A sound data extractor that extracts sound indicated by the sound signal (Sharp KK, ¶17: voice extraction unit 3 extracts acoustic features such as the volume and pitch of the voice from the separated voice); and
A conversion processor that converts the sound extracted by the sound data extraction processor into the caption signal indicating the characters indicating sound, wherein the display form determiner determines a display form of a caption indicated by the caption signal converted by the conversion processor (Sharp KK, ¶17: voice extraction unit 3 extracts acoustic features such as the volume and pitch of the voice from the separated voice, and subtitle display control unit 4 refers to Table 1, Table 2, etc. stored in advance in the storage device 5 based on the input audio feature information, where if information that the volume of the voice is “loud” and the pitch of the voice is “low” is extracted, information that ht e size of the characters is “large” and the color of the characters is “blue” is output, such that the character display information is added to the caption signal separated by the signal separation unit 2 and passed to the next stage caption superimposing unit 6 as caption display information)
Both Ikegami and Sharp KK are directed to systems and techniques for controlling the display of captions or subtitles for improved display in video.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, including the technique for dividing a signal to obtain video, audio and caption data for processing as provided by Sharp KK, by further using the caption formatting based on extracted sound characteristics as provided by Sharp KK, using known electronic interfacing and programming techniques.  The modification results in an improved caption display by presenting formatting that conveys information in addition to the text, providing easier to understand visual information to assist with understanding the related sounds.  

Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Ikegami (US 2009/0278986 A1) modified by 
Sharp KK (JP 2007-300323 A, reference included in applicant’s IDS filed 6/17/2024, reference made to English Translation from EPO website) and in further view of
Nagaishi et al. (US 2004/0008277 A1). 
Regarding claim 8, the limitations included from claim 7 are rejected based on the same rationale as claim 7 set forth above.  Further regarding claim 8, Ikegami further discloses: 
Wherein the display form determiner determines, (Ikegami, ¶¶259-260 discloses changing closed caption formatting based on the user speaking, including using different colors set for each person by sound classifier)
Ikegami does not explicitly disclose the color information determined when a specific character string is included in the caption as the caption feature.  
Nagaishi discloses: 
Wherein the display form determiner determines, when a specific character string is included in the caption as the caption feature, the display form based on color information of the image serving as the image feature (Nagaishi, ¶10: character recognition unit recognizes character strings contained in the extracted superimposed captions on a real-time basis, and outputs character information containing character codes corresponding to the recognized character strings, and a display unit which displays the character strings based on the character information; ¶16: “in the caption extraction device of the present invention, a character processing unit may also be additionally provided that adds additional information for processing the recognized character strings by enlarging, deforming or coloring and so forth to the character information.”; ¶71: Reference symbol 1l indicates a keyword judgment section that judges whether or not a keyword registered in keyword DB 1m is present in a character string recognized by character recognition section 1c; ¶86: On the other hand, although voice synthesis section 1h synthesizes a voice based on a character code received from character recognition section 1c and outputs that voice from speaker 1i, when a predetermined keyword is received from keyword judgment section 1l, the tone quality of the voice is changed and output corresponding to that keyword or corresponding to the color of characters contained in the caption (and this color information is provided by color information extraction section 1k)).
Ikegami, Sharp KK and Nagaishi are directed to systems and techniques for controlling the display of captions or subtitles for improved display in video.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, including the technique for dividing a signal to obtain video, audio and caption data for processing as provided by Sharp KK, by further using the technique for coloring captions as provided by Nagaishi, using known electronic interfacing and programming techniques.  The modification results in improved caption display by providing better visual indicators to assist a user with reading and understanding the content. 

Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Ikegami (US 2009/0278986 A1) modified by 
Catalano et al. (US 10,299,008 B1). 
Regarding claim 9, the limitations included from claim 1 are rejected based on the  same rationale as claim 1 set forth above.  Further regarding claim 9, Catalano discloses: 
Wherein the display form determiner determines a font as the display form based on a category of the content (Catalano, Fig. 4 and [8:42-9:9] discloses a closed captioning component for generating closed captions with optimal positioning and audio source output styles where “The characteristics of the font style could be unique to the character, and may be visually associated with the character through font color or other styling. The font styling may also convey the sentiment of the statement made by the character.”)
Ikegami and Catalano are directed to systems and techniques for controlling the display of captions or subtitles for improved display in video.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, by further using the font styling based on video content as provided by Catalano, using known electronic interfacing and programming techniques.  The modification results in improved video captioning by changing the font styles of captions to convey more information to a user for easier understanding of video context.  

Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Ikegami (US 2009/0278986 A1) modified by 
Yang et al. (US 2009/0027552 A1) and in further view of 
Mears (US 2010/0225807 A1).  
Regarding claim 10, the limitations included from claim 1 are rejected based on the same rationale as claim 1 set forth above.  Further regarding claim 10, Ikegami further discloses: 
Wherein caption on-screen for displaying the caption superimposed on the image and caption out-screen for displaying the caption (Ikegami, ¶¶158-159: the controller 4 compares the positions of the persons P1, P2, P3 on the screen recognized by the video recognition processing with the estimated positions of the presumptive persons Q1, Q2, Q3 stored as the sound classification information CT2 and associates the persons P1, P2, P3 recognized as the working objects in the video with the presumptive persons Q1, Q2, Q3. In this manner, the persons P1, P2, P3 that are the working objects in the video are associated with the sounds of the speeches made by the respective persons by the linking processing; ¶216: “In this example, the person C stored in the sound classification table does not exist on the screen. In this case, the text producer 6 determines that a sound source of the character codes stored in the sound classification table does not exist on the screen. Then, the text producer 6 decides a closed caption position in a predetermined region (standard position) on the screen.”; Fig. 10(j) and ¶217; ¶293: superimpose closed caption on video)
The display form determiner adds to the caption as the display form when the caption on-screen is selected depending on a category of the content (Ikegami, ¶155: the sound classifier 5b sets the display time periods of the closed captions of the speeches for the respective persons Q1, Q2, Q3 based on the time information SY, and classifies the display time periods, and in addition, the sound classifier 5b sets display colors of the closed captions of the speeches for the respective persons Q1, Q2, Q3, and classifies the display colors; Note, ¶23: the synthesized video production means may produce the synthesized video by arranging the closed caption )
Ikegami does not explicitly disclose the out-screen for displaying the caption as without being superimposed on the image and does not explicitly disclose the effect as an animation.
Yang discloses: 
caption out-screen for displaying the caption without being superimposed on the image (Yang, ¶23: automatic adjusting text within video by determining edge of video and the text on the display is moved so that the viewer can watch the video and text without the text overlapping the video – See Fig. 5)
Ikegami and Yang are directed to systems and techniques for controlling the display of captions or subtitles for improved display in video.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, by using the caption positioning location of Yang, using known electronic interfacing and programming techniques.  The modification merely substitutes one known out-screen position of captions for another, yielding predictable results of ensuring that when a default position is used for displaying captions, the image is not obstructed, by locating the caption off the image.  Furthermore, the modification results in an improved display of captions not related to any particular object on screen by ensuring the image itself is not obstructed for easier viewing of both the image data and the caption presented.  This improves the readability of all the information provided while still maintaining the relevant location data for other caption data.  
Mears discloses: 
The display form determiner adds animation to the caption as the display form when the caption on-screen is selected depending on a category of the content (Mears, ¶20: replace select caption text with icons, which may be animated, e.g. laughing replaced by icon of face laughing) 
Ikegami, Yang and Mears are directed to systems and techniques for controlling the display of captions or subtitles for improved display in video.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, including using the caption positioning location of Yang, by further using the use of different characters for sounds as provided by Mears, using known electronic interfacing and programming techniques.  The modification results in improved visualization of sounds for easier and faster recognition (see e.g. Mears, ¶20).

Claim(s) 11-12 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Ikegami (US 2009/0278986 A1) modified by 
Sharp KK (JP 2007-300323 A, reference included in applicant’s IDS filed 6/17/2024, reference made to English Translation from EPO website) and in further view of
Mears (US 2010/0225807 A1). 
Regarding claim 11, the limitations included from claim 2 are rejected based on the  same rationale as claim 2 set forth above.  Further regarding claim 11, Mears discloses:
A caption feature extractor that extracts a caption feature of the content from the caption signal, wherein when explanation about sound is included as the caption feature, the display form determiner replaces the explanation about sound by characters indicating the sound represented by the explanation about sound (Mears, ¶20:
The graphical content may be introduced by replacing select caption text (such as text that is repetitively used) with icons (which may optionally be animated). For example, commonly used words may be replaced with associated icons indicative of the replaced words. "Laughter" may be replaced by an icon of a face laughing, while "applause" may be replaced by an icon of two hands clapping, for example. By way of further non-limiting example, when the word "whispering" is detected in a digital caption, an icon associated with and indicative of whispering (e.g., a profile of a person's head with hand put to side of mouth) may be displayed instead of the word "whispering".

¶22:  the case of laughter, a simple "emoticon" with eyes closed and mouth open in a half-moon shape may be used; ¶25: If a match is found (step 230), the matching text may be replaced with the replacement character or icon (step 240)) 
Ikegami, Sharp KK and Mears are directed to systems and techniques for controlling the display of captions or subtitles for improved display in video.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, dividing a signal to obtain video, audio and caption data for processing as provided by Sharp KK, by further using the use of different characters for sounds as provided by Mears, using known electronic interfacing and programming techniques.  The modification results in improved visualization of sounds for easier and faster recognition (see e.g. Mears, ¶20).
Regarding claim 12, Ikegami further disclose:
Wherein the display form determiner changes the display form of the characters representing sound obtained by the replacement based on whether a sound source of the sound represented by the explanation of the sound is included in the image as the image feature (Ikegami, ¶216: “In this example, the person C stored in the sound classification table does not exist on the screen. In this case, the text producer 6 determines that a sound source of the character codes stored in the sound classification table does not exist on the screen. Then, the text producer 6 decides a closed caption position in a predetermined region (standard position) on the screen.”; Fig. 10(j) and ¶217)

Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Ikegami (US 2009/0278986 A1) modified by 
Shibata (US 2013/0293775 A1)
Regarding claim 13, the limitations included from claim 1 are rejected based on the  same rationale as claim 1 set forth above.  Further regarding claim 13, Shibata discloses:
A plurality of signal receivers that receive a caption signal, an image signal, a sound signal of the content respectively (Shibata, Abstract: recorder for digital caption broadcasts includes a video signal receiver, an audio signal receiver, and a caption data signal receiver that respectively receive video signals, audio signals, and caption data signals from a digital caption broadcast; ¶11: nvention is a recorder for digital caption broadcast composed of: a video signal receiver, an audio signal receiver, and a caption data signal receiver that respectively receive video signals, audio signals, and caption data signals from a digital caption broadcast; a video signal transmitter, an audio signal transmitter, and a caption data signal transmitter that respectively receive signals from the video signal receiver, the audio signal receiver, and the caption data signal receiver and transmit the received signals to a viewer monitor; and a variable display adjustor provided between the video signal receiver and audio signal receiver, and the video signal transmitter and audio signal transmitter, the variable display adjustor being capable of adjusting timings of video display and sound generation to match caption display on the viewer monitor.)
Both Ikegami and Shibata are directed to systems and techniques processing video data including processing audio, image and caption data.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, by including the receiver architecture as provided by Shibata, using known electronic interfacing and programming techniques.  The modification merely substitutes one known type of video data communication architecture for another to obtain predictable results of obtaining the video, caption, and audio data utilized by the system in Ikegami for determining a coordination of image objects and audio timing with caption formatting, i.e. by receiving the data in a known data format as provided by Shibata, instead of the format used by Ikegami.  Instead of the extraction of the data to obtain the three relevant data elements as used by Ikegami, which are then utilized for assembling the synthesized video, the three relevant data elements are obtained by the architecture provided by Shibata, for use with the object and sound recognition association processing of the data elements.  Moreover, the modification results in an improved video processing system by allowing for extracting the data from different types of input video data for a more useful and versatile system.  Alternatively, the modification allows for recording of the processed video data in separated elements, allowing for easier replayability and reducing the need for redundant processing every time a user wants to view media. 

Claim(s) 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Ikegami (US 2009/0278986 A1) modified by 
Sharp KK (JP 2007-300323 A, reference included in applicant’s IDS filed 6/17/2024, reference made to English Translation from EPO website) and in further view of
Cuthbert et al. (US 2015/0134318 A1). 
Regarding claim 14, the limitations included from claim 2 are rejected based on the same rationale as claim 2 set forth above.  Further regarding claim 14, Cuthbert discloses: 
An image data extraction processor that extracts characters included in the image indicating by the image signal (Cuthbert, ¶56 discloses processing images to extract text, where “The translator 115 includes a text identifier 120 that can identify text in images and other types of documents. In some implementations, the text identifier 120 analyzes images using optical character recognition ("OCR") to identify text depicted by the images”; Also ¶57; and
A conversion processor that converts the characters extracted by the image data extraction processor into the caption signal (Cuthbert, ¶56-¶57: text identifier determines text in an image and sends data regarding the text to translation image, where data can be the text itself), wherein 
The display form determiner determines a display form of a caption indicated by the caption signal converted by the conversion processor (Cuthbert, ¶58: translation engine translates text to different language; ¶59: user interface can define how information is presented, e.g. overlay over image 107 or using context selector based on arrangement of other characteristics of text depicted in the image; See Fig. 6 and ¶84 showing image processed with resulting overlaid text translation in particular location of translated text in image) 
Ikegami, Sharp KK and Cuthbert are directed to systems and techniques for controlling the display of captions or subtitles for improved display in processed image data.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, dividing a signal to obtain video, audio and caption data for processing as provided by Sharp KK, by further using the text image processing for changing image-based text for presentation in a different caption format as provided by Cuthbert, using known electronic interfacing and programming techniques. The modification improves upon the presentation of image media with related text-captions by allowing for more robust processing of the media by extracting image data, as well as audio and subtitle data, for providing easier understanding of all the content presented to a user.  

Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Ikegami (US 2009/0278986 A1) modified by 
Sharp KK (JP 2007-300323 A, reference included in applicant’s IDS filed 6/17/2024, reference made to English Translation from EPO website) and 
Cuthbert et al. (US 2015/0134318 A1) and in further view of 
Cowburn et al. (US 2022/0076017 A1). 
Regarding claim 15, the limitations included from claim 14 are rejected based on the same rationale as claim 14 set forth above.  Further regarding claim 15, Cowburn discloses: 
An image signal processor that performs the image processing to remove the characters extracted by the image data extraction processor in the image indicated by the image signal (Cowburn, ¶78: At operation 610, the presentation module 502 generates a presentation of the image, wherein the presentation of the image includes an updated text string based on the personalization request. In some example embodiments, the updated txt string may be presented based on the form of the surface identified by the identification module 504, based on deformities of the typeface. For example, one or more characters of the text string may be changed or removed from the image, and replaced with new characters, or simply filled in with the background pattern – see Figs. 8A to 8B)
Ikegami, Sharp KK, Cuthbert and Cowburn are directed to systems and techniques for controlling the display of text related to image content using image processing.  Moreover, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention and with a reasonable expectation of success, to modify the system and technique for determining caption audio and text and coordinating placement and effects of the caption text with recognized image data as provided by Ikegami, dividing a signal to obtain video, audio and caption data for processing as provided by Sharp KK, and using the text image processing for changing image-based text for presentation in a different caption format as provided by Cuthbert, by further removing unwanted image data that has been identified and processed as provided by Cowburn, using known electronic interfacing and programming techniques. The modification improves upon the presentation of image media with related text-caption processing related to the image content by allowing removal of unwanted text from images that might otherwise be confusing to a view or unnecessarily clutter an image, allowing easier focus by the user on relevant data for presentation. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Chou et al. (US 2013/0066623 A1) teaches that insertion, translation and removal of objects was known at the time of the effective filing date of the claimed invention, wherein such processing was performed dynamically and within media streams or videos, (Chou, ¶13: network-based dynamic insertion and removal of video objects in accordance with one embodiment of the present disclosure. As used herein, the terminology "video object" includes any type of video data, image data, graphics, watermarking, text, audio, etc.)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM A BEUTEL whose telephone number is (571)272-3132. The examiner can normally be reached Monday-Friday 9:00 AM - 5:00 PM (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL HAJNIK can be reached at 571-272-7642. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/WILLIAM A BEUTEL/Primary Examiner, Art Unit 2616
Read full office action
Prosecution Timeline

Jun 17, 2024
Application Filed
Jan 13, 2026
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/344,299
Patent 12581262
AUGMENTED REALITY INTERACTION METHOD AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 17, 2026
18/307,238
Patent 12572258
APPARATUS AND METHOD WITH IMAGE PROCESSING USER INTERFACE
2y 5m to grant Granted Mar 10, 2026
17/948,480
Patent 12566531
CONFIGURING A 3D MODEL WITHIN A VIRTUAL CONFERENCING SYSTEM
2y 5m to grant Granted Mar 03, 2026
18/342,458
Patent 12561927
MEDIA RESOURCE DISPLAY METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Feb 24, 2026
18/199,695
Patent 12554384
SYSTEMS AND METHODS FOR IMPROVED CONTENT EDITING AT A COMPUTING DEVICE
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
70%
Grant Probability
90%
With Interview (+20.4%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 469 resolved cases by this examiner. Grant probability derived from career allow rate.