Last updated: April 17, 2026
Application No. 18/408,622
SYSTEM FOR ENABLING PROCESSING RICH MEDIA DATA

Non-Final OA §101§102
Filed
Jan 10, 2024
Examiner
SHARMA, NEERAJ
Art Unit
2659
Tech Center
2600 — Communications
Assignee
unknown
OA Round
1 (Non-Final)
Interview Optional

— +11.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 457 resolved cases, 2023–2026
Examiner Intelligence

SHARMA, NEERAJ View full profile →
Grants 85% — above average
Career Allow Rate
387 granted / 457 resolved
+22.7% vs TC avg
Moderate +12% lift
Without
With
+11.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
19 currently pending
Career history
476
Total Applications
across all art units
Statute-Specific Performance

§101
13.9%
-26.1% vs TC avg
§103
39.5%
-0.5% vs TC avg
§102
28.7%
-11.3% vs TC avg
§112
6.4%
-33.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 457 resolved cases
Office Action

§101 §102
DETAILED ACTION

Introduction

1.	This office action is in response to Applicant's submission filed on 01/10/2024. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claims 1-33 are currently pending and examined below. 

Drawings

2.	The drawings filed on 01/10/2024 have been accepted and considered by the Examiner. 

Claim Rejections - 35 USC § 101

35 U.S.C. 101 reads as follows:

Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

3.	Claims 1-33 are rejected under 35 U.S.C. 101 as being nothing more than an abstract idea. As an example, regarding claim 1, the limitations of obtaining speech input, obtaining a transcript from it, segmenting said transcript, labeling the segments and then reflecting those segments with regards to the input data all fall under the category of mental processes. These steps are drafted at a high level of generality without tying it to a specific technological improvement. More specifically, these steps can be performed in the mind of a human being with at most the aid of a pen and paper but for the recitation of generic computer components, and thus it falls within the-Mental Processes-grouping of abstract ideas. Accordingly, this claim recites an abstract idea.

This judicial exception is not integrated into a practical application because the 
recitation of a device, a system, processor and/or a computer readable medium merely read to generalized computer components, based upon the claim interpretation wherein the structure is interpreted using the specification. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to generate, extract, determine, and generate, amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is therefore not patent eligible.

Claims 2-33, only provide certain details of the mental processes outlined above, such as completing the process of claim 1 in real-time, detecting pauses in the input audio, clipping said segments into still shorter portions, matching the segments or clippings with predefined key strings etc. These are all steps which themselves can also be accomplished by a human being with at most the aid of a pen and paper and hence also do not amount to significantly more than the judicial exception. 

Claim Rejections - 35 USC § 102

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1)	The claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

4.	Claims 1-5, 7-9, 13-22 and 29-33 are rejected under 35 U.S.C. 102 (a) (1) as being anticipated by Sandison (U.S. Patent Application Publication # 2017/0092277 A1).

With regards to claim 1, Sandison teaches a system for processing rich media data comprising audio data, wherein a server comprising one or more processors is configured to receive the rich media data comprising the audio data (Para 32, teaches a rich media content or RMC file stored on one or more of the storage devices. A user input provided through a selected client device results in the transfer and display of respective video and audio components at the client device); 

process the audio data to generate transcript of the audio data (Para 44, teaches audio-to-text capability of the system from an audio signal); 

process the transcript to automatically segregate the transcript into a plurality of segments (Para 35, teaches that the audio frames are separated and forwarded to an audio decoder circuit); 

associate each of the segments with at least one label, among a plurality of labels that are predefined (Para 49, teaches that each new audio segment can be classified through heuristic mechanisms to identify/assign that segment to an existing known speaker. A new speaker not matching the database may be identified as an “unknown speaker” until such time that further analysis can determine that the speaker is an existing speaker, or labeling information is entered to identify the new speaker under its own heading in the VCDL); 

and reflect the segments and the label associated with each of the segments on the rich media data (Para 51, teach a VCDL that is organized on a per-speaker basis. For each identified speaker, each occurrence of each word spoken by that speaker is logged as a separate entry in the table). 

With regards to claim 2, Sandison teaches the system of claim 1, wherein the server is configured to source the rich media data from a live stream and the segregation occurs in real-time during the live stream (Para 47, teaches that the viseme recognition circuit operates to apply viseme recognition to sequences of video frames in the video stream having a visible human speaker). 

With regards to claim 3, Sandison teaches the system of claim 1, wherein the server is configured to segregate the rich media data into plurality of segments upon detecting at least one discontinuation in the received rich media data (Para 47, teaches that the end user can select the associated recording, and the application can be configured to begin playback at the number of seconds associated with the object, minus a small interval to enable playback of the entire quote or phrase spoken by the speaker). 

With regards to claim 4, Sandison teaches the system of claim 3, wherein the discontinuation corresponds to instances when at least one speaker ceases speaking or takes a pause in the rich media data (Para 47, teaches that the end user can select the associated recording, and the application can be configured to begin playback at the number of seconds associated with the object, minus a small interval to enable playback of the entire quote or phrase spoken by the speaker). 
 
With regards to claim 5, Sandison teaches the system according to claim 1, where two of the segments linked to a first label among the plurality of labels have varying durations (Paragraphs 62-64, teach that the providing a brief description of the associated file along with other information such as an excerpt including the search input, the start time of the location of the excerpt, a duration of the entire file, etc. The playback will initiate within a selected time frame of the detected text, such as five seconds prior to the detected occurrence. This time frame may be adjusted through user selection e.g., from one second to 30 seconds prior to the occurrence, etc.). 

With regards to claim 7, Sandison teaches the system according to claim 1, comprising a database comprising the plurality of labels, with each of the labels associated with a list of key strings, wherein the list of key strings is predefined (Para 56, teaches a VCDL processing circuit that accesses an associated VCDL from memory and performs a search to match the input text search string to the contents of the VCDL. Para 24, teaches that the voice characteristics dynamic library or VCDL data structure may be sorted to arrange the spoken words by source e.g., individual speakers, etc.). 

With regards to claim 8, Sandison teaches the system according to claim 7, wherein the server is configured to screen data in the transcript in a sequence to identify presence of at least one of the key strings and associate each of the key strings with at least one label (Para 19, teaches that the system generates a VCDL data structure that allows a user to locate particular words that were spoken or otherwise presented within the content of the files, and initiates playback of the RMC file at that location). 

With regards to claim 9, Sandison teaches the system according to claim 1, comprising a database comprising the plurality of labels, with each of the labels associated with a list of key strings, wherein the server is configured to screen data in the transcript in a sequence to identify presence of at least one of the key strings and associate one segment from the plurality of segments with at least one label, wherein said segment comprises of the at least one of the key strings (Para 49, teaches that each new audio segment can be classified through heuristic mechanisms to identify/assign that segment to an existing known speaker. A new speaker not matching the database may be identified as an “unknown speaker” until such time that further analysis can determine that the speaker is an existing speaker, or labeling information is entered to identify the new speaker under its own heading in the VCDL. The key string here is interpreted as the actual name of the speaker, while a label is the category of speaker names). 

With regards to claim 13, Sandison teaches the system of claim 1, wherein the server is configured to automatically retrieve at least one of the segments, associated with at least one of the labels, from the plurality of labels, wherein the at least one of the labels is received by the server (Paragraphs 58-59 and figure 8, teach different search strategies based on different search input values. The different search strategies are labeled from A to E. The first search involves a simple search for all occurrences of the word “the” in the VCDL. As can be seen, this results in a first, large number of matches. This result is based on the fact that the word “the” is commonly employed and all hits by all speakers would be grouped in the output search results. Search B uses the term “the” plus the identification of a particular speaker. This narrows the number of matches, and represents all of the occurrences of the word “the” as spoken by the selected Speaker A. Search C adds a selected time frame to the strategy for Search B. Depending on the range of the time frame, this may result in a narrowed search set).

With regards to claim 14, Sandison teaches the system of claim 1, wherein the server is configured to automatically retrieve at least two of the segments, each one associated with at least one of two labels from the plurality of labels, wherein the two labels are received by the server (Paragraphs 58-59 and figure 8, teach different search strategies based on different search input values. The different search strategies are labeled from A to E. The first search involves a simple search for all occurrences of the word “the” in the VCDL. As can be seen, this results in a first, large number of matches. This result is based on the fact that the word “the” is commonly employed and all hits by all speakers would be grouped in the output search results. Search B uses the term “the” plus the identification of a particular speaker. This narrows the number of matches, and represents all of the occurrences of the word “the” as spoken by the selected Speaker A. Search C adds a selected time frame to the strategy for Search B. Depending on the range of the time frame, this may result in a narrowed search set).

With regards to claim 15, Sandison teaches the system of claim 1, wherein the server is configured to automatically retrieve at least one of the segments, associated with one of the labels from the plurality of labels, to a user based on the label associated with said user (Paragraphs 58-59 and figure 8, teach different search strategies based on different search input values. The different search strategies are labeled from A to E. The first search involves a simple search for all occurrences of the word “the” in the VCDL. As can be seen, this results in a first, large number of matches. This result is based on the fact that the word “the” is commonly employed and all hits by all speakers would be grouped in the output search results. Search B uses the term “the” plus the identification of a particular speaker. This narrows the number of matches, and represents all of the occurrences of the word “the” as spoken by the selected Speaker A. Search C adds a selected time frame to the strategy for Search B. Depending on the range of the time frame, this may result in a narrowed search set).

With regards to claim 16, Sandison teaches the system of claim 1, wherein the server is configured to automatically reflect at least two of the segments on the rich media data, wherein each of the at least two of the segments are reflected separately (Para 51, teaches a VCDL that is organized on a per-speaker basis. For each identified speaker, each occurrence of each word spoken by that speaker is logged as a separate entry in the table). 

With regards to claim 17, Sandison teaches the system of claim 1, wherein the server is configured to automatically reflect at least two of the segments on the rich media data, wherein each of the at least two of the segments are reflected together (Para 51, further teaches that for teach speaker entry, other information is stored together with the speaker information as well, such as an associated timestamp and an RMC file name). 
	
With regards to claim 18, Sandison teaches the system of claim 1, wherein the server is configured to automatically reflect at least one of the segments based on a key string received, wherein the segment comprises at least one of the key strings from a list of key strings (Para 56, teaches a VCDL processing circuit that accesses an associated VCDL from memory and performs a search to match the input text search string to the contents of the VCDL. Para 24, teaches that the voice characteristics dynamic library or VCDL data structure may be sorted to arrange the spoken words by source e.g., individual speakers, etc.). 

With regards to claim 19, Sandison teaches the system of claim 1, wherein the server is configured to automatically reflect at least one of the segments comprising at least one of the key strings from the list of key strings, wherein the at least one of the key strings is received from a user (Para 56, teaches a VCDL processing circuit that accesses an associated VCDL from memory and performs a search to match the input text search string to the contents of the VCDL. Para 24, teaches that the voice characteristics dynamic library or VCDL data structure may be sorted to arrange the spoken words by source e.g., individual speakers, etc.). 

With regards to claim 20, Sandison teaches the system of claim 1, wherein the server is configured to reflect at least one of the segments on the rich media data to a user, associated with one of the labels from the plurality of labels, based on an input received from said user, wherein the server is configured to enable the user to select from the plurality of labels (Para 56, teaches a VCDL processing circuit that accesses an associated VCDL from memory and performs a search to match the input text search string to the contents of the VCDL. This results in the identification of a selected RMC file at the starting point at which the audio text corresponding to the input search terms commences within the file. A playback device accesses an RMC file repository, such as a suitable memory, to initiate playback of the selected RMC file at the associated location). 

With regards to claim 21, Sandison teaches the system of claim 1, wherein the server is configured to reflect plurality of segments on the rich media data to a user, associated with at least two of the labels from the plurality of labels, based on an input received from said user, wherein the server is configured to enable the user to select at least two of the labels from the plurality of labels (Para 60 and figure 8, teach an example search D which uses a longer search term, “the three bears.” Adding additional terms to the search string significantly narrows the search set to 2 hits in this example. It will be appreciated that the search can be for the specific string, or can be tailored for hits involving all three of these words in any order within a given elapsed time interval. Finally, Search E adds a file name to Search D, which is interpreted as the second label. This narrows the search further to a single output). 

With regards to claim 22, Sandison teaches the system of claim 1, wherein the server is configured to create a clipped rich media data from the received rich media data, wherein duration of the clipped rich media data is shorter than the rich media data (Para 64, teaches that the playback will initiate within a selected time frame of the detected text, such as five seconds prior to the detected occurrence. This time frame may be adjusted through user selection e.g., from one second to 30 seconds prior to the occurrence, etc. In other cases, a repeating loop option can be generated whereby a selected clip of selected duration, such as 15 seconds, may be repeated until the user ceases further playback). 

With regards to claim 29, Sandison teaches the system of claim 1, wherein the server is configured to generate a clipped rich media data from the received rich media data, wherein the clipped rich media data comprises of at least one of the segments associated with a label from the predefined plurality of labels, wherein duration of the clipped rich media data is shorter than the rich media data (Para 49, teaches that each new audio segment can be classified through heuristic mechanisms to identify/assign that segment to an existing known speaker. A new speaker not matching the database may be identified as an “unknown speaker” until such time that further analysis can determine that the speaker is an existing speaker, or labeling information is entered to identify the new speaker under its own heading in the VCDL. The key string here is interpreted as the actual name of the speaker, while a label is the category of speaker names. Both key string and actual name can be a search string. Para 64, teaches that the playback will initiate within a selected time frame of the detected text, such as five seconds prior to the detected occurrence. This time frame may be adjusted through user selection e.g., from one second to 30 seconds prior to the occurrence, etc. In other cases, a repeating loop option can be generated whereby a selected clip of selected duration, such as 15 seconds, may be repeated until the user ceases further playback). 

With regards to claim 30, Sandison teaches the system of claim 1, wherein the server is configured to generate a clipped rich media data from the received rich media data, wherein the clipped rich media data comprises of at least one of the segment associated with at least two of the labels from the predefined plurality of labels, wherein duration of the clipped rich media data is shorter than the rich media data (Paragraphs 58-59 and figure 8, teach different search strategies based on different search input values. The different search strategies are labeled from A to E. The first search involves a simple search for all occurrences of the word “the” in the VCDL. As can be seen, this results in a first, large number of matches. This result is based on the fact that the word “the” is commonly employed and all hits by all speakers would be grouped in the output search results. Search B uses the term “the” plus the identification of a particular speaker. This narrows the number of matches, and represents all of the occurrences of the word “the” as spoken by the selected Speaker A. Search C adds a selected time frame to the strategy for Search B. Depending on the range of the time frame, this may result in a narrowed search set).

With regards to claim 31, Sandison teaches the system of claim 1, wherein the server is configured to generate a clipped rich media data from the received rich media data, wherein the clipped rich media data comprises of at least two of the segments associated with at least two of the labels from the predefined plurality of labels wherein each of the at least two segments is associated with plurality of labels and the duration of the clipped rich media data is shorter than the rich media data (Figure 8, shows different search inputs which are interpreted as segments. Each of the segments is associated with a plurality of labels, e.g. speaker name with operator “the” or without “the”, with duration or without duration. The duration of the clipped rich media data being shorter than the rich media data is shown in para 64 as pointed out earlier). 

With regards to claim 32, Sandison teaches the system of claim 1, wherein the server is configured to generate a clipped rich media data from the received rich media data, wherein the clipped rich media data is linked to at least one of the segments comprising at least one key string (Para 2, teaches a data structure stored in a memory links each portion of the reference audio sequence with an associated time stamp that identifies a time location of the associated portion of the reference audio sequence within the media content file with respect to a reference point of the media content file. Figure 8 along with paragraphs 58-60, teach how the search strings are linked to input segments). 

With regards to claim 33, Sandison teaches the system of claim 1, wherein the server is configured to provide a plurality of markers to a user (Para 69, teaches that the user selects a desired RMC file and the system initiates playback at an intermediate point in the RMC file in the vicinity of, and just prior to, the occurrence of the search input string in the selected file). 

allow the user to position the plurality of markers on the rich media data (Para 69, teaches that the intermediate point is in the vicinity of the search string input by the user); 

and create a clipped rich media data from the received rich media data based on the placement of the markers on the rich media data (Para 70, teaches that the playback of the RMC file on a display device is initiated beginning at the intermediate point of the RMC file responsive to the time stamp associated with the selected spoken word, and that portion of the RMC file prior to the intermediate point is not displayed to the user. This eliminates the need for a manual search operation to locate the intermediate point, since the system uses the timestamp data to calculate an appropriate starting point to begin playback). 

Allowable Subject Matter

5.	Claims 6, 10-12 and 23-28 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims and if the rejection under 35 U.S.C. 101 is overcome. The prior art of record, alone or in combination, does not currently suggest or teach the invention as outlined in these claims. More detailed reasons for allowance will be outlined as and when the Application proceeds to allowability.

Conclusion

6.	The following prior art, made of record but not relied upon, is considered pertinent to applicant's disclosure: Vennelakanti (U.S. Patent Application Publication # 2013/0241834 A1), Deutscher (U.S. Patent Application Publication # 2004/0001106 A1). These references are also included in the PTO-892 form attached with this office action.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. If you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). In case you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to NEERAJ SHARMA whose contact information is given below.  The examiner can normally be reached on Monday to Friday 8 am to 5 pm. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis-Desir can be reached on 571-272-7799 (Direct Phone).  The fax number for the organization where this application or proceeding is assigned is 571-273-8300.

/NEERAJ SHARMA/
Primary Examiner, Art Unit 2659
571-270-5487 (Direct Phone)
571-270-6487 (Direct Fax)
neeraj.sharma@uspto.gov (Direct Email)
Read full office action
Prosecution Timeline

Jan 10, 2024
Application Filed
Jan 09, 2026
Non-Final Rejection — §101, §102 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/582,462
Patent 12597428
DISPLAY DEVICE, CONTROL METHOD OF DISPLAY DEVICE, AND RECORDING MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/670,148
Patent 12591736
FINE-TUNED LARGE LANGUAGE MODELS FOR CAPABILITY CONTROLLER
2y 5m to grant Granted Mar 31, 2026
18/453,338
Patent 12579983
SPEECH RECOGNITION DEVICE, SPEECH RECOGNITION METHOD, AND NON-TRANSITORY COMPUTER-READABLE MEDIUM
2y 5m to grant Granted Mar 17, 2026
18/339,670
Patent 12573403
SCENE-AWARE SPEECH RECOGNITION USING VISION-LANGUAGE MODELS
2y 5m to grant Granted Mar 10, 2026
18/016,732
Patent 12566076
AD-HOC NAVIGATION INSTRUCTIONS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
85%
Grant Probability
96%
With Interview (+11.5%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 457 resolved cases by this examiner. Grant probability derived from career allow rate.