Last updated: April 19, 2026

Application No. 18/928,032

DATA PROCESSING TECHNOLOGY USING AUDIO FEATURE ENCODING

Non-Final OA §103

Filed

Oct 27, 2024

Examiner

HAGHANI, SHADAN E

Art Unit

2485

Tech Center

2400 — Computer Networks

Assignee

Hanwha Vision Co., Ltd.

OA Round

2 (Non-Final)

This examiner grants 60% of cases after interview

— +18.6% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 366 resolved cases, 2023–2026

Examiner Intelligence

HAGHANI, SHADAN E View full profile →

Grants 60% of resolved cases

Career Allow Rate

221 granted / 366 resolved

+2.4% vs TC avg

Strong +19% interview lift

Without

With

+18.6%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

33 currently pending

Career history

399

Total Applications

across all art units

Statute-Specific Performance

§101

2.1%

-37.9% vs TC avg

§103

60.3%

+20.3% vs TC avg

§102

13.8%

-26.2% vs TC avg

§112

16.1%

-23.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 366 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-5, 8 are rejected under 35 U.S.C. 103 as being unpatentable over Kim (US PG Publication 2024/0233385) in view of Bernal (US PG Publication 2018/0063538).

Regarding Claim 1, Kim (US PG Publication 2024/0233385) discloses a method for processing data (video captioning server 120, Fig. 1 [0053]), comprising:
obtaining video data (video caption unit 123 can separate video data into vision data and audio data [0055]; vision server 121 collecting vision data of video data [0054]) and extracting a video feature (creating a vision attention vector [0058]) from the obtained video data (vision data [0058]);
obtaining audio (audio server 122 collecting audio data of video data [0054]) data related to the video data (of video data [0054]; video caption unit 123 can separate video data into vision data and audio data [0055]) and extracting an audio feature (audio attention vector [0058]) from the obtained audio data (audio data [0058]);
detecting a preset event (separate time-series sections by setting behavior stop points [0055]; automatically detect behavior events [0063]) on the basis of one or more of the video feature and the audio feature (Feature values of an I3D model and a VGGish model can be configured into a multi-modal type in a vanilla transformer architecture [0063]);
upon detecting the preset event (when a specific dangerous behavior is sensed [0050]) … generating transmission data (reported to a manager and detailed information about a criminal situation is transmitted [0050]) ….
Kim does not disclose, but Bernal (US PG Publication 2018/0063538) teaches upon detecting the preset event (once the classification is learned [0038]), performing conversion processing (vector encoding [0038]) and encoding processing (Huffman, arithmetic or Lempel Ziv coding [0038]) on the video (by including features descriptive of objects of interest classified as vehicles, features descriptive of objects of interest classified as structures, features descriptive of actions of interest, features that discriminate actions A, B, C and D from each other [0021]) and the audio (frequency and phase descriptors in the case of one-dimensional sequential data such as audio [0030]) feature and the audio (frequency and phase descriptors in the case of one-dimensional sequential data such as audio [0030]) feature (the feature space [0038]);
and generating transmission data including (compressed data stream can be stored or transmitted [0031]) the video feature (by including features descriptive of objects of interest classified as vehicles, features descriptive of objects of interest classified as structures, features descriptive of actions of interest, features that discriminate actions A, B, C and D from each other [0021]) and the audio (frequency and phase descriptors in the case of one-dimensional sequential data such as audio [0030]) feature (the compression module may generate a compressed data representation of the feature representation extracted by the feature extraction module [0028]) on which the conversion processing and the encoding processing have been performed (vector quantization encoding and arithmetic/Huffman/Lempel Ziv [0038]).
One of ordinary skill in the art before the application was filed would have been motivated to supplement the event detection of Kim with content-based compression, as in Bernal, because Bernal teaches that compressed image/video data comprises less data than raw image/video data, facilitating better communication of the information via bandwidth constrained communication link and alleviating the bottleneck that communication link 360 could otherwise have posed in the data collection process [0021].

Regarding Claim 2, Kim (US PG Publication 2024/0233385) discloses the method of claim 1, wherein, in the detecting of the preset event, upon satisfaction of a condition related to the preset event in both the video feature and the audio feature, it is determined that the preset event has been detected (automatically detect behavior events and create video caption information in an artificial intelligence model. Accordingly, it is possible to easily figure out the context in each section by automatically setting breakpoints (behavior stop points) using all of vision and audio information [0063]).

Regarding Claim 3, Kim (US PG Publication 2024/0233385) discloses the method of claim 1, wherein, in the detecting of the preset event, upon satisfaction of a condition related to the preset event in any one of the video feature and the audio feature, it is determined that the preset event has been detected (setting behavior stop points on the basis of vision data [0055]).

Regarding Claim 4, Kim (US PG Publication 2024/0233385) discloses the method of claim 1, wherein the performing of the conversion processing and the encoding processing includes:
converting the video feature (Feature values of an I3D model [0063]) and encoding the converted video feature to generate video feature encoded data (the vision encoder vector created by the encoder unit 210 [0060]);
and converting the audio feature (Feature values of an VGGish model [0063]) and encoding the converted audio feature to generate audio feature encoded data (the audio encoder vector created by the encoder unit 210 [0060]).

Regarding Claim 5, Kim (US PG Publication 2024/0233385) discloses the method of claim 4, wherein the transmission data includes the video feature encoded data and the audio feature encoded data (the vision encoder vector and the audio encoder vector created by the encoder unit 210 [0060]).

Regarding Claim 8, Kim (US PG Publication 2024/0233385) discloses the method of claim 1, wherein the transmission data includes metadata related to the detected event (subtitle data related to video data may be obtained by a caption unit 242 [0060], Fig. 2). 
 
Claim(s) 6-7, 9 are rejected under 35 U.S.C. 103 as being unpatentable over Kim (US PG Publication 2024/0233385) in view of Li (NPL: “Video Description Combining Visual and Audio Features,” SPIE April 2023).

Regarding Claim 6, Kim (US PG Publication 2024/0233385) discloses the method of claim 1, wherein the performing of the conversion processing and the encoding processing includes:
converting the video feature (Feature values of an I3D model [0063]);
converting the audio feature (Feature values of an VGGish model [0063]).
Kim does not disclose, but Li (NPL: “Video Description Combining Visual and Audio Features,” SPIE April 2023) teaches fusing the converted video feature and the converted audio feature to generate fusion feature data (By splicing the two features, the spliced features are linearly transformed to unify the features, and then the average pooling operation is used to obtain the global features; features are embedded in time sequence, Section 3.1).
One of ordinary skill in the art before the application was filed would have been motivated to fuse the visual and audio feature vectors of Kim as taught by Li because Li suggests that fusing the audio-visual data can generate more accurate description text (Section 5) and improve the convergence speed of the model (Section 3.4).

Regarding Claim 7, Kim (US PG Publication 2024/0233385) discloses the method of claim 6.
Kim does not disclose, but Li (NPL: “Video Description Combining Visual and Audio Features,” SPIE April 2023) teaches wherein the transmission data includes fusion-encoded data generated by encoding the fusion feature data (features are embedded in time sequence through time sequence encoding and sent to the encoder of the transformer).
One of ordinary skill in the art before the application was filed would have been motivated to fuse the visual and audio feature vectors of Kim as taught by Li because Li suggests that fusing the audio-visual data can generate more accurate description text (Section 5) and improve the convergence speed of the model (Section 3.4).

Regarding Claim 9, Kim (US PG Publication 2024/0233385) discloses the method of claim 1.
Kim does not disclose, but Li (NPL: “Video Description Combining Visual and Audio Features,” SPIE April 2023) teaches wherein the transmission data includes time stamp information for synchronization of the video feature and the audio feature (as a sequence file, the video needs to be embedded with timing encoding to enable the model to learn the context information, Section 3.4).
One of ordinary skill in the art before the application was filed would have been motivated to embed the feature vectors of Kim with timing information because Li teaches that by embedding timing information, the transformer encoder can better recognize the sequence information in the video, resulting in a more accurate predictive description of the content (Section 5). 

Claim(s) 10, 11, 12, 15 are rejected under 35 U.S.C. 103 as being unpatentable over Kim (US PG Publication 2024/0233385) in view of Bernal (US PG Publication 2018/0063538) and Fujimura (US PG Publication 2023/0153610).

Regarding Claim 10, Kim (US PG Publication 2024/0233385) discloses a method comprising:
obtaining audio data (audio server 122 collecting audio data of video data [0054]; video caption unit 123 can separate video data into vision data and audio data [0055]) and extracting an audio feature (audio attention vector [0058])  from the obtained audio data (audio data [0058]);
detecting a preset event (detecting behavior events [0063]) on the basis of the audio feature (feature values of the VGGish model [0063]; VGGish converting in-image audio into a 128-d feature and providing the feature as input to  downstream classification model [0062]);
upon detecting the preset event (when a specific dangerous behavior is sensed [0050]) … generating transmission data (reported to a manager and detailed information about a criminal situation is transmitted [0050]) ….
Kim does not disclose, but Bernal (US PG Publication 2018/0063538) teaches performing conversion processing (vector encoding [0038]) and encoding processing (Huffman, arithmetic or Lempel Ziv coding [0038]) on the video (by including features descriptive of objects of interest classified as vehicles, features descriptive of objects of interest classified as structures, features descriptive of actions of interest, features that discriminate actions A, B, C and D from each other [0021]) and the audio (frequency and phase descriptors in the case of one-dimensional sequential data such as audio [0030]) feature and the audio (frequency and phase descriptors in the case of one-dimensional sequential data such as audio [0030]) feature (the feature space [0038]);
and generating transmission data including (compressed data stream can be stored or transmitted [0031]) the video feature (by including features descriptive of objects of interest classified as vehicles, features descriptive of objects of interest classified as structures, features descriptive of actions of interest, features that discriminate actions A, B, C and D from each other [0021]) and the audio (frequency and phase descriptors in the case of one-dimensional sequential data such as audio [0030]) feature (the compression module may generate a compressed data representation of the feature representation extracted by the feature extraction module [0028]) on which the conversion processing and the encoding processing have been performed (vector quantization encoding and arithmetic/Huffman/Lempel Ziv [0038]).
Kim does not disclose, but Fujimura (US PG Publication 2023/0153610) teaches upon detecting the preset event (the precise moment of impact from [0067]), extracting a video feature from video data related to the detected event (the video-processing machine trained network 310 also receives the output 125 of the sound processing machine trained network 110 for each golf swing [0067]; each portion in the video that is associated with a golf swing is fed to the video processing machine trained network 310 as input to produce the video-processing machine trained network 310 output data 325  [0068]).
One of ordinary skill in the art before the application was filed would have been motivated to supplement the event detection of Kim with content-based compression, as in Bernal, because Bernal teaches that compressed image/video data comprises less data than raw image/video data, facilitating better communication of the information via bandwidth constrained communication link and alleviating the bottleneck that communication link 360 could otherwise have posed in the data collection process [0021].
One of ordinary skill in the art before the application was filed would have been motivated to modify Kim using the teachings of Fujimura to identify video segments based on precise moments identified through sound because Fujimura teaches that video does not have sufficient temporal resolution to identify all major events, and relying on sound data can enable the system to interpolate additional data samples in video [0067], providing information to users of the system and improving user experience. 

Regarding Claim 11, the claim is rejected on the grounds provided in Claim 4.
Regarding Claim 12, the claim is rejected on the grounds provided in Claim 5.
Regarding Claim 15, the claim is rejected on the grounds provided in Claim 8.

Claim(s) 13-14, 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over Kim (US PG Publication 2024/0233385) in view of Kim (US PG Publication 2024/0233385) in view of Bernal (US PG Publication 2018/0063538), Fujimura (US PG Publication 2023/0153610), and Li (NPL: “Video Description Combining Visual and Audio Features,” SPIE April 2023).

Regarding Claim 13, the claim is rejected on the grounds provided in Claim 6.
Regarding Claim 14, the claim is rejected on the grounds provided in Claim 7.
Regarding Claim 16, the claim is rejected on the grounds provided in Claim 9.

Regarding Claim 17, Kim does not disclose, but Li (NPL: “Video Description Combining Visual and Audio Features,” SPIE April 2023) teaches a data processing device (Intel (R) Xeon (R) Rronze3106 CPU, Section 4.2) comprising:
a memory configured to store input data (inherent);
and a processor coupled to the memory, wherein the processor is configured to perform operations (Intel (R) Xeon (R) Rronze3106 CPU@1.70GHz And a GPU (NVIDIA-SMI), using Python 3, Section 4.2). The remainder of Claim 17 is rejected on the grounds provided in Claim 10.
One of ordinary skill in the art before the application was filed would have been motivated to implement the teaches of Kim on a computer because computer-software implementation of algorithms is one of the most common ways to automate complex mathematical algorithms.

Regarding Claim 18, the claim is rejected on the grounds provided in Claim 8.
Regarding Claim 19, the claim is rejected on the grounds provided in Claim 9.

Response to Arguments
Applicant’s arguments filed 2/27/2026 are persuasive in part. 
Applicant argues on Page 3 that Kim teaches away from the invention because Kim performs “conversion processing” regardless of whether the preset event is detected or not detected. Remarks at 3. This argument is unpersuasive because (1) the doctrine of “teaching away” does not apply to a 102 rejection (see MPEP 2131.05); and (2) the claims do not require not-performing “conversion processing” in the event that an event is not detected. The claims only require conversion processing upon an event being detected, and because Kim always performs conversion processing, Kim does so upon an event being detected. Applicant may limit the claims to not-converting upon not-detecting of the preset event. This rebuttal applies to the new reference relied upon for teaching conversion processing.
Applicant argues on Page 4 that Examiner has cited the wrong paragraph to demonstrate that Kim discloses “detecting a preset event.” Applicant also concedes that Kim discloses “detecting a preset event.” Remarks at 4. Applicant’s argument, therefore, is moot. Regardless, Kim explains that by detecting “behavior events” it sets “behavior stop points.” Kim at [0063]. Therefore, the citation to “behavior stop points” provides applicant with information to discern that Kim discloses “detecting a preset event.”
Applicant argues on Page 5 that Kim does not disclose “detecting a preset event on the basis of audio feature,” as required by Claim 10. Remarks at 5. This is not persuasive because Kim discloses, “Feature values of an I3D model and a VGGish model can be configured into a multi-modal type in a vanilla transformer architecture… to automatically detect behavior events.” Kim at [0063]. The “feature values” from “VGGish” are audio features. Kim at [0062]. Applicant argues that the audio feature must be “extracted” and “not any audio feature that has been calculated with added information such as learning.” Remarks at 6. This is not persuasive because the claims do not exclude additional features from being considered. Applicant has also not demonstrated how Kim’s audio features generated by the VGGish model fail to be “extracted.” This rebuttal applies to Applicant’s arguments regarding Claim 17 at Page 8 of the Remarks.
Examiner believes that Kim does not disclose “upon detecting the preset event, extracting a video feature from video data related to the detected event,” as required by Claim 10, and a new reference is relied upon to teach that step. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
NPL: Liu, “Visually-aware audio captioning with adaptive audio-visual attention,” arXiv:2210.16428v3, May 2023. – an encoder-decoder architecture for combining visual and audio features for captioning video, using an adaptive self-attention block.
NPL: Lin, “TAVT: Towards Transferable Audio-Visual Text Generation,” Association for Computational Linguistics, July 2023. – an encoder-decoder architecture for adding text to videos using a meta-mapper to identify semantic audio features.
US-20230020834-A1 – encoder-decoder network for captioning video
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHADAN E HAGHANI whose telephone number is (571)270-5631. The examiner can normally be reached M-F 9AM - 5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jay Patel can be reached at 571-272-2988. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SHADAN E HAGHANI/Examiner, Art Unit 2485

Read full office action

Prosecution Timeline

Oct 27, 2024

Application Filed

Nov 25, 2025

Non-Final Rejection — §103

Feb 27, 2026

Response Filed

Mar 17, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/815,980

Patent 12604020

VIDEO DECODING METHOD AND DECODER DEVICE

2y 5m to grant Granted Apr 14, 2026

18/234,738

Patent 12598323

INTER PREDICTION-BASED VIDEO ENCODING AND DECODING

2y 5m to grant Granted Apr 07, 2026

18/663,581

Patent 12586336

WEARABLE DEVICE, METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM CONTROLLING LIGHT RADIATION OF LIGHT SOURCE

2y 5m to grant Granted Mar 24, 2026

18/901,214

Patent 12574549

CHROMA INTRA PREDICTION WITH FILTERING

2y 5m to grant Granted Mar 10, 2026

18/296,791

Patent 12568225

LIMITING A NUMBER OF CONTEXT CODED BINS FOR RESIDUE CODING

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

2-3

Expected OA Rounds

60%

Grant Probability

79%

With Interview (+18.6%)

2y 11m

Median Time to Grant

Moderate

PTA Risk

Based on 366 resolved cases by this examiner. Grant probability derived from career allow rate.