Last updated: April 19, 2026
Application No. 18/607,290
Transcriptive Biomechanical System And Method

Non-Final OA §103
Filed
Mar 15, 2024
Examiner
MA, MICHELLE HAU
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Odd Thinking LLC
OA Round
1 (Non-Final)
Interview Optional

— +36.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 21 resolved cases, 2023–2026
Examiner Intelligence

MA, MICHELLE HAU View full profile →
Grants 81% — above average
Career Allow Rate
17 granted / 21 resolved
+19.0% vs TC avg
Strong +36% interview lift
Without
With
+36.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
35 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
3.0%
-37.0% vs TC avg
§103
84.2%
+44.2% vs TC avg
§102
6.4%
-33.6% vs TC avg
§112
5.5%
-34.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 21 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: “234” and “238” in Fig. 2, “316” in Fig. 3.  Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities: 
In paragraph 12 line 11-12, “to have augment their experience” should be revised for clarity. Perhaps, “to have augment their experience” should read “to have their experience augmented” or “to augment their experience”. 
In paragraph 15 line 6, “there is long pauses” should read “there are long pauses”.
In paragraph 23 line 1, “either” should be removed.
In paragraph 23 line 4, “display 140 of to the user device” should read “display 140 of the user device”.
Appropriate correction is required.
Claim Objections
Claims 2, 4, 6-9, 11-15, and 16-20 are objected to because of the following informalities: 
In claim 2 line 2, it is unclear whether or not “input data” refers to the “acoustic, language, or biomechanical inputs” from claim 1 or if it refers to a different kind of input data. 
Claim 4 recites the limitation "the lexical, syntactic, semantic data" in lines 3-4.  There is insufficient antecedent basis for this limitation in the claim.
In claim 6, “The method of claim 6” should read “The method of claim 5”. For the sake of examination, claim 6 is taken to be dependent on claim 5. 
In claim 7 line 3, “to deriving” should read “to derive”.
In claim 8 line 1, “wherein the part of speech” should read “wherein part of speech”.
In claim 9 line 3, “motion data;” should read “motion data.”.
In claim 11 line 6, “3-Dimensional objects include components from the” should read “3-Dimensional objects include components from which the”.
In claim 11 line 9, “such that the animated biomechanical model of the interaction of the elements” may be revised for clarity to read “such that interaction of elements of the animated biomechanical model”. 
In claim 11 line 12, “based on the generated motion” should read “based on generated motion”.
Claims 12-15 are objected to because of their dependency on claim 11.
In claim 16 line 7, “atleast one of acoustic phoenetic” should read “at least one of acoustic, phonetic”. 
In claim 16 line 8, “atleast one of acoustic, phoenetic” should read “at least one of acoustic, phonetic”.
In claim 16 line 13, “matching the part of the speech” should read “matching part of the speech”.
In claim 17 line 1, “the part of speech” should read “the part of the speech”.
In claim 18 line 3, “motion data;” should read “motion data.”.
Claims 17-20 are objected to because of their dependency on claim 16.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4, 7, 9, 16, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Quatieri et al. (US 20220079511 A1) in view of Huang et al. (Model-based articulatory phonetic features for improved speech recognition) and Xie et al. (CN 112381913 A), hereinafter Quatieri, Huang, and Xie respectively.
Regarding claim 1, Quatieri teaches a computer-implemented method for generating biomechanical data output (Paragraph 0008, 0028 – “The method may also display an image of a vocal tract on a display device. The display device may be configured to play the audio recording and simultaneously animate the image of the vocal tract to display the physical configuration of the vocal tract of the speaker to provide visualization of where an articulatory deviation may occur due to a disorder… system 100 may comprise various software modules, libraries, and the like that may be stored on a non-transitory medium which, when executed by a processor, cause the processor and/or an associated computer system to perform the functions and implement the features described”; Note: the method generates an animation of the vocal tract, which is equivalent to a biomechanical data output. The method is performed by a computer system), the method comprising: 
generating translated data from acoustic, language, or biomechanical inputs and outputs (Paragraph 0030, 0033 – “The system 100 may include a feature extractor 204 that receives the digital version of waveform 102 and produces feature coefficients Y.sub.M representing characteristics of at least a portion of the acoustic waveform 102…The system 100 also includes a vocal tract variable generator 208 that receives the feature coefficients Y.sub.M and produces vocal tract variable TV.sub.N vectors. The vocal tract variables are numerical representations, specified in terms, for example representing a state of the user's 104 vocal tract 108 during articulation of the sound in the waveform 102… vocal tract variables that can be included may describe features or positions of the nasal cavity, buccal cavity, nostrils, epiglottis, trachea, hard palate, or any other element of a person's vocal tract”; Note: vocal tract variables, which are equivalent to translated data, are generated from feature coefficients of input audio, which is acoustic input); 
generating a stack of 2-Dimensional objects using a biomechanical model that maps each sound unit or language chunk to 2-Dimensional objects that are "rigged" (Fig. 4, Paragraph 0048, 0051 – “the system 100 may generate TV vectors that represent articulation and position of elements of the speaker's vocal tract. These variables may indicate the position and/or relative position or movement of articulatory vocal elements such as the lips, teeth, vocal folds, etc.…The display 400 may include an animation 402 of the vocal tract that shows how elements of the vocal tract move while the audio recording 102 is played, an image 404 of the recorded waveform, one or more panels 406 displaying information about the user 104 and the analyzed speech…”; Note: the vocal tract shown in 402 of Fig. 4 is equivalent to the stack of objects; see screenshot of Fig. 4 below. It is based on a biomechanical model of a human’s vocal organs. The TV vectors “rig” the objects/body parts by mapping speech to the movement of the individual objects); 
and generating an animated biomechanical data output that moves across coordinate space (Fig. 4, Paragraph 0036, 0051 – “the TV vectors include trajectory data (referred to as pellet trajectory) recorded for the individual articulators: e.g. Upper Lip, Lower Lip, Tongue Tip, Tongue Blade, Tongue Dorsum, Tongue Root, Lower Front Tooth (Mandible Incisor), Lower Back Tooth (Mandible Molar). These data may represent the way the articulators move during utterance as opposed to absolute position of the individual articulators. Because the physical X-Y positions of the pellets may be closely tied to the anatomy of the user 104, the pellet trajectories may provide relative measures of the articulators that reduce or remove dependence on the individual user's 104 anatomy…The display 400 may include an animation 402 of the vocal tract that shows how elements of the vocal tract move while the audio recording 102 is played, an image 404 of the recorded waveform, one or more panels 406 displaying information about the user 104 and the analyzed speech”; Note: the output in the display shown in Fig. 4, including the animation of the vocal tract, is equivalent to the biomechanical data output. The position of the articulators/objects/body parts are represented by coordinates X-Y and are thus in a coordinate space. When they move in the animation, they move across the coordinate space; see screenshot of Fig. 4 below).  

    PNG
    media_image1.png
    557
    555
    media_image1.png
    Greyscale

Screenshot of Fig. 4 (taken from Quatieri)
Quatieri does not teach generating translated data from acoustic, language, or biomechanical inputs and outputs by using codex mapping; generating a stack of 3-Dimensional objects using a biomechanical model that maps each sound unit or language chunk to 3-Dimensional objects that are "rigged". However, Huang teaches generating translated data from acoustic, language, or biomechanical inputs and outputs by using codex mapping (Paragraph 1 in 1st Col. of Page 2 – “We used the articulatory synthesizer which was initially proposed by Mermelstein [9] to approximate the speech production process of average adult speakers. The 28 controlling variables, named after the muscles, in the 12 major articulatory organs are shown in Table I. The synthesizer serves to produce an overall map of the acoustic outputs coupled with the articulatory gestures”; Note: the articulatory synthesizer is equivalent to a codex that maps acoustics to movement, which is equivalent to translated data). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Huang to map sound to movement using a codex for the benefit of being able to consistently and quickly identify which body parts contribute to the sound being made. Without the level of organization from a codex, the system would not be able to know all the important organs related to speech, as shown in Table 1 of Huang. Quatieri modified by Huang still does not teach generating a stack of 3-Dimensional objects using a biomechanical model that maps each sound unit or language chunk to 3-Dimensional objects that are "rigged". However, Xie teaches generating a stack of 3-Dimensional objects using a biomechanical model that maps each sound unit or language chunk to 3-Dimensional objects that are "rigged" (Paragraph 0046-0050, 0053-0055 – “An embodiment of the present invention provides a method for constructing a dynamic pronunciation teaching model based on 3D modeling and oral anatomy…According to the oral anatomy model in the head and neck anatomy model, the physiological structure of each vocal organ is obtained; Based on the pronunciation properties of each phoneme, a three-dimensional description of the pronunciation process of each phoneme is obtained;…A three-dimensional animated interactive teaching model is produced by combining the physiological structure of each vocal organ, the three-dimensional description of the pronunciation process of each phoneme, and the pronunciation teaching process…the vocal organs in the oral anatomical model include: a tuning organ, a resonance cavity and a sound source; The modulation organs include active pronunciation organs and passive pronunciation organs. The active pronunciation organs include lips, tongue, soft palate, uvula, etc…The pronunciation attributes of each phoneme include the pronunciation position, pronunciation method, and whether the vocal cords vibrate. That is to say, according to the pronunciation properties, each phoneme can be described using the place of articulation, the method of articulation, and whether the vocal cords vibrate”; Note: the three-dimensional animated interactive teaching model is equivalent to the stack of 3D objects, and it is generated using the physiological structure of each vocal organ, which is equivalent to the biomechanical model. Phonemes, which are sound units, are mapped to the organs/objects by articulation, making them “rigged”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Xie to have a stack of 3-Dimensional objects instead of 2-Dimensional objects because “Most of the existing pronunciation teaching auxiliary materials are oral side profiles, real-person pronunciation videos and two-dimensional flat animations. These simulation methods cannot see the changes in the pronunciation organs inside the mouth, especially the most important pronunciation organ - the tongue, making it difficult to achieve ideal teaching results. The rapid development of 3D technology in recent years has led to the wider application of 3D technology in simulating subtle movement changes” (Xie: Paragraphs 0051-0052). In other words, 3-Dimensional objects better represents the motion of vocal organs. 
Regarding claim 2, Quatieri in view of Huang and Xie teaches the method of claim 1. Quatieri further teaches receiving input data (Paragraph 0029 – “The system 100 may include an audio input 201 that receives the audio waveform 102”); validating whether the input data is in an acoustic data format (Paragraph 0029 – “Depending on the format of the waveform 102, the system 100 may optionally contain analog to digital converter (ADC) 202 to sample convert the waveform 102 into a digital waveform if, for example, waveform 102 is not provided in digital format or if waveform 102 needs to be resampled”); and converting the input data into the acoustic or language data format when the input data is not of an acoustic data format (Paragraph 0029 – “Depending on the format of the waveform 102, the system 100 may optionally contain analog to digital converter (ADC) 202 to sample convert the waveform 102 into a digital waveform if, for example, waveform 102 is not provided in digital format or if waveform 102 needs to be resampled”; Note: an analog wave form is not a data format, so it is converted into a digital wave form, which is a data format). 
Regarding claim 4, Quatieri in view of Huang and Xie teaches the method of claim 1. Quatieri further teaches determining coordinates in a 2-Dimensional space for each 2-Dimensional biomechanical object by mapping each of the sound and language units to other units (Fig. 4, Paragraph 0036 – “the TV vectors include trajectory data (referred to as pellet trajectory) recorded for the individual articulators: e.g. Upper Lip, Lower Lip, Tongue Tip, Tongue Blade, Tongue Dorsum, Tongue Root, Lower Front Tooth (Mandible Incisor), Lower Back Tooth (Mandible Molar). These data may represent the way the articulators move during utterance as opposed to absolute position of the individual articulators. Because the physical X-Y positions of the pellets may be closely tied to the anatomy of the user 104, the pellet trajectories may provide relative measures of the articulators that reduce or remove dependence on the individual user's 104 anatomy”; Note: the articulators are the biomechanical objects, and sounds from speech/utterances are mapped to movements of the articulators. It is implied that in order to determine the movement of the articulators, their coordinate positions are also determined. The articulators are shown in Fig. 4; see screenshot of Fig. 4 above). Quatieri does not teach obtaining and analyzing a stack of 3-Dimensional biomechanical objects to determine the movement of components in each biomechanical object that produce motion or sound from the lexical, syntactic, semantic data; and determining coordinates in a 3-Dimensional space for each 3-Dimensional biomechanical object by mapping each of the sound and language units to other units derived by the codex. However, Xie teaches obtaining and analyzing a stack of 3-Dimensional biomechanical objects to determine the movement of components in each biomechanical object that produce motion or sound from the lexical, syntactic, semantic data (Paragraph 0022, 0046-0050, 0053-0055 – “interactive modes are designed to construct a visualized 3D virtual human head and its oral system that can produce synchronized speech animations…An embodiment of the present invention provides a method for constructing a dynamic pronunciation teaching model based on 3D modeling and oral anatomy…According to the oral anatomy model in the head and neck anatomy model, the physiological structure of each vocal organ is obtained; Based on the pronunciation properties of each phoneme, a three-dimensional description of the pronunciation process of each phoneme is obtained;…A three-dimensional animated interactive teaching model is produced by combining the physiological structure of each vocal organ, the three-dimensional description of the pronunciation process of each phoneme, and the pronunciation teaching process…The pronunciation attributes of each phoneme include the pronunciation position, pronunciation method, and whether the vocal cords vibrate. That is to say, according to the pronunciation properties, each phoneme can be described using the place of articulation, the method of articulation, and whether the vocal cords vibrate”; Note: the three-dimensional animated interactive teaching model is equivalent to the stack of 3D objects, and the individual objects correspond to the different organs. Phonemes, which are sound units, are mapped to the organs/objects that produce those sounds, and the organs are animated to move based on the sounds. The phonemes come from speech, which is lexical syntactic semantic data), and determining coordinates in a 3-Dimensional space for each 3-Dimensional biomechanical object by mapping each of the sound and language units to other units (Paragraph 0046-0050, 0053-0055 – “An embodiment of the present invention provides a method for constructing a dynamic pronunciation teaching model based on 3D modeling and oral anatomy…According to the oral anatomy model in the head and neck anatomy model, the physiological structure of each vocal organ is obtained; Based on the pronunciation properties of each phoneme, a three-dimensional description of the pronunciation process of each phoneme is obtained;…A three-dimensional animated interactive teaching model is produced by combining the physiological structure of each vocal organ, the three-dimensional description of the pronunciation process of each phoneme, and the pronunciation teaching process…the vocal organs in the oral anatomical model include: a tuning organ, a resonance cavity and a sound source; The modulation organs include active pronunciation organs and passive pronunciation organs. The active pronunciation organs include lips, tongue, soft palate, uvula, etc…The pronunciation attributes of each phoneme include the pronunciation position, pronunciation method, and whether the vocal cords vibrate. That is to say, according to the pronunciation properties, each phoneme can be described using the place of articulation, the method of articulation, and whether the vocal cords vibrate”; Note: the vocal organs are equivalent to the 3D biomechanical objects. Phonemes, which are sound units, are mapped to the organs/objects. It is implied that coordinates are determined in a 3D space because the objects are 3-dimensional, articulated, and animated. Coordinates are required to produce movement in an animation for the objects). The 2D space and 2D biomechanical objects in Quatieri can be replaced by the 3D space and 3D biomechanical objects in Xie; it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to do so because having three dimensions compared to two creates a more realistic visualization, and thus would better represent the human body. It also would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Xie to analyze 3D biomechanical objects to determine their movement based on sound from speech data for the benefit of being able to visualize how speech affects the body, which may be useful for teaching purposes in biology: “Most of the existing pronunciation teaching auxiliary materials are oral side profiles, real-person pronunciation videos and two-dimensional flat animations. These simulation methods cannot see the changes in the pronunciation organs inside the mouth, especially the most important pronunciation organ - the tongue, making it difficult to achieve ideal teaching results. The rapid development of 3D technology in recent years has led to the wider application of 3D technology in simulating subtle movement changes” (Xie: Paragraphs 0051-0052). Furthermore, Quatieri modified by Xie still does not teach mapping each of the sound and language units to other units derived by the codex. However, Huang teaches mapping each of the sound and language units to other units derived by the codex (Paragraph 1 in 1st Col. of Page 2 – “We used the articulatory synthesizer which was initially proposed by Mermelstein [9] to approximate the speech production process of average adult speakers. The 28 controlling variables, named after the muscles, in the 12 major articulatory organs are shown in Table I. The synthesizer serves to produce an overall map of the acoustic outputs coupled with the articulatory gestures”; Note: the articulatory synthesizer is equivalent to a codex that maps acoustics to movement). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Huang to map sound to movement using a codex for the benefit of being able to consistently and quickly identify which body parts contribute to the sound being made. Without the level of organization from a codex, the system would not be able to know all the important organs related to speech, as shown in Table 1 of Huang.
Regarding claim 7, Quatieri in view of Huang and Xie teaches the method of claim 1. Quatieri further teaches parsing the input data that goes into the system (Paragraph 0030 – “The system 100 may include a feature extractor 204 that receives the digital version of waveform 102 and produces feature coefficients Y.sub.M representing characteristics of at least a portion of the acoustic waveform 102”; Note: the input data is parsed using a feature extractor); analyzing the at least one of acoustic, phonetic, or language data to derive natural language data (Paragraph 0034 – “For example, assuming that the user 104 articulated the word ‘No,’ the feature extractor may produce a feature coefficient vector Y for each sampled segment 105 of the waveform. Thus, there may be a sequence of feature coefficient vectors Y.sub.N associated with the ‘N’ sound of the word ‘No,’ and another sequence of feature coefficient vectors Y.sub.O associated with the ‘O’ sound of the word ‘No.’”; Note: The input data is analyzed to derive sounds related to language/speech); and generating the stack of 3-Dimensional objects (Fig. 4, Paragraph 0051 – “The display 400 may include an animation 402 of the vocal tract that shows how elements of the vocal tract move while the audio recording 102 is played, an image 404 of the recorded waveform, one or more panels 406 displaying information about the user 104 and the analyzed speech…”; Note: the vocal tract shown in 402 of Fig. 4 is equivalent to the stack of objects. Furthermore, Quatieri was previously modified by Xie to teach that the objects are 3-dimensional; see the rejection of claim 1 above) and analyzing motion of components of the biomechanical objects to determine an underlying medical condition (Paragraph 0033, 0041 – “The vocal tract variables are numerical representations, specified in terms, for example representing a state of the user's 104 vocal tract 108 during articulation of the sound in the waveform 102 including, but not limited to, a state of the time-varying place (e.g. location along the oral cavity) and time varying manner (e.g. degree of constriction at the location) of characteristics of the position…The disorder identification module may process the eigenspectra to determine a degree of correlation between two or more of the vocal tract variables. This degree of correlation may represent correlation of phase, rise time, fall time, slope, or other time-based characteristics of the vocal tract variables, and/or may include correlation of amplitude, peak-to-peak values, or other magnitude-based characteristics of the vocal tract variables. The degree of correlation between vocal tract variables can indicate the presence of a speech irregularity that may be caused by a neuromotor disorder”; Note: the correlation between vocal tract variables represents a motion, and that correlation is analyzed to determine a potential disorder).  
Regarding claim 9, Quatieri in view of Huang and Xie teaches the method of claim 1. Quatieri further teaches generating the stack of 3-Dimensional objects using a biomechanical model mapping each sound unit to motion data (Fig. 4, Paragraph 0048, 0051 – “the system 100 may generate TV vectors that represent articulation and position of elements of the speaker's vocal tract. These variables may indicate the position and/or relative position or movement of articulatory vocal elements such as the lips, teeth, vocal folds, etc.…The display 400 may include an animation 402 of the vocal tract that shows how elements of the vocal tract move while the audio recording 102 is played, an image 404 of the recorded waveform, one or more panels 406 displaying information about the user 104 and the analyzed speech…”; Note: the vocal tract shown in 402 of Fig. 4 is equivalent to the stack of objects; see screenshot of Fig. 4 above. It is based on a biomechanical model of a human’s vocal organs. The TV vectors “rig” the objects/body parts by mapping speech to the movement of the individual objects. Furthermore, Quatieri was previously modified by Xie to teach that the objects/body parts are 3-dimensional; see the rejection of claim 1 above).
Regarding claim 16, Quatieri teaches a system comprising: a memory device; and a processing device, operatively coupled to the memory device (Paragraph 0028 – “system 100 may comprise various software modules, libraries, and the like that may be stored on a non-transitory medium which, when executed by a processor, cause the processor and/or an associated computer system to perform the functions and implement the features described”; Note: the non-transitory medium is a memory and holds software for the processor), to perform operations comprising:
generating lexical data from input data, the lexical data including multiple sound units (Paragraph 0030, 0033 – “The system 100 may include a feature extractor 204 that receives the digital version of waveform 102 and produces feature coefficients Y.sub.M representing characteristics of at least a portion of the acoustic waveform 102… The system 100 also includes a vocal tract variable generator 208 that receives the feature coefficients Y.sub.M and produces vocal tract variable TV.sub.N vectors. The vocal tract variables are numerical representations, specified in terms, for example representing a state of the user's 104 vocal tract 108 during articulation of the sound in the waveform… vocal tract variables that can be included may describe features or positions of the nasal cavity, buccal cavity, nostrils, epiglottis, trachea, hard palate, or any other element of a person's vocal tract”; Note: the waveform is input data. Data related to sound articulations, including feature coefficients and vocal tract variables, are equivalent to lexical data. There are multiple elements of the vocal tract, which are sound units); 
parsing the input data into at least one of acoustic, phonetic, or language data (Paragraph 0034 – “For example, assuming that the user 104 articulated the word ‘No,’ the feature extractor may produce a feature coefficient vector Y for each sampled segment 105 of the waveform. Thus, there may be a sequence of feature coefficient vectors Y.sub.N associated with the ‘N’ sound of the word ‘No,’ and another sequence of feature coefficient vectors Y.sub.O associated with the ‘O’ sound of the word ‘No.’”; Note: the input data is parsed using a feature extractor); 
analyzing the at least one of acoustic, phonetic, or language data to obtain data matching speech units (Paragraph 0038 – “the system may correlate the vocal tract activity with the sounds in the speech waveform 102. This is useful because human speech may contain hysteresis (for example, the way a sound is physically formed by the vocal tract can depend on the way the previous sound was physically formed)”; Note: sounds in the speech waveform, which are acoustic data, are analyzed to match to vocal tract organs, which are speech units);
generating a stack of 2-Dimensional objects using a biomechanical model mapping each sound unit to lexical data, wherein the 2-Dimensional objects include components from the biomechanical model (Fig. 4, Paragraph 0048, 0051 – “the system 100 may generate TV vectors that represent articulation and position of elements of the speaker's vocal tract. These variables may indicate the position and/or relative position or movement of articulatory vocal elements such as the lips, teeth, vocal folds, etc.…The display 400 may include an animation 402 of the vocal tract that shows how elements of the vocal tract move while the audio recording 102 is played, an image 404 of the recorded waveform, one or more panels 406 displaying information about the user 104 and the analyzed speech…”; Note: the vocal tract shown in 402 of Fig. 4 is equivalent to the stack of objects; see screenshot of Fig. 4 above. It is based on a biomechanical model of a human’s vocal organs. The TV vectors “rig” the objects/body parts by mapping articulations/lexical data to the vocal elements/sound units);
and analyzing motion of components near lexical data matching the part of the speech to determine an underlying medical condition (Paragraph 0033, 0041 – “The vocal tract variables are numerical representations, specified in terms, for example representing a state of the user's 104 vocal tract 108 during articulation of the sound in the waveform 102 including, but not limited to, a state of the time-varying place (e.g. location along the oral cavity) and time varying manner (e.g. degree of constriction at the location) of characteristics of the position…The disorder identification module may process the eigenspectra to determine a degree of correlation between two or more of the vocal tract variables. This degree of correlation may represent correlation of phase, rise time, fall time, slope, or other time-based characteristics of the vocal tract variables, and/or may include correlation of amplitude, peak-to-peak values, or other magnitude-based characteristics of the vocal tract variables. The degree of correlation between vocal tract variables can indicate the presence of a speech irregularity that may be caused by a neuromotor disorder”; Note: the correlation between vocal tract variables represents a motion corresponding to speech, and that correlation is analyzed to determine a potential disorder).  
Quatieri does not teach generating lexical data from input data using codex mapping; generating a stack of 3-Dimensional objects using a biomechanical model mapping each sound unit to lexical data, wherein the 3-Dimensional objects include components from the biomechanical model. However, Huang teaches generating lexical data from input data using codex mapping (Paragraph 1 in 1st Col. of Page 2 – “We used the articulatory synthesizer which was initially proposed by Mermelstein [9] to approximate the speech production process of average adult speakers. The 28 controlling variables, named after the muscles, in the 12 major articulatory organs are shown in Table I. The synthesizer serves to produce an overall map of the acoustic outputs coupled with the articulatory gestures”; Note: the articulatory synthesizer is equivalent to a codex that maps acoustics to articulatory gestures, which is equivalent to lexical data). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Huang to map sound using a codex for the benefit of being able to consistently and quickly identify which body parts contribute to the sound being made. Without the level of organization from a codex, the system would not be able to know all the important organs related to speech, as shown in Table 1 of Huang. Quatieri modified by Huang still does not teach generating a stack of 3-Dimensional objects using a biomechanical model mapping each sound unit to lexical data, wherein the 3-Dimensional objects include components from the biomechanical model. However, Xie teaches generating a stack of 3-Dimensional objects using a biomechanical model mapping each sound unit to lexical data, wherein the 3-Dimensional objects include components from the biomechanical model (Paragraph 0046-0050, 0053-0055 – “An embodiment of the present invention provides a method for constructing a dynamic pronunciation teaching model based on 3D modeling and oral anatomy…According to the oral anatomy model in the head and neck anatomy model, the physiological structure of each vocal organ is obtained; Based on the pronunciation properties of each phoneme, a three-dimensional description of the pronunciation process of each phoneme is obtained;…A three-dimensional animated interactive teaching model is produced by combining the physiological structure of each vocal organ, the three-dimensional description of the pronunciation process of each phoneme, and the pronunciation teaching process…the vocal organs in the oral anatomical model include: a tuning organ, a resonance cavity and a sound source; The modulation organs include active pronunciation organs and passive pronunciation organs. The active pronunciation organs include lips, tongue, soft palate, uvula, etc…The pronunciation attributes of each phoneme include the pronunciation position, pronunciation method, and whether the vocal cords vibrate. That is to say, according to the pronunciation properties, each phoneme can be described using the place of articulation, the method of articulation, and whether the vocal cords vibrate”; Note: the three-dimensional animated interactive teaching model is equivalent to the stack of 3D objects, and it is generated using the physiological structure of each vocal organ, which is equivalent to the biomechanical model. Phonemes, which are lexical data, are mapped to the organs/sound units by articulation, making them “rigged”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Xie to have a stack of 3-Dimensional objects instead of 2-Dimensional objects because “Most of the existing pronunciation teaching auxiliary materials are oral side profiles, real-person pronunciation videos and two-dimensional flat animations. These simulation methods cannot see the changes in the pronunciation organs inside the mouth, especially the most important pronunciation organ - the tongue, making it difficult to achieve ideal teaching results. The rapid development of 3D technology in recent years has led to the wider application of 3D technology in simulating subtle movement changes” (Xie: Paragraphs 0051-0052). In other words, 3-Dimensional objects better represents the motion of vocal organs. 
Regarding claim 18, Quatieri in view of Huang and Xie teaches the system of claim 16. Quatieri further teaches generating the stack of 3-Dimensional objects using a biomechanical model mapping each sound unit to motion data (Fig. 4, Paragraph 0048, 0051 – “the system 100 may generate TV vectors that represent articulation and position of elements of the speaker's vocal tract. These variables may indicate the position and/or relative position or movement of articulatory vocal elements such as the lips, teeth, vocal folds, etc.…The display 400 may include an animation 402 of the vocal tract that shows how elements of the vocal tract move while the audio recording 102 is played, an image 404 of the recorded waveform, one or more panels 406 displaying information about the user 104 and the analyzed speech…”; Note: the vocal tract shown in 402 of Fig. 4 is equivalent to the stack of objects; see screenshot of Fig. 4 above. It is based on a biomechanical model of a human’s vocal organs. The TV vectors “rig” the objects/body parts by mapping speech to the movement of the individual objects. Furthermore, Quatieri was previously modified by Xie to teach that the objects/body parts are 3-dimensional; see the rejection of claim 1 above).
Claims 3 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Quatieri in view of Huang, Xie, and Dimitrova et al. (US 20030236663 A1), hereinafter Dimitrova.
Regarding claim 3, Quatieri in view of Huang and Xie teaches the method of claim 1. Quatieri does not teach performing acoustic processing on the input data, the processing including: denoising the input data; segmenting the input data into windows; identifying a number of speakers from the input data; classifying the segmented data into chatter and silence; and processing speech for clarity and interpretability. However, Dimitrova teaches performing acoustic processing on the input data, the processing (Paragraph 0001 – “the present invention relates to speaker ID systems employing automatic audio signal segmentation based on mel-frequency cepstral coefficients (MFCC) extracted from the audio signals. Corresponding methods suitable for processing signals from multiple audio signal sources are also disclosed”) including: denoising the input data (Paragraph 0060 – “During the throwaway process of substep S123, a segment labeled signal with a signal strength value smaller than a predetermined threshold is relabeled as a silence segment”; Note: the audio is denoised since a certain level of noise under a threshold is filtered out); segmenting the input data into windows (Paragraph 0030 – “instantiate functions including an audio segmentation and classification function receiving general audio data (GAD) and generating segments”; Note: the segments are equivalent to windows); identifying a number of speakers from the input data (Paragraph 0030 – “a matching and labeling function assigning a speaker ID to speech signals within the GAD, and a database function for correlating the assigned speaker ID to the respective speech signals within the GAD”; Note: identification of the number of speakers in the input data is implied because the number of speakers is equivalent to the number of unique speaker IDs stored in the database after processing); classifying the segmented data into chatter and silence (Paragraph 0030 – “the audio segmentation and classification function assigns each segment to one of N audio signal classes including silence, single speaker speech, music, environmental noise, multiple speaker's speech, simultaneous speech and music, and speech and noise”); and processing speech for clarity and interpretability (Paragraph 0060 – “During the throwaway process of substep S123, a segment labeled signal with a signal strength value smaller than a predetermined threshold is relabeled as a silence segment”; Note: this step allows speech to be processed more clearly because it helps filter background or unnecessary noise). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Dimitrova to perform processing on the audio for the benefit of being able to hear and understand the audio better. Raw audio is not always in good condition, making it difficult to make out what is being said. In situations like those, processing would help improve the quality. It also would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Dimitrova to denoise, segment, and classify the audio for the benefit of more efficient processing. Specifically, denoising audio helps filter out unnecessary sounds, segmenting separates the audio into parts, and classifying audio organizes it into different categories so that each category can be analyzed individually. All of these processes are common for automatic speech recognition. Finally, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Dimitrova to identify a number of speakers from the input data for the benefit of being able to distinguish the different speakers in the audio. Mixing up the speakers in an audio may cause the audio to be misunderstood, so it is important to be able to identify how many speakers there are and who is speaking. 
Regarding claim 20, Quatieri in view of Huang and Xie teaches the system of claim 16. Quatieri does not teach performing acoustic processing on the input data, the processing including: denoising the input data; segmenting the input data into windows; identifying a number of speakers from the input data; classifying the segmented data into chatter and silence; and processing speech for clarity and interpretability. However, Dimitrova teaches performing acoustic processing on the input data, the processing (Paragraph 0001 – “the present invention relates to speaker ID systems employing automatic audio signal segmentation based on mel-frequency cepstral coefficients (MFCC) extracted from the audio signals. Corresponding methods suitable for processing signals from multiple audio signal sources are also disclosed”) including: denoising the input data (Paragraph 0060 – “During the throwaway process of substep S123, a segment labeled signal with a signal strength value smaller than a predetermined threshold is relabeled as a silence segment”; Note: the audio is denoised since a certain level of noise under a threshold is filtered out); segmenting the input data into windows (Paragraph 0030 – “instantiate functions including an audio segmentation and classification function receiving general audio data (GAD) and generating segments”; Note: the segments are equivalent to windows); identifying a number of speakers from the input data (Paragraph 0030 – “a matching and labeling function assigning a speaker ID to speech signals within the GAD, and a database function for correlating the assigned speaker ID to the respective speech signals within the GAD”; Note: identification of the number of speakers in the input data is implied because the number of speakers is equivalent to the number of unique speaker IDs stored in the database after processing); classifying the segmented data into chatter and silence (Paragraph 0030 – “the audio segmentation and classification function assigns each segment to one of N audio signal classes including silence, single speaker speech, music, environmental noise, multiple speaker's speech, simultaneous speech and music, and speech and noise”); and processing speech for clarity and interpretability (Paragraph 0060 – “During the throwaway process of substep S123, a segment labeled signal with a signal strength value smaller than a predetermined threshold is relabeled as a silence segment”; Note: this step allows speech to be processed more clearly because it helps filter background or unnecessary noise). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Dimitrova to perform processing on the audio for the benefit of being able to hear and understand the audio better. Raw audio is not always in good condition, making it difficult to make out what is being said. In situations like those, processing would help improve the quality. It also would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Dimitrova to denoise, segment, and classify the audio for the benefit of more efficient processing. Specifically, denoising audio helps filter out unnecessary sounds, segmenting separates the audio into parts, and classifying audio organizes it into different categories so that each category can be analyzed individually. All of these processes are common for automatic speech recognition. Finally, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Dimitrova to identify a number of speakers from the input data for the benefit of being able to distinguish the different speakers in the audio. Mixing up the speakers in an audio may cause the audio to be misunderstood, so it is important to be able to identify how many speakers there are and who is speaking. 
Claims 5-6, 8, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Quatieri in view of Huang, Xie, and Soroski et al. (Evaluating Web-Based Automatic Transcription for Alzheimer Speech Data: Transcript Comparison and Machine Learning Analysis), hereinafter Soroski.
Regarding claim 5, Quatieri in view of Huang and Xie teaches the method of claim 1. Quatieri does not teach generating a text transcript from the animated biomechanical data output. However, Soroski teaches generating a text transcript from audio (Paragraph 2 in 1st Col. of Page 4 – “participant audio was uploaded to the Google Cloud STT platform using US English and 16000 Hz settings, with word-level time stamps enabled, to output the automatic transcripts”; Note: a text transcript is generated from audio). Because audio is a part of the animated biomechanical data output in Quatieri (Paragraph 0051 – “The display 400 may include an animation 402 of the vocal tract that shows how elements of the vocal tract move while the audio recording 102 is played, an image 404 of the recorded waveform, one or more panels 406 displaying information about the user 104 and the analyzed speech”; Note: the animated biomechanical data output includes the animation of the vocal tract and the audio recording), it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri to incorporate the teachings of Soroski to generate a transcript from the animated biomechanical data output for the benefit of providing an additional way to visualize speech. Moreover, it may be beneficial in analyzing diseases related to speech: “Analysis of speech to aid in the identification of individuals with early neurodegenerative disease can be a promising strategy, as speech recording is noninvasive, scalable, and easily repeated over time…For AD classification using speech, transcription is a key step to leverage the wealth of information contained in lexical data” (Soroski: Paragraph 2-3 in 1st Col. of Page 2). 
Regarding claim 6, Quatieri in view of Huang, Xie, and Soroski teaches the method of claim 5. Quatieri does not teach validating a transcript from an external source based on the generated transcripts from the system. Soroski teaches validating generated transcripts from the system based on a transcript from an external source (Paragraph 1 in 2nd Col. of Page 3 – “Using manually corrected transcripts as the gold standard, we calculated the error rate of automatic transcripts”; Note: the automatic transcripts, which are generated transcripts from the system, are validated using manual transcripts, which are transcripts from an external source). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Quatieri in view of Soroski to validate a transcript from an external source based on the generated transcripts from the system, because if the transcript from the external source seems more unreliable compared to the generated transcripts, the generated transcripts can be used as a guide to correct errors. Furthermore, when there are two types of transcripts, there is a finite number of ways to validate them; either the first transcript can 
Read full office action
Prosecution Timeline

Mar 15, 2024
Application Filed
Oct 02, 2025
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/528,488
Patent 12602750
DIFFERENTIABLE EMULATION OF NON-DIFFERENTIABLE IMAGE PROCESSING FOR ADJUSTABLE AND EXPLAINABLE NON-DESTRUCTIVE IMAGE AND VIDEO EDITING
2y 5m to grant Granted Apr 14, 2026
17/832,771
Patent 12597208
BUILDING INFORMATION MODELING SYSTEMS AND METHODS
2y 5m to grant Granted Apr 07, 2026
18/250,082
Patent 12573217
SERVER, METHOD AND COMPUTER PROGRAM FOR GENERATING SPATIAL MODEL FROM PANORAMIC IMAGE
2y 5m to grant Granted Mar 10, 2026
18/481,308
Patent 12561851
HIGH-RESOLUTION IMAGE GENERATION USING DIFFUSION MODELS
2y 5m to grant Granted Feb 24, 2026
18/193,076
Patent 12536734
Dynamic Foveated Point Cloud Rendering System
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
81%
Grant Probability
99%
With Interview (+36.4%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 21 resolved cases by this examiner. Grant probability derived from career allow rate.