Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1. This action is responsive to the communication filed on 5/15/25. Claims 1-3, 12-14 and 17-19 have been amended. Claims 1-20 are pending.
2. Applicants' arguments filed 5/15/25 have been fully considered but they are not deemed to be persuasive. Rejections and/or objections not reiterated from previous office actions are hereby withdrawn. The following rejections and/or objections are either reiterated or newly applied. They constitute the complete set presently being applied to the instant application.
Claim Rejections - 35 USC § 103
3. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
4. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
5. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
6. The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
7. Claims 1-2, 4-6, 8, 12-13 and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over KEMENTSIETSIDIS in view of Borhade (US 20230113635 A1 hereinafter, “Borhade”), and further in view of HUANG, and further in view of YOUNG.
8. With respect to claim 1,
KEMENTSIETSIDIS discloses
a method, comprising:
identifying a join candidate in a data source, wherein identifying the join candidate comprises:
identifying pairs of columns within the data source, wherein each pair includes a first column in a first table of the data source and a second column of a second table of the data source; and
for the each pair, assigning a casting similarity index from a scale that includes a first casting similarity level,
wherein the casting similarity index is determined based on an extent to which data values from the first column are convertible to match a data type of the second column within the each pair, and
wherein the first casting similarity level is assigned to the each pair in a case that the first column has a Boolean type or the second column has a float type (KEMENTSIETSIDIS [0041] – [0042], [0059] – [0062], [0066] e.g. [0041] Data type technique 300C may transform data records from one schema to another by calculating a data type similarity score between fields in the first schema and fields in the second schema. For example, if the first schema had a field with an integer data type and the second schema had a field with a float data type and another field with a timestamp data type, then data type technique 300C may be more disposed to transform records from the field with an integer data type to the float data type rather than to the timestamp data type because an integer has a greater data type similarity to a float than to a timestamp. [0042] Value distribution technique 300D may transform first data records from one schema to another by calculating a value similarity score between values of the first data records and representative second data records that are structured in accordance with the second schema. For example, if the first data records had a field with values that have a mean of 102 and the second data records had a field with values that have a mean of 100 and another field with values that have a mean of 400, then value distribution technique 300D may be more disposed to transform data records from the field values that have a mean of 102 to the field with values that have a mean of 100 rather than to the field with values that have a mean of 400 because 102 has a greater value similarity score to 100 than to 400. [0059] First data records 412 are structured in accordance with first schema 414. For the purpose of brevity, the contents of first schema 414 are not shown, but first schema 414 may be a relational schema. In general, a relational schema organizes data records into a series of interconnected tables. A table contains columns and rows. The rows of a table correspond to individual entries, while the columns correspond to fields of those individual entries. Entries may be character strings, numbers, Boolean values, or null values. [0062] Schema matching engine 120 may receive property values 420 from parser 110 and may use property values 420 along with assignment rules 310 to select one or more schema mapping techniques to transform first data records 412 into transformed data records 422. In this example, the columns in first data records 412 have semantically similar names with the columns in second data records 418. For instance, second data records 418 use the column names “Title”, ‘Tears”, and “Occupation” and first data records 412 use the column names “Name”, “Age”, and “Job”. Assignment rules 310 may ascertain this semantic similarity and may cause schema matching engine 120 to select semantic distance technique 300B to transform first data records 412 into transformed data records 422 [as
identifying pairs of columns (e.g. columns) within the data source (e.g. data records), wherein each pair includes a first column in a first table of the data source and a second column of a second table of the data source; and
for the each pair, assigning a casting similarity (e.g. similarity score; referring to the instant applicant’s spec. [0025] … a comparison (i.e., a casting similarity) of a first type of the data of the first column and a second type of the data of the second column) index from a scale (e.g. 100, 102, 400) that includes a first casting similarity level indicative of a (e.g. 400) casting similarity level,
wherein the casting similarity index is determined based on an extent to which data values from the first column are convertible (e.g. transform) to match (e.g. match) a data type of the second column within the each pair, and
wherein the first casting similarity level is assigned to the each pair in a case that the first column has a Boolean type (e.g. boolean) or the second column has a float type (e.g. float)]).
Although KEMENTSIETSIDIS substantially teaches the claimed invention, KEMENTSIETSIDIS does not explicitly indicate a predefined scale.
Borhade teaches the limitations by stating
identifying a join candidate in a data source, wherein identifying the join candidate comprises:
identifying pairs of columns within the data source, wherein each pair includes a first column in a first table of the data source and a second column of a second table of the data source; and
for the each pair, assigning a casting similarity index from a predefined scale that includes a first casting similarity level (Borhade [0005], [0045] – [0050], [0067], [0083], [0169], [0193], [0225] – [0238] and Fig. 1 e.g. [0005] In an embodiment, a system for identifying similarities between tables and databases stored in a storage server. The system comprises the storage server and a management server. The storage server comprises a first database configured to store a plurality of first database tables, and a second database configured to store a plurality of second database tables, wherein the first database tables comprise a source table and a target table. The management server comprises a non-transitory memory configured to store and a storage index indicating metadata describing a plurality of tables stored at the storage server, wherein the metadata indicates a column name of a column in each of the tables. The management server further comprises a processor configured to execute the instructions, which cause the processor to be configured to determine, by a table similarity scoring application of the management server, a table similarity score between the source table and the target table based on a comparison between metadata describing each column in the source table and metadata describing each column in the target table, determine, by a database similarity scoring application of the management server, a database similarity score between the first database and the second database based on either a comparison between metadata describing each column in the first database and metadata describing each column in the second database, or a plurality of table similarity scores between of each table in the first database and each table in the second database, determine, by a usage application of the management server, access data describing a frequency of accessing data stored in at least one of the first database or the source table, and transmit, using a server-to-storage application programming interface, an instruction to the storage server to delete data stored in at least one of the first database or the source table to eliminate redundancies at the storage server, wherein the data stored in at least one of the first database or the source table is associated with the UE. [0045] In an embodiment, the table similarity scoring application 120 may determine the table similarity score 132 using the metadata 138 in the storage index 136, as will be further described below. In an embodiment, the database similarity scoring application 122 may determine the database similarity score 134 using the metadata 138 in the storage index 136, as will be further described below. [0049] The policy may specify the type of metadata 138 that should be compared in the source table 152A and target table 152B to determine the table similarity score 132. For example, a policy may indicate that the table similarity score 132 is to be determined based on at least two items of metadata 138 in the storage index 136 (i.e., the table name 140, physical path 142,column name 144, row count 145, size 146, last access 147 of the table 152A-C, and/or a data format149 of the databases 150A-B, tables 152A-C in the databases 150A-B, or the columns 153A-C in the tables 152A-C). [0050] For example, suppose the policy for the table similarity score 132 indicates that the table similarity score 132 is to be determined based on the column names 144 of the columns 153A-Bwithin the source table 152A and the target table 152B. In this example, the scanner application 118may obtain the column name 144 of each column 153A in the source table 152A, and obtain the column name 144 of each column 153B in the target table 152B. [0067] In an embodiment, the table similarity scoring application 120 may only determine that two strings of metadata 138, such as column names 144, match when the string matching score is greater than or equal to a pre-defined threshold. The pre-defined threshold may be a threshold value indicating that two strings having string matching scores greater than or equal to the threshold are considered to be matching or substantially matching. For example, the pre-defined threshold may be 0.95. In this way, the table similarity scoring application 120 may compare to strings of metadata 138 and determine that the strings match, even when the strings are not identical. [0083] Continuing with the example above, suppose that that first table similarity score is 0.3, the second table similarity score is 0.98, the third table similarity score is 0.75, the fourth table similarity score, is 0.25, the fifth table similarity score is 0.97, and the sixth table similarity score is 0. Further, suppose the pre-defined threshold is 0.95. In this case, the second table similarity score and the fifth table similarity score are greater than or equal to the threshold. The subset of the table similarity scores 132 greater than the threshold include the second table similarity score 132 between table 152A and the second table 152C in the target database 150B, and the fifth table similarity score 132 between table 152B and the second table 152C in the target database 150B. [0169] sim_df=source_df.crossJoin(target_df).withColumn(" SIMILARITY SCORE", matching(source_df. SOURCE)(target_df.TARGET)) [0193] RIGHT JOIN DATA_CATALOG.TD_SIMSCORE_COLUMNMATCHES [0225] LEFT JOIN [0226] (SELECT*FROM DATA_CATALOG.SIMSCORE_DF RAW NEWMETHOD [0227] WHERE SIMILARITY SCORE>=0.5) MAIN ON MAIN.database_searched [as
identifying a join (e.g. join) candidate in a data source (e.g. storage server 104 in Fig. 1), wherein identifying the join candidate comprises:
identifying pairs of columns (e.g. columns) within the data source, wherein each pair includes a first column in a first table (e.g. first table) of the data source and a second column of a second table (e.g. second table) of the data source; and
for the each pair, assigning a casting similarity (e.g. comparison/compared … similarity score; referring to the instant applicant’s spec. [0025] … a comparison (i.e., a casting similarity) of a first type of the data of the first column and a second type of the data of the second column) index (e.g. index) from a predefined scale (e.g. pre-defined threshold 0.95) that includes a first casting similarity level (e.g. 0.97, 0.95)]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention, in view of the teachings of KEMENTSIETSIDIS and Borhade, to if the legacy database is replaced by a new database that structures data records in accordance with a new schema, then the existing data records in the legacy database should be transformed to be structured in accordance with the new schema so that the existing data records can be properly stored within the new database (KEMENTSIETSIDIS [0002]).
Although KEMENTSIETSIDIS and Borhade combination substantially teaches the claimed invention, they do not explicitly indicate the first column has a Boolean type and the second column has a float type.
HUANG teaches the limitations by stating
wherein the first casting similarity level is assigned to the each pair in a case that the first column has a boolean type and the second column has a float type (HUANG pages 4-5 and Figs. 2-3 e.g. [pages 4-5] FIG. 1 shows a flow chart of a data analysis method according to one embodiment of the present invention. The method comprises the following steps: step 1-6, the specific explanation for each step. step 1-data set preparation Firstly, the range of the data set to be analyzed is selected in the source database (i.e., data source). In an embodiment of the present invention, the data set may include one or more complete database in the source database, one or more partial database or a combination thereof (e.g., one or more complete database and one or more portion database). The database (e.g., relational database) is typically composed of one or more tables, and the minimum unit for analyzing the data is typically afield, and the field refers to a column in a table. … table definition (e.g., table name, table description, table type; field included in the table) and field definition (such as field name, field description, field data type) information, The construction and use of the analysis database will be described in detail below. step 2-field data type identification and normalization FIG. 2 shows a schematic diagram of the process of identifying and normalizing the data type of the field according to the embodiment of the present invention; the following detailed description. Table 1 lists the common data types in the art, these data types may have different naming and definition in different database tools. The invention takes the HIVE database as example to illustrate the process of identifying the field data type. after reading the metadata and the data content in the step 1, can obtain the explicit statement of the data type of the related field. when there is explicit declaration in a field, the data type in the explicit statement is the data type of the field, for example, when the data type of the explicit statement is BIGINT, the data type of the field is BIGINT. when the field does not exist explicit statement, converting the data content in the field according to the data type conversion matrix, and determining the data type of the field according to whether the conversion is successful. Generally, such a conversion operation is also referred to as "data identification". FIG. 3 shows a schematic diagram of an HIVE data type conversion matrix according to one embodiment of the present invention. For example, the data content of the BOOLEAN type is only true when converting the value of the BOOLEAN type, then the data type of the data content only can be BOOLEAN; and the data content of the TINYINT type is converted into TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL; the value of the STRING and VARCHAR type is true, then the data type of the data content can be any one of the nine types. In the embodiment of the present invention, the priority of the data type converted successfully is displayed in the first row of FIG. 3 in turn from left to right. For example, although the TINYINT type can be converted into nine types, but the data content is converted into TINYINT is successful, no subsequent conversion, and the datatype of the conversion is recorded as the data type of the field, otherwise, then the priority level lower conversion; until the conversion is successful [as wherein the first casting similarity level (e.g. FALSE between BOOLEAN and FLOAT in Fig. 3) is assigned to the each pair in a case that the first column has a boolean type (e.g. BOOLEAN column/field) and the second column has a float type (e.g. FLOAT column/field)]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention, in view of the teachings of KEMENTSIETSIDIS, Borhade and HUANG, to if the legacy database is replaced by a new database that structures data records in accordance with a new schema, then the existing data records in the legacy database should be transformed to be structured in accordance with the new schema so that the existing data records can be properly stored within the new database (KEMENTSIETSIDIS [0002]).
Although KEMENTSIETSIDIS, Borhade and HUANG substantially teaches the claimed invention, they do not explicitly indicate
presenting the join candidate on a device of a user;
receiving, from the device of the user, a command to perform a data query of the data source, wherein the data query is based on the join candidate;
querying the data source based on the data query to obtain tabular data; and
outputting the tabular data.
Young teaches the limitations by stating
identifying a join candidate in a data source, wherein identifying the join candidate comprises:
identifying pairs of columns within the data source, wherein each pair includes a first column in a first table of the data source and a second column of a second table of the data source (YOUNG [0006] – [0007], [0025] e.g. a first column in a first table to a second column in a second table; a file system or a database); and
for the each pair, assigning a casting similarity index from a predefined scale (YOUNG [0006] – [0007], [0029], [0050] – [0051], [0058], [0063] – [0064], [0076] – [0077], [0088] – [0090] e.g. similarity metric - a Euclidean, squared Euclidean or other distance metric can be used for multidimensional numerical values; a Hamming distance; joinability score - the similarity or as a factor in the joinability score; [0050] For example, an appropriate similarity (or difference) metric, given the type of the data in the fields, can be used to compare each pair of values. For example, a Euclidean, squared Euclidean or other distance metric can be used for multidimensional numerical values; a Hamming distance or other string matching metric can be used to compare strings; The individual comparison results for each value pair can be aggregated to provide a similarity measure between the two data sets. [0051] If the comparison indicates that there is insufficient similarity, as illustrated at 502, processing ends, as illustrated at 504. Otherwise, this pair of fields is identified as a potential candidate and further analysis is performed. [0063] A penalty can be applied 706 to this raw score, for example, if the data type or field names of the fields F and G do not match. Note that in the foregoing explanation, the various set operations are multiset operations. [0064] For example, the penalty can be a scaling factor. Such a scaling factor can be selected so as to penalize fields that do not match, but would permit non-matching fields to be used in the event that no matches are found. As an example, if the data types are unjoinable (e.g., money, floats, doubles, date data types), the penalty can be a scaling factor of 0.2. If the data types do not match, then the penalty can be a scaling factor of 0.5. If the names of the fields do not match, then the penalty can be a scaling factor of between 0.5 to 1.0. For example, if one of the field names is a prefix of the other (e.g., "comp" is a prefix of "company"), then the scaling factor can be higher (e.g., .95). A distance metric applied to the field names also can be used as part of a function to compute a scaling factor. For example, the Levenshtein edit distance between two names divided by the minimum lengths of the two names, subtracted from one but limited to a minimum value such as 0.5, can be used to compute a scaling factor.[0089] The joinability score can take into account one or more factors such as, for example, similarity between field names of the columns, the likelihood of finding a value from the first column in the second column, the similarity between the values contained in the columns, the data type of each column. [0090] In one embodiment, the likelihood of finding a value v from column CI in column C2 is used as the similarity or as a factor in the joinability score for an edge representing the join from CI to C2. The similarity between the field names of CI and C2 can also be determined (e.g. by Euclidian distance or Hamming distance) and included in the joinability score as an additional factor. [0091] As one example of a determination of a joinability score, the data types of columns CI and C2are compared. If the data types do not match, then the joinability score is deemed to be zero and no edge is added to the join graph. Next, if the data types do match, then a likelihood of finding a value v from column CI in the column C2 is determined with the resulting likelihood being used as the joinability score.) that includes a first casting (YOUNG [0006] – [0007], [0029], [0076] – [0077], [0088] – [0091] e.g. [0077] Booleans are not good join candidates because of the small number of distinct values and Boolean columns will have a very low uniqueness. Floating point numbers are also generally impractical for use in joins. [0091] the joinability score (including similarity metric) is deemed to be zero; referring to the instant applicant’s specification [0218] “… If none of the previous conditions apply (such as, for example, in a case where the first column may be of type BOOLEAN and the second column may be of type FLOAT), then a fourth casting similarity (e.g., 0, "very low" can be associated with the join candidate”),
wherein the casting similarity index is determined based on an extent to which data values from the first column are convertible (YOUNG [0029] – [0032] e.g. transformation; convert - [0029] In addition to making the data accessible in memory for access by a processor for analysis, the data accessor can perform various data transformations to allow easier comparison of data between different fields. For example, the data accessor may convert a data type of the data from a stored data type (in storage) to an analysis data type (in memory). In practice, the data type of most fields is a string, but some may be integers (signed or unsigned), floating point integers, dates and so on. [0032] Generally, the statistical results include, for each pair of fields, in addition to an identification of the fields in the pair, a set of values resulting from the statistical analyses performed between the two fields. Such statistical results can include, for example, a measure of similarity of the sets of values in the two fields, a measure of entropy with respect to an intersection of the sets of values of the identified fields, a measure of density of one or both fields or a measure of a likelihood that a value in the identified field in the first data set matches a value in the identified field in the second data set) to match (e.g. match) a data type of the second column within the each pair, and
wherein the first casting similarity level is assigned to the each pair in a case that the first column has a Boolean type or the second column has a float type (e.g. [0077] Booleans are not good join candidates because of the small number of distinct values and Boolean columns will have a very low uniqueness. Floating point numbers are also generally impractical for use in joins. [0091] the joinability score (including similarity metric) is deemed to be zero; referring to the instant applicant’s specification [0218] “… If none of the previous conditions apply (such as, for example, in a case where the first column may be of type BOOLEAN and the second column may be of type FLOAT), then a fourth casting similarity ( e.g., 0, "very low" can be associated with the join candidate”);
presenting the join candidate on a device of a user (YOUNG [0033] e.g. [0033] The statistical results 112 are input to a recommendation engine 114 which provides, as its output, one or more recommended joins 116, which can be provided to applications 118. Each recommended join is a pair of fields, one field from each data set. The recommendation engine can output a list of such recommended joins. The list of joins can be sorted or unsorted. The recommendation engine 114 or an application 118 can present such a list to a user through a display or other output device, and the user can provide one or more selected joins through an appropriate input device. The application 118 also can use the list to select one or more joins);
receiving, from the device of the user, a command to perform a data query of the data source, wherein the data query is based on the join candidate;
querying the data source based on the data query to obtain tabular data; and
outputting the tabular data (YOUNG [0006] – [0007], [0029], [0033], [0076] – [0077], [0088] – [0090] e.g. [0006] In accordance with a first embodiment, a computer system processes arbitrary data sets to identify fields of data that can be the basis of a join operation, which in turn can be used in report and query generation. [0007]. A user selects a subset of the tables and the system creates a join tree with recommended joins between the tables selected by the user. The recommended joins are used to create a structured query language statement which is executed to return a result to the user. [0033] The statistical results 112 are input to a recommendation engine 114 which provides, as its output, one or more recommended joins 116, which can be provided to applications 118. Each recommended join is a pair of fields, one field from each data set. The recommendation engine can output a list of such recommended joins. The list of joins can be sorted or unsorted. The recommendation engine 114 or an application 118 can present such a list to a user through a display or other output device, and the user can provide one or more selected joins through an appropriate input device. The application 118 also can use the list to select one or more joins).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention, in view of the teachings of KEMENTSIETSIDIS, Borhade, HUANG and YOUNG, to overcome the complexity of Joining tables if the data sets are arbitrary and generated from unstructured data (YOUNG [0004]).
9. With respect to claim 2,
Borhade discloses wherein the predefined scale further includes a second casting similarity level indicative of a casting similarity level that is greater than the first casting similarity level (Borhade [0005], [0045] – [0050], [0067], [0083], [0169], [0193], [0225] – [0238] e.g. 0.97), and wherein assigning the casting similarity index further comprises:
YOUNG discloses assigning the second casting similarity level in a case that the first column has a STRING type and the second column has a DATE type (YOUNG [0029] e.g. string, dates).
10. With respect to claim 4,
YOUNG discloses wherein presenting the join candidate on the device of the user comprises:
displaying identified join candidates that include the join candidate on the device of the user in a ranked order based on respective casting similarity indexes of the identified join candidates (YOUNG [0039], [0059], [0065], [0141] – [0143], [0155] – [0156] e.g. ranked list).
11. With respect to claim 5,
YOUNG discloses wherein assigning the casting similarity index comprises: assigning the casting similarity index based on a comparison of respective metadata associated with the first column and the second column (YOUNG [0006], [0029], [0049] – [0051], [0055], [0066] – [0069], [0091], [0094] – [0095], [0122] – [0124], [0137], [0151] e.g. compares; comparison; metadata).
12. With respect to claim 6,
YOUNG discloses wherein the respective metadata include respective data types (YOUNG [0006] – [0007], [0029], [0076] – [0077], [0088] – [0090] e.g. data types), lengths, or precision attributes.
13. With respect to claim 8,
KEMENTSIETSIDIS discloses wherein the predefined scale includes multiple casting similarity levels, each casting similarity level being assigned based on specific data type combinations between the first column and the second column (KEMENTSIETSIDIS [0041] – [0042], [0059] – [0062], [0066] e.g. similarity score – e.g. 100. 102. 400).
14. Claims 12-13 and 16 are same as claims 1-2 and 5 and are rejected for the same reasons as applied hereinabove.
15. Claims 17-18 are same as claims 1-2 and are rejected for the same reasons as applied hereinabove.
16. Claims 3, 14 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over KEMENTSIETSIDIS in view of Borhade, HUANG and YOUNG, and further in view of SASSIN.
17. With respect to claim 3,
YOUNG discloses wherein the predefined scale further includes a second casting similarity level indicative of a casting similarity level that is greater than the first casting similarity level, wherein assigning the casting similarity index further comprises:
assigning the second casting similarity level in a case that the first column has a different type and includes character string data (YOUNG [0029] e.g. string).
Although KEMENTSIETSIDIS, Borhade, HUANG and YOUNG combination substantially teaches the claimed invention, they do not explicitly indicate the second column has a VARCHAR type.
SASSIN teaches the limitations by stating the second column has a VARCHAR type (SASSIN [0107] e.g. data type VARCHAR).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention, in view of the teachings of KEMENTSIETSIDIS, Borhade, HUANG, YOUNG and SASSIN, to overcome the complexity of Joining tables if the data sets are arbitrary and generated from unstructured data (YOUNG [0004]).
18. Claim 14 is same as claim 3 and is rejected for the same reasons as applied hereinabove.
19. Claim 19 is same as claim 3 and is rejected for the same reasons as applied hereinabove.
20. Claims 7, 10, 15 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over KEMENTSIETSIDIS in view of Borhade, HUANG and YOUNG, and further in view of Gulwani.
21. With respect to claim 7,
Although KEMENTSIETSIDIS, Borhade, HUANG and YOUNG combination substantially teaches the claimed invention, they do not explicitly indicate comparing column names using a lexical similarity algorithm.
Gulwani teaches the limitations by stating comparing column names using a lexical similarity algorithm (Gulwani [0128], [0140] – [0142], [0153], [0159], [0162], [0175] – [0176] and Figs. 1, 9, 11, 13 e.g. 2.3.3 Matching Parses [0128] The function MatchParse takes two parse descriptors p.sub.1 and p.sub.2 as arguments (where one of the parse descriptors is from the library of parse descriptors and the other is associated with a semantic entity associated with an input or output item), and computes a weight w representing the similarity of the two format descriptors. A perfect match of fields and delimiters is given the highest weight. In one implementation, if the two format descriptors do not match perfectly, the hamming distance between the field and delimiter matches between the two format descriptors is computed. The weight representing the closeness measure is computed inversely proportional to the hamming distance. 2.4.3.1 Currency Entity [0153] For example, in the table 900 shown in FIG. 9, an end-user wants to translate the currency in column-1902 into the currency type shown in column-2 904 using the currency conversion rate on the date shown in column-3 906, to obtain the result as shown in the output column 908. [0162] For example, consider a spreadsheet with dates in three different formats (i.e., US format: Month/Day/Year, European format: Day.Month.Year, and Chinese format: Year-Month-Day), as shown in FIG. 13. [0175] In one implementation, this is accomplished by computing a closeness measure that is inversely proportional to the hamming distance between the parse path and a parse descriptor, such that the closer the pattern of semantic entity fields and delimiters in the parse path is to the pattern of semantic entity fields and delimiters in the parse descriptor, the larger the closeness measure. [0176] With regard to the action of identifying one or more transforms from a library of transforms of a type that can produce the desired output item from the input items of an input-output example, it is noted that in one implementation, the entity class of each semantic entity identifies one or more transforms that apply to that class of entities [as comparing column names (e.g. Figs. 1, 9, 11 and/or 13) using a lexical (e.g. semantic) similarity algorithm]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention, in view of the teachings of KEMENTSIETSIDIS, Borhade, HUANG and YOUNG and Gulwani, to overcome the complexity of Joining tables if the data sets are arbitrary and generated from unstructured data (YOUNG [0004]).
22. With respect to claim 10,
Gulwani further discloses calculating an edit distance between a first name of the first column and a second name of the second column (Gulwani [0128], [0153], [0159], [0162], [0175] – [0176] and Figs. 1, 9, 11, 13 e.g. 2.3.3 Matching Parses [0128] A perfect match of fields and delimiters is given the highest weight. In one implementation, if the two format descriptors do not match perfectly, the hamming distance between the field and delimiter matches between the two format descriptors is computed. The weight representing the closeness measure is computed inversely proportional to the hamming distance. 2.4.3.1 Currency Entity [0153] For example, in the table 900 shown in FIG. 9, an end-user wants to translate the currency in column-1902 into the currency type shown in column-2 904 using the currency conversion rate on the date shown in column-3 906, to obtain the result as shown in the output column 908. [0162] For example, consider a spreadsheet with dates in three different formats (i.e., US format: Month/Day/Year, European format: Day.Month.Year, and Chinese format: Year-Month-Day), as shown in FIG. 13. [0175] In one implementation, this is accomplished by computing a closeness measure that is inversely proportional to the hamming distance between the parse path and a parse descriptor, such that the closer the pattern of semantic entity fields and delimiters in the parse path is to the pattern of semantic entity fields and delimiters in the parse descriptor, the larger the closeness measure. [0176] With regard to the action of identifying one or more transforms from a library of transforms of a type that can produce the desired output item from the input items of an input-output example, it is noted that in one implementation, the entity class of each semantic entity identifies one or more transforms that apply to that class of entities [as calculating an edit distance (e.g. hamming distance; referring to the instant applicant’s specification [0222]) between a first name of the first column and a second name of the second column (e.g. Figs. 1, 9, 11 and/or 13)]).
23. Claim 15 is same as claim 10 and is rejected for the same reasons as applied hereinabove.
24. Claim 20 is same as claim 10 and is rejected for the same reasons as applied hereinabove.
25. Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over KEMENTSIETSIDIS in view of Borhade, HUANG and YOUNG, and further in view of KANG.
26. With respect to claim 9,
Although KEMENTSIETSIDIS, Borhade, HUANG and YOUNG combination substantially teaches the claimed invention, they do not explicitly indicate determining compatibility between data type of the first column and the second column for assigning the casting similarity index.
KANG teaches the limitations by stating determining compatibility between data type of the first column and the second column for assigning the casting similarity index (KANG [0104] e.g. [0104] 2) create four general data type, CHAR, NUMBER, DATE and BOOLEAN, each having metainformation for data types of the data type, it comprises the characteristics of common data type, compatibility with other common data type, converting and compliant condition).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention, in view of the teachings of KEMENTSIETSIDIS, Borhade, HUANG and YOUNG and KANG, to overcome the complexity of Joining tables if the data sets are arbitrary and generated from unstructured data (YOUNG [0004]).
27. Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over KEMENTSIETSIDIS in view of Borhade, HUANG and YOUNG, and further in view of KABRA.
28. With respect to claim 11,
Although KEMENTSIETSIDIS, Borhade, HUANG and YOUNG combination substantially teaches the claimed invention, they do not explicitly indicate evaluating semantic relationships between column names to identify column name pairs having semantic similarity beyond character-level.
KABRA teaches the limitations by stating evaluating semantic relationships between column names to identify column name pairs having semantic similarity beyond character-level (KABRA [0035] e.g. [0035] Thus, in tracking historical (i.e. past, completed) classifications of columns of data into data classes of a collection of data classes, this can include establishing clusters of candidate data classes, of that collection of candidate data classes, that have classified related columns of data. Based on some relation(s) observed between columns that have been classified, the respective data classes can be clustered. Example relations include same or similar column name, and contextual relation (proximity, domain, order, as examples). Specifically, establishing the clusters could establish a cluster that clusters candidate data classes related by common terms of a glossary of terms directed to a specific business domain. In an example, a cluster is established for candidate data classes that have classified columns of data having lexicographically similar column names, as the relation between the classified columns of data. In selecting the candidate data classes to compare to value(s) of a target column, this can include selecting the classes of that cluster based on the column name of the target column of data matching to the lexicographically similar column name(s) that are part of the cluster. “Lexicographically similar” is defined to mean that two names are (i) the same (e.g. alphanumerically identical), (ii) similar based on having common substring(s) or low edit distance(s) compared to each other, (iii) identified as synonyms in a glossary, and/or (iv) variant ways of writing the same thing, for instance one as an abbreviation of the other (as with column 304 of FIG. 3). In a particular example, two names are lexicographically similar if the Levenshtein edit distance as between to two is below a predefined threshold).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention, in view of the teachings of KEMENTSIETSIDIS, Borhade, HUANG and YOUNG and KABRA, to overcome the complexity of Joining tables if the data sets are arbitrary and generated from unstructured data (YOUNG [0004]).
Response to Argument
29. On page 10, Applicant alleges Kementsietsidis does not describe joins
Examiner disagrees because:
Amended claim 1 merely recites a join candidate, and there is no join operation being performed in the claim.
Further, both Borhade and YOUNG recite join candidate and join operation,
30. On pages 10-11, Applicant alleges the Office appears to have interpreted "data source" as equivalent to "data records". This conclusion is respectfully traversed as inconsistent with the understanding of a person having ordinary skill in the art in view of the present application. (See e.g., [0092] of the present application). The Office Action is unclear with respect to which data records are being cited.
Examiner disagrees because:
Data records are clearly a data source.
Further, Borhade (e.g. [0025] a file system or a database), HUANG (e.g. [pages 4-5] the source database (i.e. data source)) and YOUNG (e.g. storage server 104 in Fig. 1) recite a data source.
31. On pages 11-12, Applicant alleges the Office has failed to establish that a person having ordinary skill in the art would interpret the combination of Kementsietsidis, Huang, and Young as teaching "for the each pair, assigning a casting similarity index from a predefined scale that includes a first casting similarity level," as recited by revised independent claim 1.
Examiner disagrees because:
Applicant’s remarks and arguments presented on pages 11-12 have been fully considered but they are moot in view of the new grounds of rejection presented in this office action.
32. On pages 11-13, Applicant alleges the Office has failed to establish that a person having ordinary skill in the art would interpret the combination of Kementsietsidis, Huang, and Young as teaching "wherein the first casting similarity level is assigned to the each pair in a case that the first column has a Boolean type and the second column has a float type," as recited by revised independent claim 1 … the Office has failed to establish that a person having ordinary skill in the art would interpret "False" as a casting similarity level
Examiner disagrees because:
As described in Huang pages 4-5 and Fig. 2-3, FALSE between BOOLEAN and FLOAT in Fig. 3. That is, the first casting similarity level is zero. Zero is clearly a casting similarity level between Boolean and Float.
The disclosure reasonably describes the argued limitation of "wherein the first casting similarity level is assigned to the each pair in a case that the first column has a Boolean type and the second column has a float type".
33. Applicant’s remarks and arguments presented on page 13 directed to claim 2 have been fully considered but they are moot in view of the new grounds of rejection presented in this office action.
34. On pages 13-15, the arguments of claims 3, 14, 19, 7, 10, 15, 20, 9 and 11 are directed to the similar argument of claim 1 which has been addressed above.
Conclusion
35. Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SyLing Yen whose telephone number is 571-270-1306.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sanjiv Shah can be reached at 571-272-4098. The fax and phone numbers for the organization where this application or proceeding is assigned is 571-273-8300.
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the receptionist whose telephone number is 571-272-2100.
66
/SYLING YEN/Primary Examiner, Art Unit 2166
May 27, 2025