Academic literature on the topic 'Approximate record matching'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Approximate record matching.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Approximate record matching"

1

Seleznjev, Oleg, and Bernhard Thalheim. "Random Databases with Approximate Record Matching." Methodology and Computing in Applied Probability 12, no. 1 (July 31, 2008): 63–89. http://dx.doi.org/10.1007/s11009-008-9092-4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Verykios, Vassilios S., Ahmed K. Elmagarmid, and Elias N. Houstis. "Automating the approximate record-matching process." Information Sciences 126, no. 1-4 (July 2000): 83–98. http://dx.doi.org/10.1016/s0020-0255(00)00013-x.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Essex, Aleksander. "Secure Approximate String Matching for Privacy-Preserving Record Linkage." IEEE Transactions on Information Forensics and Security 14, no. 10 (October 2019): 2623–32. http://dx.doi.org/10.1109/tifs.2019.2903651.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Hanrath, Scott, and Erik Radio. "User search terms and controlled subject vocabularies in an institutional repository." Library Hi Tech 35, no. 3 (September 18, 2017): 360–67. http://dx.doi.org/10.1108/lht-11-2016-0133.

Full text
Abstract:
Purpose The purpose of this paper is to investigate the search behavior of institutional repository (IR) users in regard to subjects as a means of estimating the potential impact of applying a controlled subject vocabulary to an IR. Design/methodology/approach Google Analytics data were used to record cases where users arrived at an IR item page from an external web search and subsequently downloaded content. Search queries were compared against the Faceted Application of Subject Terminology (FAST) schema to determine the topical nature of the queries. Queries were also compared against the item’s metadata values for title and subject using approximate string matching to determine the alignment of the queries with current metadata values. Findings A substantial portion of successful user search queries to an IR appear to be topical in nature. User search queries matched values from FAST at a higher rate than existing subject metadata. Increased attention to subject description in IR records may provide an opportunity to improve the search visibility of the content. Research limitations/implications The study is limited to a particular IR. Data from Google Analytics does not provide comprehensive search query data. Originality/value The study presents a novel method for analyzing user search behavior to assist IR managers in determining whether to invest in applying controlled subject vocabularies to IR content.
APA, Harvard, Vancouver, ISO, and other styles
5

Bianchi Santiago, Josie D., Héctor Colón Jordán, and Didier Valdés. "Record Linkage of Crashes with Injuries and Medical Cost in Puerto Rico." Transportation Research Record: Journal of the Transportation Research Board 2674, no. 10 (July 31, 2020): 739–48. http://dx.doi.org/10.1177/0361198120935439.

Full text
Abstract:
Cost considerations are critical in the analysis and prevention of traffic crashes. Integration of cost data into crash datasets facilitates the crash-cost analyses with all their related attributes. It is, however, a challenging task because of the lack of availability of unique identifiers across the databases and because of privacy and confidentiality regulations. This study performed a record linkage comparison between the deterministic and probabilistic approaches using attributes matching techniques with numerical distance and weight patterns under the Fellegi–Sunter approach. As a result, the deterministic algorithm developed using the exact match of the 14-digit police accident record number had an overall matching performance of 52.38% of real matched records, while the probabilistic algorithm had an overall matching performance of 70.41% with a quality measurement of the sensitivity of 99.99%. The deterministic approach was thus outperformed by the probabilistic approach by approximately 20% of records matched. The probabilistic matching with numerical variables seems to be a good matching strategy supported by quality variables. On record matching, a multivariable regression model was developed to model medical costs and identify factors that increase the costs of treating injured claimants in Puerto Rico.
APA, Harvard, Vancouver, ISO, and other styles
6

Douglas, M. M., D. Gardner, D. Hucker, and S. W. Kendrick. "Best-Link Matching of Scottish Health Data Sets." Methods of Information in Medicine 37, no. 01 (1998): 64–68. http://dx.doi.org/10.1055/s-0038-1634494.

Full text
Abstract:
Abstract:Methods are described used to link the Community Health Index and the National Health Service Central Register (NHSCR) in Scotland to provide a basis for a national patient index. The linkage used a combination of deterministic and probability matching techniques. A best-link principle was used by which each Community Health Index record was allowed to link only to the NHSCR record with which it achieved the highest match weight. This strategy, applied in the context of two files which each covered virtually the entire population of Scotland, increased the accuracy of linkage approximately a thousand-fold compared with the likely results of a less structured probability matching approach. By this means, 98.8% of linkable records were linked automatically with a sufficient degree of confidence for administrative purposes.
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Shan, Huiling Shan, Chi Zhang, Yuexing Wang, and Chunxiang Shi. "Bias Correction in Monthly Records of Satellite Soil Moisture Using Nonuniform CDFs." Advances in Meteorology 2018 (July 16, 2018): 1–11. http://dx.doi.org/10.1155/2018/1908570.

Full text
Abstract:
It is important to eliminate systematic biases in the field of soil moisture data assimilation. One simple method for bias removal is to match cumulative distribution functions (CDFs) of modeled soil moisture data to satellite soil moisture data. Traditional methods approximate numerical CDFs using 12 or 20 uniformly spaced samples. In this paper, we applied the Douglas–Peucker curve approximation algorithm to approximate the CDFs and found that three nonuniformly spaced samples can achieve the same reduction in standard deviation. Meanwhile, the matching results are always closely related to the temporal and spatial availability of soil moisture observed by automatic soil moisture station (ASM). We also applied the new nonuniformly spaced sampling method to a shorter time series. Instead of processing a whole year of data at once, we divided it into 12 datasets and used three nonuniformly spaced samples to approximate the model data’s CDF for each month. The matching results demonstrate that NU-CDF3 reduced the SD, improved R, and reduced the RMSD in over 70% of the stations, when compared with U-CDF12. Additionally, the SD and RMSD have been reduced by over 4% with R improved by more than 9%.
APA, Harvard, Vancouver, ISO, and other styles
8

Grannis, Shaun J., Huiping Xu, Joshua R. Vest, Suranga Kasthurirathne, Na Bo, Ben Moscovitch, Rita Torkzadeh, and Josh Rising. "Evaluating the effect of data standardization and validation on patient matching accuracy." Journal of the American Medical Informatics Association 26, no. 5 (March 8, 2019): 447–56. http://dx.doi.org/10.1093/jamia/ocy191.

Full text
Abstract:
Abstract Objective This study evaluated the degree to which recommendations for demographic data standardization improve patient matching accuracy using real-world datasets. Materials and Methods We used 4 manually reviewed datasets, containing a random selection of matches and nonmatches. Matching datasets included health information exchange (HIE) records, public health registry records, Social Security Death Master File records, and newborn screening records. Standardized fields including last name, telephone number, social security number, date of birth, and address. Matching performance was evaluated using 4 metrics: sensitivity, specificity, positive predictive value, and accuracy. Results Standardizing address was independently associated with improved matching sensitivities for both the public health and HIE datasets of approximately 0.6% and 4.5%. Overall accuracy was unchanged for both datasets due to reduced match specificity. We observed no similar impact for address standardization in the death master file dataset. Standardizing last name yielded improved matching sensitivity of 0.6% for the HIE dataset, while overall accuracy remained the same due to a decrease in match specificity. We noted no similar impact for other datasets. Standardizing other individual fields (telephone, date of birth, or social security number) showed no matching improvements. As standardizing address and last name improved matching sensitivity, we examined the combined effect of address and last name standardization, which showed that standardization improved sensitivity from 81.3% to 91.6% for the HIE dataset. Conclusions Data standardization can improve match rates, thus ensuring that patients and clinicians have better data on which to make decisions to enhance care quality and safety.
APA, Harvard, Vancouver, ISO, and other styles
9

Zhang, Yifan, Erin E. Holsinger, Lea Prince, Jonathan A. Rodden, Sonja A. Swanson, Matthew M. Miller, Garen J. Wintemute, and David M. Studdert. "Assembly of the LongSHOT cohort: public record linkage on a grand scale." Injury Prevention 26, no. 2 (October 29, 2019): 153–58. http://dx.doi.org/10.1136/injuryprev-2019-043385.

Full text
Abstract:
BackgroundVirtually all existing evidence linking access to firearms to elevated risks of mortality and morbidity comes from ecological and case–control studies. To improve understanding of the health risks and benefits of firearm ownership, we launched a cohort study: the Longitudinal Study of Handgun Ownership and Transfer (LongSHOT).MethodsUsing probabilistic matching techniques we linked three sources of individual-level, state-wide data in California: official voter registration records, an archive of lawful handgun transactions and all-cause mortality data. There were nearly 28.8 million unique voter registrants, 5.5 million handgun transfers and 3.1 million deaths during the study period (18 October 2004 to 31 December 2016). The linkage relied on several identifying variables (first, middle and last names; date of birth; sex; residential address) that were available in all three data sets, deploying them in a series of bespoke algorithms.ResultsAssembly of the LongSHOT cohort commenced in January 2016 and was completed in March 2019. Approximately three-quarters of matches identified were exact matches on all link variables. The cohort consists of 28.8 million adult residents of California followed for up to 12.2 years. A total of 1.2 million cohort members purchased at least one handgun during the study period, and 1.6 million died.ConclusionsThree steps taken early may be particularly useful in enhancing the efficiency of large-scale data linkage: thorough data cleaning; assessment of the suitability of off-the-shelf data linkage packages relative to bespoke coding; and careful consideration of the minimum sample size and matching precision needed to support rigorous investigation of the study questions.
APA, Harvard, Vancouver, ISO, and other styles
10

Greer, Melody Lynn. "4294 Patient Matching Errors and Associated Safety Events." Journal of Clinical and Translational Science 4, s1 (June 2020): 42. http://dx.doi.org/10.1017/cts.2020.160.

Full text
Abstract:
OBJECTIVES/GOALS: Errors in patient matching could result in serious adverse safety events. Unlike publicized mix-ups by healthcare providers these errors are insidious and with increased data sharing, this is a growing concern in healthcare. The following project will examine patient matching errors and quantify their association with safety. METHODS/STUDY POPULATION: EHR systems perform matching out-of-the-box with unknown quality. Using matching processes outside the EMR, the rate at which matching errors are present was quantified and the erroneous records were flagged providing both comparative measures and data necessary to evaluate patient safety. To understand the relationship between matching and safety we will establish a percent of voluntarily reported safety events in our institution where a matching error existed during an encounter. Any safety events occurring for a flagged patient will be reviewed to determine if matching errors contributed to the safety problem. Not all safety events are reported so we will perform full chart review of a filtered list of medical records that have a higher likelihood of safety events. RESULTS/ANTICIPATED RESULTS: We were able to quantify matching errors, and the preliminary matching error rate is approximately 1%, representing over 700 patients. The work is in progress and we are beginning to determine the association between safety events and incorrect matching. Together these results will provide an incentive to identify errors, make corrections, and develop methods to achieve these objectives. The number of matching errors impacts patient care as well as business operations and is likely to have a negative financial impact on institutions with high error rates regardless of its relationship to safety. DISCUSSION/SIGNIFICANCE OF IMPACT: Patient matching is bundled with EHR software and institutions have little control over error rates, yet bear the liability for resulting clinical error. Institutions need to be able to identify undetected matching errors and any associated safety events and this project will provide that solution.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Approximate record matching"

1

Jupin, Joseph. "Temporal Graph Record Linkage and k-Safe Approximate Match." Diss., Temple University Libraries, 2016. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/412419.

Full text
Abstract:
Computer and Information Science
Ph.D.
Since the advent of electronic data processing, organizations have accrued vast amounts of data contained in multiple databases with no reliable global unique identifier. These databases were developed by different departments for different purposes at different times. Organizing and analyzing these data for human services requires linking records from all sources. RL (Record Linkage) is a process that connects records that are related to the identical or a sufficiently similar entity from multiple heterogeneous databases. RL is a data and compute intensive, mission critical process. The process must be efficient enough to process big data and effective enough to provide accurate matches. We have evaluated an RL system that is currently in use by a local health and human services department. We found that they were using the typical approach that was offered by Fellegi and Sunter with tuple-by-tuple processing, using the Soundex as the primary approximate string matching method. The Soundex has been found to be unreliable both as a phonetic and as an approximate string matching method. We found that their data, in many cases, has more than one value per field, suggesting that the data were queried from a 5NF data base. Consider that if a woman has been married 3 times, she may have up to 4 last names on record. This query process produced more than one tuple per database/entity apparently generating a Cartesian product of this data. In many cases, more than a dozen tuples were observed for a single database/entity. This approach is both ineffective and inefficient. An effective RL method should handle this multi-data without redundancy and use edit-distance for approximate string matching. However, due to high computational complexity, edit-distance will not scale well with big data problems. We developed two methodologies for resolving the aforementioned issues: PSH and ALIM. PSH – The Probabilistic Signature Hash is a composite method that increases the speed of Damerau-Levenshtein edit-distance. It combines signature filtering, probabilistic hashing, length filtering and prefix pruning to increase the speed of edit-distance. It is also lossless because it does not lose any true positive matches. ALIM – Aggregate Link and Iterative Match is a graph-based record linkage methodology that uses a multi-graph to store demographic data about people. ALIM performs string matching as records are inserted into the graph. ALIM eliminates data redundancy and stores the relationships between data. We tested PSH for string comparison and found it to be approximately 6,000 times faster than DL. We tested it against the trie-join methods and found that they are up to 6.26 times faster but lose between 10 and 20 percent of true positives. We tested ALIM against a method currently in use by a local health and human services department and found ALIM to produce significantly more matches (even with more restrictive match criteria) and that ALIM ran more than twice as fast. ALIM handles the multi-data problem and PSH allows the use of edit-distance comparison in this RL model. ALIM is more efficient and effective than a currently implemented RL system. This model can also be expanded to perform social network analysis and temporal data modeling. For human services, temporal modeling can reveal how policy changes and treatments affect clients over time and social network analysis can determine the effects of these on whole families by facilitating family linkage.
Temple University--Theses
APA, Harvard, Vancouver, ISO, and other styles
2

Tam, Siu-lung. "Linear-size indexes for approximate pattern matching and dictionary matching." Click to view the E-thesis via HKUTO, 2010. http://sunzi.lib.hku.hk/hkuto/record/B44205326.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Тодоріко, Ольга Олексіївна. "Моделі та методи очищення та інтеграції текстових даних в інформаційних системах." Thesis, Запорізький національний університет, 2016. http://repository.kpi.kharkov.ua/handle/KhPI-Press/21856.

Full text
Abstract:
Дисертація на здобуття наукового ступеня кандидата технічних наук за спеціальністю 05.13.06 – інформаційні технології. – Національний технічний університет "Харківський політехнічний інститут", Харків, 2016. У дисертаційній роботі вирішена актуальна науково-практична задача підвищення ефективності та якості технології очищення та інтеграції текстових даних в довідкових і пошукових інформаційних системах за рахунок використання моделей словозмінної парадигми та методу побудови лексемного індексу при організації пошуку за схожістю. Розроблено моделі словозмінної парадигми, що включають представлення слів та обчислення приблизної міри схожості між ними. Розроблено метод побудови лексемного індексу, що базується на запропонованих моделях словозмінної парадигми та дозволяє відобразити слово і всі його словоформи в один запис індексу. Удосконалено метод пошуку за схожістю за рахунок покращення етапу попередньої фільтрації завдяки використанню розробленої моделі словозмінної парадигми та лексемного індексу. Виконана експериментальна оцінка ефективності вказує на високу точність та 99 0,5 % повноту. Удосконалено інформаційну технологію очищення та інтеграції даних за рахунок розроблених моделей та методів. Розроблено програмну реалізацію, яка на базі запропонованих моделей та методів виконує пошук за схожістю, очищення та інтеграцію наборів даних. Одержані в роботі теоретичні та практичні результати впроваджено у виробничий процес документообігу приймальної комісії та навчальний процес математичного факультету Державного вищого навчального закладу "Запорізький національний університет".
The thesis for the candidate degree in technical sciences, speciality 05.13.06 – Information Technologies. – National Technical University "Kharkiv Polytechnic Institute", Kharkiv, 2016. In the thesis the actual scientific and practical problem of increasing the efficiency and quality of cleaning and integration of data in information reference system and information retrieval system is solved. The improvement of information technology of cleaning and integration of data is achieved by reduction of quantity of mistakes in text information by means of use of model of an inflectional paradigm, methods of creation of a lexeme index, advanced methods of tolerant retrieval. The developed model of an inflectional paradigm includes a representation of words as an ordered collection of signatures and an approximate measure of similarity between two representations. The model differs in method of dealing with forms of words and character positions. It provides the basis for the implementation of improved methods of tolerant retrieval, cleaning and integration of datasets. The method of creation of the lexeme index which is based on the offered model of an inflectional paradigm is developed, and it allows mapping a word and all its forms to a record of the index. The method of tolerant retrieval is improved at preliminary filtration stage thanks to the developed model of an inflectional paradigm and the lexeme index. The experimental efficiency evaluation indicates high precision and 99  0,5 % recall. The information technology of cleaning and integration of data is improved using the developed models and methods. The software which on the basis of the developed models and methods carries out tolerant retrieval, cleaning and integration of data sets was developed. Theoretical and practical results of the thesis are introduced in production of document flow of an entrance committee and educational process of mathematical faculty of the State institution of higher education "Zaporizhzhya National University".
APA, Harvard, Vancouver, ISO, and other styles
4

Тодоріко, Ольга Олексіївна. "Моделі та методи очищення та інтеграції текстових даних в інформаційних системах." Thesis, НТУ "ХПІ", 2016. http://repository.kpi.kharkov.ua/handle/KhPI-Press/21853.

Full text
Abstract:
Дисертація на здобуття наукового ступеня кандидата технічних наук за спеціальністю 05.13.06 – інформаційні технології. – Національний технічний університет «Харківський політехнічний інститут», Харків, 2016. У дисертаційній роботі вирішена актуальна науково-практична задача підвищення ефективності та якості технології очищення та інтеграції текстових даних в довідкових і пошукових інформаційних системах за рахунок використання моделей словозмінної парадигми та методу побудови лексемного індексу при організації пошуку за схожістю. Розроблено моделі словозмінної парадигми, що включають представлення слів та обчислення приблизної міри схожості між ними. Розроблено метод побудови лексемного індексу, що базується на запропонованих моделях словозмінної парадигми та дозволяє відобразити слово і всі його словоформи в один запис індексу. Удосконалено метод пошуку за схожістю за рахунок покращення етапу попередньої фільтрації завдяки використанню розробленої моделі словозмінної парадигми та лексемного індексу. Виконана експериментальна оцінка ефективності вказує на високу точність та 99 0,5 % повноту. Удосконалено інформаційну технологію очищення та інтеграції даних за рахунок розроблених моделей та методів. Розроблено програмну реалізацію, яка на базі запропонованих моделей та методів виконує пошук за схожістю, очищення та інтеграцію наборів даних. Одержані в роботі теоретичні та практичні результати впроваджено у виробничий процес документообігу приймальної комісії та навчальний процес математичного факультету Державного вищого навчального закладу «Запорізький національний університет».
The thesis for the candidate degree in technical sciences, speciality 05.13.06 – Information Technologies. – National Technical University «Kharkiv Polytechnic Institute», Kharkiv, 2016. In the thesis the actual scientific and practical problem of increasing the efficiency and quality of cleaning and integration of data in information reference system and information retrieval system is solved. The improvement of information technology of cleaning and integration of data is achieved by reduction of quantity of mistakes in text information by means of use of model of an inflectional paradigm, methods of creation of a lexeme index, advanced methods of tolerant retrieval. The developed model of an inflectional paradigm includes a representation of words as an ordered collection of signatures and an approximate measure of similarity between two representations. The model differs in method of dealing with forms of words and character positions. It provides the basis for the implementation of improved methods of tolerant retrieval, cleaning and integration of datasets. The method of creation of the lexeme index which is based on the offered model of an inflectional paradigm is developed, and it allows mapping a word and all its forms to a record of the index. The method of tolerant retrieval is improved at preliminary filtration stage thanks to the developed model of an inflectional paradigm and the lexeme index. The experimental efficiency evaluation indicates high precision and 99  0,5 % recall. The information technology of cleaning and integration of data is improved using the developed models and methods. The software which on the basis of the developed models and methods carries out tolerant retrieval, cleaning and integration of data sets was developed. Theoretical and practical results of the thesis are introduced in production of document flow of an entrance committee and educational process of mathematical faculty of the State institution of higher education «Zaporizhzhya National University».
APA, Harvard, Vancouver, ISO, and other styles
5

Dobiášovský, Jan. "Přibližná shoda znakových řetězců a její aplikace na ztotožňování metadat vědeckých publikací." Master's thesis, 2020. http://www.nusl.cz/ntk/nusl-415121.

Full text
Abstract:
The thesis explores the application of approximate string matching in scientific publication record linkage process. An introduction to record matching along with five commonly used metrics for string distance (Levenshtein, Jaro, Jaro-Winkler, Cosine distances and Jaccard coefficient) are provided. These metrics are applied on publication metadata from V3S current research information system of the Czech Technical University in Prague. Based on the findings, optimal thresholds in the F​1​, F​2​ and F​3​-measures are determined for each metric.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Approximate record matching"

1

Dong, Boxiang, and Hui Wendy Wang. "Efficient Authentication of Approximate Record Matching for Outsourced Databases." In Advances in Intelligent Systems and Computing, 119–68. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-319-98056-0_6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Margaritis, Dimitris, Christos Faloutsos, and Sebastian Thrun. "NetCube." In Database Technologies, 2011–36. IGI Global, 2009. http://dx.doi.org/10.4018/978-1-60566-058-5.ch120.

Full text
Abstract:
We present a novel method for answering count queries from a large database approximately and quickly. Our method implements an approximate DataCube of the application domain, which can be used to answer any conjunctive count query that can be formed by the user. The DataCube is a conceptual device that in principle stores the number of matching records for all possible such queries. However, because its size and generation time are inherently exponential, our approach uses one or more Bayesian networks to implement it approximately. Bayesian networks are statistical graphical models that can succinctly represent the underlying joint probability distribution of the domain, and can therefore be used to calculate approximate counts for any conjunctive query combination of attribute values and “don’t cares.” The structure and parameters of these networks are learned from the database in a preprocessing stage. By means of such a network, the proposed method, called NetCube, exploits correlations and independencies among attributes to answer a count query quickly without accessing the database. Our preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. We give an algorithm for estimating the count result of arbitrary queries that is fast (constant) on the database size. Our experimental results show that NetCubes have fast generation and use, achieve excellent compression and have low reconstruction error. Moreover, they naturally allow for visualization and data mining, at no extra cost.
APA, Harvard, Vancouver, ISO, and other styles
3

Margaritis, Dimitris, Christos Faloutsos, and Sebastian Thrun. "NetCube." In Bayesian Network Technologies, 54–83. IGI Global, 2007. http://dx.doi.org/10.4018/978-1-59904-141-4.ch004.

Full text
Abstract:
We present a novel method for answering count queries from a large database approximately and quickly. Our method implements an approximate DataCube of the application domain, which can be used to answer any conjunctive count query that can be formed by the user. The DataCube is a conceptual device that in principle stores the number of matching records for all possible such queries. However, because its size and generation time are inherently exponential, our approach uses one or more Bayesian networks to implement it approximately. Bayesian networks are statistical graphical models that can succinctly represent the underlying joint probability distribution of the domain, and can therefore be used to calculate approximate counts for any conjunctive query combination of attribute values and “don’t cares.” The structure and parameters of these networks are learned from the database in a preprocessing stage. By means of such a network, the proposed method, called NetCube, exploits correlations and independencies among attributes to answer a count query quickly without accessing the database. Our preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. We give an algorithm for estimating the count result of arbitrary queries that is fast (constant) on the database size. Our experimental results show that NetCubes have fast generation and use, achieve excellent compression and have low reconstruction error. Moreover, they naturally allow for visualization and data mining, at no extra cost.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Approximate record matching"

1

Gollapalli, Mohammed, Xue Li, Ian Wood, and Guido Governatori. "Approximate Record Matching Using Hash Grams." In 2011 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2011. http://dx.doi.org/10.1109/icdmw.2011.33.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Dong, Boxiang, and Wendy Wang. "ARM: Authenticated Approximate Record Matching for Outsourced Databases." In 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI). IEEE, 2016. http://dx.doi.org/10.1109/iri.2016.86.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Schraagen, Marijn. "Complete Coverage for Approximate String Matching in Record Linkage Using Bit Vectors." In 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2011. http://dx.doi.org/10.1109/ictai.2011.116.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Jia, Dan, Yong-Yi Wang, and Steve Rapp. "Material Properties and Flaw Characteristics of Vintage Girth Welds." In 2020 13th International Pipeline Conference. American Society of Mechanical Engineers, 2020. http://dx.doi.org/10.1115/ipc2020-9658.

Full text
Abstract:
Abstract Vintage pipelines, which in the context of this paper refer to pipelines built before approximately 1970, account for a large portion of the energy pipeline systems in North America. Integrity assessment of these pipelines can sometimes present challenges due to incomplete records and lack of material property data. When material properties for the welds of interest are not available, conservative estimates based on past experience are typically used for the unknown material property values. Such estimates can be overly conservative, potentially leading to unnecessary remedial actions. This paper is a summary of PRCI-funded work aimed at characterizing material properties and flaw characteristics of vintage girth welds. The data obtained in this work can be utilized to understand and predict the behavior of vintage pipelines, which is covered in a companion paper [1]. The material property data generated in this work include (i) pipe base metal tensile properties in both the hoop (transverse) and the longitudinal (axial) directions, (ii) deposited weld metal tensile properties, (iii) macrohardness traverses, (iv) microhardness maps, and (v) Charpy impact transition curves of specimens with notches in the heat-affected zone (HAZ) and weld centerline (WCL). These data provide essential information for tensile strength, strength mismatch, and impact toughness. In addition to the basic material property data, instrumented cross-weld tensile (ICWT) tests were conducted on CWT specimens with no flaws, natural flaws, and artificially machined planar flaws. The ICWT tests provide an indication of the welds’ stress and strain capacity without and with flaws. For welds with even-matching or over-matching weld strengths, the CWT specimens usually failed outside of the weld region, even for specimens with natural flaws reported by non-destructive examination. Having over-matching weld strength can compensate for the negative impact of weld flaws. All tested girth welds were inspected using radiography and/or phased array ultrasonic testing. The inspection results are compared with the flaws exposed through destructive testing. The ability of these inspection methods to detect and size flaws in vintage girth welds is evaluated.
APA, Harvard, Vancouver, ISO, and other styles
5

Ramakrishnan, Kishore Ranganath, Shoaib Ahmed, Benjamin Wahls, Prashant Singh, Maria A. Aleman, Kenneth Granlund, Srinath Ekkad, Federico Liberatore, and Yin-Hsiang Ho. "Gas Turbine Combustor Liner Wall Heat Load Characterization for Different Gaseous Fuels." In ASME 2019 International Mechanical Engineering Congress and Exposition. American Society of Mechanical Engineers, 2019. http://dx.doi.org/10.1115/imece2019-11283.

Full text
Abstract:
Abstract The knowledge of detailed distribution of heat load on swirl stabilized combustor liner wall is imperative in the development of liner-specific cooling arrangements, aimed towards maintaining uniform liner wall temperatures for reduced thermal stress levels. Heat transfer and fluid flow experiments have been conducted on a swirl stabilized lean premixed combustor to understand the behavior of Methane-, Propane-, and Butane-based flames. These fuels were compared at different equivalence ratios for a matching adiabatic flame temperature of Methane at 0.65 equivalence ratio. Above experiments were carried out a fixed Reynolds number (based on the combustor diameter) of 12000, where the pre-heated air temperature was approximately 373K. Combustor liner in this setup was made from 4 mm thick quartz tube. An infrared camera was used to record the inner and outer temperatures of liner wall, and two-dimensional heat conduction model was used to find the wall heat flux at a quasi-steady state condition. Flow field in the combustor was measured through Particle Image Velocimetry. The variation of peak heat flux on the liner wall, position of peak heat flux and heat transfer, and position of impingement of flame on the liner have been presented in this study. For all three gaseous fuels studied, the major swirl stabilized flame features such as corner recirculation zone, central recirculation zone and shear layers have been observed to be similar. Liner wall and exhaust temperature for Butane was highest among the fuel tested in this study which was expected as the heat released from combustion of Butane is higher than that of Methane and Propane.
APA, Harvard, Vancouver, ISO, and other styles
6

Cummings, Scott M. "Prediction of Rolling Contact Fatigue Using Instrumented Wheelsets." In ASME 2008 Rail Transportation Division Fall Technical Conference. ASMEDC, 2008. http://dx.doi.org/10.1115/rtdf2008-74013.

Full text
Abstract:
The measured wheel/rail forces from four wheels in the leading truck of a coal hopper car during one revenue service roundtrip were used to by the Wheel Defect Prevention Research Consortium (WDPRC) to predict rolling contact fatigue (RCF) damage. The data was recorded in March 2005 by TTCI for an unrelated Strategic Research Initiatives project funded by the Association of American Railroads (AAR). RCF damage was predicted in only a small portion of the approximately 4,000 km (2,500 miles) for which data was analyzed. The locations where RCF damage was predicted to occur were examined carefully by matching recorded GPS and train speed/distance data with track charts. RCF is one way in which wheels can develop tread defects. Thermal mechanical shelling (TMS) is a subset of wheel shelling in which the heat from tread braking reduces a wheel’s fatigue resistance. RCF and TMS together are estimated to account for approximately half of the total wheel tread damage problem [1]. Other types of tread damage can result from wheel slides. The work described in this paper is concerning pure RCF, without regard to temperature effects or wheel slide events. It is important that the limitations of the analysis in this paper are recognized. The use of pre-existing data that was recorded two years prior to the analysis ruled out the possibility of determining the conditions of the track when the data was recorded (rail profile, friction, precise track geometry). Accordingly, the wheel/rail contact stress was calculated with an assumed rail crown profile radius of 356-mm (14 inches). RCF was predicted using shakedown theory, which does not account for wear and is the subject of some continuing debate regarding the exact conditions required for fatigue damage. The data set analyzed represents the wheel/rail forces from two wheelsets in a single, reasonably well maintained car. Wheelsets in other cars may produce different results. With this understanding, the following conclusions are made. - RCF damage is predicted to accumulate only at a small percentage of the total distance traveled. - RCF damage is predicted to accumulate on almost every curve 4 degrees or greater. - RCF damage is primarily predicted to accumulate while the car is loaded. - RCF damage is predicted to accumulate more heavily on the wheelset in the leading position of the truck than the trailing wheelset. - No RCF damage was predicted while the test car was on mine property. - Four unique curves (8 degrees, 7 degrees, 6 degrees, and 4 degrees) accounted for nearly half of the predicted RCF damage of the loaded trip. In each case, the RCF damage was predicted to accumulate on the low-rail wheel of the leading wheelset. - Wayside flange lubricators are located near many of the locations where RCF damage was predicted to accumulate, indicating that simply adding wayside lubricators will not solve the RCF problem. - The train was typically being operated below the balance speed of the curve when RCF damage was predicted to occur. - The worst track locations for wheel RCF tend to be on curves of 4 degrees or higher. For the route analyzed in this work, the worst locations for wheel RCF tended to be bunched in urban areas, where tight curvature generally prevails.
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Approximate record matching"

1

Day, Christopher M., Howell Li, Sarah M. L. Hubbard, and Darcy M. Bullock. Observations of Trip Generation, Route Choice, and Trip Chaining with Private-Sector Probe Vehicle GPS Data. Purdue University, 2022. http://dx.doi.org/10.5703/1288284317368.

Full text
Abstract:
This paper presents an exploratory study of GPS data from a private-sector data provider for analysis of trip generation, route choice, and trip chaining. The study focuses on travel to and from the Indianapolis International Airport. GPS data consisting of nearly 1 billion waypoints for 12 million trips collected over a 6-week period in the state of Indiana. Within this data, there were approximately 10,000 trip records indicating travel to facilities associated with the Indianapolis airport. The analysis is based the matching of waypoints to geographic areas that define the extents of roadways and various destinations. A regional analysis of trip ends finds that travel demand for passenger services at the airport extends across a region spanning about 950 km. Local travel between land uses near the airport is examined by generation of an origin-destination matrix, and route choice between the airport and downtown Indianapolis is studied. Finally, the individual trips are scanned to identify trip chaining behavior. Several observations are made regarding these dynamics from the data. There is some sample bias (types of vehicles) and opportunities to further refine some of the land use definitions, but the study results suggest this type of data will provide a new frontier for characterizing travel demand patterns at a variety of scales.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography