Statistical data cleaning with applications in R / Edwin de Jonge, Mark van der Loo.

By:

Jonge, Edwin de, 1972- [author.]

Contributor(s):

Loo, Mark van der, 1976- [author.]

Material type: Text

TextPublisher: [Hoboken, NJ] : John Wiley & Sons Ltd, [2018]Description: 1 online resourceContent type:

text

Media type:

computer

Carrier type:

online resource

ISBN:

9781118897140
1118897145
9781118897126
1118897129

Subject(s):

Genre/Form:

Electronic books.

Additional physical formats: Print version:: Statistical data cleaning with applications in R.DDC classification:

519.50285/5133 23

LOC classification:

QA276.45.R3 J66 2018

Online resources:

Wiley Online Library

Contents:

Cover -- Title Page -- Copyright -- Contents -- Foreword -- About the Companion Website -- Chapter 1 Data Cleaning -- 1.1 The Statistical Value Chain -- 1.1.1 Raw Data -- 1.1.2 Input Data -- 1.1.3 Valid Data -- 1.1.4 Statistics -- 1.1.5 Output -- 1.2 Notation and Conventions Used in this Book -- Chapter 2 A Brief Introduction to R -- 2.1 R on the Command Line -- 2.1.1 Getting Help and Learning R -- 2.2 Vectors -- 2.2.1 Computing with Vectors -- 2.2.2 Arrays and Matrices -- 2.3 Data Frames -- 2.3.1 The Formula{u2010}Data Interface -- 2.3.2 Selecting Rows and Columns -- Boolean Operators -- 2.3.3 Selection with Indices -- 2.3.4 Data Frame Manipulation: The dplyr Package -- 2.4 Special Values -- 2.4.1 Missing Values -- 2.5 Getting Data into and out of R -- 2.5.1 File Paths in R -- 2.5.2 Formats Provided by Packages -- 2.5.3 Reading Data from a Database -- 2.5.4 Working with Data External to R -- 2.6 Functions -- 2.6.1 Using Functions -- 2.6.2 Writing Functions -- 2.7 Packages Used in this Book -- Chapter 3 Technical Representation of Data -- 3.1 Numeric Data -- 3.1.1 Integers -- 3.1.2 Integers in R -- 3.1.3 Real Numbers -- 3.1.4 Double Precision Numbers -- 3.1.5 The Concept of Machine Precision -- 3.1.6 Consequences of Working with Floating Point Numbers -- 3.1.7 Dealing with the Consequences -- 3.1.8 Numeric Data in R -- 3.2 Text Data -- 3.2.1 Terminology and Encodings -- 3.2.2 Unicode -- 3.2.3 Some Popular Encodings -- 3.2.4 Textual Data in R: Objects of Class Character -- 3.2.5 Encoding in R -- 3.2.6 Reading and Writing of Data with Non{u2010}Local Encoding -- 3.2.7 Detecting Encoding -- 3.2.8 Collation and Sorting -- 3.3 Times and Dates -- 3.3.1 AIT, UTC, and POSIX Seconds Since the Epcoch -- 3.3.2 Time and Date Notation -- 3.3.3 Time and Date Storage in R -- 3.3.4 Time and Date Conversion in R -- 3.3.5 Leap Days, Time Zones, and Daylight Saving Times.

3.4 Notes on Locale Settings -- Chapter 4 Data Structure -- 4.1 Introduction -- 4.2 Tabular Data -- 4.2.1 data.frame -- 4.2.2 Databases -- 4.2.3 dplyr -- 4.3 Matrix Data -- 4.4 Time Series -- 4.5 Graph Data -- 4.6 Web Data -- 4.6.1 Web Scraping -- 4.6.2 Web API -- 4.7 Other Data -- 4.8 Tidying Tabular Data -- 4.8.1 Variable Per Column -- 4.8.2 Single Observation Stored in Multiple Tables -- Chapter 5 Cleaning Text Data -- 5.1 Character Normalization -- 5.1.1 Encoding Conversion and Unicode Normalization -- 5.1.2 Character Conversion and Transliteration -- 5.2 Pattern Matching with Regular Expressions -- 5.2.1 Basic Regular Expressions -- 5.2.2 Practical Regular Expressions -- 5.2.3 Generating Regular Expressions in R -- 5.3 Common String Processing Tasks in R -- 5.4 Approximate Text Matching -- 5.4.1 String Metrics -- 5.4.2 String Metrics and Approximate Text Matching in R -- Chapter 6 Data Validation -- 6.1 Introduction -- 6.2 A First Look at the validate Package -- 6.2.1 Quick Checks with check_that -- 6.2.2 The Basic Workflow: validator and confront -- 6.2.3 A Little Background on validate and DSLs -- 6.3 Defining Data Validation -- 6.3.1 Formal Definition of Data Validation -- 6.3.2 Operations on Validation Functions -- 6.3.3 Validation and Missing Values -- 6.3.4 Structure of Validation Functions -- 6.3.5 Demarcating Validation Rules in validate -- 6.4 A Formal Typology of Data Validation Functions -- 6.4.1 A Closer Look at Measurement -- 6.4.2 Classification of Validation Rules -- 6.5 Validating Data with the validate Package -- 6.5.1 Validation Rules in the Console and the validator Object -- 6.5.2 Validating in the Pipeline -- 6.5.3 Raising Errors or Warnings -- 6.5.4 Tolerance for Testing Linear Equalities -- 6.5.5 Setting and Resetting Options -- 6.5.6 Importing and Exporting Validation Rules from and to File.

6.5.7 Checking Variable Types and Metadata -- 6.5.8 Checking Value Ranges and Code Lists -- 6.5.9 Checking In{u2010}Record Consistency Rules -- 6.5.10 Checking Cross{u2010}Record Validation Rules -- 6.5.11 Checking Functional Dependencies -- 6.5.12 Cross{u2010}Dataset Validation -- 6.5.13 Macros, Variable Groups, Keys -- 6.5.14 Analyzing Output: validation Objects -- 6.5.15 Output Dimensionality and Output Selection -- 6.5.15 Exercises for Section -- Chapter 7 Localizing Errors in Data Records -- 7.1 Error Localization -- 7.2 Error Localization with R -- 7.2.1 The Errorlocate Package -- 7.3 Error Localization as MIP{u2010}Problem -- 7.3.1 Error Localization and Mixed{u2010}Integer Programming -- 7.3.2 Linear Restrictions -- 7.3.3 Categorical Restrictions -- 7.3.4 Mixed{u2010}Type Restrictions -- 7.4 Numerical Stability Issues -- 7.4.1 A Short Overview of MIP Solving -- 7.4.2 Scaling Numerical Records -- 7.4.3 Setting Numerical Threshold Values -- 7.5 Practical Issues -- 7.5.1 Setting Reliability Weights -- 7.5.2 Simplifying Conditional Validation Rules -- 7.6 Conclusion -- Chapter 8 Rule Set Maintenance and Simplification -- 8.1 Quality of Validation Rules -- 8.1.1 Completeness -- 8.1.2 Superfluous Rules and Infeasibility -- 8.2 Rules in the Language of Logic -- 8.2.1 Using Logic to Rewrite Rules -- 8.3 Rule Set Issues -- 8.3.1 Infeasible Rule Set -- 8.3.2 Fixed Value -- 8.3.3 Redundant Rule -- 8.3.4 Nonrelaxing Clause -- 8.3.5 Nonconstraining Clause -- 8.4 Detection and Simplification Procedure -- 8.4.1 Mixed{u2010}Integer Programming -- 8.4.2 Detecting Feasibility -- 8.4.3 Finding Rules Causing Infeasibility -- 8.4.4 Detecting Conflicting Rules -- 8.4.5 Detect Partial Infeasibility -- 8.4.6 Detect Fixed Values -- 8.4.7 Detect Nonrelaxing Clauses -- 8.4.8 Detect Nonconstraining Clauses -- 8.4.9 Detect Redundant Rules -- 8.5 Conclusion.

Chapter 9 Methods Based on Models for Domain Knowledge -- 9.1 Correction with Data Modifying Rules -- 9.1.1 Modifying Functions -- 9.1.2 A Class of Modifying Functions on Numerical Data -- 9.1.2 Exercises for Section -- 9.2 Rule{u2010}Based Correction with dcmodify -- 9.2.1 Reading Rules from File -- 9.2.2 Modifying Rule Syntax -- 9.2.3 Missing Values -- 9.2.4 Sequential and Sequence{u2010}Independent Execution -- 9.2.5 Options Settings Management -- 9.3 Deductive Correction -- 9.3.1 Correcting Typing Errors in Numeric Data -- 9.3.1 Exercises for Section -- 9.3.2 Deductive Imputation Using Linear Restrictions -- Chapter 10 Imputation and Adjustment -- 10.1 Missing Data -- 10.1.1 Missing Data Mechanisms -- 10.1.2 Visualizing and Testing for Patterns in Missing Data Using R -- 10.2 Model{u2010}Based Imputation -- 10.3 Model{u2010}Based Imputation in R -- 10.3.1 Specifying Imputation Methods with simputation -- 10.3.2 Linear Regression{u2010}Based Imputation -- 10.3.3 M{u2010}Estimation -- 10.3.4 Lasso, Ridge, and Elasticnet Regression -- 10.3.5 Classification and Regression Trees -- 10.3.6 Random Forest -- 10.4 Donor Imputation with R -- 10.4.1 Random and Sequential Hot Deck Imputation -- 10.4.2 k Nearest Neighbors and Predictive Mean Matching -- 10.5 Other Methods in the simputation Package -- 10.6 Imputation Based on the EM Algorithm -- 10.6.1 The EM Algorithm -- 10.6.2 EM Imputation Assuming the Multivariate Normal Distribution -- 10.7 Sampling Variance under Imputation -- 10.8 Multiple Imputations -- 10.8.1 Multiple Imputation Based on the EM Algorithm -- 10.8.2 The Amelia Package -- 10.8.3 Multivariate Imputation with Chained Equations (Mice) -- 10.8.4 Imputation with the mice Package -- 10.9 Analytic Approaches to Estimate Variance of Imputation -- 10.9.1 Imputation as Part of the Estimator -- 10.10 Choosing an Imputation Method -- 10.11 Constraint Value Adjustment.

10.11.1 Formal Description -- 10.11.2 Application to Imputed Data -- 10.11.3 Adjusting Imputed Values with the rspa Package -- Chapter 11 Example: A Small Data{u2010}Cleaning System -- 11.1 Setup -- 11.1.1 Deterministic Methods -- 11.1.2 Error Localization -- 11.1.3 Imputation -- 11.1.4 Adjusting Imputed Data -- 11.2 Monitoring Changes in Data -- 11.2.1 Data Diff (Daff) -- 11.2.2 Summarizing Cell Changes -- 11.2.3 Summarizing Changes in Conformance to Validation Rules -- 11.2.4 Track Changes in Data Automatically with lumberjack -- 11.3 Integration and Automation -- 11.3.1 Using RScript -- 11.3.2 The docopt Package -- 11.3.3 Automated Data Cleaning -- References -- Index -- EULA.

Summary: A comprehensive guide to automated statistical data cleaning The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. Statistical Data Cleaning brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy. Key features: -Focuses on the automation of data cleaning methods, including both theory and applications written in R. -Enables the reader to design data cleaning processes for either one-off analytical purposes or for setting up production systems that clean data on a regular basis. -Explores statistical techniques for solving issues such as incompleteness, contradictions and outliers, integration of data cleaning components and quality monitoring. -Supported by an accompanying website featuring data and R code. This book enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. It can also be used as material for a course in data cleaning and analyses.

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

No physical items for this record

Includes bibliographical references and index.

Online resource; title from digital title page (viewed on June 13, 2018).

A comprehensive guide to automated statistical data cleaning The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. Statistical Data Cleaning brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy. Key features: -Focuses on the automation of data cleaning methods, including both theory and applications written in R. -Enables the reader to design data cleaning processes for either one-off analytical purposes or for setting up production systems that clean data on a regular basis. -Explores statistical techniques for solving issues such as incompleteness, contradictions and outliers, integration of data cleaning components and quality monitoring. -Supported by an accompanying website featuring data and R code. This book enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. It can also be used as material for a course in data cleaning and analyses.

There are no comments on this title.

to post a comment.

Indira Gandhi National Tribal University, Amarkantak

Prof. Ram Dayal Munda Central Library

Online Public Access Catalogue

Statistical data cleaning with applications in R / Edwin de Jonge, Mark van der Loo.