Chapter 8: Validating and Cleaning Data
- Data errors occur when data values are not appropriate for the SAS statements that are specified in a program. SAS detects data errors during program execution.
- The
freq
produce can show if any genders are notF
orM
and if any countries are notAU
orUS
. - The
mean
procedure can show if any salaries are not in the range of 24000 to 500000. -
The
univariate
procedure can show if any salaries are not in the range of 24000 to 500000.123456789101112131415161718192021222324252627282930data work.nonsales;length Employee_ID 8 First $ 12Last $ 18 Gender $ 1Salary Job_Title $ 25Country $ 2 Birth_DateHire_Date 8;infile 'nonsales.csv' dlm=',';input Employee_ID First $ Last $Gender $ Salary Job_Title $Country $ Birth_Date :date9.Hire_Date :date9.;format Birth_Date Hire_Date ddmmyy10.;run;proc print data=work.nonsales;var Employee_ID Job_Title Birth_Date Hire_Date;where Job_Title = ' ' or Birth_Date > Hire_Date;run;proc freq data=work.nonsales;tables Gender Country;run;proc means data=work.nonsales n nmiss min max;var Salary;run;proc univariate data=work.nonsales;var Salary;run; -
During the processing of every
data
step, SAS automatically creates the following temporary variable:
_N_
variable, which counts the number of times thedata
step begins to iterate._ERROR_
variable, which signals the occurrence of an error caused by the data during execution. 0 indicates no error exist.
- Which statement best descries the invalid data? b:
- The data in the raw data file is bad
- The programmer incorrectly read the data
-
To write a SAS date constant, enclose a date in quotation marks in the form
ddmmyyyy
and immediately follow the final quotation mark with the letterd
. Example: January 1, 1974 is'01JAN1974'd
1234proc print data=orion.nonsales;var Employee_ID Birth_Date Hire_Date;where Hire_Date < '01JAN1974'd;run; -
The
freq
procedure produces one-way to n-way frequency tables.
- The
tables
statement specifies the frequency tables to produce. Without it,proc freq
produces a frequency table for each variable. - The
nlevels
option displays a table that provides the number of distinct values for each variable named in thetables
statement.
123proc freq data=orion.nonsales nlevels;tables Gender Country Employee_ID;run;
- The
means
procedure produces summary reports displayed descriptive statistics.
- The
var
statement specifies the analysis variables and their order in the result. - By default, the
means
procedure creates a report withN
,mean
,stddev
,min
andmax
1234567891011proc means data=orion.nonsales n nmiss min max;var Salary;run;```10. The `univariate` procedure produces summary reports displaying descriptive statistics.+ The `var` statement specifies the analysis variables and their order in the results.+ Without the `var` statement, SAS will analysis all numeric variables.```sasproc univariate data=orion.nonsales;var Salary;run;
- Interactively cleaning data: the
Viewtable
window enables you to browse, edit, or create SAS data sets interactively. - Programmatically cleaning data: The
data
step can be used to programmatically clean the invalid data.
- The assignment statement evaluates an expression and assigns the resulting value to a variable:
variable = expression;
Salary = 26960;
Hire_Date = '21JAN1995'd;
Country = upcase(Country);
-
The
if-then-else
statement executes a SAS statement for observations that meet specific conditions.12345678910data work.clean;set orion.nonsales;Country=upcase(Country);if Employee_ID=120106 then Salary=26960;else if Employee_ID=120115 then Salary=26500;else if Employee_ID=120191 then Salary=24015;else if Employee_ID=120107 then Hire_Date='21JAN1995'd;else if Employee_ID=120111 then Hire_Date='01NOV1978'd;else if Employee_ID=121011 then Hire_Date='01JAN1998'd;run; -
What are the two phases of DATA step processing?: Compilation and Execution
- What is a program data vector (PDV)?: A logical area in memory where SAS holds the current observation
- What is an instruction that SAS uses to read data values into a variable?: An informat
- When would you use a : modifier?: You use a : modifier with nonstandard raw data that requires list input and an informat
Chapter 9: Manipulating Data
- If an operand is missing for an arithmetic operator, the result is missing. Example:
var1 = .
,var2 = 10
, thennum = var1 + var2 / 2
,num
is.
(missing). sum
: return the sum of all arguments.year
,qtr
,month
,day
,weekday
: extract pieces from a SAS date.today()
: return the current date as a SAS date value.mdy(month, day, year)
: return a SAS date value.
AnnivBonus=mdy(month(Hire_Date),15,2008);
-
Given the following code, are the correct results produced when the drop statement is placed after the set statement?
1234567data work.comp;set orion.sales;drop Gender Salary Job_Title Country Birth_Date Hire_Date;Bonus=500;Compensation=sum(Salary,Bonus);BonusMonth=month(Hire_Date);run; -
Yes, the drop statement specifies the names of the variables to omit from the output data set
- The
drop
andkeep
statements select variables after they are brought into the program data vector. -
Alternatives to the
drop
andkeep
statements are thedrop=
andkeep=
data set options placed in thedata
statement.123456data work.comp(drop=Salary Hire_Date);set orion.sales(keep=Employee_ID First_Name Last_Name Salary Hire_Date);Bonus=500;Compensation=sum(Salary,Bonus);BonusMonth=month(Hire_Date);run; -
Multiple executable statements are allowed in
if-then do / else do ... end
statements.123456789101112data work.bonus;set orion.sales;length Freq $ 12;if Country='US' then do;Bonus=500;Freq='Once a Year';end;else do;Bonus=300;Freq='Twice a Year';end;run; -
if-then delete
: an alternative to the subsettingif
statement is thedelete
statement on anif-then
statement.
if BonusMonth ne 12 then delete;
is equivalent to:if BounsMonth = 12;
Chapter 10: Combining SAS Data Sets
1.
近期评论