cd "C:\Users\Lenovo\Documents\LUMS\Fall 2022\Econometrics\01 dta"
use "roster"
*opens the roster dataset. You can also directly click the dta file or
*write the full path after use.
tab age
set more off
*Disables the annoying "more" option displayed everytime you run a command that obtains results so long, they can't fit in the screen all at
once.
*Numeric variables are stored as byte, int, long, float or double.
*Byte, int and long can only hold integers
*String variables are stored as str9, str13 etc.
*String variables can hold non-numeric characters as well. No mathematical operation can be formed on them
tab age
summarize age
drop if age==2018
*drops observations in which age=2018
tab sb1q4
tab sb1q4, nolab
*Or
label list sb1q4
*finding the value labels of gender
drop if sb1q4==0
summarize age if sb1q4==2
*restricting the observations for which the age variable is being summarized,
*to only female respondents (compare female age stats with overall stats)
count if sb1q4==1 & age<18
*211,185 respondents are male and below 18
*TASK: How many respondents are above or equal to 18 and either separated or
*divorced?
count if age>=18 & (sb1q7==4 | sb1q7==5)
*2178 respondents above or equal to 18 are either separated or divorced.
sum age if sb1q7==2
tab sb1q7 sb1q4
*Two-way tabulation. Breakdown of respondents in each marital status category by gender, or, breakdown of male and female respondents by
the marital status category (depends if you are following values
*horizontally or vertically). Try the "column", "row", "column nofreq" and "row nofreq" options with the tab command (feel free to use the help
file to see how to use these options)
*and see how you can enhance the results obtained from this two-way tabulation command.
cd "C:\Users\Lenovo\Documents\LUMS\Fall 2022\Econometrics\01 dta"
use "roster"
set more off
drop if age==2018
*drops observations in which age=2018
preserve
drop sb1q62
restore
*restores the dataset and undo any changes you made
codebook age
*provides more detailed summary statistics
*TASK: How many districts in this dataset?
codebook district
list age in 1/5
*lists the first 5 observations
ed
*saving dataset
save "roster_new", replace
/*Data Cleaning involves:
1) Renaming Variables
2) Assigning/Changing Variable Labels
3) Defining Value Labels
4) Generating New Variables
5) Creating smaller subsets of data
6) Merging Datasets
7) Reshaping Datasets
8) Removing special characters from string variables
9) Converting variables from string to numeric or vice versa
10)Recoding Missing Values
*/
*To rename a variable:
rename sb1q4 gender
*TASK: Rename the variable sb1q11 to HHmember
rename sb1q11 HHmember
*To assign a variable label:
label variable hhcode "Household Code"
*TASK: Assign the variable label "Primary Sampling Unit" to the psu variable
label variable psu "Primary Sampling Unit"
*Changing Variable Label
label variable age
label variable age "Age"
*Recoding Missing Values
tab gender
replace gender=. if gender==0
tab gender
tab gender, missing
tab gender, nolab missing
*TASK: In the marital status variable, replace the 0 values with missing values
replace sb1q7=. if sb1q7==0
cd "C:\Users\Lenovo\Documents\LUMS\Fall 2022\Econometrics\01 dta"
use "roster"
set more off
rename sb1q4 gender
replace gender=. if gender==0
*Generating Variables and Assigning Value Labels
gen female=0
replace female=1 if gender==2
replace female=. if gender==.
/*We can now assign value labels to the female variable through the following
two-step procedure:
1) Define a value label
2) Assign that value label to our variable of interest
*/
tab female
lab define gender 0"Male" 1"Female"
lab val female gender
tab female
label list gender
*TASK: Now generate a variable named "Male". It should take the value 1 for
*male respondents and 0 for female respondents. Assign value labels to this
*variable.
*To verify that you have generated the correct binary variable for male:
tab gender male
*Changing Value Labels
tab sb1q11
tab sb1q11, nolab
lab val sb1q11
tab sb1q11
recode sb1q11 (2=0)
tab sb1q11
lab def hhmember 0"No" 1"Yes"
lab val sb1q11 hhmember
tab sb1q11
*Conversion from numeric to string
decode sb1q7, gen(marital)
tostring age, gen(age_string)
tostring sb1q4, gen(gender)
cd "C:\Users\Lenovo\Documents\LUMS\Fall 2022\Econometrics\01 dta"
*Importing from excel
import excel "C:\Users\Lenovo\Documents\Job Applications\LUMS\Spring 2023\Econometrics\roster_subset.xlsx", sheet("Sheet1") firstrow
*Generating variables from string variables
describe sb1q7
gen divorced=0
replace divorced=1 if sb1q7=="divorced"
replace divorced=. if sb1q7==""
tab divorced
*TASK: Generate a binary variable and name it "married". It should take the
*value 1 if the respondent is currently married and 0 otherwise.
*Convert a string variable to a numeric categorical variable
encode sb1q7, generate(marital)
*TASK: Convert sb1q4 from string to numeric
encode sb1q4, gen(gender)
use "[Link]", clear
*Convert province from numeric to string
decode province, gen(province_string)
*upper(), lower() and proper functions
*To make all the characters into uppercase
replace province_string = upper(province_string)
*To make all the characters into lowercase
replace province_string = lower(province_string)
*To make all just the first character into uppercase
replace province_string = proper(province_string)
*TASK: Convert the district variable from numeric to string. Transform the
*variable such that all the characters are in uppercase.
di subinstr("Ec00nometrics","00","o",.)
di subinstr("IceCCCCCCCCCCream","C","",.)
di subinstr("Iceream","er","ecr",.)
replace province_string = subinstr(province_string,"a","@",.)
***Replace @ with a
replace province_string = subinstr(province_string,"@","a",.)
di substr("Econometrics",1,3)
di substr("Econometrics",6,.)
di substr("Econometrics",-3,.)
gen prov_code = substr(province_string,1,3)
*generates the first three characters from the province_string variable