Stata Training: Data Cleaning, Management and Manipulation
Objectives
Participants should have a basic working knowledge of how to use the Stata program prior to starting this training. This includes general program use and how to create do files.
By the end of this training, participants will:
Develop the ability to independently learn how to use commands in Stata using help files
Understand the basic steps required for data cleaning, manipulation and management
Create a data cleaning, manipulation and management do file
Understand the importance of proper data cleaning, manipulation and management prior to statistical analysis
Learn how to export Stata datasets to excel
Become at least a level 2 data wizard
Introduction
This training will teach you how to clean, manipulate and manage data like a wizard. Importantly, the suggested code/commands are just one way to solve each problem, not the only way, you may find another way which gives the same results, and you prefer. This method will improve your ability to work in Stata autonomously.
The framework and principles used in this training can be broadly applied to most datasets you encounter. However, each dataset is unique and will present its own challenges. The principles outlined here can be transferred to other statistical data programs such as R.
This training does not cover any graphical outputs. Stata outputs are often not of publication quality, but are very helpful for data exploration. Final published tables and graphs are recommended to be made in other programmes such as Microsoft Excel, or R.
Throughout this document, Stata code will be formatted like this
. Within the template Stata do file, where you see *T#
at the start of a line, I've put tasks for you solve; where you see *i
, I've put information and hints.
The help
command displays help information about the specified command or topic. For each task, you'll be provided with the key Stata commands you'll want to use, but you have to figure out how to use the command and its syntax properly using the help files. The help command does not execute the command, it simply teaches you how to use that specific command. In the help file, you'll see the command name, sometimes with part of it underlined. This is the short form of the command; eg. for display the short form is di
.
If you ever want to use the command line as a calculator, or test out a string, use di
. This will print out the result of your calculation or string function.
help help
help di
*i paste the two lines below in the command line in Stata
di "this will print out in the display"
di 2+2
The dataset
This practice dataset is simulated and does not contain any actual patient or laboratory data.
These simulated data contain routine laboratory surveillance data of incident cases of bacterial bloodstream infections in England from 2015 to 2017. There is only one record per infection episode; however, an individual can have multiple infection episodes over time. Each infection episode is considered a new infection.
Table 4. The core dataset with variables names and contents.
Variable | Contents |
---|---|
healthcare_id |
a unique healthcare ID number per patient |
hospital_id |
a hospital based ID number per patient |
sex |
the sex of the individual |
soundex |
an anonymised surname field for a patient |
birth_date |
patient date of birth |
lab |
a unique ID number for the lab which supplied the data |
specimen_id |
an ID number created by the lab for the blood specimen |
specimen_date |
the date the specimen was taken |
species |
the species isolated from the blood culture |
area_name |
the region in which this infection occurred |
area_code |
the code for the region |
The required files for this training are already in the above format. Click here to download the training files project directory
Extract the contents of the zip file to your chosen project directory. The project directory is wherever you choose to save contents of the unzipped folder. Open the do file code\StataTraining_dCMM.do
in Stata (v12.0 or higher required). Complete your code in the provided template do file.
Stata basics
Coding best practice
Always code in a do file. Do files enable you to save and script a logical flow of commands to run on your data. This is how you will ensure reproducibility in your work, for yourself and others. Within a do file, you can leave comments. These will make your life easier.
Comments work in Stata by either putting a *
at the beginning of the line, or //
after code, see the example below. Alternatively, to comment a block of code over several lines use /* */
, everything between the *
will be commented. Comments allow you to annotate in your do file, and will not be read as code, even if there is actual code within the comment. This means you can use comments to "silence" code that you're still working on or testing.
- write lots of comments so you remember what and why you've done something
- leave a header comment at the top of your file containing the following:
- project title
- description outlining the overall purpose of the code
- your name
- date the code was last edited
- look at the template file for an example of what to include in your header
- within the code itself, leave comments for what you're trying to do, sometimes you'll come back to it with a new improved method
- separate out your code into sections, put a comment line to demonstrate that everything in that section relates to the same purpose eg. data import, data cleaning, data management
- leave a header comment at the top of your file containing the following:
- use tabs to indent code, indented code implies that it relates to the code above, this makes code much easier to read; for example: the replace lines are reliant on the results of the gen line
- this has no impact on how to code functions, but simply makes your code neater, and means it's easier to read and edit for both you and others
*i training program based on location
gen program="EPIET" // creating a new categorical variable
replace program="UK FETP" if country=="ENGLAND"
replace program="PAE" if country=="GERMANY"
- make your code as simple and efficient as possible, you'll always learn better ways in the future, thats okay, but never make your code more complicated than it needs to be
- use a system to help with saving and version control, if your code is sensitive and cannot be shared outside the organisation check if your organisation has an internal git page (PHE does)
- this will allow better control of versions in your code, including a history, and better facilitates sharing of code
- never delete old methodology/completed code, you may have been right the first time, another reason version control is helpful
- save your working files (
.dta
) at each stage, eg. raw import, cleaned, ready for analysis - if your file has become too long to manage, perhaps its better to split it over two or more do files
- this is often very helpful when its a complicated process and you want to focus on one stage per file or are troubleshooting why a process isn't working as expected
- remember: you can run a do file as a command (within another do file)
- this is often very helpful when its a complicated process and you want to focus on one stage per file or are troubleshooting why a process isn't working as expected
It's not just me recommending this. I (and others) say this for your own good.
Stata Data types
Stata has two data types. String and numeric. The way Stata deals with each is different.
Table 1. Differences between numeric and string data in Stata
numeric | strings | |
---|---|---|
contains | any real number (+ / -) | any characters |
example value | 3.28 |
"ALEX" |
missing value | full stop: . |
blank: "" |
functions | evaluate magnitude (>= / <=) add, subtract, multiply, divide |
string functions and regular expressions |
example function | di 3*28 |
di substr("ALEX",1,1) |
You cannot combine a string and a numeric variable. Likewise, the missing characters are specific to the data type. Attempting either will result in an error message. You can always convert a number to a string, but you cannot always convert a string to a number.
Note that all dates, even when they're formatted in Stata format are still numeric. Likewise, you can label number values. This will display a string value instead of the numeric value, but like dates, the actual data is still a number. We'll get to this later.
When working with strings, the string value must be in double quotes "string goes here"
. When using the string variable name, it does not require quotes.
+---------------+
| year colour |
|---------------|
1. | 2017 red |
2. | 2016 blue |
3. | 2015 green |
4. | 2014 yellow |
+---------------+
gen x=year+colour // you can't add a number and string
type mismatch // this is the error message
r(109);
Stata commands
As a reference, Stata commands always follow this general format. The commands have four parts:
bysort [varlist]: command [if] [, options]
Table 2. Stata command parts.
Stata code | What is it for |
---|---|
bysort [varlist]: |
You are telling the program to run the command, but grouping your data according to the listed variables |
command |
This is the main command, the functions you will use in Stata; each command has a help file which tells you how to use that specific command |
[if] |
These allow you to selectively run a command on a subset of data based on the evaluation of certain criteria or an expression |
[,options] |
Almost every command in Stata has options, which allow you to change how to main command works, this may change the functionality, or simply just display extra information |
To run code in Stata, only the command is required, the other three parts are optional, however you will use them extensively. The help file for every Stata command will tell you exactly how to use that command, including its syntax and if it allows if-statements and what its options are.
Operators
Operators are the commands which you can use to evaluate expressions. Like math, an expression is a line of code which will be evaluated for a result. Remember your order of operations from math, same order applies (Brackets, Exponents, Multiplication/Division,Addition/Subtraction). But, we also have AND and OR.
Table 3. Operators in Stata.
Stata code | Operator | Example | Purpose |
---|---|---|---|
= |
Assignment | gen x=1 |
only used when generating a new variable |
== |
Equals to | gen x=1 if 2==2 |
assess if left side equals right side |
!= |
Not equal to | gen x=1 if 2!=3 |
assess if left side does not equal right side |
>= > <= < |
Greater than/ less than (or equal to) |
gen x=1 if 2<=3 |
as it sounds only works with numbers |
& |
AND | gen x=1 if y==2 & z==3 |
to evaluate 2+ things, BOTH side of & must be TRUE |
¦ (solid line pipe character)* |
OR | gen x=1 if y==2 ¦ z==2 |
to evaluate 2+ things, only ONE side of ¦ must be TRUE |
+ - * / ^ |
Add, subtract, multiply, divide, exponent |
gen x=1*3/3+2^2-4 |
math, order is important numbers only |
() |
Brackets | gen x=1 if y==2 ¦ (2^2==4 & a=="red") |
everything in the bracket must be considered, this allows you to group expressions |
* |
is used in Stata as the OR operator
Expressions
An expression is something you want to evaluate and has a TRUE/FALSE outcome. Operators which allow you to assess two things using == != >= > <= <
are required for expressions. Expressions can involve both strings and numbers and are often used in if-statements. These can be as simple as 1+1==2
or country=="ENGLAND"
, however they can become as complex as you need.
As illustrated in the table above, AND/OR differ from equals or not equals. For example, 2+2==4 & 2+2!=5
both sides of the &
are TRUE, and therefore the full expression is TRUE. Conversely, 5<4
is FALSE. Importantly, 2+2==4 | 5<4
is also TRUE, because when using OR, if one side of the |
is TRUE, then the entire expression evaluates as TRUE.
If evaluating an expression in Stata, a TRUE will be output as 1
and FALSE as 0
.
* example expression; copy into Stata to see outputs
di 1+1==2
di 1+1==3
* complex expresssions; is each one TRUE or FALSE
di 2+2==4 & 2+2!=5
di 2+2==4 | 5<4
di 2+2==4 & 5<4 | 2+2!=5
di 2+2==4 & 2+2==5 | 5<4 & 1+1==3
di 2+2==4 | 2+2==5 | 5<4 | 1+1==3
di 1+1==2 & (2+2==4 & (1+1==3 | 2+2!=5))
Don't worry if you don't fully understand all of this yet. You will.
Import Data
Lets start small. Here we are going to learn how to bring data from spreadsheets into Stata. This is the most common way that you will receive data, and is an essential skill.
TASK 1: set the working directory to your project folder
help cd
This should always be in the first set of commands within your do file. It is how you will navigate around the different files that you will be accessing or creating. It also means that if someone wants to use your code, and they have the files, they only have to change the directory here, and not every time a file is referenced within the code.
In Stata, you can only have one dataset in memory at a time. Before a new dataset can be loaded in, you will need to clear the memory. This can be done with either the first using the command clear
or using the ,clear
option when loading a new dataset.
Importing spreadsheets
Stata imports excel files [.xls
/.xlsx
] differently than text delimited files [.csv
/.txt
]. Read about the different syntax needed. Always spend a little time looking at your data, particularly when it's fragmented like this. How you look at your data will be a personal preference, however, there are several tools within Stata available to you to aid this exploration.
When you explore your data:
- take a look at the structure of the data.
- what are the names of the variables
- are they of the same type (string vs numeric)
- sometimes it can be helpful to import your data as all string as Stata and Excel sometimes play tricks with dates.
TASK 2: import the spreadsheets from the data folder into Stata without making any modifications in excel
TASK 3: explore the tables, note any differences and modify the tables so that they have the same structure, save each of your modified tables
We are going to import and explore each of the tables separately. Remember you'll need to clear
the memory each time. By saving your imported datasets as .dta
they can be safely cleared from the memory and brought back in later with all your changes and formatting. Recall that our project directory cd
is set to our current directory, but we want to save the file into our data
folder. Use data/filename
in with save command. We will have three separate .dta
files, each containing one year of data. Look back at the variable names listed in the Table 1 and what we learned previously about data types. Make the three tables consistent. We are not changing any of the data within the tables at this stage.
help import // read the import options
help describe // look at the data types
help codebook
help browse
help rename
help tostring // destring is the opposite
help save // read about the replace option
TASK 4: append the tables
help append // this is a top:bottom join, it adds observations into the dataset
We have 3 similar tables, but we want one. Join them top to bottom. Stata will report an error if you attempt to append a table where the variable types are different (eg. In one table healthcare_id
is string and in the other its numeric). Whichever data type was there first is what Stata will assume is correct. If you specify the option ,force
, it will result in all type-mismatched data being replaced with a missing value.
TASK 5: save your new appended dataset with all three years data in your data subfolder
help save
This is our base original dataset. We've made no real changes other than to amalgamate the data. Save it and don't change it.
Data cleaning
What is data cleaning
This is when you modify your data by changing the original values within the dataset. It is important to record any and all changes in the do file, to ensure that you and others can reproduce the results in the future. Always leave comments as to why changes were made. Remember, we never change the original dataset; we will save these changes as a new version of the dataset.
The purpose of data cleaning is to make our data as consistent as possible in preparation for analysis. This may include activities such as: formatting or recoding data, removing duplicate entries and accounting for missing data. This may also be referred to as having "tidy" data. Simply put, a clean dataset is easier to work with and a tidy dataset is easier to manage. Combined, having a a clean and tidy dataset will result in fewer mistakes in your code and therefore, your results.
In a clean/tidy dataset:
- each variable has its own column
- each observation of that variable is in a different row (this is long format data; we'll get to this later)
- if you have multiple related tables, they should include a column in the table that allows them to be linked
Remember, Stata can work with both numbers and strings. However, Stata codes a missing value differently depending on what format it's in. In numeric variables, missing values are represented with a full stop .
; for string variables, a missing value is represented with back-to-back double quotes ""
. If you have a choice, I recommend that variables containing all numbers are converted to numeric, and not kept as strings. They are easier to work with.
String functions and regular expressions
String variables in Stata can be messy and infuriating. As they can contain literally any character, they have have text, numbers or symbols, sometimes all three. Luckily, its okay.
String functions are your friend. These are a set of inbuilt commands within Stata to enable you to systematically work with strings. Using these, you can search part of a string, do replacements, or more simple things like change a string to be all upper case. You can use these edit strings, or to make evaluations in if-statements. Often when working with strings, its helpful to change them all or upper or lowercase as "CASE"!="case"
.
Remember, that unless you're putting a variable name in the str
position for these commands, it needs to be in ""
; di
is used to print out results in the display. Test out the example code in your command line.
Table 5. String functions and regular expressions in Stata.
Command | What does it do | Example | Result |
---|---|---|---|
length(str) |
returns number of characters in str |
di length("ALEX") |
4 |
word(str,n) |
returns nth word from str (if n<0 starts counting from right) |
di word("ALEX SAID LEARN STATA",2) |
SAID |
reverse(str) |
returns str reversed |
di reverse("LEARN") |
NRAEL |
trim(str) |
removes spaces on beginning and end of str |
di trim(" STATA ") |
STATA |
lower(str) |
returns str without lowercase letters only |
di lower("STRING") |
string |
upper(str) |
converts all letters to uppercase | di upper("functions") |
FUNCTIONS |
proper(str) |
capitalizes all letters not preceded by letters | di proper("for real") |
For Real |
strpos(str1,str2) |
returns position of str2 in str1 |
di strpos("STRINGS","R") |
3 |
substr(str,n1,n2) |
extracts characters n1 through n2 from str ; n1<0 starts from right |
di substr("IN STATA",1,2) di substr("IN STATA",-5,5) |
IN STATA |
subinstr(str,sub1,sub2,.) |
replaces all instances of str1 with str2 in str |
di subinstr("STATA","T","7") |
S7A7A |
regexm(str,re) |
evaluates whether str matches regular expression regex ; results in a 1/0 output meaning TRUE/FALSE |
di regexm("ARE OKAY","A") |
1 |
regexr(str1, re, str2) |
replaces the first substring of str1 that matches reg. exp. re with str2 |
di regexr("STATA","T","7") |
S7ATA |
Regular expression match, or regexm
, is a powerful string tool which can be used in an evaluation as it outputs a TRUE/FALSE result as 1/0, respectively. For example, say I had a dataset where the travel_abroad
variable was completed as an open source field, and people entering data had input: "UNK", "UNKNOWN", "unk", "unknwn" and "Unknown" when they didn't know.
*i this will capture all the variations that contain "unk" anywhere, ignoring case
replace travel_abroad="Unknown" if lower(regexm(travel_abroad,"unk"))
Regular expressions are more powerful then regular string functions. They have added function which allow you to search if a string starts, or ends with a set of characters, or to use wildcards. For more detail on regular expressions, and how to use all the wildcards, click here.
gen x=1 if regexm(forename,"^A") // if the forename variable starts with "A"
gen y=1 if regexm(surname,"A$") // if the surname ends with "A"
*i just say we have Steve, Stephen, Stefen, and Steven in our dataset but we want them all.
replace forename=upper(forename)
gen steve=1 if regexm(forename,"STE*E")
Note the key difference in results between regexr and subinstr. This one is important. String functions can be combined as well. Remember, order and brackets matter. Paste these into Stata to see:
di regexr("ALEX WANTS YOU TO LEARN STATA STRING FUNCTIONS","A","4")
di subinstr("SERIOUSLY. LEARN STATA STRING FUNCTIONS","S","5",.)
di "`c(username)'"
di proper("hello "+substr("`c(username)'",1,strpos("`c(username)'",".")))
Generate and Replace
gen
andreplace
are the bread and butter of data cleaning in Stata. Sometimes we want to make a new variable (gen
) based on the contents of another, other times we want to replace
the contents of an existing variable. When you're cleaning data, create a new variable and then modify the new variable instead of the original one, this is best practice and adds a sense check element to data cleaning. You can always drop
the "dirty" variable later.
help tab // use this to list out the contents of a variable. read about option missing
help gen
help replace
help drop
*i example code of cleaning up a variable, coding the new variable as numeric for analysis
tab patient_sex,m // explore the variable contents, note values and type
gen sex=0 if sex=="F" // create a new variable instead of changing existing
replace sex=1 if sex=="M" // replace the values based on criteria
replace sex=9 if sex=="U"
replace sex=9 if sex=="" // dont forget missing ones
tab patient_sex sex // check that your cleaning changes made sense
drop patient_sex // we dont need the "dirty" one anymore
If-statements
If-statements are qualifiers for a command which allow you to selectively evaluate or change your data, command [if], [options]
. Remember your operators, here and elsewhere, we will use logical operators AND (A AND B must be true) &
OR (A OR B must be true) |
and brackets ()
. These will be the basis for your if-statements. The full if-statement must be true for the command to proceed.
For example, just say we know that unknown dates of birth were categorised as missing with age 99 or as 01/01/1900 and we wanted to recode those ages as missing.
*i the replace commands will only affect observations where the full if-statement is true
*i this way we don't accidently remove the age on someone who is legitimately 99
*i without the brackets, Stata would not group the first two items of the if-statement
* the OR would apply evaluate age==99 | dateofbirth==td(01jan1900)
replace age=. if (dateofbirth==. & age==99) | dateofbirth==td(01jan1900)
TASK 6: investigate the dataset. are there missing or inconsistent data? if so, are these random or systematic errors? do we have variables where values are given in multiple formats? use these prompts to clean your data
help if
help cond // a simplified if/else function used with gen or replace
TASK 7: format your dates in Stata format
help date
help format
help datetime_display_formats // pick what you like, its personal preference.
Sometimes your dates may be in multiple formats, even within a single variable, eg. 2017-12-25, 25dec2017, 12/25/2017. While this is frustrating, it can be dealt with. When you create your date variables from a string, you must define the format your dates (in their string variable) are in. One way to deal with this is to redefine your Stata date variable several times, or you can format your dates during your import stage. In this exercise, the other option would have been to modify the dates before appending them; this is often easier. Here is the beauty of a do file. If you want, you can go back and change that now and re-run, or just continue.
There is no correct way that you should format display your dates. Use whatever date display format you prefer. Stata default %td
will display dates in the format 01jan2018
.
*i replacing the date variables with different date formats.
gen date=date(string_date,"YMD")
replace date=date(string_date,"DMY") if date==.
replace date=date(string_date,"MDY") if date==.
format date %td_D/N/CY // after the %td_ lets you choose how you format the date
Duplicates
TASK 8: check for any duplicates records and remove them
help duplicates // read about drop, tag and force
Many datasets are plagued by issues of duplicate data. This is especially true of surveillance datasets, where data may be coming in from multiple sources. Depending on your dataset, and the specifics of the data being collected, this can be a relatively simple or very complex step. Often it will take you several stages to deduplicate your data. There are many ways to remove (or flag) duplicates, however the simplest way uses the duplicates
command.
This dataset contains laboratory surveillance data. In some cases a lab may run a blood sample more than once, and therefore we may have multiple records of the same infection episode. Alternatively, a hospital could send a blood sample to more than one lab; to address this, you'll need to deduplicate based on a subset of variables within the dataset. You will need to use two different deduplication criteria (2 lines) to remove all the records.
Never change the original dataset. Seriously. Don't do it. Save your cleaned dataset as a new file. Feel free to make any and all changes to this one. Use the cleaned dataset going forwards. If you ever make a mistake, just fix and rerun your do-file.
TASK 9: save your cleaned dataset as a new file
Data management
We've cleaned the dataset. Things will get a little more complicated as we're going to use more advanced commands. The key in data management is creating new variables. These variables will contain flags, summary data, or key identifiers which we will use for further modification. A properly managed dataset will be ready to share with others for analysis.
Derived variables
Derived variables use the existing data to create new variables which summarise your results in a more meaningful way. Or they may be flag variables. A flag variable is a variable coded 0/1 which helps to tell you something about your dataset and is used for manipulation or analysis.
For example, sometimes you don't want to drop records, but flag which ones should be omitted from analysis. So you could make a variable _drop
in which you flag records 1
which should be omitted; you will need to ensure that you account for this in your code. Or perhaps your data contains onset and exposure dates, but you are going to be doing your analysis only on those with <72 hour incubation time; create a flag variable for these individuals.
TASK 10: create a new variable for year, using specimen_date
help gen
help dates
TASK 11: calculate age in years at time of specimen
You can perform math functions with dates in Stata. Remember, Stata stores dates as the number of days since 01jan1960. This is important, as when you subtract two dates that are on the same day, the result will be 0
.
help round
TASK 12: calculate age groups [<1, 1-4, 5-14, 15-29, 30-49, 50-74, 75+], label the variable values using the groups above and check the new categorical variable against the continuous one
help egen // read about [type] cut and option ic
help label
help tab
Bysort and index numbers
bysort
is a prefix command that groups observations, allowing you to perform functions within the group. It is super powerful and has many practical uses, a few of which we will cover later in this section. First, you need to understand how it works.
*i REMEMBER:
bysort [varlist]: command [if], [options]
help bysort
Within each bysort
, each observation is temporarily assigned an index number within the group, starting with 1
. When bysort
runs, it will start with the first observation within the group (index number [1]
), perform the function, then move to the next observation within the group [2]
, repeating for each observation, until the last observation _N
.
You can perform manipulations according to these numbers. Importantly, these numbers are not actually displayed anywhere; but you can generate a new variable to display the index number if you wish. To use index numbers, put the index number within square brackets []
directly after the variable, without any spaces.
Table 6. Overview of index numbers within bysort
.
index number | which observation during the bysort does the index number relate to? |
---|---|
[1] |
first observation |
[2] |
second observation |
[_n] |
current observation |
[_N] |
last observation |
[_n+1] |
next observation / one observation after the current |
[_N-2] |
third-last observation / two observations from last |
You can also use _n
and _N
without the []
. In this case, _n
refers to the current index number, and _N
refers to the maximum number. They can be used in if-statement evaluations.
*i you will be able to run your version of the commands on your dataset, try them.
*i create a new var count which would have the total number of observations per id
*i only label the last observation per id with the count of observations
bysort id: gen count=_N
bysort id (specimen_date): replace count=. if _n!=_N
list if _n<10 & count!=. // shows the first 9 observations with counts
*i in bysort the variables in the () are not included in the group, but determine sort order
*i sorting is not mandatory, but necessary if you're working with index numbers
bysort id (specimen_date): gen index=_n
bysort id (specimen_date): gen days1=specimen_date-specimen_date[1]
bysort id (specimen_date): gen days2=specimen_date-specimen_date[_n-1]
Table 7. Results after bysort id (specimen_date)
forcount
,index
, days1
and days2
.
id | specimen_date | count | index | days1 | days2 |
---|---|---|---|---|---|
5448 | 29may2017 | 3 | 1 | 0 | . |
5448 | 23jun2017 | 3 | 2 | 25 | 25 |
5448 | 27jul2017 | 3 | 3 | 59 | 34 |
4040 | 08jun2017 | 3 | 1 | 0 | . |
4040 | 16jun2017 | 3 | 2 | 8 | 8 |
4040 | 04aug2017 | 3 | 3 | 57 | 49 |
Compare the results of the two bysort
commands for days1
and days2
. Look at the differences between the results of these two columns. Use the results to help you interpret how the function works. Why would the first value index==1
for both id
be equal to .
in days2
? Which observation is the index number pointing at during the bysort
?
In days1
, it would give you the number of days since the first observation per id
, and in days2
, the number of days between the observation and the one directly above it per id
. In both examples, the data has been sorted in ascending order by specimen_date
, with the earliest date as the first observation per id
group. In days2
, the first observation per group is missing because there is no previous record.
Practical use of bysort
patientID
We can use bysort
to group patient records. Often in a dataset, especially a surveillance dataset, some patients may have multiple records. It is important to capture these. This may be a preliminary step for further deduplication or simply to calculate a mean number of infections/person. Often a dataset will have multiple patient identifiers, but these may be incomplete.
For example, just say we had 2 variables, for forename and surname and we wanted to group these records to create a patient id, we could try the following:
gen id=_n
Table 8a. Results before bysort
.
forename | surname | id |
---|---|---|
Alex | Batman | 1 |
Alex | Batman | 2 |
Beth | Smile | 3 |
Simon | Darling | 4 |
Matt | D'Swing | 5 |
Simon | Darling | 6 |
Rebecca | Humbug | 7 |
Kazim | Bumblebee | 8 |
Rebecca | Humbug | 9 |
Kazim | Bumblebee | 10 |
Matt | D'Swing | 11 |
Beth | Smiles | 12 |
bysort forname surname (id): replace id=id[1]
Table 8b. Comparison of results after bysort
.
forename | surname | id |
---|---|---|
Alex | Batman | 1 |
Alex | Batman | 1 |
Beth | Smiles | 3 |
Beth | Smiles | 3 |
Kazim | Bumblebee | 8 |
Kazim | Bumblebee | 8 |
Matt | D'Swing | 5 |
Matt | D'Swing | 5 |
Rebecca | Humbug | 7 |
Rebecca | Humbug | 7 |
Simon | Darling | 6 |
Simon | Darling | 6 |
Compare Table 7a and 7b. Note that everyone that the id
value was replaced with each persons lowest id
value. This is because we sorted the grouping by id
and replaced with the first observation in the group [1]
. Remember, variables within the ()
brackets are not included in the group, but determine the sort order of the observations, thus the index number.
Lets try that again, however, this time, consider what would happen if some of the values were missing, as shown in Table 8a. We can avoid inappropriate groupings by using the if
command after our replace
. This is a more realistic scenario, as it is rare that you will have a fully 100% complete dataset. This is also why in practice, you will often use a sequential set of parameters to group records when you have missing data.
Table 9a. Results with missing data.
forename | surname | id |
---|---|---|
Alex | Batman | 1 |
Alex | 2 | |
Beth | Smile | 3 |
Simon | Darling | 4 |
Matt | 5 | |
Simon | Darling | 6 |
Rebecca | Humbig | 7 |
Kazim | Bumblebee | 8 |
Rebecca | Humbug | 9 |
Bumblebee | 10 | |
Matt | 11 | |
Beth | Smiles | 12 |
bysort forename surname (id): replace id=id[1] if forename!="" & surname!=""
Table 9b. Results after bysort
when there is missing data.
forename | surname | id |
---|---|---|
Bumblebee | 10 | |
Alex | 2 | |
Alex | Batman | 1 |
Beth | Smiles | 3 |
Beth | Smiles | 3 |
Kazim | Bumblebee | 8 |
Matt | 5 | |
Matt | 5 | |
Rebecca | Humbig | 7 |
Rebecca | Humbug | 9 |
Simon | Darling | 4 |
Simon | Darling | 4 |
By adding in exclusion criteria, we can prevent incorrect matches. If we run these lines afterwards, it will capture the individuals with missing data together. Producing similar results to those in in Table 7b. However, in reality, you would rarely use a single identifier as a grouping for patient data unless the quality and reliability of that specific data is known and confirmed in your data to be very high. For example, you would never use date_birth
and sex
together without a third variable to further group the observations. However, just say you have a standardised national health number, this may be a strong and accurate patient identifier, but it still requires exclusion criteria! If its not clear from the data, ask someone who might know.
bysort surname (id): replace id=id[1] if surname!=""
bysort forename (id): replace id=id[1] if forename!=""
Combine observations
Just say a patient had two records and you wished to deduplicate and retain the first observation. If you deduplicate at this stage, you would lose the information in surname and sex, as you would only keep the first record.
Once you understand bysort
and index numbers as how their used in this section, you can use it as an alternative method to flag duplicate records for removal. Helpfully, bysort
combined with [_n-1]
or [_n+1]
will allow you to capture information from an observation before or after, respectively. This is can often be helpful before deduplicating.
Table 10a. There is missing information in the primary record which has the outcome (cholera). Note that the index number [_n]
value is not actually shown in Stata
[_n] | id | forename | surname | sex | specimen_date | cholera |
---|---|---|---|---|---|---|
1 | 3 | John | 25dec2017 | 1 | ||
2 | 3 | John | Snow | Male | 25dec2017 | . |
bys id (specimen_date): replace surname=surname[_n+1] if surname==""
bys id (specimen_date): replace sex=sex[_n+1] if sex==""
bys id specimen_date: gen _drop=1 if _n!=1
Table 10b. Using the bysort
commands above replace the missing values in [1]
with the information from the next ([2]
)
_n | id | forename | surname | sex | specimen_date | cholera | _drop |
---|---|---|---|---|---|---|---|
1 | 3 | John | Snow | Male | 25dec2017 | 1 | . |
2 | 3 | John | Snow | Male | 25dec2017 | . | 1 |
There were only 2 observations within this bysort
(_N==2
) . The code captured surname and sex from the second record [2]
within the bysort
, then created a new variable, which flaged all subsequent records (matched on id
and specimen_date
), with a flag (_drop
) to help deduplicate further records.
Apply your learning!
TASK 13: create a patientID
help bysort
help _n
help replace
HINT: start with gen id=_n
and look at the output. Because it was run without bysort
what does the value represent? You'll use bysort
and replace
with index numbers for this part.
You will use multiple steps; each line will be a different set of criteria that you want to bysort
on. I suggest using different combinations of the patient identifiers.
For this exercise, let's try four combinations. When creating a patient ID using identifiers, also think of when we would not want to group that record. Look at your dataset. Think about combinations of identifiers which you can use. Always consider at least two variables to be used together in the grouping, the exclusions, and always sort the data in a consistent and logical way. Depending on what variables you choose, and the order, you may get different results from your colleagues. Another reason why it's important to document everything in your do file.
TASK 14: using your patientID, create a new variable which contains the total number of infections per person ID and a second variable which flags the first time an individual had an infection per organism each year
help bysort
help _N
HINT: you're going to want to use the _N
for one, and an if
-statement using _n
for the other.
TASK 15: using your patientID and total infections variable, generate a new variable which tags records where the individual had a concurrent infection with more than one organism on the same specimen date
help duplicates // read about tag
How do you interpret these results? What does 0 mean, what does a number ≥1 mean? Use the browse
command to look at the results in a line list to help you answer these questions. The duplicates
command is not just for dropping records.
TASK 16: save your dataset
At this stage you can replace your previous cleaned dataset, or you can create a new file which captures these changes. It's up to your personal preference.
Data manipulation
Data manipulation is when you change the structure of your dataset. Depending on what your analysis requires, you may need to summarise your data, append or merge other datasets, or reshape your data. Here we will do all of these. Remember, Stata does not have an undo function, this is why we saved our base dataset, a deduplicated, cleaned line list.
In this next part, we are going to create a summary table for the number of infections, and calculate the yearly incidence rates per 100,000 population. Then we are going to export our result table to excel. These are the tables seen in the exports\summary.xlsx
file.
Long versus wide data: reshape
This section requires that you understand data structures. What does your data table look like? How is it organised? Is it in wide or long format, and why? Recall our discussion of "tidy" data; that is long format, with each variable having a column, and each observation of that variable having a row. For example, see Table 10a. For each area and year, there is a separate observation. Wide data breaks that, and allows for multiple variables of like data per observation (Table 10b/c).
Shape your data according to your needs. There is no "one shape fits all". Feel free to reshape
multiple times throughout an analysis. A structure may work for one part but may not work for another. Think about your code, and your objective. What shape is necessary for any joins or merges (more on that coming up)? What makes your life easiest (meaning the most simple code)? Am I exporting a table, what shape is best for reading the data?
help reshape
In Stata, reshape
requires an index key i()
, and what you would like to reshape on j()
. In the example below, starting with the long data, they key is area in the top table, and we have reshaped wide on year, and in the second they key is year, and we have reshaped wide on area. You can restructure your data multiple times. Your key cannot have missing observations. You can reshape from wide to long, or long to wide. Reshaping can be done multiple times if you dataset allows it.
Table 11a. Long format data
area | year | count |
---|---|---|
A | 2001 | 17 |
A | 2002 | 82 |
A | 2003 | 6 |
B | 2001 | 654 |
B | 2002 | 798 |
B | 2003 | 312 |
reshape wide count, i(area) j(year)
Table 11b. Wide format, by year.
area | count2001 | count2002 | count2003 |
---|---|---|---|
A | 17 | 82 | 6 |
B | 654 | 798 | 312 |
*i reverts the reshape back, can be used once per reshape.
reshape long
*i string option required when its string variable
reshape wide count, i(year) j(area) string
Table 11c. Wide format, by area.
year | countA | countB |
---|---|---|
2001 | 17 | 654 |
2002 | 798 | 82 |
2003 | 6 | 312 |
Creating summary datasets: collapse
We are going to be joining the following dataset to our data. If you've setup your file structure and downloaded data files, run the code below. preserve
takes a snapshot of your current data, and sends it to memory. When you use the command restore
, it reverts back to the preserved dataset. You can only preserve one dataset to memory at a time.
help preserve
preserve
use data\area_population, clear
list
restore
Take a look at this dataset. What have you observed? Look at the variables, observations and structure of the data. How does it compare to your current dataset?
TASK 17: summarise your dataset to give counts of the number of infection episodes in each area per organism per year
help table
help collapse // read about sum and count
table
allows you to print out tables in your Stata display, while collapse
changes your dataset. For exploration table
is easier but temporary; collapse
has more functionality, and is considerably more powerful as we will see. Use the table
command first. Compare that to the results of your collapse
. Remember, you cannot undo a collapse
. Make sure your dataset is saved beforehand.
TASK 18: summarise your dataset to give the national counts of the number of infection episodes per year
After you collapse
your national dataset, what have you noticed? Are all the variables from the regional table there? When we join the two tables, will there be any missing data? Can we correct that? If we need to, when do we need to make these changes and what are they for?
TASK 19: join the two summary tables (area level and national)
help append
Recall earlier when we were creating our dataset, we used append
, joining data top to bottom making our dataset longer by adding observations. Here we are doing that again, but on a much smaller scale.
TASK 20: reshape your data in preparation to join to population data
help reshape
*i to jog your memory...
preserve
use data\area_population, clear
list
restore
Being able to reshape
data is an important and powerful set of skills. Data can be presented in many different formats, but they may be unsuitable for use in their original form. Compare the shape of your new summary data to the population table. They are different, and therefore will be unable to join with each other in a meaningful way. By giving both datasets a similar structure, we will be able to join them together for further analysis. Reshape one dataset to match the other.
Joining datasets: merge
merge
is a left-right join making our dataset wider by adding variables. When we merge
, we select a key (a variable or combination of variables) which allows both datasets to identify what we are going to join on.
master data is the base dataset currently in memory (on the left)
using data is what we are bringing in (on the right)
The advantage of merging datasets is that it allows us to enrich our data with more information. When we merge data, we're making the dataset wider, however we may also change the total number of observations within the dataset, depending on the type of merge.
Table 12. Merge types in Stata
merge type | what does it mean | example |
---|---|---|
1:1 one-to-one |
all observations between both datasets must be unique; only one possible match from master to using | 2 patient level datasets containing no duplicates |
m:1 many-to-one |
the using data only has 1 unique observation per key; each observation of the using data can merge onto the master multiple times | merging a death register (U) onto a surveillance dataset (M) |
1:m one-to-many |
the master data only has 1 unique observation per key; each observation of the using data can merge onto the master multiple times (this is the least common merge) | merging a treatment register (U) onto patient cancer registry dataset (M) |
m:m many-to-many |
both datasets contain duplicates; all matches from both side merge (these are highly discouraged as they can result in errors, but are still sometime necessary) | merging a treatment dataset (U) onto a hospital admissions dataset (M) |
Recall we want to calculate incidence rates/100,000 population. Looking back at the population data, what key do we have available to us in both datasets? What type of merge do we want to perform? Which merge results do we want to keep, and what variables do we need from our using data?
TASK 21: join the population data onto your dataset
help merge // read about merge types and keep options
Macros
Macros are Statas way of letting you use values that do not exist. In Stata, there are two types of macros, locals and globals. A local macro exists only for a short time, for example, within a loop or a chunk of code, and then disappears. Local macros are defined using local macroname
and called up using `macroname'
with the funny tick mark and a single quote around whatever the name of the macro is. A global macro exists from when you define it until you exit Stata. Global macros defined using global macroname
are called using $macroname
with a dollar sign followed by the macro name. With all macros, you set the name.
Whenever you run a command, Stata is generating hidden values storing the results as macros. Run a sum
of your count variable from your summary table, then type return list
in the console. This brings up a list of hidden values from the sum
command you just ran, which have been stored as local macros. You can use these values to undertake further analysis. If you want to store the value for later use, you can redefine it as a global macro.
help macro // read this
help levelsof
sum count
return list
A macro can be a string, a variable name, or a numerical value, in some cases a macro can be a list of things help levelsof
. How you define the macro will determine the format of the value. Note the difference in how the local macro `four'
is defined; copy/paste the example into the command lines in Stata.
local four=2+2
di "`four'"
local four 2+2
di "`four'"
In the first example, the macro contained a value for the evaluated equation, as defined using the assignment operator (=
), for 2+2
. In the second example, the assignment operator was omitted, and was therefore not evaluated and the macro was defined as a string containing the characters "2+2"
.
Looping
Looping is a way of repeating code over something using macros. foreach
loop over variables, strings, or lists. forval
loops over numbers, these numbers can be values or as part of a variable name. Loops always follow the same general syntax. If you find yourself copy/pasting code, you should have written a loop. Once understood, loops simplify code, are more concise and are easier to maintain or make changes to.
See how the local macro has been called within the loop. Run the following loop code in your command line or do file to get an idea of the output.
help foreach // looping over strings/variables/list
help forval // looping over numbers, read about ranges
*i _all = stata code for all variables in dataset
foreach v of varlist _all {
di "looping over variable: `v'"
}
forval i=1/10 {
di "looping over number: `i'"
}
Table 13. Understanding the parts of a loop
loop type | macroname | looping over | in english | command |
---|---|---|---|---|
foreach |
v |
of varlist _all |
each variable in dataset | di "looping over variable: `v'" |
forval |
i |
=1/10 |
number 1 to 10 sequentially | di "looping over number: `i'" |
You will always define all three parts: the loop type, macroname and what you are looping over. The opening curly bracket always goes on the end of the first line {
, and the closing bracket goes alone on the last line of the loop }
. Whatever commands are within the curly brackets are run on every iteration of the loop. As good practice, it's very helpful to include a di "`macroname'"
in loops as it makes the output easier to read. There is no limit to how big a loop can be or how many commands can be within (that I have ever encountered).
TASK 22: calculate annual incidence rates per 100,000 population
help forval
Don't worry about confidence intervals for this. Just calculate a simple rate. Just say your dataset had 15 years and not just 3. Depending on how your data is shaped, it's often easier to do this with a loop and it makes future changes easier to manage.
TASK 23: order and sort and label your variables in a logical way and export data to a new xlsx file in 3 sheets called inc_genusname
help order
help sort
help label
help forval // use this for the labelling
help levelsof // this creates a local macro
help foreach // read about "of local" loops
help export excel // read about export_excel_options
Ensure the table is in the shape and order you want with just the variables that you need, with the columns in the correct logical order. Just export what you need. When Stata exports data to excel, it will do is in the sorted order currently set in your viewer. For example, do you want the area_code or the area_name for a final output table? Do you want your data sorted by year or by species?
In our dataset now have a count, population and rate variable with a year after its name, eg. n2015, pop2015, rate2015. Use a forval
loop to label these appropriately.
export
each species to its own sheet called "INC_genusname"
. Bonus points if you can use figure out how to use levelsof
and a foreach
loop to do this. Use string functions to extract the genus from the species name.
Don't forget to save
your dataset so you can return to it later!
TASK 24: create a new summary table of infection episodes per year and month for each organism
help use
help collapse
We changed our data with our last collapse, so we have to go back to our saved cleaned deduplicated dataset before we can create new summary tables.
TASK 25: shape your data in a format that excel would allow you to make a graph/epicurve and export the monthly counts for each of the species in 2017 to your xlsx file in a new sheet called monthly_species
As this is surveillance data, it may be that you want to create a line graph or epicurve for comparison against other data. Using the steps above, you should be able to create summary tables and export them as necessary.
TASK 26: create a table which gives the agegroup:sex breakdown by sex proportions for each of the species per year. Export these to new sheets by species.
Remember, we made a derived variable earlier which captured the first time an individual had an infection per organism per year. Let's use that to prevent overestimations due to some individuals having repeat infections.
This one requires everything we've talked about above. You'll need to collapse
, drop
missing data, sort
and reshape
your data, generate
new derived variables and export
subsets of your results.
TASK 27: compare your final results against the file in
outputs\summary.xlsx
Compare your final xlsx
file to the reference file. You may have slightly different numbers based on how you did your deduplication and patientID, that's okay. Compare your methods against your colleagues or the reference file. It is important you document everything in your do file, this include extensive comments on why you have made certain decisions.
Achievement Unlocked: Level 2 Data Wizard
Congratulations. You can completed the training and can now clean, manage and manipulate data in Stata.
To help improve this training module and for a Certificate of Completion, please take a few minutes to complete the evaluation/feedback for this training.