Having a background in science, I got interested in the water quality data from data.austintexas.gov. At the same time, I found a need by Open Austin, a volunteer group that uses tech to fill the needs of the Austin community, to create a water quality bot. So I downloaded the dataset which is about 300MB. I then explored it using pandas. The codes for my data exploration can be found here.
The dataset has more than a million observations, with 24 columns. There are some missing values.
To understand the data, I looked at the unique values of the features (columns). I found the following (from data downloaded Sept. 8, 2017).
- There are 92 watersheds that are subject to water quality assessments. Within the watersheds there are varying number of sampling sites (‘site_name’). The following bar graph shows that more assessments were done in some watersheds than the others. Wallnut Creek, Bull Creek and Slaughter Creek are the watersheds with the most sampling sites for some reason.
- Water quality is measured through different parameter types and there are 95 of these parameter types. Parameter types may be of biological, organic or inorganic in nature. They may be measured in different ways, giving a total of more than 3000 parameters measured. The list of parameter types are shown below:
Diatoms, one of the parameter types, have the most parameters, as shown by the code below.
- Though results are available for the analyses posted on the site, assessment of data needs consideration of the indicated qualifier codes. Each result are flagged whether it is reliable of not. Majority of the values are “usable” (U), but care must be taken to see that some are “estimated” (J) and some are “unusable” (R). (The small ‘u’ might mean ‘U’, see below.)
‘S’, ‘B’, and ‘Z’ mean “not explained” (not sure why there are three of these), ‘?’ is “unknown”, and there are some that are missing (nan).
Among the parameter types, I looked at the data on E. coli, one of the most important indicators of water quality. Filtering the dataset using the masks in the following code allowed me to get a more or less dependable data for E. coli.
Exploring this subset of data, I found that there are three units for measuring E. coli, meaning that all the values in this dataset can’t be compared to each other. The data must further be filtered according to units of E. coli measurement.
To get an idea of how E. coli values vary over time for a particular sample site, I created a time series dataframe. Seeing that the unit ‘MPN/100ML’ (most probable number per 100 milliliters of sample) had more data under it, I used this subset of data. To create the time series, I resorted to converting the “SAMPLE_DATE” column from string to a datetime object and converted it to an index. An example of a plot is shown below:
According to this site, an upper limit of 200 MPN/100mL is set by the Washington State Department of Ecology Recreational Water Standards for safe E. coli levels.