class: center, middle, inverse, title-slide # Econ 414 - Urban Economics ## Exploratory Spatial Data Analysis ### Marcelino Guerra ### April 27-29, 2021 --- # Outline <style type="text/css"> .pull-left2 { float: left; width: 59%; } .pull-right2 { float: right; width: 40%; } .pull-right2 ~ p { clear: both; } </style> </br> .center[**"Everything is related to everything else, but near things are more related than distant things"**] </br> * Generally, in data analysis, the researcher looks for patterns and trends, and spatial patterns are particularly important for scientific inquiry * Exploratory Spatial Data Analysis (ESDA) focuses on visualizing the spatial distribution of variables and summarizing the spatial pattern with statistics * The goal of this lecture is to show an example of spatial analysis - the work of John Snow on the mechanisms of cholera transmission * We also define an object's neighborhood: this is a crucial step to measure the strength of the spatial relationships and how neighbors influence each other * After that, we focus on global and local measures of spatial autocorrelation: Moran's I and Local Indicators of Spatial Association (LISA) --- class: inverse,center, middle # 1854 Cholera Outbreak in London --- # John Snow (1813-1858) .pull-left2[ * John Snow was an English physician and a leader in the development of anesthesia and medical hygiene * He had long believed that cholera was spread by contaminated water, but the dominant theory was that cholera was transmitted by breathing "bad air" * Cholera is an intestinal illness that can cause death within hours after the first symptoms, and the first cases of the disease in London were recorded in 1831 * He is considered the father of modern epidemiology ] .pull-right2[![](figs/snow_pic.jpg)] --- # The Broad Street Pump .pull-left2[ * By the 1800s, most people did not have running water and basic sanitation in their houses and used communal pumps to get the water for cooking, drinking, and washing * Snow published an article in 1849 outlining his theory about the mechanism behind cholera transmission, but he could not convince other doctors and scientists * In August of 1854, Soho - a suburb of London - was hit hard by an outbreak of cholera. Living near Soho, Dr. Snow could do some fieldwork to prove his theory * [“Within 250 yards of the spot where Cambridge Street joins Broad Street there were upwards of 500 fatal attacks of cholera in 10 days,” Dr. Snow wrote “As soon as I became acquainted with the situation and extent of this irruption (sic) of cholera, I suspected some contamination of the water of the much-frequented street-pump in Broad Street.”](https://www.ph.ucla.edu/epi/snow/snowcricketarticle.html) * John snow put together information from hospitals, public records, and interviews. He also tracked hundreds of cholera cases and deaths nearby schools, restaurants, pubs, and businesses ] .pull-right2[ ![:scale 80%](figs/pump.jpg) ] --- # John Snow Mapping Cholera Deaths .pull-left[ * Snow used a geographical grid to chart deaths from the outbreak, and the dot map would illustrate the cluster of cholera cases very close to the water pump * He also investigated cases to determine whether they were related to the same pump in Broad Street * Together with his empirical findings, Dr. Snow also had anecdotal evidence coming from many different sources * For instance, a brewery located on Broad Street. Among its 70 workers, there were no fatalities. The explanation is that the men were given an allowance of free beer every day and so never drank water at all ] .pull-right[![](figs/SnowMap.png)] --- # John Snow Getting Causal Effects .pull-left2[ * The most compelling evidence that Cholera was transmitted by contaminated water came from a "natural experiment." At that time, different companies were responsible for the water distribution within London, often bottling water from the Thames * In 1849, Lambeth company moved their pipes higher up (above the main sewage discharge point) in the Thames river, while Southwark and Vauxhall did not <table class=" lightable-paper table" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Company </th> <th style="text-align:left;"> Cholera Cases in 1849 </th> <th style="text-align:left;"> Cholera Cases in 1854 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Southwark and Vauxhall </td> <td style="text-align:left;"> 135 </td> <td style="text-align:left;"> 147 </td> </tr> <tr> <td style="text-align:left;"> Lambeth </td> <td style="text-align:left;"> 85 </td> <td style="text-align:left;"> 19 </td> </tr> </tbody> <tfoot> <tr> <td style = 'padding: 0; border:0;' colspan='100%'><sup></sup> Source: Cunningham (2021). Cholera cases per 10,000 people.</td> </tr> </tfoot> </table> * Snow linked household enrollment in water supply companies with cases of Cholera and made his case for the cause of cholera transmission ] .pull-right2[ <iframe src="maps/snow.html" style="width: 800px; height: 500px; border: 5px" alt=""> ] --- class: inverse,center, middle # Spatial Weight Matrices --- # Modelling Proximity * One can impose some structure to define neighborhoods. Spatial weight matrices represent the degree of connection between two regions using some criteria to determine proximity. Also, proximity has a comprehensive meaning: it could be geographic, socio-economic, or political affiliation proximity, for instance. * It is an arbitrary choice since there is no formal test to define how someone characterizes neighbors. Also, the estimated parameters can be sensitive to the researcher's choice. One way to answer critics is to report results from the use of many different **spatial weight matrices** * Each element `\(w_{i,j}\)` of the Weigh Matrix indicates the way location `\(i\)` and `\(j\)` are connected. The matrix's diagonal is equal to zero - you are not a neighbor of yourself * Suppose you have the following spatial setting `\(1\rightarrow \leftarrow2\rightarrow \leftarrow3\rightarrow \leftarrow4\)`: **Here are the associated spatial weights:** .center[ `\begin{bmatrix} 0 & 1 & 0 & 0\\ 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1\\ 0 & 0 & 1 & 0\\ \end{bmatrix}` ] --- # W Matrix - First Order Contiguity * One way is to consider two regions as neighbors when they share a border. That means `\(w_{i,j}=1\)` if `\(i\)` and `\(j\)` are contiguous and `\(w_{i,j}=0\)` otherwise * Despite its simplicity, there are many ways to define contiguity. The two most common conventions are *rook* and *queen* .center[![](figs/rook_queen.png)] --- # W Matrix - Nearest Neighbors * Another proximity criterion relies on the geographical distance (in `\(km\)`, miles, etc.). In that sense, one W Matrix heavily adopted in the literature is the `\(k\)`-nearest neighbors .pull-left[ * Formally, `\(w_{ij}(k)=1\)` if `\(d_{ij} \leq d_{i}(k)\)` and `\(w_{ij}(k)=0\)` otherwise * `\(d_{i}(k)\)` is the cutoff distance for the region `\(i\)` to have exactly k neighbors * When you model proximity in that way, it is a guarantee you do not have "islands" (e.g., Hawaii would have no neighbors using contiguity criteria) * Other W matrices use geographical distance such as inverse distance and great circle distance ] .pull-right[![](figs/knn.png)] --- class: inverse,center, middle # Spatially Lagged Variables --- # Illustration I * Back to our former spatial setting based on the first-order contiguity: `\(1\rightarrow \leftarrow2\rightarrow \leftarrow3\rightarrow \leftarrow4\)`. .pull-left[ * The associated spatial weight matrix is: `\begin{equation*} \begin{matrix} & & & 1 & 2 & 3 & 4 \\ \end{matrix} \\ W= \begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix} \begin{bmatrix} 0 & 1 & 0 & 0\\ 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1\\ 0 & 0 & 1 & 0 \end{bmatrix} \end{equation*}` * One common practice is to row-standardize the `\(W\)` matrix. To do that, divide each element `\(w_{ij}\)` by the sum of their rows. ] .pull-right[![](figs/matrix.png)] --- # Illustration II .pull-left[ * The row-standardized version of the referred `\(W\)` matrix is: `\begin{bmatrix} 0 & 1 & 0 & 0\\ 1/2 & 0 & 1/2 & 0\\ 0 & 1/2 & 0 & 1/2\\ 0 & 0 & 1 & 0 \end{bmatrix}` * Now, consider the following values for population density in each block: `\begin{equation*} x= \begin{bmatrix} 2 \\ 7 \\ 8 \\ 1 \end{bmatrix} \end{equation*}` ] .pull-right[![](figs/matrix.png)] --- # Illustration III Using those values - the neighborhood structure defined by `\(W\)` and population density -, it is straightforward to calculate `\(Wx\)`. That is, the spatial lag vector of the variable in question: population density. `\begin{equation*} Wx= \begin{bmatrix} 0 & 1 & 0 & 0\\ 1/2 & 0 & 1/2 & 0\\ 0 & 1/2 & 0 & 1/2\\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 2 \\ 7 \\ 8 \\ 1 \end{bmatrix} = \begin{bmatrix} 7 \\ 5 \\ 4 \\ 8 \end{bmatrix} \end{equation*}` Each value of `\(Wx\)` corresponds to the average population density in the neighborhood of each block. For instance, `\(5\)` represents the average population density in the adjacent areas of block 2. <br/> **Example**: [Crime displacement](https://guerramarcelino.github.io/gradseminar/blitzproject#1) --- # Homicide per 1,000 - Chicago-IL 2020 <iframe src="maps/homic2020.html" style="width: 1100px; height: 500px; border: 5px" alt=""> --- # Homicide rates x Spatial Lag <iframe src="maps/scatter.html" style="width: 1100px; height: 500px; border: 5px" alt=""> --- class: inverse,center, middle # Spatial Autocorrelation --- # Moran's I </br> * Moran (1948) developed a spatial correlation coefficient. This statistic is given by: `\begin{equation*} I=(\frac{n}{S_{0}})\cfrac{\sum_{i=1}^{n}\sum_{j=1}^{n}w_{ij}z_{i}z_{j}}{\sum_{i=1}^{n}z_{i}^2} = \frac{n}{S_{0}}\frac{z'Wz}{z'z} \end{equation*}` where `\(n\)` is the number of regions, `\(z\)` is the scaled values of the variable in question, `\(W\)` is the spatial matrix used, and `\(S_{0}=\sum\sum w_{ij}\)`. When row-standardized, `\(S_{0}\)` is equal to `\(n\)`, and `\(I=\frac{z'Wz}{z'z}\)`. * The null hypothesis is spatial randomness: the observed spatial pattern of a variable is equally likely as any other spatial pattern. In other words, values in one area do not depend on values at neighboring locations. * If your p-value is less than .05, you **reject the null:** the variable in question is spatially autocorrelated (in global terms) --- # Moran's Scatterplot <iframe src="maps/moran.html" style="width: 1100px; height: 500px; border: 5px" alt=""> --- # Local Indicator of Spatial Association (LISA) * The Moran's I is a global statistic that identifies the general data's pattern * To check local patterns of spatial autocorrelation, you might want to proceed with the Local Indicator of Spatial Association (LISA) analysis * Is your positive spatial autocorrelation related to clusters of similarly high or low values? * Which observations contribute the most to spatial autocorrelation? * Are there outliers that differ from the global spatial autocorrelation scheme? * Anselin (1995) formalized the local Moran’s statistics. The formula is given by: `\begin{equation*} I_{i}=(\frac{z_{i}}{m_{0}})\sum_{j}w_{ij}z_{j} \end{equation*}` where `\(m_{0}= \sum_{i}z_{i}^2\)`, `\(z\)` is the scaled values of the variable in question, and `\(W\)` is the spatial matrix used --- # LISA Map <iframe src="maps/LISA.html" style="width: 1100px; height: 500px; border: 5px" alt="">