rOpenSci | Interesting Uses of censo2017 a Year After Publishing

Interesting Uses of censo2017 a Year After Publishing

See the Spanish version of this blog post: Usos Interesantes de censo2017 un Año Después de su Publicación

🔗 Summary

This post is about the surprising uses I’ve noticed and the questions about the censo2017 R package, a tool for accessing the Chilean census 2017 data, I’ve gotten since it was peer-reviewed through rOpenSci one year ago. The original post about the package one year ago didn’t cover the different examples I present here, including a Python port of the R package.

🔗 Organizing the census data

Three years ago, I had to complete an assignment that required me to extract data from Windows-only software in DVD format, which got very complicated.

I needed to access REDATAM files and obtain a few population summaries with specific software for that format. To my surprise, the task got incredibly challenging, and I exported the data to SQL for more accessible data extraction.

What I initially organized in PostgreSQL, ended up being organized in a local (embedded) database, which is more convenient for end-users and can be installed with two lines of code.

Installation in R:

remotes::install_github("ropensci/censo2017")
censo2017::censo_descargar()

Installation in Python:

pip install git+https://github.com/pachadotdev/censo2017python.git#egg=censo2017

import censo2017
censo2017.descargar()

🔗 Unexpected use I: Substandard housing (in R)

We can use the census data to determine where substandard housing is common, leading to valuable insights for public policy. That is something I never contemplated when creating the software.

What kind of housing information can we look for in the census? To keep it simple, let’s explore the p01 variable, which we can look at in the variables_codificacion, a table that I created to organize my work.

library(censo2017)
library(dplyr)

con <- censo_conectar()

tbl(con, "variables") %>% 
  filter(variable == "p01") %>% 
  collect() %>% 
  kable()
tabla variable descripcion tipo rango
viviendas p01 Tipo de Vivienda integer 1 - 10
tbl(con, "variables_codificacion") %>% 
  filter(variable == "p01") %>% 
  select(valor, descripcion) %>% 
  collect() %>% 
  kable()
valor descripcion
1 Casa
2 Departamento en Edificio
3 Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)
4 Pieza en Casa Antigua o en Conventillo
5 Mediagua, Mejora, Rancho o Choza
6 Móvil (Carpa, Casa Rodante o Similar)
7 Otro Tipo de Vivienda Particular
8 Vivienda Colectiva
9 Operativo Personas en Tránsito (No Es Vivienda)
10 Operativo Calle (No Es Vivienda)
11 Valor Perdido
0 No Aplica

The variable refers to “Tipo de Vivienda” (Housing Type) and contains ten numeric values that mean:

Value Description
1 House
2 Apartment Building
3 Traditional Indigenous Dwelling (Ruka1 or Others)
4 Bedroom in an Old House or in a Conventillo2
5 Half Hut, Improvement, Ranch or Shack
6 Mobile (Tent, Mobile Home or Similar)
7 Other Type of Private Housing
8 Collective Housing
9 Persons in Transit Operative (Not a Dwelling)
10 Street Operative (Not Housing)
11 Missing Value
0 Not applicable

  1. traditional Mapuche house type↩︎

  2. A building designed to house as many people as possible in a small space at a low cost.↩︎

In this example, I consider dwellings coded as 5 (Half Hut, Improvement, Ranch or Shack / Mediagua, Mejora, Rancho o Choza) as substandard. Because of the census data organization, to get the share of these types of housing per region, we need to:
  1. Create the code for each political division (i.e., region) from the geographical code in the zones table.
  2. Join the zones (“zonas”) table with the households (“viviendas”) table to match each unit to its location (i.e., household ID 100 is in the “Los Ríos (14th)” region).
  3. Group by p01 and count the number of units.

For example, if we want to look at the 14th region specifically, one way to do this could be:

tbl(con, "zonas") %>% 
  mutate(region = str_sub(geocodigo, 1, 2)) %>% 
  filter(region == 14) %>% 
  
  inner_join(tbl(con, "viviendas")) %>% 
  
  group_by(p01) %>% 
  count() %>% 
  
  collect() %>% 
  kable()
p01 n
1 144081
2 4793
3 44
4 494
5 2187
6 35
7 1875
8 467
9 6
10 8

We can improve this a little bit with an optional join and a step to extract the percentages:

tbl(con, "zonas") %>% 
  mutate(region = str_sub(geocodigo, 1, 2)) %>% 
  filter(region == "14") %>% 
  
  inner_join(tbl(con, "viviendas")) %>% 
  
  group_by(p01) %>% 
  count() %>% 
  
  left_join(
    tbl(con, "variables_codificacion") %>% 
      filter(variable == "p01") %>% 
      select(p01 = valor, p01_desc = descripcion)
  ) %>% 
  select(p01, p01_desc, n) %>% 
  
  arrange(-n) %>% 
  ungroup() %>% 
  mutate(p = round(100 * n / sum(n), 2)) %>% 
  
  collect() %>% 
  kable()
p01 p01_desc n p
1 Casa 144081 93.57
2 Departamento en Edificio 4793 3.11
5 Mediagua, Mejora, Rancho o Choza 2187 1.42
7 Otro Tipo de Vivienda Particular 1875 1.22
4 Pieza en Casa Antigua o en Conventillo 494 0.32
8 Vivienda Colectiva 467 0.30
3 Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras) 44 0.03
6 Móvil (Carpa, Casa Rodante o Similar) 35 0.02
10 Operativo Calle (No Es Vivienda) 8 0.01
9 Operativo Personas en Tránsito (No Es Vivienda) 6 0.00

For the 14th region, housing quality seems not to be a problem. What if we wonder for the number of people living in each of these unit types? This would need additional steps to match people (“personas”) to homes (“hogares”), and homes to households.

tbl(con, "zonas") %>% 
  mutate(region = str_sub(geocodigo, 1, 2)) %>% 
  filter(region == "14") %>% 
  
  inner_join(tbl(con, "viviendas")) %>% 
  
  inner_join(tbl(con, "hogares")) %>% 
  inner_join(tbl(con, "personas")) %>% 
  
  group_by(p01) %>% 
  count() %>% 
  
  left_join(
    tbl(con, "variables_codificacion") %>% 
      filter(variable == "p01") %>% 
      select(p01 = valor, p01_desc = descripcion)
  ) %>% 
  select(p01, p01_desc, n) %>% 
  
  arrange(-n) %>% 
  ungroup() %>% 
  mutate(p = round(100 * n / sum(n), 2)) %>% 
  
  collect() %>% 
  kable()
p01 p01_desc n p
1 Casa 359343 93.38
2 Departamento en Edificio 9139 2.37
8 Vivienda Colectiva 7905 2.05
5 Mediagua, Mejora, Rancho o Choza 3980 1.03
7 Otro Tipo de Vivienda Particular 3008 0.78
4 Pieza en Casa Antigua o en Conventillo 727 0.19
9 Operativo Personas en Tránsito (No Es Vivienda) 494 0.13
3 Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras) 96 0.02
10 Operativo Calle (No Es Vivienda) 84 0.02
6 Móvil (Carpa, Casa Rodante o Similar) 61 0.02

How does the previous result compare with the capitol (13th region)?

tbl(con, "zonas") %>% 
  mutate(region = str_sub(geocodigo, 1, 2)) %>% 
  filter(region == "13") %>% 
  
  inner_join(tbl(con, "viviendas")) %>% 
  
  inner_join(tbl(con, "hogares")) %>% 
  inner_join(tbl(con, "personas")) %>% 
  
  group_by(p01) %>% 
  count() %>% 
  
  left_join(
    tbl(con, "variables_codificacion") %>% 
      filter(variable == "p01") %>% 
      select(p01 = valor, p01_desc = descripcion)
  ) %>% 
  select(p01, p01_desc, n) %>% 
  
  arrange(-n) %>% 
  ungroup() %>% 
  mutate(p = round(100 * n / sum(n), 2)) %>% 
  
  collect() %>% 
  kable()
p01 p01_desc n p
1 Casa 5265416 74.03
2 Departamento en Edificio 1613157 22.68
8 Vivienda Colectiva 78763 1.11
4 Pieza en Casa Antigua o en Conventillo 72142 1.01
5 Mediagua, Mejora, Rancho o Choza 52710 0.74
7 Otro Tipo de Vivienda Particular 17335 0.24
9 Operativo Personas en Tránsito (No Es Vivienda) 9113 0.13
10 Operativo Calle (No Es Vivienda) 2740 0.04
3 Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras) 865 0.01
6 Móvil (Carpa, Casa Rodante o Similar) 567 0.01

From the previous summarised data we can see that in the capitol, 52,710 individuals are living in substandard housing (Mediagua, Mejora, Rancho o Choza).

We shouldn’t forget to close the database connection.

censo_desconectar()

🔗 Unexpected use II: Water source (in Python)

The motivation for this example is to show that the goal of organizing the census data was not to promote R, but to show some of the benefits of having the data organized in an open format, regardless of the programming language used for the analysis. For me it’s easier to complete this example in R, but I wanted to include those using this census data and Python.

The variable for the water source is p05. The analysis is very similar to the previous one, and we start by looking at its description. This particular example is relevant, as it allows to quantify the number of people that might be exposed to water contamination in a certain area.

import censo2017
import duckdb

con = duckdb.connect(database = censo2017.archivo_sql())

con.execute("SELECT * FROM variables_codificacion WHERE variable = 'p05'").df()

##        tabla variable  valor                                descripcion
## 0  viviendas      p05      1                                Red Pública
## 1  viviendas      p05      2                               Pozo o Noria
## 2  viviendas      p05      3                              Camión Aljibe
## 3  viviendas      p05      4  Río, Vertiente, Estero, Canal, Lago, Etc.
## 4  viviendas      p05     99                              Valor Perdido
## 5  viviendas      p05     98                                  No Aplica
Value Description
1 Public Network
2 Well or Noria
3 Cistern Truck
4 River, Stream, Estuary, Canal, Lake, Etc.
99 Missing Value
98 Not Applicable

By using this variable, we can know the number of people for each water source in the same way as the previous case.

con.execute("""
SELECT "p05", "p05_desc", "n",
  ROUND((100.0 * "n") / SUM("n") OVER (), CAST(ROUND(2.0, 0) AS INTEGER)) AS "p"
FROM (
  SELECT "LHS"."p05" AS "p05", "n", "p05_desc"
  FROM (
    SELECT "p05", COUNT(*) AS "n"
    FROM (
      SELECT *, "LHS"."hogar_ref_id" AS "hogar_ref_id"
      FROM (
        SELECT *, "LHS"."vivienda_ref_id" AS "vivienda_ref_id"
        FROM (
          SELECT *, "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id"
          FROM (
            SELECT *
            FROM (
              SELECT *, SUBSTR("geocodigo", 1, 2) AS "region"
              FROM "zonas"
            ) "q01"
            WHERE ("region" = '14')
          ) "LHS"
          INNER JOIN "viviendas" AS "RHS"
            ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
        ) "LHS"
        INNER JOIN "hogares" AS "RHS"
          ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
      ) "LHS"
      INNER JOIN "personas" AS "RHS"
        ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
    ) "q02"
    GROUP BY "p05"
  ) "LHS"
  LEFT JOIN (
    SELECT "valor" AS "p05", "descripcion" AS "p05_desc"
    FROM "variables_codificacion"
    WHERE ("variable" = 'p05')
  ) "RHS"
    ON ("LHS"."p05" = "RHS"."p05")
) "q03"
ORDER BY -"n"
""").df()

##    p05                                   p05_desc       n      p
## 0    1                                Red Pública  299464  77.82
## 1    4  Río, Vertiente, Estero, Canal, Lago, Etc.   38796  10.08
## 2    2                               Pozo o Noria   32150   8.35
## 3   98                                  No Aplica    8483   2.20
## 4    3                              Camión Aljibe    3558   0.92
## 5   99                              Valor Perdido    2386   0.62

Where we see that for the 14th region, 22.18% of the population doesn’t have access to the public tap water network. The advantage of this approach is that here we used pure SQL for querying, which would work in any language where we can pass SQL queries. The disadvantage is not having a tool such as dplyr and the need to learn the SQL syntax.

We shouldn’t forget to close the database connection.

con.close()

🔗 Concluding remarks

The examples shown here were simplified. Most of the inquiries I get are about data summaries at the sub-communal level, for example how to count the occurrences of a categorical variable in a neighborhood, as we did in our example of water access (i.e., well or public network). By using censo2017 we can extract useful information for evidence-based public policy when it comes to, for example, prioritizing budget decisions.