Geocoding refers to the act of finding a point in space, usually represented by a pair of coordinates, given a location address. The {geocodebr} package allows one to efficiently geocode Brazilian addresses using the National Registry of Addresses for Statistical Purposes (english for Cadastro Nacional de Endereços para Fins Estatísticos, CNEFE), a data set collected and published by the Brazilian official statistics and geography office, IBGE, containing the addresses of more than 100 million households and establishments in Brazil.
Basic usage
Before using {geocodebr}, please make sure you have it installed in your computer. You can download either the most stable version from CRAN…
install.packages("geocodebr")
… or the development version from GitHub.
# install.packages("pak")
pak::pak("ipeaGIT/geocodebr")
Then attach it to the current R session:
The main function of package is geocode()
, which takes a
data frame of addresses as input and outputs the same data frame with
the latitude and longitude of each matched address, as well as two
columns indicating the precision level of the matches. To demonstrate
its usage, the package includes a few sample data sets in the
installation. In the example below, we use a small data set that
contains addresses with commonly seen issues, such as missing
information and mistyped fields.
Note: Running the function for the first time may take a while, since {geocodebr} needs to download the CNEFE data, which sums up to about 5.5 GB. This data is stored locally, so it is downloaded only once. More info about data caching below.
df <- read.csv(
system.file("extdata/small_sample.csv", package = "geocodebr")
)
result <- geocodebr::geocode(
addresses_table = df,
address_fields = geocodebr::setup_address_fields(
logradouro = "nm_logradouro",
numero = "Numero",
cep = "Cep",
bairro = "Bairro",
municipio = "nm_municipio",
estado = "nm_uf"
),
progress = FALSE
)
head(result)
#> id nm_logradouro Numero Cep Bairro
#> 1 1 RUA MARIA LUCIA PACIFICO 17 26042-730 SANTA RITA
#> 2 2 RUA LEOPOLDINA TOME 46 25030-050 CENTENARIO
#> 3 3 RUA DONA JUDITE 0 23915-700 CAPUTERA II
#> 4 4 RUA ALEXANDRE AMARAL 0 23098-120 SANTISSIMO
#> 5 5 AVENIDA E 300 23860-000 PRAIA GRANDE
#> 6 6 RUA PRINCESA ISABEL 263 69921-026 ESTACAO EXPERIMENTAL
#> nm_municipio code_muni nm_uf lon lat match_type
#> 1 NOVA IGUACU 3303500 RIO DE JANEIRO -43.47118 -22.695496 en01
#> 2 DUQUE DE CAXIAS 3301702 RIO DE JANEIRO -43.31134 -22.779173 en01
#> 3 ANGRA DOS REIS 3300100 RIO DE JANEIRO -44.20848 -22.978837 er01
#> 4 RIO DE JANEIRO 3304557 RIO DE JANEIRO -43.51150 -22.868992 er01
#> 5 MANGARATIBA 3302601 RIO DE JANEIRO -43.97214 -22.929864 en01
#> 6 RIO BRANCO 1200401 ACRE -67.83559 -9.963436 en01
#> precision
#> 1 number
#> 2 number
#> 3 street
#> 4 street
#> 5 number
#> 6 number
The output coordinates use the official geodetic reference system
used in Brazil: SIRGAS2000, CRS(4674). The results of {geocodebr} are
classified into six broad precision
categories depending on
how exactly each input address was matched with CNEFE data. The accuracy
of the results are indicated in two columns of the output:
precision
and match_type
. More information
below.
Precision categories:
The results of {geocodebr} are classified into six broad
precision
categories:
- “numero”
- “numero_interpolado”
- “rua”
- “cep”
- “bairro”
- “municipio”
-
NA
(not found)
Each precision level can be disaggregated into more refined match types.
Match Type
The column match_type
provides more refined information
on how exactly each input address was matched with CNEFE. In every
category, {geocodebr} takes the average latitude and longitude of the
addresses included in CNEFE that match the input address based on
combinations of different fields. In the strictest case, for example,
the function finds a deterministic match for all of the fields of a
given address ("estado"
, "municipio"
,
"logradouro"
, "numero"
, "cep"
,
"localidade"
). Think for example of a building with several
apartments that match the same street address and number. In such case,
the coordinates of the apartments will differ very slightly, and
{geocodebr} takes the average of those coordinates. In a less rigorous
example, in which only the fields ("estado"
,
"municipio"
, "logradouro"
,
"localidade"
) are matched, {geocodebr} calculates the
average coordinates of all the addresses in CNEFE along that street and
which fall within the same neighborhood.
The complete list of precision levels, their corresponding match type categories and the fields considered in each category are described below:
- precision: “numero”
- match_type:
- en01: logradouro, numero, cep e bairro
- en02: logradouro, numero e cep
- en03: logradouro, numero e bairro
- en04: logradouro e numero
- pn01: logradouro, numero, cep e bairro
- pn02: logradouro, numero e cep
- pn03: logradouro, numero e bairro
- pn04: logradouro e numero
- match_type:
- precision: “numero_interpolado”
- match_type:
- ei01: logradouro, numero, cep e bairro
- ei02: logradouro, numero e cep
- ei03: logradouro, numero e bairro
- ei04: logradouro e numero
- pi01: logradouro, numero, cep e bairro
- pi02: logradouro, numero e cep
- pi03: logradouro, numero e bairro
- pi04: logradouro e numero
- match_type:
- precision: “rua” (when input number is missing
‘S/N’)
- match_type:
- er01: logradouro, cep e bairro
- er02: logradouro e cep
- er03: logradouro e bairro
- er04: logradouro
- pr01: logradouro, cep e bairro
- pr02: logradouro e cep
- pr03: logradouro e bairro
- pr04: logradouro
- match_type:
- precision: “cep”
- match_type:
- ec01: municipio, cep, localidade
- ec02: municipio, cep
- match_type:
- precision: “bairro”
- match_type:
- eb01: municipio, localidade
- match_type:
- precision: “municipio”
- match_type:
- em01: municipio
- match_type:
Note: Match types starting with ‘p’ use probabilistic matching of the logradouro field, while types starting with ‘e’ use deterministic matching only. Match types with probabilistic matching are not implemented in {geocodebr} yet.
Data cache
The first time the user runs the geocode()
function,
{geocodebr} will download a few reference files and store them locally.
This way, the data only needs to be downloaded once. Mind you that these
files require approximately 4GB of space in your local drive.
The package includes the following functions to help users manage cached files:
-
get_cache_dir()
: returns the path to where the cached data is stored. By default, files are cached in the package directory. -
set_cache_dir()
: set a custom directory to be used. This configuration is persistent across different R sessions. -
list_cached_data()
: list all files currently cached -
clean_cache_dir()
: delete all files of the cache directory used by {geocodebr}