Skip to contents

Geocoding: from addresses to spatial coordinates

Once you have a table (data.frame) with addresses, geolocating the data with {geocodebr} can be done in two simple steps:

  1. The first step is to use the setup_address_fields() function to declare the names of the columns in your input data.frame that correspond to each field of the addresses.
library(geocodebr)

# read input sample data
input_df <- read.csv(system.file("extdata/small_sample.csv", package = "geocodebr"))

# 1st step: indicate the columns describing the address fields
fields <- geocodebr::setup_address_fields(
  logradouro = "nm_logradouro",
  numero = "Numero",
  cep = "Cep",
  bairro = "Bairro",
  municipio = "nm_municipio",
  estado = "nm_uf"
  )
  1. The second step is to use the geocode() function to find the geographic coordinates of input addresses.
# 2nd step: geocode
df <- geocodebr::geocode(
  addresses_table = input_df,
  address_fields = fields,
  n_cores = 1,
  progress = FALSE
  )

head(df)
#>   id            nm_logradouro Numero       Cep               Bairro
#> 1  1 RUA MARIA LUCIA PACIFICO     17 26042-730           SANTA RITA
#> 2  2      RUA LEOPOLDINA TOME     46 25030-050           CENTENARIO
#> 3  3          RUA DONA JUDITE      0 23915-700          CAPUTERA II
#> 4  4     RUA ALEXANDRE AMARAL      0 23098-120           SANTISSIMO
#> 5  5                AVENIDA E    300 23860-000         PRAIA GRANDE
#> 6  6      RUA PRINCESA ISABEL    263 69921-026 ESTACAO EXPERIMENTAL
#>      nm_municipio code_muni          nm_uf       lon        lat match_type
#> 1     NOVA IGUACU   3303500 RIO DE JANEIRO -43.47118 -22.695496       en01
#> 2 DUQUE DE CAXIAS   3301702 RIO DE JANEIRO -43.31134 -22.779173       en01
#> 3  ANGRA DOS REIS   3300100 RIO DE JANEIRO -44.20848 -22.978837       er01
#> 4  RIO DE JANEIRO   3304557 RIO DE JANEIRO -43.51150 -22.868992       er01
#> 5     MANGARATIBA   3302601 RIO DE JANEIRO -43.97214 -22.929864       en01
#> 6      RIO BRANCO   1200401           ACRE -67.83559  -9.963436       en01
#>   precision
#> 1    number
#> 2    number
#> 3    street
#> 4    street
#> 5    number
#> 6    number

obs. Note that the first time the user runs this function, {geocodebr} will download a few files and store them locally. This way, the data only needs to be downloaded once. More info about data caching below.

The output coordinates use the official geodetic reference system used in Brazil: SIRGAS2000, CRS(4674). The results of {geocodebr} are classified into six broad precision categories depending on how exactly each input address was matched with CNEFE data. The accuracy of the results are indicated in two columns of the output: precision and match_type. More information below.

Precision categories:

The results of {geocodebr} are classified into six broad precision categories:

  • “numero”
  • “numero_interpolado”
  • “rua”
  • “cep”
  • “bairro”
  • “municipio”
  • NA (not found)

Each precision level can be disaggregated into more refined match types.

Match Type

The column match_type provides more refined information on how exactly each input address was matched with CNEFE. In every category, {geocodebr} takes the average latitude and longitude of the addresses included in CNEFE that match the input address based on combinations of different fields. In the strictest case, for example, the function finds a deterministic match for all of the fields of a given address ("estado", "municipio", "logradouro", "numero", "cep", "localidade"). Think for example of a building with several apartments that match the same street address and number. In such case, the coordinates of the apartments will differ very slightly, and {geocodebr} takes the average of those coordinates. In a less rigorous example, in which only the fields ("estado", "municipio", "logradouro", "localidade") are matched, {geocodebr} calculates the average coordinates of all the addresses in CNEFE along that street and which fall within the same neighborhood.

The complete list of precision levels, their corresponding match type categories and the fields considered in each category are described below:

  • precision: “numero”
    • match_type:
      • en01: logradouro, numero, cep e bairro
      • en02: logradouro, numero e cep
      • en03: logradouro, numero e bairro
      • en04: logradouro e numero
      • pn01: logradouro, numero, cep e bairro
      • pn02: logradouro, numero e cep
      • pn03: logradouro, numero e bairro
      • pn04: logradouro e numero
  • precision: “numero_interpolado”
    • match_type:
      • ei01: logradouro, numero, cep e bairro
      • ei02: logradouro, numero e cep
      • ei03: logradouro, numero e bairro
      • ei04: logradouro e numero
      • pi01: logradouro, numero, cep e bairro
      • pi02: logradouro, numero e cep
      • pi03: logradouro, numero e bairro
      • pi04: logradouro e numero
  • precision: “rua” (when input number is missing ‘S/N’)
    • match_type:
      • er01: logradouro, cep e bairro
      • er02: logradouro e cep
      • er03: logradouro e bairro
      • er04: logradouro
      • pr01: logradouro, cep e bairro
      • pr02: logradouro e cep
      • pr03: logradouro e bairro
      • pr04: logradouro
  • precision: “cep”
    • match_type:
      • ec01: municipio, cep, localidade
      • ec02: municipio, cep
  • precision: “bairro”
    • match_type:
      • eb01: municipio, localidade
  • precision: “municipio”
    • match_type:
      • em01: municipio

Note: Match types starting with ‘p’ use probabilistic matching of the logradouro field, while types starting with ‘e’ use deterministic matching only. Match types with probabilistic matching are not implemented in {geocodebr} yet.

Data cache

The first time the user runs the geocode() function, {geocodebr} will download a few reference files and store them locally. This way, the data only needs to be downloaded once. Mind you that these files require approximately 4GB of space in your local drive.

The package includes the following functions to help users manage cached files:

  • get_cache_dir(): returns the path to where the cached data is stored. By default, files are cached in the package directory.
  • set_cache_dir(): set a custom directory to be used. This configuration is persistent across different R sessions.
  • list_cached_data(): list all files currently cached
  • clean_cache_dir(): delete all files of the cache directory used by {geocodebr}