Previous topic

tse2sql.main

Next topic

tse2sql.render

This Page

tse2sql.readers

TSE files parsing / reading module.

Classes

class tse2sql.readers.DistrictsReader(search_dir)

Read and parse the Distelec.txt file.

The Distelec.txt file is a CSV file in the form:

101001,SAN JOSE,CENTRAL,HOSPITAL
101002,SAN JOSE,CENTRAL,ZAPOTE
101003,SAN JOSE,CENTRAL,SAN FRANCISCO DE DOS RIOS
101004,SAN JOSE,CENTRAL,URUCA
101005,SAN JOSE,CENTRAL,MATA REDONDA
101006,SAN JOSE,CENTRAL,PAVAS
  • It list the provinces, cantons and districts of Costa Rica.

  • It is encoded in ISO-8859-15 and uses Windows CRLF line terminators.

  • It is quite stable. It will only change when Costa Rica districts change (quite uncommon, but happens from time to time).

  • It is relatively small. Costa Rica has 81 cantons, and ~6 or so districts per canton. As of 2016, Costa Rica has 478 districts. As this writting, the CSV file is 172KB in size.

  • The semantics of the code is as following:

    <province(1 digit)><canton(2 digits)><district(3 digits)>
    

    Please note that only the province code is unique. Both canton and districts codes are reused and thus depend on the value of the previous code.

This class will lookup for the file and will process it completely in main memory in order to build provinces, cantons and districts tables at the same time. Also, the file will be processed even if some lines are malformed. Any error will be logged as such.

Inheritance

Inheritance diagram of DistrictsReader

analyse()

Return a small report with some basic analysis of the tables.

Returns

A dictionary with the analysis of data provided by the parsed file. In particular, the amount of provinces, cantons and districts, the largest name of those, and the bad lines found.

Return type

A dict of the form:

analysis = {
    'provinces': ...,
    'provinces_extended': ...,
    'province_largest': ...,
    'cantons': ...,
    'cantons_extended': ...,
    'cantons_largest': ...,
    'districts': ...,
    'districts_extended': ...,
    'districts_largest': ...,
    'bad_data': ...
}

parse()

Open and parse the Distelec.txt file.

After parsing the following attributes will be available:

Variables
  • provinces – Dictionary with province id as key and name as value.

  • cantons – Dictionary with a tuple (province id, canton id) as key and name as value.

  • districts – Dictionary with a tuple (province id, canton id, district id) as key and name as value.

class tse2sql.readers.VotersReader(search_dir, distelec)

Read and parse the PADRON_COMPLETO.txt file.

The PADRON_COMPLETO.txt file is a CSV file in the form:

100339724,109007,1,20231119,01031,JOSE                          ,DELGADO                   ,CORRALES
100429200,109006,2,20221026,01025,PAULA                         ,QUIROS                    ,QUIROS
100697455,101023,2,20150620,00073,CARMEN                        ,FALLAS                    ,GUEVARA
100697622,101020,2,20230219,00050,ANTONIA                       ,RAMIREZ                   ,CARDENAS
100720641,108002,2,20241119,00884,SOLEDAD                       ,SEQUEIRA                  ,MORA
100752764,403004,1,20151208,03731,EZEQUIEL                      ,LEON                      ,CALVO
100753244,210012,2,20161009,02599,CONSTANCIA                    ,ARIAS                     ,RIVERA
100753335,115001,2,20180211,01362,MARGARITA                     ,ALVARADO                  ,LAHMAN
100753618,111005,2,20220109,01168,ETELVINA                      ,PARRA                     ,SALAZAR
100763791,108007,1,20190831,00971,REINALDO                      ,MENDEZ                    ,BARBOZA
  • It lists all the voters in Costa Rica: their id, voting district, officialy sex (as if anyone should care), id expiration, voting site, name, first family name and second family name.

  • It is encoded in ISO-8859-15 and uses Windows CRLF line terminators.

  • It is quite unstable. Deaths and people passing 18 years are removed - added.

  • It is very large. As this writting, the CSV file is 364MB in size, with 3 178 364 lines (and thus, registered voters).

  • The semantics of the sex code is as following: 1: men, 2: women.

  • The format of the id expiration date is %Y%m%d as following:

    <year(4 digit)><month(2 digits)><day(2 digits)>
    

This class will interpret the file on the fly without loading it entirely on main memory. Also, the file will be processed even if some lines are malformed. Any error will be logged as such.

Inheritance

Inheritance diagram of VotersReader

open()

Open voters file for on-the-fly parsing.