# Aquisição de dados, web scraping e o processo de KDD: um estudo prático com os dados da COVID-19 no Brasil.

Você tem interesse em data science e está sensibilizado com a situação da COVID-19 no Brasil? Neste post, irei explorar o conceito de data acquisition e demonstrar como utilizar uma técnica simples de web scraping para obter dados oficiais sobre o avanço da doença. Vale dizer que, com alguma adaptação no código, a estratégia aqui empregada pode ser facilmente estendida a dados de outros países ou ainda a outras aplicações. Então ponha sua máscara, lave suas mãos e inicie a leitura agora mesmo!

Devo iniciar este texto, antes de tudo, oferecendo os mais sinceros sentimentos a todas as vítimas da COVID-19. De fato, é em memória destes que devemos buscar motivação para o desenvolvimento da ciência, em todas as suas áreas e expressões, sem esquecermos de que se ela existe é porque há uma humanidade que a sustenta. Em relação à ciência de dados, é através dela que podemos minimizar fake news e contribuir, de alguma forma, com uma sociedade mais informada, sensata e ponderada.

Iniciei este projeto de exploração de dados ao perceber, há algumas semanas atrás, que havia pouca informação estratificada sobre o número de casos da COVID-19 no Brasil em bases oficiais, tais como a da Organização Mundial da Saúde (WHO – World Health Organization) ou do painel mantido pela Universidade John Hopkins. Como estratificado, aliás, entende-se o conjunto de dados que pode ser observado em diferentes camadas. Ou seja, para um país de dimensões continentais, como o Brasil, saber o número total de casos no país é apenas informativo. Para que tal informação seja funcional, ela deve ser estratificada por regiões, estados, ou cidades.

De modo a proporcionar tanto uma visão geral do problema quanto direcionar o desenvolvimento da análise, estabeleci abaixo alguns objetivos para este estudo dos quais espero que o leitor também se aproprie:

• Adquirir uma visão geral do processo de KDD e compreender onde a aquisição de dados está inserida.
• Explorar técnicas de aquisição de dados de modo que possam ser facilmente empregadas em problemas e casos similares.
• Obter, a partir de bases de dados oficiais, o número de casos e óbitos por COVID-19 no Brasil, estratificado por estado.

Nas seções a seguir, exploraremos cada um desses itens buscando construir os conceitos necessários e implementá-los de maneira prática em um ambiente de desenvolvimento em Python. Prontos?

## KDD – Descobrindo conhecimento em bases de dados

A descoberta de conhecimento em bases de dados, mais conhecida por sua sigla KDD, do inglês Knowledge Discovery in Databases, é o processo pelo qual se busca padrões compreensíveis, válidos, e relevantes a partir de bases de dados. É um processo interativo e iterativo, que compreende muitas das técnicas e processos conhecidos atualmente como ciências de dados.

Antes de explorarmos as principais etapas do processo de KDD, porém, vale diferenciarmos três termos-chaves: dados, informações e conhecimento. Cada etapa do processo de KDD, aliás, ocupa-se de processar cada um desses termos, ilustrados no diagrama da Fig. 1 a partir da problemática em que estamos trabalhando.

É a partir da manipulação da informação que adquirimos conhecimento. Este, por sua vez, está íntima e etimologicamente ligado ao que chamamos de “processos cognitivos” — a percepção, a contextualização e o aprendizado — e, por conseguinte, à inteligência.

Uma vez compreendidos estes três conceitos, podemos agora avançar para as três etapas fundamentais do processo de KDD. Estas nada mais são do que conjuntos de tarefas respectivamente dedicadas à obtenção e exploração dos dados, informações e conhecimento, conforme ilustrado na Fig. 2, abaixo:

O processo de KDD, portanto, é constituído de etapas e cada qual, de tarefas. As tarefas de data mining, por exemplo, fazem parte da etapa de processamento da informação, enquanto que as tarefas relacionadas à visualização de dados, analytics, e tomada de decisão constituem a última etapa, pela qual se extrai o conhecimento.

Neste artigo, vamos nos deter somente à tarefa de aquisição de dados (data acquisition), fundamental a todo e qualquer processo de KDD. De modo geral, não existe um método pré-definido e generalista para a obtenção dos dados. Em um projeto de médio e longo prazos, normalmente desenvolve-se um modelo de dados e uma arquitetura robusta para a sua coleta e armazenamento. Já em projetos de curto prazo, ou então em análises como esta, utilizam-se as fontes imediatamente disponíveis e acessíveis que mais estejam relacionadas ao problema.

Tendo estes conceitos fundamentais em mente, estamos prontos para seguir em direção ao nosso propósito: a aquisição de dados oficiais sobre o número de casos da COVID-19.

## Definindo uma estratégia de aquisição de dados

Como vimos, uma estratégia de aquisição de dados só fará sentido se for construída a partir da perspectiva de um processo completo de KDD. Isto é, antes de sairmos coletando dados por aí, precisamos ter bem definido qual é o propósito disto tudo. Ou ainda, qual é o problema que estamos tentando resolver ou elucidar?

O modelo de referência CRISP-DM, representado na Fig.3, pode ser utilizado como uma bússula ao percorrermos um processo de KDD. Você se recorda que anteriormente definimos o KDD como um processo interativo e iterativo?

Pois bem, a interação pode ser definida como a ação mútua ou compartilhada entre dois ou mais agentes. Neste caso, a interação pode se referir tanto entre as pessoas detentoras de dados e informações, ou entre as etapas. As setas curtas, em ambos os sentidos entre os primeiros passos da Fig. 3, representam justamente a interação necessária no processo. Ouso dizer que tal interação é o maior desafio dos recém-chegados ao mundo do KDD ou data science: há uma ilusão inicial de que os passos devem ser seguidos em um fluxo contínuo, o que logo decorre em uma sensação de se estar fazendo algo errado. Muito pelo contrário, é a compreensão do negócio que possibilita a compreensão dos dados, que gera novos insights sobre o negócio e assim por diante.

A iteração, por sua vez, está associada à repetição do mesmo processo. As setas maiores, na Fig.3, formam um círculo que simboliza esta repetição. Inicia-se no primeiro passo e retorna-se, após o desenvolvimento, novamente a ele. Enquanto estiver ativo, um projeto de data science ou KDD nunca ficará estagnado.

As caixas destacadas, na Fig. 3, representam os passos relacionados à aquisição de dados, sendo nelas que iremos nos deter a partir daqui, embora de modo um tanto simplificado.

De acordo com o CRISP-DM, o primeiro passo — a compreensão do negócio — é fundamental para o sucesso de todo o projeto e compreende (i) a definição dos objetivos do negócio; (ii) a avaliação da situação atual, incluindo inventários e análises de risco, bem como quais dados e informações já estão disponíveis; (iii) a determinação dos objetivos em se aplicar data mining, assim como a definição de quais serão os critérios de sucesso; e (iv) a elaboração de um plano de projeto.

O segundo passo — o entendimento dos dados — compreende (i) a coleta inicial dos dados e (ii) sua descrição detalhada; (iii) a exploração inicial dos dados; e (iv) uma avaliação da qualidade desses dados. Lembre-se de que estes dois passos iniciais ocorrem de maneira interativa e, idealmente, envolvendo especialistas de domínio e de negócio.

O terceiro passo inicia-se com a validação do plano de projeto pelo cliente ou pelo sponsor e caracteriza-se pela preparação dos dados. Em linhas gerais, os esforços necessários nesta fase tornam-se maiores e, portanto, suas alterações devem ser mais brandas. As principais tarefas desta etapa compreendem a seleção e limpeza dos dados, a construção e a adequação do formato, e a integração de dados (merging). Todas essas tarefas são normalmente referidas como pré-processamento dos dados.

Sua estratégia de aquisição de dados começará a ser definida tão logo você tenha as primeiras interações entre a compreensão do negócio (ou do problema) e o entendimento dos dados disponíveis. Para tanto, considero essencialmente dois cenários possíveis:

1. Há controle sobre a origem dos dados: você pode, de algum modo, gerenciar e controlar as fontes geradoras dos dados (sensores, medidores, pessoas, bancos de dados, etc.). Tome, como exemplo, a atividade de um enfermeiro-chefe: ele pode supervisionar, auditar, estabelecer protocolos e realizar o registro de todos os dados de pacientes em sua enfermaria. É importante salientar que o controle é sobre a origem ou registro dos dados, e não sobre o evento ao qual os dados estão relacionados.
Este é geralmente o cenário de projetos de grande complexidade ou duração, onde se estabelece um modelo de dados e se estabelecem meios para garantir sua consistência e integridade.
2. Não há controle sobre a origem dos dados: esta é a situação mais comum em projetos de curta duração ou de interesse ocasional, assim como projetos, análises ou estudos perenes que utilizem ou demandem dados de outras fontes que não as próprias. Seja qual for a iniciativa, a estratégia de aquisição de dados deve ser ainda mais fortalecida, dado que qualquer alteração — que está além do seu controle — pode aumentar os custos de sua análise ou mesmo inviabilizar o projeto.
Este cenário é ilustrado pelo nosso próprio estudo: como obter o número estratificado de casos da COVID-19, no Brasil, se não temos controle sobre a produção e divulgação destes dados?

Espere encontrar o cenário (1) ao atuar em ambientes corporativos e instituições, ou ao liderar pesquisas longitudinais ou então processos de fabricação. De outro modo, é bem provável que você tenha de lidar com o cenário (2). Neste caso, sua estratégia poderá compor-se de subscrições (assinatura de serviços), acordos de cessão de dados, ou ainda de técnicas que busquem e compilem tais dados a partir da Internet, desde que devidamente licenciados. A atividade de web scraping é uma destas técnicas, como veremos a seguir.

### Aquisição de dados por web scraping: obtendo os dados via painel oficial do Ministério da Saúde

No início de Maio/2020, porém, tive uma desagradável surpresa: as tabelas que até então eu utilizara para extrair os dados de meu interesse (para quem tiver curiosidade, a biblioteca PDFplumber é uma excelente ferramenta) foram substituídas por figuras, nos novos boletins, o que inviabilizou o meu método de extração.

Fiz questão de discorrer em detalhes os parágrafos acima por uma simples razão: demonstrar, mais uma vez, o processo interativo inerente à etapa de aquisição de dados, no KDD. Além disso, quis evidenciar os riscos e incertezas ao assumirmos um projeto onde não temos controle sobre a origem dos dados. Nessas situações, é sempre bom contar com alternativas, estabelecer parcerias ou convênios, e buscar construir sua própria base de dados.

Ao passo em que eu buscava uma nova estratégia, o Ministério da Saúde passou a disponibilizar os dados já compilados em um único arquivo CSV — dias depois alterando o formato para XLSX e variando o nome do arquivo a cada dia. Nos parágrafos a seguir, detalharei como adequei meu processo e meu código para esta nova situação.

A propósito, a obtenção automática de dados e informações a partir de páginas e conteúdos disponibilizados na Internet é denominada web scraping. Tal técnica pode ser implementada de maneira bastante simples, utilizando bibliotecas disponíveis em Python, conforme esquematizado na Fig. 4.

Chegamos finalmente à parte prática — e talvez a mais esperada — deste texto. Seguindo as etapas definidas na Fig. 4, começamos por (1) visitar o Painel Coronavírus Brasil e (2) identificar que o nosso interesse está nos dados disponibilizados ao clicarmos no botão “Arquivo CSV”, no canto superior direito da página. Ao ativarmos a ferramenta de inspeção de páginas do Google Chrome (3), localizamos o código correspondente ao item na página HTML, conforme mostrado na Fig. 5. Lembrando que a ferramenta de inspeção é ativada pelo menu Inspecionar, ao clicarmos com o botão direito sobre algum elemento da página, ou pelo atalho ctrl+shift+I.

O passo seguinte (4) é acessar estes dados a partir de linhas de código em Python, o que pode ser feito através de cadernos do Jupyter, scripts, ou através de uma IDE. Em qualquer dos casos, a ideia é emular o acesso ao portal desejado e a interação com a página web, utilizando para isso a biblioteca Selenium.

No código abaixo, inicio criando um ambiente virtual para em seguida acessar o site de interesse com o emulador do Google Chrome:

# Declarações iniciais
from pyvirtualdisplay import Display
from selenium import webdriver

# Definição de parâmetros
url = 'https://covid.saude.gov.br'
## A linha abaixo pode ser suprimida em alguns ambientes:
chromeDriverPath = '~/anaconda3/envs/analytics3/'

# Inicializa o display (tela virtual):
display = Display(visible=0, size=(800,600))
display.start()

driver = webdriver.Chrome()

# Lê o conteúdo da página, armazenando-o em UTF-8:
driver.get(url)
page = driver.page_source.encode('utf-8')


Uma vez realizada a leitura da página, seguimos para o passo (5) da Fig. 4, onde iteramos o processo até garantir a obtenção dos dados na forma desejada. Para tanto, podemos iniciar verificando o tamanho da página carregada (se for nulo, significa que algo deu errado) e suas primeiras linhas:

# Qual é o tamanho da página carregada?
print(len(page))

# Qual é o tipo de dados?
print(type(page))

# Qual é o conteúdo das primeiras posições do fluxo de bytes?
print(page[:2000])


Na maioria das vezes, o passo seguinte seria explorar o conteúdo HTML através de ferramentas conhecidas como parsers — a biblioteca BeautifulSoup cumpre muito bem esta função em Python. No nosso caso, considerando que os dados de interesse não estão na página web, mas sim resultam de uma ação nesta página, iremos seguir utilizando apenas os métodos do Selenium para emular o clique sobre o botão, automaticamente fazendo o download do arquivo desejado para a pasta padrão do sistema:

## Caminho obtido a partir da inspeção no Chrome:
xpathElement = '/html/body/app-root/ion-app/ion-router-outlet/app-home/ion-content/div[1]/div[2]/ion-button'

## Elemento correspondente ao botão "Arquivo CSV":



import os
import glob

## Obtendo o nome do arquivo XLSX mais recente:
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)


A partir deste ponto, podemos considerar finalizada a tarefa de web scraping, ao mesmo tempo em que adentramos na fase de preparação dos dados (se necessário, reveja a Fig. 3).

Vamos considerar que estamos interessados na quantidade de casos por estado, em relação à sua população, assim como nas taxas de letalidade (número de óbitos em relação ao número de casos) e de mortalidade (número de óbitos em relação à população). As linhas de código abaixo realizam o pré-processamento dos dados obtidos de modo a oferecer as informações desejadas.

## Leitura dos dados

## Obtendo a última data registrada:
lastDay = covidData.data.max()

## Selecionando o dataframe apenas com os dados do último dia
covidLastDay = covidData[(covidData.data == lastDay) &
(covidData.municipio.isna() == True) &
(covidData.populacaoTCU2019.isna() == False)]

## Selecionando apenas as colunas de interesse:


A etapa de pré-processamento é concluída ao gerarmos algumas features adicionais, a partir dos dados pré-existentes:

## Cópia do dataframe antes de manipulá-lo:
normalCovid = covidLastDay.copy()

## Taxa de contaminação (% de casos pela população de cada estado)
normalCovid['contamRate'] = (normalCovid['casosAcumulado'] / normalCovid['populacaoTCU2019']) * 100
## Taxa de letalidade (% de óbtiso pelo núm. casos)
normalCovid['deathRate'] = (normalCovid['obitosAcumulado'] / normalCovid['populacaoTCU2019']) * 100


A partir deste ponto, podemos fazer buscas na base de dados pré-processada:

normalCovid.query("obitosAcumulado > 1000")


Aqui encerramos a etapa de aquisição de dados. Se seguirmos a metodologia CRISP-DM, os passos seguintes poderiam ser tanto a construção de modelos, análises ou visualizações. Para ilustrar uma possível conclusão deste processo de KDD, a Fig. 6 apresenta um gráfico do tipo Four-Quadrant correlacionando o percentual de casos versus a taxa de letalidade nos diferentes estados do Brasil (o código para gerá-lo está disponível no GitHub).

O gráfico acima é o resultado de todo o processo percorrido: os dados foram obtidos via web scraping, as informações foram geradas a partir da manipulação desses dados, e o conhecimento foi adquirido ao se estruturar a informação. O Amazonas (AM) é o estado brasileiro com o maior número de infecções por habitante, assim como a maior taxa de letalidade. São Paulo (SP) é o estado com o maior número acumulado de óbitos, porém com um número de infecções relativamente baixo considerando toda a sua população.

## Considerações finais

Neste artigo busquei oferecer uma visão geral do processo de KDD e de como a aquisição de dados pode ser implementada a partir da técnica de web scraping, principalmente em se tratando de pequenos projetos ou análises.

Talvez você tenha se perguntado, ao longo da leitura, se não seria muito mais fácil apenas visitar o portal e simplesmente fazer o download dos dados lá disponíveis. Sim, seria! Contudo, o que se espera em um projeto de data science é que a obtenção dos dados seja a mais automática possível, dando espaço para que se avance com análises preditivas cujos modelos demandem atualização diária ou constante dos dados. Além disso, compreender temas complexos e ganhar confiança sobre a nossa própria técnica torna-se muito mais efetivo — e quiçá divertido — com problemas relativamente mais simples.

Por fim, devo lembrá-lo a sempre agir com ética na coleta e na manipulação de dados. Automatizar a coleta de dados de domínio público, sem ônus à sua administração, caracteriza-se como uma atividade completamente legal. Entretanto, se estiver em dúvida sobre quando é permitido ou não o uso de algoritmos deste tipo, não hesite em consultar especialistas ou mesmo os proprietários dos dados.

Se você acompanhou o artigo até aqui, gostaria muito de conhecer a sua opinião. Além da enquete abaixo, sinta-se à vontade para deixar um comentário ou escrever-me com dúvidas, críticas ou sugestões.

# Data acquisition, web scraping, and the KDD process: a practical study with COVID-19 data in Brazil

Are you interested in data science, and are you aware of the situation of COVID-19 in Brazil? In this post, I will explore the concept of data acquisition and demonstrate how to use a simple web scraping technique to obtain official data on the progress of the disease. It is worth saying that, with some adaptation in the code, the strategy used here can be easily extended to data from other countries or even to other applications. So put on your mask, wash your hands, and start reading it right now!

Clique aqui para a versão em português deste post.

I must start this text offering the most sincere sentiments to all victims of COVID-19. It is in their memory that we must seek motivation for the development of science without forgetting that if it exists, it is because there is a humanity that sustains it. Concerning data science, it is through it that we can minimize fake news and somehow contribute to a more informed, sensible, and thoughtful society.

I started this data analysis project when I realized, a few weeks ago, that there was little stratified information on the number of cases of COVID-19 in Brazil on official bases, such as that of the World Health Organization (WHO – World Health Organization) or the panel maintained by John Hopkins University. As stratified, in fact, we understand the set of data that can be observed in different layers. That is, for a country of continental dimensions, such as Brazil, knowing the total number of cases in the country is only informative. For such information to be functional, it must be stratified by regions, states, or cities.

To provide both an overview of the problem and direct the development of the analysis, I have set out below some objectives for this study that I hope the reader will appropriate them too:

• Acquire an overview of the KDD process and understand where the data acquisition places on it.
• Explore data acquisition techniques so that similar problems could easily employ them.
• Obtain, from official databases, the number of cases and deaths by COVID-19 in Brazil, stratified by state.

In the following sections, we will explore each of these items to build the necessary concepts and implement them in a Python development environment. Ready?

## KDD – Knowledge Discovering in Databases

Knowledge discovery in databases is the process by which someone seeks for comprehensive, valid, and relevant patterns from databases. It is an interactive and iterative process, which comprises many of the techniques and methods that are known today as data science.

Before exploring the main stages of the KDD process, however, it is worth differentiating three key terms: data, information, and knowledge. Each step of KDD is concerned with processing each of these terms, illustrated in the diagram of Fig. 1 from the problem we are working on.

In short, data are all chains of symbols, structured or not, related to some phenomenon or event. Data, by itself, has no meaning. See: the 4.08 million number is just a cardinal. It becomes scary only after we assign it the sense that this is the number of confirmed cases of COVID-19. In other words, we have information only when we assign meaning to data.

It is from the manipulation of information that we acquire knowledge, which is closely related to what we call “cognitive processes” — perception, contextualization, and learning — and, therefore, intelligence.

Once these three concepts are understood, we can now proceed to the three fundamental phases of the KDD process. These are nothing more than sets of tasks respectively dedicated to obtaining and exploring data, information, and knowledge, as illustrated in Fig. 2, below:

The KDD process consists of steps, which in turn are compound of tasks. Data mining tasks, for example, are part of the information processing step. In contrast, those tasks related to data visualization, analytics, and decision making are part of the last step, which aims for knowledge extraction.

In this article, we will focus only on the data acquisition task, fundamental to any KDD project. In general, there is no predefined method for obtaining data. Medium and long term projects usually demand a data model, as well as a robust architecture for its collection and storage. Short-term projects, in turn, mostly require immediately available and accessible sources.

With these fundamental concepts in mind, we are ready to move towards our purpose: the acquisition of official data on the number of COVID-19 cases in Brazil.

## Defining a data acquisition strategy

As we have seen, a data acquisition strategy will only make sense if built from the perspective of a complete KDD process. That is, before we start collecting data out there, we need to have a clear purpose definition. What is the problem that we are trying to solve or elucidate?

The CRISP-DM reference model, shown in Fig.3, can be used as a compass when going through a KDD process. Do you remember that we previously defined KDD as an interactive and iterative process?

Well, the interaction means mutual or shared action between two or more agents. It can refer either to people holding data and information or to the stages. The short arrows, in both directions between the first steps of Fig. 3, represent the necessary interaction in the process. I dare say it is the biggest challenge for newcomers to the world of KDD or data science: there is an initial illusion that these steps follow a continuous flow, which results in a feeling of doing something wrong. Quite the contrary, by understanding the business, we make it possible to understand the data, which gathers us new insights about the business, and so on.

Iteration, in turn, is associated with the repetition of the same process. The larger arrows, in Fig.3, form a circle that symbolizes this repetition. It starts in the first step and returns, after development, to it again. A KDD project will never be stagnant.

The highlighted boxes, in Fig. 3, represent the steps related to the data acquisition process. According to CRISP-DM, the first step — understanding the business — is fundamental to the success of the entire project and comprises (i) the definition of the business objectives; (ii) the assessment of the current situation, including inventories and risk analyzes, as well as what data and information are already available; (iii) the determination of the objectives in applying data mining, and also the definition of what the success criteria will be; and (iv) the elaboration of a project plan.

The second step — understanding the data — comprises (i) the first data samples and (ii) its detailed description; (iii) the initial exploration of the data; and (iv) an assessment of the quality of that data. Remember that these two initial steps take place interactively and, ideally, involving domain and business specialists.

The third step begins with the validation of the project plan by the sponsor and is characterized by the preparation of the data. In general, the efforts required in this phase become more expensive and, therefore, their changes should be milder. The main tasks of this step include the selection and cleaning of data, the construction and adaptation of the format, and the merging of data. All of these tasks are commonly referred to as data pre-processing.

Your data acquisition strategy will start being defined as soon as you have the first interactions between understanding the business (or the problem) and following the available data. For that, I consider essentially two possible scenarios:

1. There is control over the data source: you can somehow manage and control the sources that generate the data (sensors, meters, people, databases, etc.). Take, for example, the activity of a head nurse: he can supervise, audit, establish protocols, and record all patient data in his ward. It is important to note that the control is over the origin or registration of the data, and not over the event to which the data is related.
This scenario usually happens in projects of high complexity or duration, where a data model is established, and there are means to ensure its consistency and integrity.
2. There is no control over the data source: this is the most common situation in projects of short duration or of occasional interest, as well as in projects, analyzes, or perennial studies that use or demand data from sources other than their own. Whatever the initiative, the data acquisition strategy must be more reliable, given that any change beyond its control can increase the costs of its analysis or even render the project unfeasible.
Our own study illustrates this scenario: how to obtain the stratified number of COVID-19 cases in Brazil if we have no control over the production and dissemination of these data?

Expect to find the first scenario in corporations and institutions, or when leading longitudinal research or manufacturing processes. Otherwise, you will likely have to deal with the second one. In this case, your strategy may consist of subscriptions, data assignment agreements, or even techniques that search and compile such data from the Internet, once they are properly licensed. Web scraping is one of these techniques, as we will see next.

### Data acquisition via web scraping: obtaining data with the official panel of the Brazilian Ministry of Health

With the objective established — the stratified number of COVID-19 cases in Brazil — my first action was to look for possible sources. I consulted the Brazilian Ministry of Health web portal in mid-April, 2020. At the time, I had not found consolidated and easily accessible information, so I increased the granularity of possible sources: state and municipal health departments. The challenge, in this case, was the heterogeneity in the way they presented the data. When consulting a public health worker, she informed me that the data were consolidated in the national bulletins and that specific demands should be requested institutionally. My first strategy, then, was to devise a way to automatically collect such bulletins (in PDF format) and start extracting the desired data — part of the result is available in this GitHub repository.

At the beginning of May 2020, though, I had an unpleasant surprise: the tables I had used to extract the data (for those who are curious, the PDFplumber library is an excellent tool) were replaced by figures, in the news bulletins, which made my extraction method unfeasible.

I made a point of going through the above paragraphs in detail for a simple reason: to demonstrate, once again, the interactive process inherent to the data acquisition step in KDD. Besides, I wanted to highlight the risks and uncertainties when we took on a project where we have no control over the data source. In these situations, it is always good to have alternatives, establish partnerships or agreements, and seek to build your own dataset.

While I was looking for a new strategy, the Brazilian Ministry of Health started making the data available into a single CSV file — days later changing the format to XLSX and changing the file name every day. In the following paragraphs, I will detail how I adapted my process and code for this new situation.

By the way, the automatic retrieval of data and information from pages and content made available on the Internet is called web scraping. This technique can be implemented effortlessly through Python libraries, as shown in Fig. 4.

We finally come to the practical — and perhaps the most anticipated — part of this text. Following the steps defined in Fig. 4, we start by (1) visiting the Coronavirus panel and (2) identifying that our interest is in the data made available by clicking on the “Arquivo CSV” button in the upper right corner of the page. When we activate the Google Chrome page inspection tool (3), we find the code corresponding to the item on the HTML page, as shown in Fig. 5. Remembering that the inspection tool is activated by the Inspect menu, when we right-click over any element of the page, or by the shortcut ctrl+shift+I.

The next step (4) is to access this data from our code in Python, which can go through Jupyter notebooks, scripts, or through an IDE. In any case, the idea is to emulate the access to the desired portal and the interaction with the web page by using the Selenium library.

In the lines below, I start creating a virtual environment to access the portal URL within the Google Chrome emulator:

# Initial statements
from pyvirtualdisplay import Display
from selenium import webdriver

# Parameters definition
url = 'https://covid.saude.gov.br'
## A linha abaixo pode ser suprimida em alguns ambientes:
chromeDriverPath = '~/anaconda3/envs/analytics3/'

# Starts the virtual environment:
display = Display(visible=0, size=(800,600))
display.start()

# Opens the Chrome emulator in the desired URL:
driver = webdriver.Chrome()

# Reads and gets the page content encoded in UTF-8:
driver.get(url)
page = driver.page_source.encode('utf-8')


Once we read the page, we proceed to step (5) of Fig. 4, where we iterate the process until we guarantee that the data is in our desired form. To do so, we can start by checking the size of the loaded page (e.g., if it is null, it means something went wrong) and its first lines:

# What is the length of the loaded page?
print(len(page))

# What is the data type?
print(type(page))

# What is the content of the first positions of the byte stream?
print(page[:2000])


Most of the time, the next step would be to explore HTML content using tools known as parsers — the BeautifulSoup library is an excellent parser for Python. In our case, considering that the data we are interested are not on the web page but result from an action on this page, we will continue using only the Selenium methods to emulate the click on the button, automatically downloading the desired file for the default system folder:

## Path obtained from inspection in Chrome:
xpathElement = '/html/body/app-root/ion-app/ion-router-outlet/app-home/ion-content/div[1]/div[2]/ion-button'

## Element corresponding to the "Arquivo CSV" button:



The next step is to verify if the file was downloaded correctly. Since its name is not standardized, we then list the recent files through the glob library:

import os
import glob

## Getting the last XLSX file name:
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)


From this point on, the web scraping task is completed, at the same time that we enter the data preparation phase (if necessary, review Fig. 3).

Let’s consider that we are interested in the number of cases per state, concerning its population, as well as in the lethality rates (number of deaths to the number of cases) and mortality (number of deaths to the number of people). The lines of code below perform the pre-processing tasks needed to offer the desired information.

## Reading the data

## Getting the last registered date:
lastDay = covidData.data.max()

## Selecting the dataframe with only the last day
## and whose data is consolidated by state
covidLastDay = covidData[(covidData.data == lastDay) &
(covidData.municipio.isna() == True) &
(covidData.populacaoTCU2019.isna() == False)]

## Selecting only the columns of interest:


The pre-processing step is completed by generating some additional features, from the pre-existing data:

## Copying the dataframe before handling it:
normalCovid = covidLastDay.copy()

## Contamination rate (% of cases by the population of each state)
normalCovid['contamRate'] = (normalCovid['casosAcumulado'] / normalCovid['populacaoTCU2019']) * 100
## Fatality rate (% of deaths by the number of cases)
## Mortality rate (% of deaths by the population in each state)
normalCovid['deathRate'] = (normalCovid['obitosAcumulado'] / normalCovid['populacaoTCU2019']) * 100


From this point, we can search in the pre-processed database:

normalCovid.query("obitosAcumulado > 1000")


Here we end the data acquisition phase. If we follow the CRISP-DM methodology, the next steps could be either the construction of models, analyzes, or visualizations. To illustrate a possible conclusion of this KDD process, Fig. 6 presents a Four-Quadrant graph correlating the percentage of cases versus the lethality rate in the different states of Brazil (the code to generate it is available on GitHub).

The graphic above is the result of the entire process we covered so far: the data was obtained via web scraping, the information was generated from the manipulation of that data, and the knowledge was acquired by structuring the information. Amazonas (AM) is the state population with the highest number of infections per inhabitant, as well as with the highest lethality rate. São Paulo (SP) is the state with the highest cumulative number of deaths, but with a relatively low number of infections considering its entire population.

## Final considerations

In this article, I tried to offer an overview of the KDD process and how to implement data acquisition using a web scraping technique, especially in the case of small projects or analyzes.

You may have wondered if it wouldn’t be much easier just to visit the web portal and simply download the data available there. Yes, it would! However, a data science project expect data acquisition as automatic as possible, giving space to move forward with predictive analyzes whose models require daily or constant updating. Also, understanding complex topics and gaining confidence about our technique becomes much more effective — and perhaps amusing — with relatively more straightforward problems.

Finally, I must remind you to always act ethically in the acquisition and manipulation of data. Automating the collection of data in the public domain, at no cost to its administration, is characterized as a completely legal activity. However, if you are in doubt as to when the use of such algorithms is allowed or not, do not hesitate to consult experts or even the data owners.

If you have followed the article so far, I would love to hear your opinion. In addition to the poll below, feel free to leave a comment or write to me with any questions, criticisms or suggestions.

# Finding cyclic patterns: a tutorial on how to implement STFT in Python

If you have ever heard Python and Fourier nouns, chances are you’ll find this post useful: here I will explore a simple way to implement the Short-Time Fourier Transform in Python in order to run a frequency analysis for a given signal. Let’s get to work!

A few weeks ago I was invited to join a Kaggle’s competition with my colleagues from my research group. The idea has sounded very interesting to me since it is about seismic data, i.e. time series, which is somehow related to the signal category I was used to working with — for the curious ones, I have applied wavelets for drowsiness detection from brain signals. By the way, wavelet analysis has become famous when Stéphane Mallat applied it to seismic and geological data.

But before we start dealing with wavelets (which I do intend to get back to them in the near future), the main idea here is to get acquainted with periodic signals. I mean, a time series representing some cyclic events that repeat from time to time. And that, therefore, can be often designated as a pattern.

You may find patterns represented in the most different forms, but one, in particular, is the circle: the day and the night time, the seasons of the year, the electricity generated in hydroelectric, all of them are related to a circular source (or at least an approximated one, since the Earth is an oblate spheroid which orbits the Sun in an elliptical path).

From a mathematical perspective, any cyclic pattern can be somehow represented by the trigonometric sine and cosine functions, as demonstrated by the French mathematician and physicist Joseph Fourier, who in the early nineteenth century has established what is known today as Fourier Series and Fourier Transform.

The sine and cosine functions are the projections of a circular movement over a vertical and horizontal axis, respectively.

I will assume you are already familiar with trigonometric signals and their relation with frequency. In any case, what is important to notice is that a periodic or cyclic signal will repeat itself by a given amount of time, known as period (T). The frequency (f), in turn, consists of the number of repetitions per second and is measured in hertz (Hz), in homage to the German physicist Heinrich Hertz.

In the next sections, we will explore these concepts in a very practical way (I’m also assuming you are used to Python). We will first figure out how to generate and visualize some periodic waves (or signals), moving then to a Short-Time Fourier Transform implementation. For last, we will see an example of how to apply such analysis for a non-trigonometric function but rhythmic one, as the brain signals.

## 1. Generating periodic signals

In data science — and here I’m considering all the disciplines related to it, such as pattern recognition, signal processing, machine learning and so on — it is always useful to have a deep understanding of how our data (a.k.a. signals) behave when inserted into an analysis or classification tool. The problem is that at first we usually know little about our data — which in the real world can be complex and noisy — and in fact, looking for analytical tools is an attempt to understand them.

In these cases, the best alternative is to explore the analysis tool with simple and well-known data. Paraphrasing the Nobel laureate Richard Feynman — who left in his blackboard the following quote, after his death in 1988: “What I can not create, I do not understand” –, we will first generate our own periodic signals.

A time series usually results from a sampling process effort. When this signal is generated, as is the case here, we must also establish the time where the signal supposedly occurs. The function below emulates the sampling process and has as input the initial and the final time samples. In other words, it returns the x-axis basis where our signal will be placed.

[code language="python"]
import numpy as np

def timeSeries(fs, ti, tf):
'''Returns a time series based on:
fs = the sampling rate, in hertz (e.g. 1e3 for 1kHz);
ti = the starting time (e.g. second 0);
tf = the final time (e.g. second 1).'''
T = 1./fs
timeSeries = np.arange(ti,tf,T)
return timeSeries
[/code]

Let’s remember that a sinusoidal signal is generated by the function:
$y(t) = \sin(2\pi f t)$,
where f is the frequency, in hertz, of the signal y(t), and t is the time series generated by the code above. This equation is then transcripted to Python:

[code language="python"]
def genSignal(freq, amp, timeSeries):
'''Returns the time series for a sinusoidal signal, where:
freq = frequency in Hertz (integer);
amp = signal amplitude (float);
timeSeries = the linear time series (array).'''
signal = amp*np.sin(freq*2*np.pi*timeSeries)
return signal
[/code]

Almost as important as generating the signal is to provide a means of visualizing it. Using the Matplotlib library, the following function can be adapted to view any time series:

[code language="python"]
import matplotlib.pyplot as plt

def plotSignal(signal, timeSeries, figsize=(6,3)):
'''signal = numpy.array or Pandas.Series;
timeSeries = the linear time series x-axis;
figsize = the plot size, set as (6,3) by default.'''
fig, axes = plt.subplots(1,1, figsize=figsize)
plt.plot(timeSeries, signal)
#Adjust the ylim to go 10% above and below from the signal amplitudes
axes.set_ylim(signal.min()+signal.min()*0.1, signal.max()+signal.max()*0.1)
axes.grid(True)
axes.set_ylabel('Signal amplitude')
axes.set_xlabel('Time (s)')
return
[/code]

It also sounds practical to code a wrapper function to create and visualize a sinusoidal function getting frequency and amplitude as input parameters:

[code language="python"] def sineWave(freq, amp, timeSeries):
'''Generates and plots a sine wave using genSignal() and plotSignal().
freq = frequency in Hertz (int);
amp = signal amplitude (float);
timeSeries = a linear time series accordingly to the sample rate.'''
signal = genSignal(freq, amp, timeSeries)
plotSignal(signal, timeSeries)
return signal [/code]

Finally, we will now execute these functions together in order to obtain some sinusoidal waves:

[code language='python']
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Sampling time for 0s to 10s, 1kHz:
time = timeSeries(1e3,0,10)

## Generating a sine wave of 1Hz and amplitude of 2:
sign1 = sineWave(1, 2, time)

## Generating a sine wave of 0.1Hz and amplitude of 1:
sign01 = sineWave(0.1, 1, time)
[/code]

Which then give us the following result:

It may look quite simpler, but it is important to have a signal which we know exactly how it was created — we hold its model, i.e. its formula. Observe in the first chart above that a frequency of 0.1Hz means the signal will repeat itself every 10 seconds. In the second chart, we duplicate the amplitude of the signal which now repeats its pattern every second.

Putting together the functions we have defined with the numpy.concatenate method, as shown below, we can generate more complex waveforms:

[code language='python']
# Generating two segmented time series:
time1 = timeSeries(fsLANL,0,1)
time2 = timeSeries(fsLANL,1,2)
# Concatenating them in order to plot:
time = np.concatenate((time1, time2), axis=None)

# Generating the signals for each time series:
signal1 = genSignal(20,50,time1)
# Here we add up two frequencies:
signal2 = genSignal(113,28,time2) + genSignal(70,53,time2) + genSignal(0.25,30,time2)
# Concatenating both signals
signal = np.concatenate((signal1,signal2), axis=None)

# Plotting the resulting signal:
plotSignal('[0.25, 20, 70, 113]', signal, time, figsize=(12,6))
[/code]

From our first charts, it becomes clear that when we are dealing with low and single frequencies we can still identify cyclic patterns only through visual inspection in the time domain signals (e.g. counting the tops during a second). However, for more complex waves (see the interval between 1s and 2s in the chart above) it becomes clear that a tool is needed to identify any cyclic pattern.

## 2. The Short-Time Fourier Transform

A complete understanding of time series representation in the frequency domain requires a deepening of linear algebra and calculus for expanding a signal in a series of functions.

Such an approach, though, will not be taken here: those who already understand this mathematical abstraction would find a practical code for implementing a frequency analysis in Python. In the same way, those who are facing periodic signals for the first time may start deepening their theoretical understanding from these simple examples.

Before we start coding, I will provide a brief definition (please don’t get scared with the symbols… If you want, you may just page down to skip this block): Fourier has proved that it is possible to expand a great number of functions into pure harmonic functions series (i.e. a sum of sines and cosines). By the way, from the Euler’s formula we know we can represent harmonic functions as a complex exponential:

$e^{i\omega t} = \cos (\omega t) + i \sin (\omega t)$.

Now consider that we can expand a given signal y(t) into a Fourier Series:

$y(t) = \frac{1}{\sqrt{2\pi}}\int_{- \infty}^{+\infty} c_\omega e^{i\omega t} d\omega$,
where $c_\omega$ are the Fourier coefficients obtained by the Fourier transform of the signal; and $e^{i\omega t}$ is the harmonic base function that represents the frequency $\omega$.

What is important to have in mind is that each coefficient — what our computer will be calculating for us — is associated with a harmonic base function, which in turn relates this coefficient to a specific frequency. Then, if we know the value of each coefficient, we could know the intensity of each frequency component present in the signal y(t).

Another important aspect one should consider is that these base functions have infinite duration and therefore are not suitable for representing signals with discontinuities or of which the location in time or space is also desired (i.e. when we seek to know in which period of the signal a given frequency or rhythm happens the most).

The Short-Time Fourier Transform (STFT) is a way to overcome this. Its strategy consists of multiplying each basis function of the transform by a window function, w(t). This last thus being limited to a given period of time, i.e. having non-null values just for a limited interval.

Considering the function space in the Fourier transform can be defined by trigonometric functions in its complex form, for the STFT we create a new basis defined both by frequency ($\omega$) and position ($\tau$):

$G_{Fourier} = \{e^{i\omega t}\}_{\omega \in \mathbb{R}}$,
$G_{STFT} = \{w(t-\tau)e^{i\omega t}\}_{\omega \in \mathbb{R}, \tau \in \mathbb{Z}}$.

The shape and length of w(t) must be chosen accordingly to the analysis interest. The shape depends on the function used to generate it and determines its capability of frequency resolution. On the other hand, the length (N) defines the window interval and therefore its temporal resolution.

Many window functions can be used and tried out (an overview of their main features and differences can be found here). For our purpose — or when in doubt about which one to pick — the Hann window can be a satisfying choice:

Let us now return to our Python code. Once we have understood the basic principles the STFT relies on, we can make use of the signal module from SciPy library to implement an spectrogram — which consist of plotting the squared magnitude of the STFT coefficients.

So, with the code below we will compute the STFT for our first signal (page up and seek for the sign1). Also, we must consider that the SciPy implementation will always set the frequency axis accordingly to half of the sampling frequency, the reason why I set the plt.ylim() command.

[code language='python']
## Calculating the STFT spectrogram
fs = 1e3
f, t, Zxx = signal.stft(sign1, fs)
plt.pcolormesh(t, f, np.abs(Zxx), cmap=plt.get_cmap('RdYlGn'))
plt.title('STFT Magnitude')
plt.ylim(0,160)
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()
[/code]

Okay, I do agree with you that this result is far short of what I expected. The good news is that (almost) everything is a matter of adjustment. To make things easier, let’s write the following function:

[code language='python']
def calcSTFT(inputSignal, samplingFreq, window='hann', nperseg=256, figsize=(9,5), cmap='magma', ylim_max=None, output=False):
'''Calculates the STFT for a time series:
inputSignal: numpy array for the signal (it also works for Pandas.Series);
samplingFreq: the sampling frequency;
window : str or tuple or array_like, optional
Desired window to use. If window is a string or tuple, it is
passed to get_window to generate the window values, which are
DFT-even by default. See get_window for a list of windows and
required parameters. If window is array_like it will be used
directly as the window and its length must be nperseg. Defaults
to a Hann window.
nperseg : int, optional
Length of each segment. Defaults to 256.
figsize: the plot size, set as (6,3) by default;
cmap: the color map, set as the divergence Red-Yellow-Green by default;
ylim_max: the max frequency to be shown. By default it's the half sampling frequency;
output: 'False', as default. If 'True', returns the STFT values.

Outputs (if TRUE):
f: the frequency values
t: the time values
Zxx: the STFT values"'''
##Calculating STFT
f, t, Zxx = signal.stft(inputSignal, samplingFreq, window=window, nperseg=nperseg)
##Plotting STFT
fig = plt.figure(figsize=figsize)
### Different methods can be chosen for normalization: PowerNorm; LogNorm; SymLogNorm.
### Reference: https://matplotlib.org/tutorials/colors/colormapnorms.html
spec = plt.pcolormesh(t, f, np.abs(Zxx),
norm=colors.PowerNorm(gamma=1./8.),
cmap=plt.get_cmap(cmap))
cbar = plt.colorbar(spec)
plt.title('STFT Spectrogram')
ax = fig.axes[0]
ax.grid(True)
ax.set_title('STFT Magnitude')
if ylim_max:
ax.set_ylim(0,ylim_max)
ax.set_ylabel('Frequency [Hz]')
ax.set_xlabel('Time [sec]')
fig.show
if output:
return f,t,Zxx
else:
return
[/code]

To test it, we will create a new signal whose values are easy to understand and manipulate, a 100Hz sine wave with amplitude 1000. Since we will keep it lasting 10s, with the same sampling rate of 1kHz, we can use the array time defined before.

[code language='python']
fs = 1e3  #The sampling frequency
test = genSignal(100,1000,time)  #Generating the signal

calcSTFT(test, fs)  #Calling our function in its simplest form!
[/code]

The purpose to start with a simple signal is that it allows us to check whether there is a gross error in our code or not. Fortunately, it’s not the case! The plot frequency range goes up to 500Hz, half the sampling frequency, as expected. Also, we can precisely see the 100Hz component, which values are normalized (you will find the details on the normalization strategy here).

Going further, we can specify other parameters to our STFT. To create the charts below I set different values for the nperseg parameter, which correspond to the window function size. For the left image, nperseg = 64, while for the right one nperseg = 2048. Note how a larger window results in a much more accurate frequency resolution. And vice versa.

There is no specific rule to determine the number of samples per segment. However, it’s useful to consider 1/4 of the sampling rate. E.g. for a 1kHz sampling rate, the nperseg parameter is set by default to 256. This tradeoff can become clearer if we analyze our complex signal, which we generated with components from 0.25 to 123Hz:

[code language='python']
calcSTFT(signal, fs, nperseg=2**19, ylim_max=300)
[/code]

In the plot above all the known frequency components of the signal become evident, although there is lower time resolution. For some applications, 250ms are insignificant. For others, it can be limiting. I must anticipate to you that there is no analytical tool that may offer accuracy both in time and frequency, as demonstrated by the Heisenberg’s principle — but this will be subject for a next post.

## 3. Brain signal analysis

To exemplify how the STFT analysis can be useful for “real life” situations, we will now apply the functions we built on a brain signal segment obtained from electroencephalography (EEG). If you want to try it on your own, the data is available here.

Since the signal was previously processed on Matlab, we will first have to convert the data structure to a Python dictionary using the loadmat method. Knowing that the signal was labeled as sD_W, we can move on to loading it:

[code language='python']
import scipy.io

brain_raw = mat['sD_W']
## Checking the shape of the variable:
brain_raw.shape

## Adequating the signal to an np.array
brain = brain_raw.T[0]

## Plotting the brain signal:
plt.plot(brain)

## Calculating its STFT
calcSTFT(brain, 200, nperseg=2**8, figsize=(9,4), ylim_max=100, cmap='inferno')
[/code]

Unlike when dealing with sinusoidal signals, now we can’t deduce the rhythms (i.e. frequency components) present in the EEG signal only by looking to its time domain plot. We could even say that something is happening after 20s, but we could not give any detail.

On the other hand, the spectrogram above shows us that after the event around 20s, a well established low-frequency rhythm appears from 20s to 36s and then again from 44s to 56s. As with any data science activity, the interpretation is always context-dependent. In this case, this EEG signal is from a subject submitted to polysomnography. The identified frequency range corresponds to the alpha rhythm, indicating that the patient is probably awake and relaxed, but with closed eyes. For last, the dark region around 50Hz corresponds to a pass-band filter applied during the signal acquisition in order to filter the 50Hz interference from AC power supply.

## 4. Conclusions

The goal of this post was not only to show how to implement STFT in Python, but also to bring a brief introduction to the theory behind this powerful analytical tool — which supports the more intricate ideas of mathematics.

Also, by being able to understand the main parameters that should be observed in such analysis, and drawing on the example with brain signals, I hope the reader may tailor the code and apply such a tool to any cyclic signal he/she may find: from biomedical or seismic signals going even to stock market prices or the movement of stars.