Análise de agrupamento: o problema da identificação de línguas em textos por meio de bi-gramas

Souza Júnior, Cleônidas Tavares de

Please use this identifier to cite or link to this item: http://repositoriosenaiba.fieb.org.br/handle/fieb/891

Title:	Análise de agrupamento: o problema da identificação de línguas em textos por meio de bi-gramas
Other Titles:	Cluster analysis: the problem of identifying languages in texts by means of bigrams
Authors:	Souza Júnior, Cleônidas Tavares de
metadata.dc.contributor.advisor:	Sampaio, Renelson Ribeiro
metadata.dc.contributor.advisor-co:	Pereira, Hernane Borges de Barros
metadata.dc.contributor.referees:	Senna, Valter de Rosa, Marcos Grilo
Keywords:	Mineração de dados;Análise de frequência – Pares de letras;Variação linguística;N-gramas
Issue Date:	22-Feb-2018
Publisher:	Centro Universitário SENAI CIMATEC
Citation:	SOUZA JUNIOR, Cleônidas Tavares de. Análise de agrupamento: o problema da identificação de línguas em textos por meio de bi-gramas. Orientador: Renelson Ribeiro Sampaio. 2018. 101 f. TCCP (Mestrado em Modelagem Computacional e Tecnologia Industrial) – Centro Universitário SENAI CIMATEC, Salvador, 2018.
Abstract:	Algoritmos de aprendizado supervisionado baseados em frequência de letras têm sido usados para identificar as línguas de origem de textos; no entanto, eles são imprecisos quando, por exemplo, tentam distinguir os textos nas línguas norueguesa e dinamarquesa. Os objetivos deste trabalho são: (i) identificar padrões na análise de frequência de pares de letras que possam ser utilizados para agrupar os textos que compartilham uma mesma língua; e (ii) identificar os motivos que levam alguns algoritmos baseados em análise de frequência de letras a serem imprecisos da identificação de algumas línguas. A hipótese inicial é que línguas com uma grande quantidade de palavras em comum e que são variedades/ dialetos de uma língua dificilmente são diferenciadas umas das outras por meio da análise de frequência de letras. Para testar essa hipótese, foram desenvolvidos dois algoritmos: (i) um para verificar se a análise de frequência de letras gera resultados suficientes para agrupar os textos de mesma língua em um mesmo agrupamento; e (ii) o outro para verificar a quantidade de palavras compartilhadas por algumas línguas. Os resultados obtidos por meio da análise de agrupamentos revelaram que variedades de uma mesma língua permanecem em um mesmo agrupamento; isso sugere uma proximidade entre elas. Este trabalho contribui (i) para os estudos da linguagem, ao apresentar que variedades de uma mesma língua não podem ser diferenciadas por meio de análise de frequência de pares de letras (com escrita alfabética, com alfabeto latino-europeu); e (ii) para as áreas da computação interessadas em processamento de línguas naturais, com algoritmos que, a partir de um conjunto de textos, identificam e agrupam, graficamente, os textos de mesma variedade linguística ou de mesma língua. ABSTRACT: Supervised learning algorithms based on letter frequency have been used to identify the source languages of texts; however, they are inaccurate when, for example, they try to distinguish between Norwegian and Danish language texts. The goals of this work are: (i) to identify patterns in the frequency analysis of pairs of letters that can be used to group texts that share the same tongue; and (ii) identify the reasons that lead some algorithms based on analysis of frequency of letters to be inaccurate in the identification of some languages. the hypothesis initial principle is that languages with a large number of words in common and which are varieties/ dialects of a language are hardly differentiated from one another through the letter frequency analysis. To test this hypothesis, two algorithms were developed: (i) one to verify that letter frequency analysis yields sufficient results to group texts of the same language in the same grouping; and (ii) the other for check the amount of words shared by some languages. The results obtained through cluster analysis revealed that varieties of the same language remain in the same grouping; this suggests a closeness between them. This work contributes (i) to language studies, by showing which varieties of the same language cannot be differentiated by means of frequency analysis of letter pairs (with alphabetic script, with Latin-European alphabet); and (ii) for the areas of computing interested in processing natural languages, with algorithms that, from a set of texts, graphically identify and group texts of the same linguistic variety or of the same language.
URI:	http://repositorio.universidadesenaicimatec.edu.br/handle/fieb/891
Appears in Collections:	Dissertações de Mestrado (PPG MCTI)

Files in This Item:

File	Description	Size	Format
Cleônidas Tavares de Souza Júnior.pdf	TCCP / DISSERTAÇÃO MCTI / SENAI CIMATEC	5.7 MB	Adobe PDF	View/Open

Show full item record