Análise de agrupamento: o problema da identificação de línguas em textos por meio de bi-gramas

Souza Júnior, Cleônidas Tavares de

Please use this identifier to cite or link to this item: http://repositoriosenaiba.fieb.org.br/handle/fieb/891

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Sampaio, Renelson Ribeiro	-
dc.contributor.author	Souza Júnior, Cleônidas Tavares de	-
dc.date.accessioned	2019-07-15T21:21:35Z	-
dc.date.available	2019-07-15T21:21:35Z	-
dc.date.issued	2018-02-22	-
dc.identifier.citation	SOUZA JUNIOR, Cleônidas Tavares de. Análise de agrupamento: o problema da identificação de línguas em textos por meio de bi-gramas. Orientador: Renelson Ribeiro Sampaio. 2018. 101 f. TCCP (Mestrado em Modelagem Computacional e Tecnologia Industrial) – Centro Universitário SENAI CIMATEC, Salvador, 2018.	pt_BR
dc.identifier.other	http://repositorio.universidadesenaicimatec.edu.br/handle/fieb/728?mode=full	-
dc.identifier.uri	http://repositorio.universidadesenaicimatec.edu.br/handle/fieb/891	-
dc.description.abstract	Algoritmos de aprendizado supervisionado baseados em frequência de letras têm sido usados para identificar as línguas de origem de textos; no entanto, eles são imprecisos quando, por exemplo, tentam distinguir os textos nas línguas norueguesa e dinamarquesa. Os objetivos deste trabalho são: (i) identificar padrões na análise de frequência de pares de letras que possam ser utilizados para agrupar os textos que compartilham uma mesma língua; e (ii) identificar os motivos que levam alguns algoritmos baseados em análise de frequência de letras a serem imprecisos da identificação de algumas línguas. A hipótese inicial é que línguas com uma grande quantidade de palavras em comum e que são variedades/ dialetos de uma língua dificilmente são diferenciadas umas das outras por meio da análise de frequência de letras. Para testar essa hipótese, foram desenvolvidos dois algoritmos: (i) um para verificar se a análise de frequência de letras gera resultados suficientes para agrupar os textos de mesma língua em um mesmo agrupamento; e (ii) o outro para verificar a quantidade de palavras compartilhadas por algumas línguas. Os resultados obtidos por meio da análise de agrupamentos revelaram que variedades de uma mesma língua permanecem em um mesmo agrupamento; isso sugere uma proximidade entre elas. Este trabalho contribui (i) para os estudos da linguagem, ao apresentar que variedades de uma mesma língua não podem ser diferenciadas por meio de análise de frequência de pares de letras (com escrita alfabética, com alfabeto latino-europeu); e (ii) para as áreas da computação interessadas em processamento de línguas naturais, com algoritmos que, a partir de um conjunto de textos, identificam e agrupam, graficamente, os textos de mesma variedade linguística ou de mesma língua. ABSTRACT: Supervised learning algorithms based on letter frequency have been used to identify the source languages of texts; however, they are inaccurate when, for example, they try to distinguish between Norwegian and Danish language texts. The goals of this work are: (i) to identify patterns in the frequency analysis of pairs of letters that can be used to group texts that share the same tongue; and (ii) identify the reasons that lead some algorithms based on analysis of frequency of letters to be inaccurate in the identification of some languages. the hypothesis initial principle is that languages with a large number of words in common and which are varieties/ dialects of a language are hardly differentiated from one another through the letter frequency analysis. To test this hypothesis, two algorithms were developed: (i) one to verify that letter frequency analysis yields sufficient results to group texts of the same language in the same grouping; and (ii) the other for check the amount of words shared by some languages. The results obtained through cluster analysis revealed that varieties of the same language remain in the same grouping; this suggests a closeness between them. This work contributes (i) to language studies, by showing which varieties of the same language cannot be differentiated by means of frequency analysis of letter pairs (with alphabetic script, with Latin-European alphabet); and (ii) for the areas of computing interested in processing natural languages, with algorithms that, from a set of texts, graphically identify and group texts of the same linguistic variety or of the same language.	pt_BR
dc.description.sponsorship	SENAI CIMATEC	pt_BR
dc.language.iso	pt_BR	pt_BR
dc.publisher	Centro Universitário SENAI CIMATEC	pt_BR
dc.rights	acesso aberto	pt_BR
dc.subject	Mineração de dados	pt_BR
dc.subject	Análise de frequência – Pares de letras	pt_BR
dc.subject	Variação linguística	pt_BR
dc.subject	N-gramas	pt_BR
dc.title	Análise de agrupamento: o problema da identificação de línguas em textos por meio de bi-gramas	pt_BR
dc.title.alternative	Cluster analysis: the problem of identifying languages in texts by means of bigrams	-
dc.type	Dissertação	pt_BR
dc.embargo.terms	aberto	pt_BR
dc.publisher.country	brasil	pt_BR
dc.publisher.departament	Departamento do MCTI	pt_BR
dc.publisher.program	Programa de Pós-Graduação Stricto Sensu do Centro Universitário SENAI CIMATEC	pt_BR
dc.publisher.initials	SENAI CIMATEC	pt_BR
dc.subject.cnpq	Engenharias	pt_BR
dc.contributor.advisor-co	Pereira, Hernane Borges de Barros	-
dc.contributor.referees	Senna, Valter de	-
dc.contributor.referees	Rosa, Marcos Grilo	-
Appears in Collections:	Dissertações de Mestrado (PPG MCTI)

Files in This Item:

File	Description	Size	Format
Cleônidas Tavares de Souza Júnior.pdf	TCCP / DISSERTAÇÃO MCTI / SENAI CIMATEC	5.7 MB	Adobe PDF	View/Open

Show simple item record