Approximate Sequence Matching with MapReduce Full text

Δανάη Κούτρα
Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, ΕΜΠ
2010
Diploma Thesis
Abstract. The significance of processing and analysing the organisms' genetic codes and the representation of biological data in digital form which can be processed by computers raised the need for close cooperation of biologists and computer scientists in order to solve a variety of biological problems. One of the most important problems is mapping reads -short sequences of DNA- to reference sequences to find locations where each read occurs, allowing for a small number of errors. In general, this problem is time-consuming and memory-demanding and thus biologists often use approximate algorithms, such as BLAST-like algorithms, which detect the most significant sequence similarities. However, there are some biological applications that require the best matches to be found and in these cases heuristic algorithms are not sufficient. Moreover, the next-generation high-throughput DNA sequencing technologies and projects like the Personal Genome Project which aim at sequencing the DNA of every individual at low cost, are changing the scale and scope of genomics and are expected to increase substantially the data that is used in biosciences. Therefore, it seems that the distribution of data storage and computation among numerous processors is inevitable. This diploma thesis aims to: (a) the study of existing biological applications implemented in MapReduce, a framework that supports parallel distributed execution of data intensive applications, (b) gaining knowledge on the MapReduce programming model and its most commonplace open-source version, Hadoop, and (c) the implementation and evaluation of MapReduce algorithms which solve the approximate sequence matching problem by detecting all the existing matches.