SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Wiewiórka, Marek S; Messina, Antonio; Pacholewska, Alicja Elzbieta; Maffioletti, Sergio; Gawrysiak, Piotr; Okoniewski, Michał J (2014). SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics, 30(18), pp. 2652-2653. Oxford University Press 10.1093/bioinformatics/btu343

Preview

Text
btu343.pdf - Published Version
Available under License Publisher holds Copyright.
Download (144kB) | Preview

Summary: Many time-consuming analyses of next-generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics because of their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data querying.

The SparkSeq software has been created to take advantage of a new MapReduce framework, Apache Spark, for next-generation sequencing data. SparkSeq is a general-purpose, flexible and easily extendable library for genomic cloud computing. It can be used to build genomic analysis pipelines in Scala and run them in an interactive way. SparkSeq opens up the possibility of customized ad hoc secondary analyses and iterative machine learning algorithms. This article demonstrates its scalability and overall fast performance by running the analyses of sequencing datasets. Tests of SparkSeq also prove that the use of cache and HDFS block size can be tuned for the optimal performance on multiple worker nodes.

Item Type:	Journal Article (Original Article)
Division/Institute:	05 Veterinary Medicine > Department of Clinical Veterinary Medicine (DKV) > ISME Equine Clinic Bern > ISME Equine Clinic, Internal medicine 05 Veterinary Medicine > Department of Clinical Research and Veterinary Public Health (DCR-VPH) > Institute of Genetics 05 Veterinary Medicine > Department of Clinical Veterinary Medicine (DKV) 05 Veterinary Medicine > Department of Clinical Research and Veterinary Public Health (DCR-VPH)
UniBE Contributor:	Pacholewska, Alicja Elzbieta
Subjects:	500 Science > 590 Animals (Zoology) 600 Technology > 630 Agriculture 000 Computer science, knowledge & systems > 040 Unassigned 500 Science > 570 Life sciences; biology
ISSN:	1367-4803
Publisher:	Oxford University Press
Funders:	[4] Swiss National Science Foundation
Language:	English
Submitter:	Alicja Elzbieta Pacholewska
Date Deposited:	04 Nov 2016 17:48
Last Modified:	05 Dec 2022 14:53
Publisher DOI:	10.1093/bioinformatics/btu343
PubMed ID:	24845651
Web of Science ID:	000342913000015
BORIS DOI:	10.7892/boris.79078
URI:	https://boris.unibe.ch/id/eprint/79078

Actions (login required)

Edit item

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Interest & Impact

Downloads

Citations

Search

Services

Actions (login required)

Item Type:

Division/Institute:

UniBE Contributor:

Subjects:

ISSN:

Publisher:

Funders:

Language:

Submitter:

Date Deposited:

Last Modified:

Publisher DOI:

PubMed ID:

Web of Science ID:

BORIS DOI:

URI:

Actions (login required)