PRINSEQ. PReprocessing and INformation of SEQuence data.
FAQ
If you can't find the answer to your question, take a look at the manual or the Q&A site.
What is PRINSEQ?
PRINSEQ is a publicly available tool that is able to filter, reformat and trim your sequences and to provide you summary statistics for your sequence data. It is easily configurable and provides a user-friendly interface.
The interactive web interface facilitates visualizations of the results and export functionality for subsequent data processing, and the latest version is available at http://edwards.sdsu.edu/prinseq/ or by clicking on "Use PRINSEQ" in the menu above.
The standalone lite version (available under "Downloads") is written in Perl and does not require any non-core Perl modules. The lite version is primarily designed for data preprocessing and does not generate summary statistics in graphical form.
Why should I use PRINSEQ?
Not only sequencing, but also data analysis costs money. Analyzing poor data wastes CPU time and interpreting the results from poor data wastes people time. The quality control step can be easily performed using the summary statistics generated by PRINSEQ and can help to choose parameters for data preprocessing. The filter, trim and reformat options provided by PRINSEQ can ensure that the data used for downstream analysis is not compromised of low-quality sequences or sequence artifacts that might lead to erroneous conclusions. PRINSEQ is free, fast, and does not require you to install any software.
How can I cite PRINSEQ?
If you use PRINSEQ, please cite:
Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]
@article{schmieder_prinseq, title = {Quality control and preprocessing of metagenomic datasets}, volume = {27}, issn = {1367-4811}, url = {http://www.ncbi.nlm.nih.gov/pubmed/21278185}, doi = {10.1093/bioinformatics/btr026}, number = {6}, journal = {Bioinformatics {(Oxford,} England)}, author = {Robert Schmieder and Robert Edwards}, month = mar, year = {2011}, note = {{PMID:} 21278185}, pages = {863--864} }
What file formats does PRINSEQ support?
You can submit files in FASTA (and QUAL) or FASTQ format using the web version and download the data in either formats. The files can also be compressed in ZIP, GZIP or BZIP2 format.
What is the maximum number of sequences that I can submit through the web interface?
There is no limit on the number of sequences that you can submit. However, there is a limit for the file size that you can upload. The current web-service allows files up to 600 MB. If you compress your data, you can submit around 2 GB of sequence data.
Where can I set the filter and reformat options when using the web interface?
The PRINSEQ web interface does not require the setting of any parameters or options before the data is parsed. Instead, the parameters are set after the data is processed and the summary statistics were generated, which allows the user to choose parameters appropriate for their dataset and does not require them to submit and process the same data with modified parameters for several times. The options and parameters can be set by hand or using the parameter and option managing functionality.
What preprocessing options should I use?
The necessary data preprocessing steps highly depend on the type of library being sequenced (whole genome, transcriptome, 16S, metagenome, ...) and on the type of sequencing technology used to generate the data. There is no "one-size-fits-all" solution and each user must make informed decisions as to the appropriate parameters used for preprocessing. Take a look at the manual ("Manual" in menu on top) for some guidelines.
How long do you keep the data uploaded to the web interface?
You as the user can select if you want us to keep the data accessible for one day (24 hours) or one week (168 hours). You can also request to delete the data after you are done, or if you want us to keep it for a longer time period.
How can I use compressed files with PRINSEQ?
Using the piping feature in PRINSEQ you can use compressed input files (without the need to unzip and zip your input file):
gzip -dc myinputfile.fastq.gz | perl prinseq-lite.pl -verbose -fastq stdin ...
If you want to compress the output file(s), then you can either use a semi-colon (";") and then the gzip command:
gzip -dc myinputfile.fastq.gz | perl prinseq-lite.pl -verbose -fastq stdin -min_len 100 -out_good myoutputgood -out_bad myoutputbad; gzip myoutputgood.fastq myoutputbad.fastq
or only keep the "good" (or "bad") output data from PRINSEQ and compress those using the pipe ("|") so it can be run in the background ("&"-sign at the end of the command):
gzip -dc myinputfile.fastq.gz | perl prinseq-lite.pl -verbose -fastq stdin -min_len 100 -out_good stdout -out_bad null | gzip > myoutputgood.fastq.gz
You can also combine the "stdout" with normal output files (that will not be compressed):
gzip -dc myinputfile.fastq.gz | perl prinseq-lite.pl -verbose -fastq stdin -min_len 100 -out_good stdout -out_bad myoutputbad | gzip > myoutputgood.fastq.gz
Why does the standalone version not generate an output file?
The standalone version does not generate output files if you use any of the statistics options (e.g. -stats_len). After removing all statistics options, PRINSEQ will generate the preprocessed output file(s).
In what order does the standalone version perform filtering and trimming?
The standalone version performs the processing of the available options in the following order: seq_num, trim_left, trim_right, trim_left_p, trim_right_p, trim_qual_left, trim_qual_right, trim_tail_left, trim_tail_right, trim_ns_left, trim_ns_right, trim_to_len, min_len, max_len, range_len, min_qual_score, max_qual_score, min_qual_mean, max_qual_mean, min_gc, max_gc, range_gc, ns_max_p, ns_max_n, noniupac, lc_method, derep, seq_id, seq_case, dna_rna, out_format
What is the link for generating the HTML report files from graph data files?
The standalone version can generate graph data files that can be used to create PNG files or a HTML report file using PRINSEQ-graphs (distributed with PRINSEQ-lite) or the online form at:
http://edwards.sdsu.edu/prinseq/ -> Get Report
How can I report a problem or request a new feature?
Please use the Q&A forum (https://groups.google.com/d/forum/edwardslabtools) and tag the entry with PRINSEQ.
How do I install PRINSEQ?
You can use PRINSEQ-lite without the need of installing any third-party or Perl modules. If you want to generate the report PNG or HTML files with the prinseq-graphs.pl script, you will need to install the Cairo library (http://cairographics.org/) and some additional Perl modules (see the README file for details). If you want to install PRINSEQ under a Linux/Unix environment to your /usr/bin/ directory, you will need root permissions. The guys from Era7 Bioinformatics created a shell script that will do the installation for you (https://github.com/era7bio/prinseq.install). Please make sure that you change the shell script to use the latest version of PRINSEQ by simply opening the script in a text editor and changing the version numbers to the latest version.
How do I sort FASTQ files by their sequence identifier for paired-end processing in PRINSEQ?
PRINSEQ requires sorted input files for paired-end or mate-pair data processing. If your two FASTQ files of a paired-end (or mate-pair) dataset need to be sorted by their sequence identifiers, you can use the following one-liner in Linux/Unix/OSX:
paste - - - - < file_1.fastq | sort -k1,1 -t " " | tr "\t" "\n" > file_1_sorted.fastq paste - - - - < file_2.fastq | sort -k1,1 -t " " | tr "\t" "\n" > file_2_sorted.fastq
This will first join the 4 lines (paste - - - -) of a FASTQ entry into a single line (with each of the 4 original lines separated by tabs), then sort them by their sequence identifier (-k1,1 -t " " specifies everything before the first space for the sorting, which is our sequence identifier), and write each entry again in 4 lines by replacing the tabs with line breaks. The sorted entries are then saved in a new file specified after the ">" sign.
How are duplicates defined in PRINSEQ for paired-end data?
PRINSEQ defines duplicates in a slightly different way for single and pair-end data. For paired-end data, a duplicate is calculated using both reads. Having a pair (read1-read2) that would be considered a duplicate on read1, but not read2 is not considered a paired-duplicate (it is most likely not an artificial and possibly not a biological duplicate). If a singleton is a duplicate of a read in a pair, then the singleton will be filtered out.
Is there a mailing list for updates and new releases?
Yes, there is a mailing list. You can sign up for updates and release emails at https://lists.sourceforge.net/lists/listinfo/prinseq-news. The mailing list archive can be accessed at https://sourceforge.net/mailarchive/forum.php?forum_name=prinseq-news.