PRINSEQ. PReprocessing and INformation of SEQuence data.
Easy and rapid quality control and
PRINSEQ can be used to filter, reformat, or trim your genomic and metagenomic sequence data. It generates summary statistics of your sequences in graphical and tabular format. It is easily configurable and provides a user-friendly interface.
Input and output formats
You can use data in FASTA (and QUAL) format or FASTQ format as input. The files can be compressed with the ZIP, GZIP, or BZIP2 algorithm to reduce upload time. A unique data identifier can be used to share and access previously uploaded data.
PRINSEQ provides summary statistics for your data including read length, GC content, sequence complexity and quality score distributions, number of read duplicates, occurence of Ns and poly-A/T tails, assembly quality meassures, tag sequences, and more.
Filter, trim and reformat data
Sequence data can be filtered to remove sequence copies, short or long sequences, sequences with N's, low-quality sequences, and much more. The user can change the sequence identifiers, can convert between FASTA+QUAL and FASTQ format, convert RNA to DNA and vice versa, ...
News and Updates
Here you will find release notes and news related to PRINSEQ.
03 / 2013
Release of lite version 0.20.3: Fixed issue of incorrect duplicate counts when a sequence is both an exact duplicate and reverse complement exact duplicate of another sequence.
01 / 2013
Release of lite version 0.20.2: Added support for STDOUT output to paired-end processing.
Release of lite version 0.20.1: Fixed issue with FASTA inputs that caused the program to exit.
Release of web version 0.20.1: Release of web version files to run the web version on a local machine.
12 / 2012
Release of lite version 0.20: Fixed depricated use of 'defined' on aggregates. Added options "trim_left_p" and "trim_right_p" to trim reads by a percentage value in addition to options that trim by number of nucleotides. Added option "stats_assembly" to report N50, N90, etc contig size in the standalone version's summary statistics output. Added support for paired-end data (new options "fasta2" and "fastq2").
Release of graphs version 0.6: Added support for paired-end data.
11 / 2012
There was an attack on our servers this week and we had to switch to the backup server. Data on the backup server is about 2 weeks out of date, which means that data you uploaded is most likely not available. There is not much we can do until Sunday (the servers need to be examined by the IT guys).
09 / 2012
Release of lite version 0.19.5: Fixed issue of incorrect quality trimming with arguments "min" and "max" for option -trim_qual_type.
Release of lite version 0.19.4 and graphs version 0.5.1: Fixed issues related to the use of qw() in loops for Perl version 5.14+ (thanks to Evan Staton for pointing out the issue and providing the link with details: http://search.cpan.org/~jesse/perl-5.14.0/pod/perldelta.pod#Use_of_qw(...)_as_parentheses). Fixed issue with 5'/3' duplicate removal that forced option -exact_only (thanks to Stephanie Pierson for reporting the issue). Fixed issue with missing duplicate statistics in graph data output if -derep or -graph_stats was not specified. Suppressed output of PCA module when generating PCA plots.
06 / 2012
Release of lite version 0.19.3: Added new output file option to keep track of sequence identifier renaming (option -seq_id_mappings). Fixed trim_qual_rule parameter listed twice in the log file. Fixed issue with sequences of length 3bp when calculating DUST scores. Fixed issue with exact_only parameter check.
05 / 2012
Release of lite version 0.19.2: Increased memory efficiency for graph data calculation on big input files.
Release of lite version 0.19.1: Fixed rounding issue in sequence complexity calculation.
Release of lite version 0.19 and graphs version 0.5: Added check for counts of filtered sequence to report when no sequences were filtered. Optimized dinucleotide calculation (~400% faster), sequence complexity calculation (~80% faster) and quality filtering and trimming (~90% faster when both filtering and trimming). Added option (-graph_stats) to select what statistics should be calculated and included in the graph_data file (useful if you e.g. do not need sequence complexity information, which requires a lot of computation). Added binned base quality data to graph data output (as generated in web version up to 0.17.4). Removed annotations from length distribution graph if standard deviation is zero.
Release of lite version 0.18.3: Fixed phred64 scaling issue for graph data outputs (thanks to Komal Jain for pointing out the issue).
04 / 2012
Release of lite version 0.18.2: Fixed typo in selection of graph data elements that resulted in missing quality data (since version 0.18).
Release of lite version 0.18.1 and graphs version 0.4.1: Fixed missing zero count for Ns when generating graphs data file. Fixed duplication count table output for HTML report.
Release of lite version 0.18 and graphs version 0.4: Added options for web version processing (lite+graphs).Added custom parameter processing (same as already available in web version). Added option to input parameters saved in a file (lite). Fixed issue with output to STDOUT for "-out_format 4" option. Added counts by type for filtered sequences to verbose output (lite). Updated layout of HTML report to match web version and to use less colors to reduce printing costs and increase readability (graphs).
02 / 2012
Release of lite version 0.17.4: Fixed issue with MID tag output when using the -graph_data option.
01 / 2012
Release of lite version 0.17.3: Fixed issue with non-exact duplicate removal that caused incorrect out_bad files (filtered out outputs) introduced in last version.
Release of lite version 0.17.2: Fixed issue with non-exact duplicate removal when graph data and data processing is performed at the same time.
12 / 2011
Release of lite version 0.17.1 and graphs version 0.3: Added support for tag sequence check to the HTML output.
11 / 2011
Release of lite version 0.17 and graphs version 0.2: Added error message if statistics and graph data are generated at the same time. Prevented generation of graphs for missing data that might otherwise generate errors. Prevent the use of -stats outputs when generating graphs data. Added example data for prinseq-graphs. Fixed issue with filenames containing a non-alphanuerical sign after the period sign (thanks to Marmaduke for pointing out the issue). New option -no_qual_header allows to reduce the file size of FASTQ files by preventing any header information output for the quality data. New option -derep_min to specify the duplication threshold (e.g. only filter sequences that occur more than 5 times).
09 / 2011
Release of web version 0.16.1: Fixed issue with mean and max quality score rule for trimming and changed trim "until" to "while" (web only, lite version is not affected).
Release of version 0.16: Check if sequence qualities are in Phred+64 format, if specified. Added the reporting of errors during processing of data. Multiple output formats are now supported (prinseq-lite). Extended the input format from ACGTN to full nucleic acid ambiguity code (ACGTURYKMSWBDHVNX-). Allow processing of amino acid sequences (prinseq-lite). Replace option -si13 with -phred64 to specify input files in Phred+64 format. New options to generate graphs in standalone lite version (using prinseq-graphs.pl or online form).
Release of version 0.15.1: Fixed problem with dots in directory names (prinseq-lite). Fixed problem with trimming from left of reads that are shorter than the specified trim length. Fixed error in calculation of Phred quality scores for Solexa/Illumina 1.3+ data.
06 / 2011
Release of version 0.15: New file input by URL (web version). Corrected typo in regex (missing \ before s*) and sequence id hash value (was seqi_d instead of seq_id). Added quality score scaling for Solexa/Illumina 1.3+ data. New option to trim poly-N tails. New option to read from STDIN and write to STDOUT (lite version). Adjusted graph labels for datasets with more than 1 million reads (web version).
05 / 2011
Release of version 0.14.4: Fixed warnings in tag sequence function. Corrected line break possition in output format for QUAL files. Fixed warnings for quality trimming from the 3'-ends (lite version).
04 / 2011
Release of version 0.14.2: Fixed issue of file format check with non-Unix line breaks causing misidentification of FASTQ files. Fixed warning when trimming and dereplicating.
03 / 2011
Release of version 0.14: Added status report for writing data after duplicate removal and added number of bases and mean length to output summary statistics in verbose and log mode (lite version). Modified data processing to allow larger files and higher compressed input files that previously caused callback timeout (web version). Fixed warning when out_good or out_bad is set to null and of renaming sequence identifiers when additionally removing read duplicates.
02 / 2011
Release of version 0.13: Fixed issue with leading spaces in first quality score. Added length check that ensures that the number of bases matches the number of quality scores. (This also ensures that each sequence has quality scores, if a QUAL file is provided as input.)
Release of version 0.12: Fixed issue when sequences are 3bp or shorter that caused a division by zero and incorrect DUST complexity scores. Added -log option to generate a log file with the used command and basic input/output statistics (lite version). Fixed renaming issue for duplicate removal (lite version). Fixed issue for sequences with a single base and a quality score of 0.
01 / 2011
Release of version 0.11: Improved tag sequence probability estimation with additional check for MID tags (454 GSMIDs and RLMIDs) and report of MID sequence if found. Visualization for odds ratios to easily identify over- and under-represented dinucleotides (web version only). Added table with minimum and maximum complexity values and the respective sequece to the web version.
Release of version 0.10: Corrected typos in option description and user interface. Fixed bug when both output options out_good and out_bad are set to "null" in standalone version. Added summary statistics calculation for basic infos (stats_info), length (stats_len), dinucleotide odds ratios (stats_dinuc), tag probabilities (stats_tag), sequence duplicates (stats_dupl), ambiguous base N (stats_ns) and all together (stats_all) to the standalone version. See "Manual" for details on the new options.
12 / 2010
Release of version 0.9: Fixed parameter loading for JSON data. Fixed ID type in sequence complexity method. Changed order of tail trimming and quality trimming. Fixed 3'-end tail trimming bug. Extended documentation and verbose print output of lite-version. Added option to prevent output generation of certain files to lite-version. Fixed issue of maximum number of sequences in combination with duplicate removal. Fixed missing quality trimming in trimming of sequence to fixed length. Fixed GC content range filtering. Changed integer to float for percentage value filtering. Removed debugging output. Forced single line output for FASTQ format.
11 / 2010
Release of version 0.8: Use JSON to manage parameters on server and user site. Added mean and standard deviations to length and GC content plot to guide choice of minimum and maximum values. Added example data.
10 / 2010
Release of version 0.7: Added dinucleotide odds ratio calculation and PCA plots including several viral and microbial metagenomes. Add sequence complexity plots and filters using DUST and entropy methods. Reorganize input counts and input info to merge into single table. Add tables with counts to most plots. Add percentage for sequence numbers.
Release of version 0.6: Use Cairo graphics library to generate graphics. Added parameter management functionality and pre-defined parameter sets. Separate duplication plot into separate plot and add reverse complement counts. Add two plots to show duplication level and number of duplicate counts. Use box-plots for quality scores.
Release of version 0.5: Switched to ExtJS for web-interface. Change progress bars and other functionality to JS. Use bi-histograms for duplicate identification in GC content and length distribution plots. Only show graphs when there are values to plot to reduce load on user site. Added sequence quality scores plot and filter functionality.
09 / 2010
PRINSEQ release 0.4: Removed rarely used information shown in "Input stats". Added base frequencies at sequence ends and tag sequence probability for tag sequence check. Added line width formatting option for FASTA (and QUAL) output. Use binning for datasets with long sequences.
PRINSEQ release 0.3 including more information to "Reformat Options" field for renaming sequence ids, automatic removal of spaces, ">", and quotes from sequence ids before renaming. Fixed issue with saving "0" values instead of default values into parameters file and "division by 0" when calculating 1/length for sequence fractions. PRINSEQ now automatically removes space and dash from sequences when parsing the input data. Added function to convert base U to T for RNA input and DNA output files.