PRINSEQ @ SourceForge.net

PRINSEQ. PReprocessing and INformation of SEQuence data.

Manual

Web version
    Necessary resources
    Upload data

Standalone lite version
    Necessary resources
    Available options
    Examples

Graphs version
    Necessary resources
    Available options

Quality control
    Length of sequences
    Base qualities
    GC content
    Poly-A/T tails
    Ambiguous bases
    Sequence duplications
    Sequence complexity
    Tag sequences
    Contamination
    Assembly
    References

Data preprocessing
    Manage options
    Filter options
    Trim options
    Reformat options
    References

Preprocessing 454 data
01/16/2011
PDF, 92 KB

Preprocessing FASTA-only data
01/16/2011
PDF, 76 KB

Raw data processing
01/16/2011
PDF, 112 KB

Quality control
01/16/2011
PDF, 520 KB

Data preprocessing
09/01/2011
PDF, 827 KB

Introduction

PRINSEQ is a tool that generates summary statistics of sequence and quality data and that is used to filter, reformat and trim next-generation sequence data. It is particular designed for 454/Roche data, but can also be used for other types of sequence data. PRINSEQ is available through a user-friendly web interface or as standalone version. The standalone version is primarily designed for data preprocessing and does not generate summary statistics in graphical form.

PRINSEQ is distributed under the GNU Public License (GPL). All its source codes are freely available to both academic and commercial users. The latest version can be downloaded at the SourceForge download page.

Web version

TOP OF PAGE

The interactive web interface provides summary statistics for quality control that can help to choose parameters for data preprocessing. PRINSEQ provides filter, trim and reformat options for data preprocessing.

Necessary resources

Hardware
- Computer connected to the Internet

Software
- Up-to-date Web browser (Firefox, Safari, Chrome, Internet Explorer, ...)

Files
  - FASTA file with sequence data
  - QUAL file with quality scores (if available)
  - FASTQ file (as alternative format)

Upload data to the PRINSEQ web version

To upload a new dataset in FASTA and QUAL format (or FASTQ format) to PRINSEQ, follow these steps:
  1. Go to http://prinseq.sourceforge.net
  2. Click on "Use PRINSEQ" in the top menu on the right (the latest PRINSEQ web version should load)
  3. Click on "Upload new data"
  4. Select your FASTA and QUAL files or your FASTQ file and click "Submit"

After the data is parsed and processed successfully, the user interface will show a menu on the left and a message in the main panel as shown below.

Notes
After clicking the submit button, a status bar (not progress bar) will be displayed until the file upload is completed. During the data processing, several progress bars will show the progress of the data parsing and statistics calculation steps. After the data is parsed and processed successfully, the user interface will provide a menu on the left and information and guidance in the main panel.

Possible problems
1. The PRINSEQ web interface does not load / is not visible.
You only see this and nothing else happens:

Solution: Make sure that you have JavaScript activated in your browser, as this is required to load and use PRINSEQ's web interface.

2. The upload status bar does not disappear.
After clicking on the submit button you see this and it does not disappear:

Solution: The first thing to check is if the file is still uploading. The easiest way to do this is by checking the loading icon in your browser.

If you see the loading icon (right) instead of the PRINSEQ icon (left), your file is still uploading and you should give it more time. If you see the PRINSEQ icon instead of the loading icon, your file did not upload completely and this caused an error. If you have a slow connection to the Internet or try to upload large files, the connection to the web server can time out before the upload was completed. If you did not upload compressed files, try to compress your files with any of the supported compression algorithms (ZIP, GZIP, ...).
In rare cases, the issue can also be caused by certain Firefox plugins or extensions. If possible, use an alternative browser to test if this was the case. If the browser caused the problem, updating Firefox and the plugins/extension to the latest version might solve the problem.

Quality control

TOP OF PAGE

Sequencing technologies are not perfect and the quality control (QC) is an essential step to ensure that the data used for downstream analysis is not compromised of low-quality sequences, sequence artifacts, or sequence contamination that might lead to erroneous conclusions. The easiest way of QC is looking at summary statistics of the data. There are different programs that can produce those statistics. Web applications allow users to easily share and discuss the results with other people without transferring large data files. The following QC steps are implemented in and all graphics generated by PRINSEQ.

Number and length of sequences

Check those numbers to make sure it matches approximately the manufacturer estimates. If your numbers are off too much, check the raw data and filter statistics in "454BaseCallerMetrics" and "454QualityFilterMetrics".

Length distribution
The length distribution can be used as quality measure for the sequencing run. You would expect a normal distribution for the best result. However, most sequencing results show a slowly increasing and then a steep falling distribution. The plots in PRINSEQ mark the mean length (M) and the length for one and two standard deviations (1SD and 2SD), which can help to decide where to set length thresholds for the data preprocessing. If any of the sequences is longer than 100 bp, the lengths will be binned in the plots generated by PRINSEQ. The number of sequences for each bin is then shown instead of the number of sequences for a single length (values might therefore be bigger than shown in the table for non-binned lengths).

The following two datasets have approximately the same number of sequences, however the length distributions look different.

Both distributions have the highest number of sequences around 500 bp, but for the first dataset the mean of the sequence lengths is higher and the standard deviation is lower. A certain number of shorter reads might be expected, but if the sample contained mainly longer fragments, this number should be low.
Assuming that both samples contained enough fragments of at least 500 bp and all fragments were sequenced with the same number of cycles (sequencing flows), we would expect that the majority of the sequences would have approximately the same length. The higher amount of shorter reads in the second dataset suggests that those reads might have been of lower quality and were trimmed during the signal processing. If the sample contained many short fragments, the shorter reads might be from those fragments and not of lower quality.

Minimum and maximum read length
Sequences in the SFF files can be as short as 40 bp (shorter sequences are filtered during signal processing). For multiplexed samples, the MID trimmed sequences can be as short at 28 bp (assuming a 12 bp MID tag). Such short sequences can cause problems during, for example, database searches to find similar sequences. Short sequences are more likely to match at a random position by chance than longer sequences and may therefore result in false positive functional or taxonomical assignments. Furthermore, short sequences are likely to be quality trimmed during the signal-processing step and of lower quality with possible sequencing errors.
In some cases, sequences can be much longer than several standard deviations above the mean length (e.g. 1,500+ bp for a 500 bp mean length with a 100 bp standard deviation). Those sequences should be used with caution as they likely contain long stretches of homopolymer runs as in the following example below. Homopolymers are a known issue of pyrosequencing technologies such as 454/Roche [1].

aactttaaccttttaaaacccccttaaaaaaactttaaaccccgtaaaccccccgggttt
ttttttaaaaaaccgttttttacgggggtttaccccgttttaccggggttttgggggttt
taaaaaaaacgggttttaaacgggttaacccccgggttttccgggggtttaaaaagtttt
tttaaacgggggttttcccgtaaaaaaaaaaccccgtttaaaaaaaggggttaaaaaaaa
aaggggttaaccccccggggtttaaaaaaaaccttttttttttttaaaaaaaacgttttt
tttttttaaaaggggttttttttacgggggtaaacgggggggttaaaaaaaaaccccccc
cggggggttttaaaaaaaaaacccccggttttaaaaaaccccgttttaacccctttaaaa
aaaaaacgggggggttttaaaaaaaaaagggggttttttttttttaaaaacccgttttta
aaaaccccccgttttttaacccgggttaaaccccccccgggggggtaaaacccccccccc
ggggtaaccccctttttttaaaacccccccccgttttttacccgggggtttttacccccg
gggggggtaaaaaaacggggggtttttttttttttaaaaccggggttttttttttttaaa
ccccggtttttaaaaaccggtttttaccccggggggttttacccccgggggggggttttt
aaaacccccggtttaaaactttaaaaacccgggtaaccccggggttttaaaaaaaaaaaa
aaaaccccccccgttaaaaaaaaaaaacccgttttttttttaaaaaaaacccccccccgg
ttttaaaaccccccccgggggtttttaccccggggttttaaaaaaaacccgtttaaaaaa
accgggttttttaaaggggttttaaacccccccccc

Base qualities

As for Sanger sequencing, next-generation sequencers produce data with linearly degrading quality across the read. The quality scores for 454/Roche sequencers are Phred-based since version 1.1.03, ranging from 0 to 40. Phred values are log-scaled, where a quality score of 10 represents a 1 in 10 chance of an incorrect base call and a quality score of 20 represents a 1 in 100 chance of an incorrect base call.
In PRINSEQ, the quality scores are plotted across the reads using box plots. The x-axis indicates the absolute position if all reads are no longer than 100 bp and the relative position (in % of read length) if any read is longer than 100 bp. For datasets with any read longer than 100 bp, a second plot shows binned quality values to keep its absolute positions. This plot is helpful to identify quality scores at the end of longer reads, which would otherwise be grouped with the ends of the shorter reads. The following example shows the quality scores across the read length for fragments sequenced with GS FLX using the Titanium kit. The sequences with low quality scores at the ends should be trimmed during data preprocessing.

In addition to the decrease in quality across the read, regions with homopolymer stretches will tend to have lower quality scores. Huse et al. [1] found that sequences with an average score below 25 had more errors than those with higher averages. Therefore, it is helpful to take a look at the average (or mean) quality scores. PRINSEQ provides a plot that shows the distribution of sequence mean quality scores of a dataset, as shown below. The majority of the sequences should have high mean quality scores.

Low quality sequences can cause problems during downstream analysis. Most assemblers or aligners do not take into account quality scores when processing the data. The errors in the reads can complicate the assembly process and might cause misassemblies or make an assembly impossible.

GC content

The GC content distribution of most samples should follow a normal distribution. In some cases, a bi-modal distribution can be observed, especially for metagenomic data sets. The GC content plot in PRINSEQ marks the mean GC content (M) and the GC content for one and two standard deviations (1SD and 2SD). This can help to decide where to set the GC content thresholds, if a GC content filter will be applied. The plot can also be used to find the thresholds or range to select sequences from a bi-modal distribution.

Poly-A/T tails

Poly-A/T tails are considered repeats of As or Ts at the sequence ends. In PRINSEQ, the minimum length of a tail is 5 bp and sequences that contain only As or Ts are counted for both ends. A small number of tails can occur even after trimming poly-A/T tails. For example, a sequence that ends with AAAAATTTTT and that has been trimmed for the poly-T will still contain the poly-A.
Trimming poly-A/T tails can reduce the number of false positives during database searches, as long tails tend to align well to sequences with low complexity or sequences with tails (e.g. viral sequences) in the database.

Ambiguous bases

Sequences can contain the ambiguous base N for positions that could not be identified as a particular base. A high number of Ns can be a sign for a low quality sequence or even dataset. If no quality scores are available, the sequence quality can be inferred from the percent of Ns found in a sequence or dataset. Huse et al. [1] found that the presence of any ambiguous base calls was a sign for overall poor sequence quality.
Ambiguous bases can cause problems during downstream analysis. Assemblers such as Velvet and aligners such as SHAHA2 or BWA use a 2-bit encoding system to represent nucleotides, as it offers a space efficient way to store sequences. For example, the nucleotides A, C, G and T might be 2-bit encoded as 00, 01, 10 and 11. The 2-bit encoding, however, only allows to store the four nucleotides and any additional ambiguous base cannot be represented. The different programs deal with the problem in different ways. Some programs replace ambiguous bases with a random base (e.g. BWA [2]) and others with a fixed base (e.g. SHAHA2 and Velvet replace Ns with As [3]). This can result in misassemblies or false mapping of sequences to a reference sequence and therefore, sequences with a high number of Ns should be removed before downstream analysis.

Sequence duplications

Real or artificial? Assuming a random sampling of the genomic material in an environment such as in metagenomic studies, reads should not start at the same position and have the same errors (at least not in the numbers that they have been observed in most metagenomes). Gomez-Alvarez et al. [5] investigated the problem in more detail and did not find a specific pattern or location on the sequencing plate that could explain the duplications.
Duplicates can arise when there are too few fragments present at any stage prior to sequencing, especially during any PCR step. Furthermore, the theoretical idea of one micro-reactor containing one bead for 454/Roche sequencing does not always translate into practice where many beads can be found in a single micro-reactor. Unfortunately, artificial duplicates are difficult to distinguish from exactly overlapping reads that naturally occur within deep sequence samples.
The number of expected sequence duplicates highly depends on the depth of the library, the type of library being sequenced (whole genome, transcriptome, 16S, metagenome, ...), and the sequencing technology used. The sequence duplicates can be defined using different methods. Exact duplicates are identical sequence copies, whereas 5' or 3' duplicates are sequences that are identical with the 5' or 3' end of a longer sequence. Considering the double-stranded nature of DNA, duplicates could also be considered sequences that are identical with the reverse complement of another sequence.
The different plots in PRINSEQ can be helpful to investigate the degree of sequence duplications in a dataset. The following plot shows the number of sequence duplicates for different lengths. The distribution of duplicates should be similar to the length distribution of the dataset. The number of 5' duplicates is higher for shorter sequences (as observed in the example below), suggesting that exact sequence duplicates may have been trimmed during signal processing.

The number of exact duplicates is often higher than the number of 5' and 3' duplicates as in the following example.

PRINSEQ offers additional plots to investigate the sequence duplicates from different points of view. The plot showing the sequence duplication levels (with number of sequences with one duplicate, two duplicates, three duplicates, ...) can be used to identify the distribution of duplicates (e.g. do many sequences have only a few duplicates). The plot showing the highest number of duplicates for a single sequence (top 100) can help to indentify if only a few sequences have many duplicates (e.g. as a result of specific PCR amplification) and what the highest duplication numbers are.
Depending on the dataset and downstream analysis, it should be considered to filter sequence duplicates. The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. In addition, removing duplicates can result in computational benefits by reducing the number of sequences that need to be processed and by lowering the memory requirements. Sequence duplicates can also impact abundance or expression measures and can result in false variant (SNP) calling. The example below shows the alignment of sequences against a reference sequence (gray). The sequence duplicates (starting at the same position) suggest a possibly false frequency of base C at the position marked in bold.

Keep in mind that the number of sequence duplicates also depends on the experiment. For short-read datasets with high coverage such as in ultra-deep sequencing or genome re-sequencing datasets, eliminating singletons can present an easy way of dramatically reducing the number of error-prone reads.

Sequence complexity

Genome sequences can exhibit intervals with low-complexity, which may be part of the sequence dataset when using random sampling techniques. Low-complexity sequences are defined as having commonly found stretches of nucleotides with limited information content (e.g. the dinucleotide repeat CACACACACA). Such sequences can produce a large number of high-scoring but biologically insignificant results in database searches. The complexity of a sequence can be estimated using many different approaches. PRINSEQ calculates the sequence complexity using the DUST and Entropy approaches as they present two commonly used examples.
The DUST approach is adapted from the algorithm used to mask low-complexity regions during BLAST search preprocessing [6]. The scores are computed based on how often different trinucleotides occur and are scaled from 0 to 100. Higher scores imply lower complexity and complexity scores above 7 can be considered low-complexity. A sequence of homopolymer repeats (e.g. TTTTTTTTT) has a score of 100, of dinucleotide repeats (e.g. TATATATATA) has a score around 49, and of trinucleotide repeats (e.g. TAGTAGTAGTAG) has a score around 32.

The Entropy approach evaluates the entropy of trinucleotides in a sequence. The entropy values are scaled from 0 to 100 and lower entropy values imply lower complexity. A sequence of homopolymer repeats (e.g. TTTTTTTTT) has an entropy value of 0, of dinucleotide repeats (e.g. TATATATATA) has a value around 16, and of trinucleotide repeats (e.g. TAGTAGTAGTAG) has a value around 26. Sequences with an entropy value below 70 can be considered low-complexity.

Tag sequences

Tag sequences are artifacts at the ends of sequence reads such as multiplex identifiers, adapters, and primer sequences that were introduced during pre-amplification with primer-based methods. The base frequencies across the reads present an easy way to check for tag sequences. If the distribution seems uneven (high frequencies for certain bases over several positions), it could indicate some residual tag sequences. The following three examples show the base frequencies of datasets with no tag sequence, multiplex identifier (MID) tag sequence, and whole transcriptome amplified (WTA) tag sequence.

Those tag sequence should be trimmed using a program such as TagCleaner (http://tagcleaner.sourceforge.net) [4]. The input to any such trimming program should be untrimmed reads (e.g. not quality trimmed), as this will allow easier and more accurate identification of tag sequences. PRINSEQ can be used after tag sequence trimming to check if the tags were removed sufficiently.
In addition to the frequency plots, PRINSEQ estimates if the dataset contains tag sequences. The probabilities for a tag sequence at the 5'- or 3'-end require a certain number of sequences (10,000 should be sufficient). A percentage below 40% does not always suggest a tag sequence, especially if it cannot be observed from the base frequencies. The estimation does not work for sequence datasets that target a single loci (e.g. 16S) and should only be used for randomly sequenced samples such as metagenomes.

Sequence contamination

Sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, possibly causing erroneous conclusions. The dinucleotide odds ratios as calculated by PRINSEQ use the information content in the sequences of a dataset and can be used to identify possibly contamination [7]. Furthermore, dinucleotide abundances have been shown to capture the majority of variation in genome signatures and can be used to compare a metagenome to other microbial or viral metagenomes. PRINSEQ uses principal component analysis (PCA) to group metagenomes from similar environments based on dinucleotide abundances. This can help to investigate if the correct sample was sequenced, as viral and microbial metagenomes show distinct patterns. As samples might be processed using different protocols or sequenced using different techniques, this feature should be used with caution.

The PCA plots in PRINSEQ show how the user metagenome (represented by a red dot) groups with other metagenomes (blue dots). Since the plots are generated for microbial and viral metagenomes separately, they are marked with an M or V (top left corner). The percentages in parenthesis show the explained variation in the first and second principal component. The plots are generated using preprocessed data from published metagenomes that were sequenced using the 454/Roche sequencing platform. If sequences contain tag sequences or are targeted to a certain loci (e.g. 16S), this approach will not be able to group the user data to metagenomes from the same environment. The plot above shows how a microbial metagenome might be related to other microbial metagenomes. (This plot suggest that the metagenome is likely a marine metagenome sampled in a coastal region.)

The following plots show how a viral metagenome does not group with the microbial metagenomes (left) but closely with other mosquito metagenomes (right).

PRINSEQ additional lists the dinucleotide relative abundance odds ratios for the uploaded dataset. Anomalies in the odds ratios can be used to identify discrepancies in metagenomes such as human DNA contamination (depression of the CG dinucleotide frequency).

Assembly quality measures

The Nxx contig size is a weighted median that is defined as the length of the smallest contig C in the sorted list of all contigs where the cumulative length from the largest contig to contig C is at least xx% of the total length (sum of contig lengths). Replace xx by the preferred value such as 90 to get the N90 contig size. The higher the Nxx value, the higher the rate of longer contigs and the better the dataset. If the dataset does not contain contigs or scaffolds, this information can be ignored.

References

1. Huse S, Huber J, Morrison H, Sogin M, Welch D: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 2007, 8:R143.

2. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25:1754-1760.

3. Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010.

4. Schmieder R, Lim YW, Rohwer F, Edwards R: TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets. BMC Bioinformatics 2010, 11:341.

5. Gomez-Alvarez V, Teal TK, Schmidt TM: Systematic artifacts in metagenomes from complex microbial communities. ISME J 2009, 3:1314-1317.

6. Morgulis A, Gertz EM, Schaffer AA, Agarwala R: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol 2006, 13:1028-1040.

7. Willner D, Thurber RV, Rohwer F: Metagenomic signatures of 86 microbial and viral metagenomes. Environ. Microbiol 2009.

Data preprocessing

TOP OF PAGE

Not only sequencing, but also data analysis costs money. Analyzing poor data wastes CPU time and interpreting the results from poor data wastes people time. The quality control step often shows that the data needs to be preprocessed before any downstream data analysis. The necessary data preprocessing steps highly depend on the type of library being sequenced (whole genome, transcriptome, 16S, metagenome, ...) and on the type of sequencing technology used to generate the data. The following guide should ensure that the data used for downstream analysis is not compromised of low-quality sequences, sequence artifacts, or sequence contamination that might lead to erroneous conclusions. However, there is no "one-size-fits-all" solution and each user must make informed decisions as to the appropriate parameters used for preprocessing.

Manage options

The DeconSeq web version allows users to manage the filter, trim and reformat options in different ways. There are features to save or reset the options currently set, to load previously saved options or to select one of the option pre-sets. Loaded options or selected pre-sets will only be applied after clicking on the "Set new options" button at the bottom. This allows users to review the options before actually setting them.

The feature to save options allows users to process different datasets with the same parameters and to record preprocessing parameters. For parameters that are likely to be used repeatedly in the future, users can request to add those parameters to the list of pre-set options.

Filter options

Length related
Sequences in the SFF files can be as short as 40 bp (shorter sequences are filtered during signal processing). For multiplexed samples, the MID trimmed sequences can be as short at 28 bp (assuming a 12 bp MID tag). Such short sequences can cause problems during, for example, database searches to find similar sequences. Short sequences are more likely to match at a random position by chance than longer sequences and may therefore result in false positive functional or taxonomical assignments. Furthermore, short sequences are likely to be quality trimmed during the signal-processing step and of lower quality with possible sequencing errors.

In some cases, sequences can be much longer than several standard deviations above the mean length (e.g. 1,500+ bp for a 500 bp mean length with a 100 bp standard deviation). Those sequences should be used with caution as they likely contain long stretches of homopolymer runs as in the following example. Homopolymers are a known issue of pyrosequencing technologies such as 454/Roche [1].

A rule of thumb for sequence length thresholds of longer-read datasets is to filter sequences shorter than 60 bp (20 amino acids) and longer than twice the mean length.

Quality score related
In addition to the decrease in quality across the read, regions with homopolymer stretches will tend to have lower quality scores. Huse et al. [1] found that sequences with an average score below 25 had more errors than those with higher averages.

Most published thresholds for the sequence mean quality score range from 15 to 25.

GC content related
The GC content distribution of most samples should follow a normal distribution. In some cases, a bi-modal distribution can be observed, especially for metagenomic datasets. This filter is rarely used, but proved useful to separate sequences in a bi-modal distribution.

Ambiguity code related
Sequences can contain the ambiguous base N for positions that could not be identified as a particular base. A high number of Ns can be a sign for a low quality sequence or even dataset. If no quality scores are available, the sequence quality can be inferred from the percent of Ns found in a sequence or dataset. Huse et al. [1] found that the presence of any ambiguous base calls was a sign for overall poor sequence quality.

Ambiguous bases can cause problems during downstream analysis. Assemblers such as Velvet and aligners such as SHAHA2 or BWA use a 2-bit encoding system to represent nucleotides, as it offers a space efficient way to store sequences. For example, the nucleotides A, C, G and T might be 2-bit encoded as 00, 01, 10 and 11. The 2-bit encoding, however, only allows to store the four nucleotides and any additional ambiguous base cannot be represented. The different programs deal with the problem in different ways. Some programs replace ambiguous bases with a random base (e.g. BWA [2]) and others with a fixed base (e.g. SHAHA2 and Velvet replace Ns with As [3]). This can result in misassemblies or false mapping of sequences to a reference sequence and therefore, sequences with a high number of Ns should be removed before downstream analysis.

Filtering out all reads containing Ns is only suggested if the loss can be afforded (e.g. high coverage datasets or low number of sequences with ambiguous bases). Filtering reads containing more than 1% of ambiguous bases is advised.

Data content related
To select a subset of all sequence in a dataset, the number of wanted sequences can be specified. The first X sequences passing all other specified filters can be selected this way.

The sequence duplicates can be defined using different methods. Exact duplicates are identical sequence copies, whereas 5' or 3' duplicates are sequences that are identical with the 5' or 3' end of a longer sequence. Considering the double-stranded nature of DNA, duplicates could also be considered sequences that are identical with the reverse complement of another sequence.

Depending on the dataset and downstream analysis, it should be considered to filter sequence duplicates. The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. In addition, removing duplicates can result in computational benefits by reducing the number of sequences that need to be processed and by lowering the memory requirements. Sequence duplicates can also impact abundance or expression measures and can result in false variant (SNP) calling.

PRINSEQ filters duplicates without allowing mismatches, as artificial duplicates tend to have the same errors and error-models are computationally more expensive. Programs such as cdhit-454 [4] use clustering techniques to identify near identical duplicates. However, those methods tend to miss duplicates identified by PRINSEQ due to the used clustering methods. For best results, duplicates should initially be filtered using PRINSEQ and then optionally using clustering methods.

For metagenomic datasets, the exact and 5' duplicates should be removed. The 3' and reverse complement duplicates can be removed as they do not provide additional information in database searches, but might be useful for variant discovery or assembly.

Sequence complexity related
Low-complexity sequences are defined as having commonly found stretches of nucleotides with limited information content (e.g. the dinucleotide repeat CACACACACA). Such sequences can produce a large number of high-scoring but biologically insignificant results in database searches. PRINSEQ calculates the sequence complexity using the DUST and Entropy approaches as they present two commonly used examples.

The DUST approach is adapted from the algorithm used to mask low-complexity regions during BLAST search preprocessing [5]. The scores are computed based on how often different trinucleotides occur and are scaled from 0 to 100. Higher scores imply lower complexity. A sequence of homopolymer repeats (e.g. TTTTTTTTT) has a score of 100, of dinucleotide repeats (e.g. TATATATATA) has a score around 49, and of trinucleotide repeats (e.g. TAGTAGTAGTAG) has a score around 32.

The Entropy approach evaluates the entropy of trinucleotides in a sequence. The entropy values are scaled from 0 to 100 and lower entropy values imply lower complexity. A sequence of homopolymer repeats (e.g. TTTTTTTTT) has an entropy value of 0, of dinucleotide repeats (e.g. TATATATATA) has a value around 16, and of trinucleotide repeats (e.g. TAGTAGTAGTAG) has a value around 26.

Sequences with a DUST score above 7 or an entropy value below 70 can be considered low-complexity. An entropy value of 50 or 60 would be a more conservative choice.

Custom filter parameters
The custom filter allows the specification of user defined filter using a two value system. Each new filter should be defined on a separate line and values should be separated by space. The first value defines the filter pattern (any combination of the letters "ACGTN"). The second value defines the number of repeats or percentage of occurrence of the filter pattern. The percentage values are defined by a number followed by the %-sign (without space). If no %-sign is given, it is assumed that the value specifies the number of repeats of the filter pattern.

Examples:
   AAT 8    filters out sequences containing AATAATAATAATAATAATAATAAT anywhere in the sequence
   T 70%   filters out sequences with more than 70% T's in the sequence
   A 15      filters out sequences containing AAAAAAAAAAAAAAA anywhere in the sequence

Trim options

Trim by length/position
Sequences can be trimmed to a specific length or a fixed number of nucleotides can be trimmed from either end.

Trim tails
Poly-A/T tails can be trimmed from either end specifying a minimum tail length. All repeats of As or Ts with at least this length will be trimmed from the sequence ends. A small number of tails can occur even after trimming poly-A/T tails. For example, a sequence that ends with AAAAATTTTT and that has been trimmed for the poly-T will still contain the poly-A.

Trimming poly-A/T tails can reduce the number of false positives during database searches, as long tails tend to align well to sequences with low complexity or sequences with poly-A tails in the database.

Trim ends by quality scores
As for Sanger sequencing, next-generation sequencers produce data with linearly degrading quality across the read. The quality scores for 454/Roche sequencers are Phred-based since version 1.1.03, ranging from 0 to 40. Phred values are log-scaled, where a quality score of 10 represents a 1 in 10 chance of an incorrect base call and a quality score of 20 represents a 1 in 100 chance of an incorrect base call.

Sequences can be trimmed from either end using different rules applied to a sliding window. To stop at the first base that fails the rule defined, use a window size of 1. A bigger window size can trim sequences that might contain a high quality score in between low quality scores without stopping at the high quality score. To move the sliding window over all quality scores without missing any, the step size should be less or equal to the window size.

Note:
The quality trimming during the signal processing step (see Raw data processing PDF file) may not be sufficient. Trimmed sequences can end with low quality bases or even with ambiguous base N (approx. 1%). Reads with RLMIDs (Rapid library multiplex identifiers) may be trimmed in high quality regions as the default behavior will cause the reads to be trimmed at the first position the MID sequence matches, even if it is not the MID but a natural occurring match inside the read.

The parameters should be set to trim positions with a quality score below 20.

Reformat options

Reformat sequences
The sequence characters case can be changed between lower and upper case. This can be useful to, for example, remove soft-masking from sequences.

Sequences can be converted between RNA and DNA. This can be useful to, for example, convert RNA sequences into DNA before generating a BLAST nucleotide database.

Reformat header lines
The header line of a sequence in FASTA or FASTQ format contains the ">" symbol followed by the sequence identifier (sequence name) and an optional description separated by space. In many cases, the description in the sequence header is not used for any downstream analysis. Removing the description in the sequence header can significantly reduce the size of the FASTA or FASTQ file. The sequence header can contain information about the sequence that can be incorrect after data preprocessing (e.g. trimming) and therefore should be corrected or removed.

The sequence identifiers can be renamed to contain new information or to ensure that all sequences have a unique identifier, which is required by most programs used for downstream analysis. Sequence identifiers should not contain spaces, >, or quotes (which will be automatically removed by PRINSEQ). A counter is added to each identifier to assure its uniqueness. For example, "mySeq_10" will generate the IDs (in FASTA format):

>mySeq_101
ACGTACGTACGT
>mySeq_102
ACGTACGTACGT
>mySeq_103
...

References

1. Huse S, Huber J, Morrison H, Sogin M, Welch D: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 2007, 8:R143.

2. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25:1754-1760.

3. Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010.

4. Niu B, Fu L, Sun S, Li W: Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 2010, 11:187.

5. Morgulis A, Gertz EM, Schaffer AA, Agarwala R: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol 2006, 13:1028-1040.

Standalone lite version

TOP OF PAGE

The standalone lite version does not provide any graphical outputs and is designed for preprocessing purposes only. This version is called "lite" as it does not require any non-core Perl modules for processing (no installation of additional Perl modules or third party programs is required). The readme file contains information on the usage of the standalone lite version.
The lite version can generate graph data files that can be used to generate the graphical outputs using the standalone graphs version or an online form (see below).

Necessary resources

Hardware
- Computer with a Linux/Unix or Mac OS X operating system

Software
- PRINSEQ lite (available at http://prinseq.sourceforge.net)

Files
  - FASTA file with sequence data
  - QUAL file with quality scores (if available)
  - FASTQ file (as alternative format)

Available options

Option/flag	Description	Default	Range
-help or -h	Print the help message; ignore other arguments
-man	Print the full documentation; ignore other arguments
-version	Print program version; ignore other arguments
-verbose	Prints status and info messages during processing
Input options
-fastq	Input file in FASTQ format that contains the sequence and quality data. Use stdin instead of a file name to read from STDIN (-fasta stdin). This can be useful to process compressed files using Unix pipes.		FILE, stdin
-fasta	Input file in FASTA format that contains the sequence data. Use stdin instead of a file name to read from STDIN (-fastq stdin). This can be useful to process compressed files using Unix pipes.		FILE, stdin
-qual	Input file in QUAL format that contains the quality data		FILE
-fastq2	For paired-end data only. Input file in FASTQ format that contains the sequence and quality data. The sequence identifiers for two matching paired-end sequences in separate files can be marked by /1 and /2, or _L and _R, or _left and _right, or must have the exact same identifier in both input files. The input sequences must be sorted by their sequence identifiers. Singletons are allowed in the input files.		FILE
-fasta2	or paired-end data only. Input file in FASTA format that contains the sequence data. The sequence identifiers for two matching paired-end sequences in separate files can be marked by /1 and /2, or _L and _R, or _left and _right, or must have the exact same identifier in both input files. The input sequences must be sorted by their sequence identifiers. Singletons are allowed in the input files.		FILE
-params	Input file in text format that contains PRINSEQ parameters. Each parameter should be specified on a new line and arguments should be separated by spaces or tabs. Commends can be specified on lines starting with the # sign. Can be combined with command line parameters. Parameters specified on the command line will overwrite the arguments in the file (if any).		FILE
-si13	This option was replaced by option -phred64.
-phred64	Quality data in FASTQ file is in Phred+64 format (hhttp://en.wikipedia.org/wiki/FASTQ_format#Encoding). Not required for Illumina 1.8+, Sanger, Roche/454, Ion Torrent, PacBio data.
-aa	Input is amino acid (protein) sequences instead of nucleic acid (DNA or RNA) sequences. Allowed amino acid characters: ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*- and allowed nucleic acid characters: ACGTURYKMSWBDHVNXacgturykmswbdhvnx-
Output options
-out_format	Output format 1 (FASTA only), 2 (FASTA and QUAL), 3 (FASTQ), 4 (FASTQ and FASTA), 5 (FASTQ, FASTA and QUAL)	same as input	[1,2,3,4,5]
-out_good	By default, the output files are created in the same directory as the input file containing the sequence data with an additional "_prinseq_good_XXXX" in their name (where XXXX is replaced by random characters to prevent overwriting previous files). To change the output filename and location, specify the filename using this option. The file extension will be added automatically (either .fasta, .qual, or .fastq). For paired-end data, filenames contain additionally "_1", "_1_singletons", "_2", and "_2_singletons" before the file extension. Use "-out_good null" to prevent the program from generating the output file(s) for data passing all filters. Use "-out_good stdout" to write data passing all filters to STDOUT (only for FASTA or FASTQ output files). Example: use "file_passed" to generate the output file file_passed.fasta in the current directory.		STRING, null, stdout
-out_bad	By default, the output files are created in the same directory as the input file containing the sequence data with an additional "_prinseq_bad_XXXX" in their name (where XXXX is replaced by random characters to prevent overwriting previous files). To change the output filename and location, specify the filename using this option. The file extension will be added automatically (either .fasta, .qual, or .fastq). For paired-end data, filenames contain additionally "_1" and "_2" before the file extension. Use "-out_bad null" to prevent the program from generating the output file(s) for data not passing any filter. Use "-out_bad stdout" to write data not passing any filter to STDOUT (only for FASTA or FASTQ output files). Example: "-out_good stdout -out_bad null" will write data passing filters to STDOUT and data not passing any filter will be ignored		STRING, null, stdout
-log	Log file to keep track of parameters, errors, etc. The log file name is optional. If no file name is given, the log file name will be "inputname.log". If the log file already exists, new content will be added to the file.	input name.log	STRING
-graph_data	File that contains the necessary information to generate the graphs similar to the ones in the web version. The file name is optional. If no file name is given, the log file name will be "inputname.gd". If the file already exists, new content will overwrite the file. Use "-out_good null -out_bad null" to prevent generating any additional outputs. (See below for more options related to the graph data.) The graph data can be used as input for the prinseq-graphs.pl file to generate the PNG graph files or an HTML report file. If you have trouble installing the required prinseq-graphs.pl modules or want to see an output example report, upload the graph data file at: http://edwards.sdsu.edu/prinseq	input name.gd	STRING
-graph_stats	Use this option to select what statistics should be calculated and included in the graph_data file. This is useful if you e.g. do not need sequence complexity information, which requires a lot of computation. Requires to have graph_data specified. Default is all selected. Allowed option are (separate multiple by comma with no spaces): ld (Length distribution), gc (GC content distribution), qd (Base quality distribution), ns (Occurence of N), pt (Poly-A/T tails), ts (Tag sequence check), aq (Assembly quality measure), de (Sequence duplication - exact only), da (Sequence duplication - exact + 5'/3'), sc (Sequence complexity), dn (Dinucleotide odds ratios, includes the PCA plots) Example use: -graph_stats ld,gc,qd,de	all	STRING
-qual_noscale	Use this option if all your sequences are shorter than 100bp and as they do not require to scale quality data to 100 data points in the graph. By default, quality scores of sequences shorter than 100bp or longer than 100bp are fit to 100 data points. (To retrieve this information and calculate the graph data would otherwise require to parse the data two times or store all the quality data in memory.)
-no_qual_header	In order to reduce the file size, this option will generate an empty header line for the quality data in FASTQ files. Instead of +header, only the + sign will be output. The header of the sequence data will be left unchanged. This option applies to FASTQ output files only.
-exact_only	Use this option to check for exact (forward and reverse) duplicates only when generating the graph data. This allows to keep the memory requirements low for large input files and is faster. This option will automatically be applied when using -derep options 1 and/or 4.
Filter options
-min_len	Filter sequence shorter than min_len		INT
-max_len	Filter sequence longer than max_len		INT
-range_len	Filter sequence by length range. Multiple range values should be separated by comma without spaces. Example: -range_len 50-100,250-300		STRING
-min_gc	Filter sequence with GC content below min_gc		INT [0..100]
-max_gc	Filter sequence with GC content above max_gc		INT [0..100]
-range_gc	Filter sequence by GC content range. Multiple range values should be separated by comma without spaces.		STRING
-min_qual_score	Filter sequence with at least one quality score below min_qual_score		INT
-max_qual_score	Filter sequence with at least one quality score above max_qual_score		INT
-min_qual_mean	Filter sequence with quality score mean below min_qual_mean		INT
-max_qual_mean	Filter sequence with quality score mean above max_qual_mean		INT
-ns_max_p	Filter sequence with more than ns_max_p percentage of Ns		INT [0..100]
-ns_max_n	Filter sequence with more than ns_max_n Ns		INT
-noniupac	Filter sequence with characters other than A, C, G, T or N
-seq_num	Only keep the first seq_num number of sequences (that pass all other filters)		INT
-derep	Type of duplicates to filter. Allowed values are 1, 2, 3, 4 and 5. Use integers for multiple selections (e.g. 124 to use type 1, 2 and 4). The order does not matter. Option 2 and 3 will set 1 and option 5 will set 4 as these are subsets of the other option. 1 (exact duplicate), 2 (5' duplicate), 3 (3' duplicate), 4 (reverse complement exact duplicate), 5 (reverse complement 5'/3' duplicate)		INT
-derep_min	This option specifies the number of allowed duplicates. If you want to remove sequence duplicates that occur more than x times, then you would specify x+1 as the -derep_min values. For examples, to remove sequences that occur more than 5 times, you would specify -derep_min 6. This option can only be used in combination with -derep 1 and/or 4 (forward and/or reverse exact duplicates).	2	INT [2..]
-lc_method	Method to filter low complexity sequences		[dust, entropy]
-lc_threshold	The threshold value used to filter sequences by sequence complexity. The dust method uses this as maximum allowed score and the entropy method as minimum allowed value.		INT [0..100]
-custom_params	Can be used to specify additional filters. The current set of possible rules is limited and has to follow the specifications below. The custom parameters have to be specified within quotes (either ' or "). Please separate parameter values with a space and separate new parameter sets with semicolon (;). Parameters are defined by two values: (1) the pattern (any combination of the letters "ACGTN"), (2) the number of repeats or percentage of occurence Percentage values are defined by a number followed by the %-sign (without space). If no %-sign is given, it is assumed that the given number specifies the number of repeats of the pattern. Examples: "AAT 10" (filters out sequences containing AATAATAATAATAATAATAATAATAATAAT anywhere in the sequence), "T 70%" (filters out sequences with more than 70% Ts in the sequence), "A 15" (filters out sequences containing AAAAAAAAAAAAAAA anywhere in the sequence), "AAT 10;T 70%;A 15" (apply all three filters)		STRING
Trim options
-trim_to_len	Trim all sequence from the 3'-end to result in sequence with this length		INT
-trim_left	Trim sequence at the 5'-end by trim_left positions		INT
-trim_right	Trim sequence at the 3'-end by trim_right positions		INT
-trim_left_p	Trim sequence at the 5'-end by trim_left_p percentage of read length. The trim length is rounded towards the lower integer (e.g. 143.6 is rounded to 143 positions). Use an integer between 1 and 100 for the percentage value.		INT
-trim_right_p	Trim sequence at the 3'-end by trim_right_p percentage of read length. The trim length is rounded towards the lower integer (e.g. 143.6 is rounded to 143 positions). Use an integer between 1 and 100 for the percentage value.		INT
-trim_tail_left	Trim poly-A/T tail with a minimum length of trim_tail_left at the 5'-end		INT
-trim_tail_right	Trim poly-A/T tail with a minimum length of trim_tail_right at the 3'-end		INT
-trim_ns_left	Trim poly-N tail with a minimum length of trim_ns_left at the 5'-end		INT
-trim_ns_right	Trim poly-N tail with a minimum length of trim_ns_right at the 3'-end		INT
-trim_qual_left	Trim sequence by quality score from the 5'-end with this threshold score		INT
-trim_qual_right	Trim sequence by quality score from the 3'-end with this threshold score		INT
-trim_qual_type	Type of quality score calculation to use	min	[min,mean,max,sum]
-trim_qual_rule	Rule to use to compare quality score to calculated value. Allowed options are lt (less than), gt (greater than) and et (equal to)	lt	[lt,gt,et]
-trim_qual_window	The sliding window size used to calculate quality score by type. To stop at the first base that fails the rule defined, use a window size of 1.	1	INT
-trim_qual_step	Step size used to move the sliding window. To move the window over all quality scores without missing any, the step size should be less or equal to the window size.	1	INT
Reformat options
-seq_case	Changes sequence character case to upper or lower case		[upper,lower]
-dna_rna	Convert sequence between DNA and RNA. Allowed options are "dna" (convert from RNA to DNA) and "rna" (convert from DNA to RNA).		[dna,rna]
-line_width	Sequence characters per line. Use 0 if you want each sequence in a single line. Use 80 for line breaks every 80 characters. Note that this option only applies to FASTA output files, since FASTQ files store sequences without additional line breaks.	60	INT
-rm_header	Remove the sequence header. This includes everything after the sequence identifier (which is kept unchanged)
-seq_id	Rename the sequence identifier. A counter is added to each identifier to assure its uniqueness		STRING
Summary statistic options *
-stats_info	Outputs basic information such as number of reads (reads) and total bases (bases).
-stats_len	Outputs minimum (min), maximum (max), range (range), mean (mean), standard deviation (stddev), mode (mode) and mode value (modeval), and median (median) for read length.
-stats_dinuc	Outputs the dinucleotide odds ratio for AA/TT (aatt), AC/GT (acgt), AG/CT (agct), AT (at), CA/TG (catg), CC/GG (ccgg), CG (cg), GA/TC (gatc), GC (gc) and TA (ta).
-stats_tag	Outputs the probability of a tag sequence at the 5'-end (prob5) and 3'-end (prob3) in percentage (0..100). Provides the number of predefined MIDs (midnum) and the MID sequences (midseq, separated by comma, only provided if midnum > 0) that occur in more than 34/100 (approx. 3%) of the reads.
-stats_dupl	Outputs the number of exact duplicates (exact), 5' duplicates (5), 3' duplicates (3), exact duplicates with reverse complements (exactrevcom) and 5'/3' duplicates with reverse complements (revcomp), and total number of duplicates (total). The maximum number of duplicates is given under the value name with an additional "maxd" (e.g. exactmaxd or 5maxd).
-stats_ns	Outputs the number of reads with ambiguous base N (seqswithn), the maximum number of Ns per read (maxn) and the maximum percentage of Ns per read (maxp). The maxn and maxp value are not necessary from the same sequence.
-stats_assembly	Outputs the N50, N90, etc contig sizes. The Nxx contig size is a weighted median that is defined as the length of the smallest contig C in the sorted list of all contigs where the cumulative length from the largest contig to contig C is at least xx% of the total length (sum of contig lengths).
-stats_all	Outputs all available summary statistics.

* The summary statistic values are written to STDOUT in the form: "parameter_name statistic_name value" (without the quotes). For example, "stats_info reads 10000" or "stats_len max 500". Only one statistic is written per line and values are separated by tabs.

Standalone graphs version

TOP OF PAGE

The standalone graphs version can be used to generate graphs as PNG files or a HTML report file containing the graphs and tables. The readme file contains information on the usage of the standalone graphs version.

Necessary resources

Hardware
- Computer with a Linux/Unix or Mac OS X operating system

Software
- PRINSEQ graphs (available at http://prinseq.sourceforge.net)

Files
- Graph data file

Available options

Option/flag	Description	Default	Range
-help or -h	Print the help message; ignore other arguments
-man	Print the full documentation; ignore other arguments
-version	Print program version; ignore other arguments
-verbose	Prints status and info messages during processing
Input options
-i	Input file containing the graph data generated by the lite version.		FILE, stdin
Output options
-o	By default, the output files are created in the same directory as the input file with an additional "_prinseq_graphs_XXXX" in their name (where XXXX is replaced by random characters to prevent overwriting previous files). To change the output filename and location, specify the filename using this option. The file extension will be added automatically.	input name.gd	STRING
-png_all	Use this option to generate PNG files with the graphs.
-html_all	Use this option to generate a HTML file with the graphs and tables.
-log	Log file to keep track of parameters, errors, etc. The log file name is optional. If no file name is given, the log file name will be "inputname.log". If the log file already exists, new content will be added to the file.	input name.log	STRING

Examples

The following examples apply to version 0.16 and higher.

Graph data (-graph_data)
To generate the graphs from the web version or the HTML report, you can use the -graph_data option:

$ perl prinseq-lite.pl -verbose -fastq test.fq -graph_data test.gd -out_good null -out_bad null
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	status: 34 %

The verbose mode shows the progress and the "-out_good null -out_bad null" prevents PRINSEQ from generating any other output files than the specified test.gd file containing the graphs data.

Generating graph PNG files (-png_all)
To generate the graph as PNG files, you can use the prinseq-graphs -png_all option:

$ perl prinseq-graphs.pl -i test.gd -png_all -o test

This should create the following PNG files:

$ ls *.png
test_cd.png	test_dl.png	test_ld.png	test_pm.png	test_td5.png
test_ce.png	test_dm.png	test_ns.png	test_pv.png
test_df.png	test_gc.png	test_or.png	test_td3.png

Generating HTML reports (-html_all)
To generate the HTML report containing all the tables and figures from the web version, you can use the prinseq-graphs -html_all option:

$ perl prinseq-graphs.pl -i test.gd -html_all -o test

This should create the following HTML file:

$ ls *.html
test.html

Note that the prinseq-graphs.pl file requires the graphics modules to be installed. If you are unable to install the graphics modules on your system, you can also use an online form to generate the HTML report file from the graph data file:
http://edwards.sdsu.edu/prinseq

Graph data selection (-graph_stats)
To generate the graphs from the web version or the HTML report for a selected number of statistics, you can use the -graph_stats option in combination with the -graph_data option:

$ perl prinseq-lite.pl -verbose -fastq test.fq -graph_data test.gd -out_good null -out_bad null -graph_stats ld,gc,qd

This feature is available in version 0.19 and higher and currently allows to select the options described here.

Remove sequences with N from FASTQ files (-ns_max_n and -ns_max_p)
To remove sequences with a certain number or percentage of ambiguous base N from a FASTQ file, you can use the PRINSEQ options -ns_max_n or -ns_max_p as follows:

$ perl prinseq-lite.pl -verbose -fastq test.fq -ns_max_n 0 -out_good test_no_ns -out_bad test_with_ns

This will separate the input FASTQ files into two files (test_no_ns.fastq containing sequences without N and test_with_ns.fastq containing all sequences with at least one N). You can replace the value for -ns_max_n with, for example, 2 to remove sequences with more than 2 Ns. Alternatively, you can use "-ns_max_p 1" to remove sequences with more than 1% of Ns.

Process paired-end FASTQ files with PRINSEQ (-fastq and -fastq2)
To process paired-end data with PRINSEQ, you can specify the two input files using -fastq and -fastq2 (or -fasta and fasta2 for FASTA files) as follows:

$ perl prinseq-lite.pl -verbose -fastq test_1.fq -fastq2 test_2.fq -ns_max_n 0 -out_good test_no_ns -out_bad test_with_ns

Singletons will automatically be filtered and stored in separate files (here test_no_ns_1_singletons.fastq and test_no_ns_2_singletons.fastq) in addition to the files test_no_ns_1.fastq, test_no_ns_2.fastq, test_with_ns_1.fastq, and test_with_ns_2.fastq.