Overview

Scope and Features

Just as shown in the following figure, we first use MECAT to do the alignment for all the raw reads and after that we can know the most possible locations of genome for those reads. Usually, only ~10% reads are mapped around gap regions(around 2kbp) and it waste a lot of time for PBJelly to do the alignment for the rest of reads with Blasr since those far away gap region reads were not used to do the local assembly and filled the gaps. So, we can speed up the PBJelly by first removing those reads. Besides, we found that the original PBJelly rudely collected reads to do the local assembly and doing this uasually gives poor results. So, with our approach, it will get better contig N50 value, sometimes.

_images/pipelines.png

run_pbjelly is intended to be easy to use but some familiarity with commandline applications is expected. Rather than providing a flexible solution to a number of common workflows, we have designed run_pbjelly to be as fixed as possible, which can help you easily get the results without worrying about installing the other software and setting up path for these program. This design help you save a lot of time, but we suggest you should read the published PBJelly paper and related documents in case you have to do some specific test when the results do not meet the contract indicator.

run_pbjelly is composed of a set of standalone tools to perform specific tasks. A brief description of each tool is shown in the table below.

Tool Subcommand Description
Python script   a set of short program designed to complete various data format conversion and process
  filt4seq_v2.0.py filter reads with length longer than 100kbp or less than 3kbp
  find_gap_v1.0.py output gap position(*.bed) information for the scaffolds
  m4_to_bed_v3.pl Convert the MECAT alignment *.m4 to the format *.bed
  pick_reads_in_raw.pl picked raw reads according to the id list
PBJelly   aligns long reads to high-confidence draft assembles and filled captured gaps
  setup Tag sequence names, find gaps, and index the reference
  mapping Use blasr to map the sequences to the reference
  support Indentify which reads support with gaps
  extraction For each gap, consolidate all reads supporting it into a local-assembly folder
  assembly Build the consensus gap-filling sequence
  output Stitch the reference sequences and gap-filling sequence together
     

Input and Output

For run_pbjelly, you have to prepared two files, the first one is the reference file name with fasta as end. The second file is ‘reads.list’ file where each line stores the path for the reads files independently. After these, you have to prepare a ‘config.cfg’ file, containing reference file path, reads.list, output dir and all kinds of parameters.

For more information, you can refer to the part ‘Config and Usage’ under the section Examples .

Parameters

There are mainly five steps in run_pbjelly, and all the important parameters were saved in the config.cfg file (see more details in the Example section). The parameters are shown as followed:

Category Sub-category Description
Data   set up names or dir for files
  sample the project name
  reads_list abs path for the file reads.list containg path for reads set
  output_dir dir where you want to ouput all the results
  ref the scaffolds file you want to do the gapcloser with
Program   set up parameters for all knids of programs
  mask_len the gap length for the 1bp gap to extend
  best the number for MECAT to output best alignments
  reads_len_max the max length for raw reads
  reads_len_min the min length for raw reads
  mecat_n number of of candidates for gap extension
  mecat_b output the best b alignments
  mecat_mem memory set for mecat alignment
  mecat_nproc cpu number set for mecat alignment
  blasr_mem memory set for blasr alignment
  blasr_nproc Align using N processes
  assembly_nproc cpu number set for PBJelly assembly
  qsub_q specify the queue to use