Overview¶
Scope and Features¶
Just as shown in the following figure, we first use MECAT to do the alignment for all the raw reads and after that we can know the most possible locations of genome for those reads. Usually, only ~10% reads are mapped around gap regions(around 2kbp) and it waste a lot of time for PBJelly to do the alignment for the rest of reads with Blasr since those far away gap region reads were not used to do the local assembly and filled the gaps. So, we can speed up the PBJelly by first removing those reads. Besides, we found that the original PBJelly rudely collected reads to do the local assembly and doing this uasually gives poor results. So, with our approach, it will get better contig N50 value, sometimes.
run_pbjelly is intended to be easy to use but some familiarity with commandline applications is expected. Rather than providing a flexible solution to a number of common workflows, we have designed run_pbjelly to be as fixed as possible, which can help you easily get the results without worrying about installing the other software and setting up path for these program. This design help you save a lot of time, but we suggest you should read the published PBJelly paper and related documents in case you have to do some specific test when the results do not meet the contract indicator.
run_pbjelly is composed of a set of standalone tools to perform specific tasks. A brief description of each tool is shown in the table below.
| Tool | Subcommand | Description |
|---|---|---|
| Python script | a set of short program designed to complete various data format conversion and process | |
| filt4seq_v2.0.py | filter reads with length longer than 100kbp or less than 3kbp | |
| find_gap_v1.0.py | output gap position(*.bed) information for the scaffolds | |
| m4_to_bed_v3.pl | Convert the MECAT alignment *.m4 to the format *.bed | |
| pick_reads_in_raw.pl | picked raw reads according to the id list | |
| PBJelly | aligns long reads to high-confidence draft assembles and filled captured gaps | |
| setup | Tag sequence names, find gaps, and index the reference | |
| mapping | Use blasr to map the sequences to the reference | |
| support | Indentify which reads support with gaps | |
| extraction | For each gap, consolidate all reads supporting it into a local-assembly folder | |
| assembly | Build the consensus gap-filling sequence | |
| output | Stitch the reference sequences and gap-filling sequence together | |
Input and Output¶
For run_pbjelly, you have to prepared two files, the first one is the reference file name with fasta as end. The second file is ‘reads.list’ file where each line stores the path for the reads files independently. After these, you have to prepare a ‘config.cfg’ file, containing reference file path, reads.list, output dir and all kinds of parameters.
For more information, you can refer to the part ‘Config and Usage’ under the section Examples .
Parameters¶
There are mainly five steps in run_pbjelly, and all the important parameters were saved in the config.cfg file (see more details in the Example section). The parameters are shown as followed:
| Category | Sub-category | Description |
|---|---|---|
| Data | set up names or dir for files | |
| sample | the project name | |
| reads_list | abs path for the file reads.list containg path for reads set | |
| output_dir | dir where you want to ouput all the results | |
| ref | the scaffolds file you want to do the gapcloser with | |
| Program | set up parameters for all knids of programs | |
| mask_len | the gap length for the 1bp gap to extend | |
| best | the number for MECAT to output best alignments | |
| reads_len_max | the max length for raw reads | |
| reads_len_min | the min length for raw reads | |
| mecat_n | number of of candidates for gap extension | |
| mecat_b | output the best b alignments | |
| mecat_mem | memory set for mecat alignment | |
| mecat_nproc | cpu number set for mecat alignment | |
| blasr_mem | memory set for blasr alignment | |
| blasr_nproc | Align using N processes | |
| assembly_nproc | cpu number set for PBJelly assembly | |
| qsub_q | specify the queue to use | |