Mapreduce-Partition分析

摘要: 轉自:http://blog.oddfoo.net/2011/04/17/mapreduce-partition%E5%88%86%E6%9E%90-2/Partition所處的位置Partition位置Partition主要作用就是將map的結果發送到相應的reduce。這就對partition有兩個要求：1）均衡負載，盡量的將工作均勻的分配給不同的reduce。2）效率，分配速度一定要快。Ma... 閱讀全文

posted @ 2013-04-01 21:10 鑫龍閱讀(412) | 評論 (0) | 編輯收藏

hadoop namenode啟動過程詳細剖析及瓶頸分析

NameNode中幾個關鍵的數據結構

FSImage

Namenode會將HDFS的文件和目錄元數據存儲在一個叫fsimage的二進制文件中，每次保存fsimage之后到下次保存之間的所有hdfs操作，將會記錄在editlog文件中，當editlog達到一定的大小（bytes，由fs.checkpoint.size參數定義）或從上次保存過后一定時間段過后（sec，由fs.checkpoint.period參數定義），namenode會重新將內存中對整個HDFS的目錄樹和文件元數據刷到fsimage文件中。Namenode就是通過這種方式來保證HDFS中元數據信息的安全性。

Fsimage是一個二進制文件，當中記錄了HDFS中所有文件和目錄的元數據信息，在我的hadoop的HDFS版中，該文件的中保存文件和目錄的格式如下：

當namenode重啟加載fsimage時，就是按照如下格式協議從文件流中加載元數據信息。從fsimag的存儲格式可以看出，fsimage保存有如下信息：

1. 首先是一個image head，其中包含：

a) imgVersion(int)：當前image的版本信息

b) namespaceID(int)：用來確保別的HDFS instance中的datanode不會誤連上當前NN。

c) numFiles(long)：整個文件系統中包含有多少文件和目錄

d) genStamp(long)：生成該image時的時間戳信息。

2. 接下來便是對每個文件或目錄的源數據信息，如果是目錄，則包含以下信息：

a) path(String)：該目錄的路徑，如”/user/build/build-index”

b) replications(short)：副本數（目錄雖然沒有副本，但這里記錄的目錄副本數也為3）

c) mtime(long)：該目錄的修改時間的時間戳信息

d) atime(long)：該目錄的訪問時間的時間戳信息

e) blocksize(long)：目錄的blocksize都為0

f) numBlocks(int)：實際有多少個文件塊，目錄的該值都為-1，表示該item為目錄

g) nsQuota(long)：namespace Quota值，若沒加Quota限制則為-1

h) dsQuota(long)：disk Quota值，若沒加限制則也為-1

i) username(String)：該目錄的所屬用戶名

j) group(String)：該目錄的所屬組

k) permission(short)：該目錄的permission信息，如644等，有一個short來記錄。

3. 若從fsimage中讀到的item是一個文件，則還會額外包含如下信息：

a) blockid(long)：屬于該文件的block的blockid，

b) numBytes(long)：該block的大小

c) genStamp(long)：該block的時間戳

當該文件對應的numBlocks數不為1，而是大于1時，表示該文件對應有多個block信息，此時緊接在該fsimage之后的就會有多個blockid，numBytes和genStamp信息。

因此，在namenode啟動時，就需要對fsimage按照如下格式進行順序的加載，以將fsimage中記錄的HDFS元數據信息加載到內存中。

BlockMap

從以上fsimage中加載如namenode內存中的信息中可以很明顯的看出，在fsimage中，并沒有記錄每一個block對應到哪幾個datanodes的對應表信息，而只是存儲了所有的關于namespace的相關信息。而真正每個block對應到datanodes列表的信息在hadoop中并沒有進行持久化存儲，而是在所有datanode啟動時，每個datanode對本地磁盤進行掃描，將本datanode上保存的block信息匯報給namenode，namenode在接收到每個datanode的塊信息匯報后，將接收到的塊信息，以及其所在的datanode信息等保存在內存中。HDFS就是通過這種塊信息匯報的方式來完成 block -> datanodes list的對應表構建。Datanode向namenode匯報塊信息的過程叫做blockReport，而namenode將block -> datanodes list的對應表信息保存在一個叫BlocksMap的數據結構中。

BlocksMap的內部數據結構如下：

如上圖顯示，BlocksMap實際上就是一個Block對象對BlockInfo對象的一個Map表，其中Block對象中只記錄了blockid，block大小以及時間戳信息，這些信息在fsimage中都有記錄。而BlockInfo是從Block對象繼承而來，因此除了Block對象中保存的信息外，還包括代表該block所屬的HDFS文件的INodeFile對象引用以及該block所屬datanodes列表的信息（即上圖中的DN1，DN2，DN3，該數據結構會在下文詳述）。

因此在namenode啟動并加載fsimage完成之后，實際上BlocksMap中的key，也就是Block對象都已經加載到BlocksMap中，每個key對應的value(BlockInfo)中，除了表示其所屬的datanodes列表的數組為空外，其他信息也都已經成功加載。所以可以說：fsimage加載完畢后，BlocksMap中僅缺少每個塊對應到其所屬的datanodes list的對應關系信息。所缺這些信息，就是通過上文提到的從各datanode接收blockReport來構建。當所有的datanode匯報給namenode的blockReport處理完畢后，BlocksMap整個結構也就構建完成。

BlockMap中datanode列表數據結構

在BlockInfo中，將該block所屬的datanodes列表保存在一個Object[]數組中，但該數組不僅僅保存了datanodes列表，還包含了額外的信息。實際上該數組保存了如下信息：

上圖表示一個block包含有三個副本，分別放置在DN1，DN2和DN3三個datanode上，每個datanode對應一個三元組，該三元組中的第二個元素，即上圖中prev block所指的是該block在該datanode上的前一個BlockInfo引用。第三個元素，也就是上圖中next Block所指的是該block在該datanode上的下一個BlockInfo引用。每個block有多少個副本，其對應的BlockInfo對象中就會有多少個這種三元組。

Namenode采用這種結構來保存block->datanode list的目的在于節約namenode內存。由于namenode將block->datanodes的對應關系保存在了內存當中，隨著HDFS中文件數的增加，block數也會相應的增加，namenode為了保存block->datanodes的信息已經耗費了相當多的內存，如果還像這種方式一樣的保存datanode->block list的對應表，勢必耗費更多的內存，而且在實際應用中，要查一個datanode上保存的block list的應用實際上非常的少，大部分情況下是要根據block來查datanode列表，所以namenode中通過上圖的方式來保存block->datanode list的對應關系，當需要查詢datanode->block list的對應關系時，只需要沿著該數據結構中next Block的指向關系，就能得出結果，而又無需保存datanode->block list在內存中。

NameNode啟動過程

fsimage加載過程

Fsimage加載過程完成的操作主要是為了：

1. 從fsimage中讀取該HDFS中保存的每一個目錄和每一個文件

2. 初始化每個目錄和文件的元數據信息

3. 根據目錄和文件的路徑，構造出整個namespace在內存中的鏡像

4. 如果是文件，則讀取出該文件包含的所有blockid，并插入到BlocksMap中。

整個加載流程如下圖所示：

如上圖所示，namenode在加載fsimage過程其實非常簡單，就是從fsimage中不停的順序讀取文件和目錄的元數據信息，并在內存中構建整個namespace，同時將每個文件對應的blockid保存入BlocksMap中，此時BlocksMap中每個block對應的datanodes列表暫時為空。當fsimage加載完畢后，整個HDFS的目錄結構在內存中就已經初始化完畢，所缺的就是每個文件對應的block對應的datanode列表信息。這些信息需要從datanode的blockReport中獲取，所以加載fsimage完畢后，namenode進程進入rpc等待狀態，等待所有的datanodes發送blockReports。

blockReport階段

每個datanode在啟動時都會掃描其機器上對應保存hdfs block的目錄下(dfs.data.dir)所保存的所有文件塊，然后通過namenode的rpc調用將這些block信息以一個long數組的方式發送給namenode，namenode在接收到一個datanode的blockReport rpc調用后，從rpc中解析出block數組，并將這些接收到的blocks插入到BlocksMap表中，由于此時BlocksMap缺少的僅僅是每個block對應的datanode信息，而namenoe能從report中獲知當前report上來的是哪個datanode的塊信息，所以，blockReport過程實際上就是namenode在接收到塊信息匯報后，填充BlocksMap中每個block對應的datanodes列表的三元組信息的過程。其流程如下圖所示:

當所有的datanode匯報完block，namenode針對每個datanode的匯報進行過處理后，namenode的啟動過程到此結束。此時BlocksMap中block->datanodes的對應關系已經初始化完畢。如果此時已經達到安全模式的推出閾值，則hdfs主動退出安全模式，開始提供服務。

啟動過程數據采集和瓶頸分析

對namenode的整個啟動過程有了詳細了解之后，就可以對其啟動過程中各階段各函數的調用耗時進行profiling的采集，數據的profiling仍然分為兩個階段，即fsimage加載階段和blockReport階段。

fsimage加載階段性能數據采集和瓶頸分析

以下是對建庫集群真實的fsimage加載過程的的性能采集數據：

從上圖可以看出，fsimage的加載過程那個中，主要耗時的操作分別分布在FSDirectory.addToParent，FSImage.readString，以及PermissionStatus.read三個操作，這三個操作分別占用了加載過程的73%，15%以及8%，加起來總共消耗了整個加載過程的96%。而其中FSImage.readString和PermissionStatus.read操作都是從fsimage的文件流中讀取數據（分別是讀取String和short）的操作，這種操作優化的空間不大，但是通過調整該文件流的Buffer大小來提高少許性能。而FSDirectory.addToParent的調用卻占用了整個加載過程的73%，所以該調用中的優化空間比較大。

以下是addToParent調用中的profiling數據：

從以上數據可以看出addToParent調用占用的73%的耗時中，有66%都耗在了INode.getPathComponents調用上，而這66%分別有36%消耗在INode.getPathNames調用，30%消耗在INode.getPathComponents調用。這兩個耗時操作的具體分布如以下數據所示：

可以看出，消耗了36%的處理時間的INode.getPathNames操作，全部都是在通過String.split函數調用來對文件或目錄路徑進行切分。另外消耗了30%左右的處理時間在INode.getPathComponents中，該函數中最終耗時都耗在獲取字符串的byte數組的java原生操作中。

blockReport階段性能數據采集和瓶頸分析

由于blockReport的調用是通過datanode調用namenode的rpc調用，所以在namenode進入到等待blockreport階段后，會分別開啟rpc調用的監聽線程和rpc調用的處理線程。其中rpc處理和rpc鑒定的調用耗時分布如下圖所示：

而其中rpc的監聽線程的優化是另外一個話題，在其他的issue中再詳細討論，且由于blockReport的操作實際上是觸發的rpc處理線程，所以這里只關心rpc處理線程的性能數據。

在namenode處理blockReport過程中的調用耗時性能數據如下：

可以看出，在namenode啟動階段，處理從各個datanode匯報上來的blockReport耗費了整個rpc處理過程中的絕大部分時間(48/49)，blockReport處理邏輯中的耗時分布如下圖：

從上圖數據中可以發現，blockReport階段中耗時分布主要耗時在FSNamesystem.addStoredBlock調用以及DatanodeDescriptor.reportDiff過程中，分別耗時37/48和10/48，其中FSNamesystem.addStoredBlock所進行的操作時對每一個匯報上來的block，將其于匯報上來的datanode的對應關系初始化到namenode內存中的BlocksMap表中。所以對于每一個block就會調用一次該方法。所以可以看到該方法在整個過程中調用了774819次，而另一個耗時的操作，即DatanodeDescriptor.reportDiff，該操作的過程在上文中有詳細介紹，主要是為了將該datanode匯報上來的blocks跟namenode內存中的BlocksMap中進行對比，以決定那個哪些是需要添加到BlocksMap中的block，哪些是需要添加到toRemove隊列中的block，以及哪些是添加到toValidate隊列中的block。由于這個操作需要針對每一個匯報上來的block去查詢BlocksMap，以及namenode中的其他幾個map，所以該過程也非常的耗時。而且從調用次數上可以看出，reportDiff調用在啟動過程中僅調用了14次(有14個datanode進行塊匯報)，卻耗費了10/48的時間。所以reportDiff也是整個blockReport過程中非常耗時的瓶頸所在。

同時可以看到，出了reportDiff，addStoredBlock的調用耗費了37%的時間，也就是耗費了整個blockReport時間的37/48，該方法的調用目的是為了將從datanode匯報上來的每一個block插入到BlocksMap中的操作。從該方法調用的運行數據如下圖所示：

從上圖可以看出，addStoredBlock中，主要耗時的兩個階段分別是FSNamesystem.countNode和DatanodeDescriptor.addBlock，后者是java中的插表操作，而FSNamesystem.countNode調用的目的是為了統計在BlocksMap中，每一個block對應的各副本中，有幾個是live狀態，幾個是decommission狀態，幾個是Corrupt狀態。而在namenode的啟動初始化階段，用來保存corrput狀態和decommission狀態的block的map都還是空狀態，并且程序邏輯中要得到的僅僅是出于live狀態的block數，所以，這里的countNoes調用在namenode啟動初始化階段并無需統計每個block對應的副本中的corrrput數和decommission數，而僅僅需要統計live狀態的block副本數即可，這樣countNodes能夠在namenode啟動階段變得更輕量，以節省啟動時間。

瓶頸分析總結

從profiling數據和瓶頸分歧情況來看，fsimage加載階段的瓶頸除了在分切路徑的過程中不夠優以外，其他耗時的地方幾乎都是在java原生接口的調用中，如從字節流讀數據，以及從String對象中獲取byte[]數組的操作。

而blockReport階段的耗時其實很大的原因是跟當前的namenode設計以及內存結構有關，比較明顯的不優之處就是在namenode啟動階段的countNode和reportDiff的必要性，這兩處在namenode初始化時的blockReport階段有一些不必要的操作浪費了時間。可以針對namenode啟動階段將必要的操作抽取出來，定制成namenode啟動階段才調用的方式，以優化namenode啟動性能。

Ref: http://blog.csdn.net/ae86_fc/article/details/5842020

posted @ 2013-03-28 18:52 鑫龍閱讀(470) | 評論 (0) | 編輯收藏

hadoop二次排序 (Map/Reduce中分區和分組的問題)

1.二次排序概念：

首先按照第一字段排序，然后再對第一字段相同的行按照第二字段排序，注意不能破壞第一次排序的結果。

如：輸入文件：

20 21
50 51
50 52
50 53
50 54
60 51
60 53
60 52
60 56
60 57
70 58
60 61
70 54
70 55
70 56
70 57
70 58
1 2
3 4
5 6
7 82
203 21
50 512
50 522
50 53
530 54
40 511
20 53
20 522
60 56
60 57
740 58
63 61
730 54
71 55
71 56
73 57
74 58
12 211
31 42
50 62
7 8

輸出（需要分割線）：

------------------------------------------------
1       2
------------------------------------------------
3       4
------------------------------------------------
5       6
------------------------------------------------
7       8
7       82
------------------------------------------------
12      211
------------------------------------------------
20      21
20      53
20      522
------------------------------------------------
31      42
------------------------------------------------
40      511
------------------------------------------------
50      51
50      52
50      53
50      53
50      54
50      62
50      512
50      522
------------------------------------------------
60      51
60      52
60      53
60      56
60      56
60      57
60      57
60      61
------------------------------------------------
63      61
------------------------------------------------
70      54
70      55
70      56
70      57
70      58
70      58
------------------------------------------------
71      55
71      56
------------------------------------------------
73      57
------------------------------------------------
74      58
------------------------------------------------
203     21
------------------------------------------------
530     54
------------------------------------------------
730     54
------------------------------------------------
740     58

2.工作原理

使用如下map和reduce：（特別注意輸入輸出類型，其中IntPair為自定義類型）

public static class Map extends Mapper<LongWritable, Text, IntPair, IntWritable>
public static class Reduce extends Reducer<IntPair, NullWritable, IntWritable, IntWritable>

在map階段，使用job.setInputFormatClass(TextInputFormat)做為輸入格式。注意輸出應該符合自定義Map中定義的輸出<IntPair, IntWritable>。最終是生成一個List<IntPair, IntWritable>。在map階段的最后，會先調用job.setPartitionerClass對這個List進行分區，每個分區映射到一個reducer。每個分區內又調用job.setSortComparatorClass設置的key比較函數類排序。可以看到，這本身就是一個二次排序。如果沒有通過job.setSortComparatorClass設置key比較函數類，則使用key的實現的compareTo方法。在隨后的例子中，第一個例子中，使用了IntPair實現的compareTo方法，而在下一個例子中，專門定義了key比較函數類。

在reduce階段，reducer接收到所有映射到這個reducer的map輸出后，也是會調用job.setSortComparatorClass設置的key比較函數類對所有數據對排序。然后開始構造一個key對應的value迭代器。這時就要用到分組，使用jobjob.setGroupingComparatorClass設置的分組函數類。只要這個比較器比較的兩個key相同，他們就屬于同一個組，它們的value放在一個value迭代器，而這個迭代器的key使用屬于同一個組的所有key的第一個key。最后就是進入Reducer的reduce方法，reduce方法的輸入是所有的（key和它的value迭代器）。同樣注意輸入與輸出的類型必須與自定義的Reducer中聲明的一致。

3，具體步驟

（1）自定義key

在mr中，所有的key是需要被比較和排序的，并且是二次，先根據partitione，再根據大小。而本例中也是要比較兩次。先按照第一字段排序，然后再對第一字段相同的按照第二字段排序。根據這一點，我們可以構造一個復合類IntPair，他有兩個字段，先利用分區對第一字段排序，再利用分區內的比較對第二字段排序。
所有自定義的key應該實現接口WritableComparable，因為是可序列的并且可比較的。并重載方法：

//反序列化，從流中的二進制轉換成IntPair
public void readFields(DataInput in) throws IOException
//序列化，將IntPair轉化成使用流傳送的二進制
public void write(DataOutput out)
//key的比較
public int compareTo(IntPair o)
//另外新定義的類應該重寫的兩個方法
//The hashCode() method is used by the HashPartitioner (the default partitioner in MapReduce)
public int hashCode()
public boolean equals(Object right)

（2）由于key是自定義的，所以還需要自定義一下類：
（2.1）分區函數類。這是key的第一次比較。

public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>

在job中使用setPartitionerClasss設置Partitioner。
（2.2）key比較函數類。這是key的第二次比較。這是一個比較器，需要繼承WritableComparator（也就是實現RawComprator接口）。

（這個就是前面說的第二種方法，但是在第三部分的代碼中并沒有實現此函數，而是直接使用compareTo方法進行比較，所以也就不許下面一行的設置）
在job中使用setSortComparatorClass設置key比較函數類。

public static class KeyComparator extends WritableComparator

2.3）分組函數類。在reduce階段，構造一個key對應的value迭代器的時候，只要first相同就屬于同一個組，放在一個value迭代器。這是一個比較器，需要繼承WritableComparator。

public static class GroupingComparator extends WritableComparator

分組函數類也必須有一個構造函數，并且重載 public int compare(WritableComparable w1, WritableComparable w2)
分組函數類的另一種方法是實現接口RawComparator。
在job中使用setGroupingComparatorClass設置分組函數類。
另外注意的是，如果reduce的輸入與輸出不是同一種類型，則不要定義Combiner也使用reduce，因為Combiner的輸出是reduce的輸入。除非重新定義一個Combiner。

轉自：http://www.cnblogs.com/dandingyy/archive/2013/03/08/2950703.html

posted @ 2013-03-25 19:38 鑫龍閱讀(2276) | 評論 (0) | 編輯收藏

hadoop面試時可能遇到的問題

面試hadoop可能被問到的問題，你能回答出幾個 ?

1、hadoop運行的原理?

2、mapreduce的原理?

3、HDFS存儲的機制?

4、舉一個簡單的例子說明mapreduce是怎么來運行的 ?

5、面試的人給你出一些問題,讓你用mapreduce來實現？

比如:現在有10個文件夾,每個文件夾都有1000000個url.現在讓你找出top1000000url。

6、hadoop中Combiner的作用?

Src： http://p-x1984.javaeye.com/blog/859843

Q1. Name the most common InputFormats defined in Hadoop? Which one is default ?
Following 2 are most common InputFormats defined in Hadoop
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat

Q2. What is the difference between TextInputFormatand KeyValueInputFormat class
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper
KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

Q3. What is InputSplit in Hadoop

When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split

Q4. How is the splitting of file invoked in Hadoop Framework

It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user

Q5. Consider case scenario: In M/R system,

- HDFS block size is 64 MB

- Input format is FileInputFormat

- We have 3 files of size 64K, 65Mb and 127Mb

then how many input splits will be made by Hadoop framework?

Hadoop will make 5 splits as follows

- 1 split for 64K files

- 2 splits for 65Mb files

- 2 splits for 127Mb file

Q6. What is the purpose of RecordReader in Hadoop

The InputSplithas defined a slice of work, but does not describe how to access it. The RecordReaderclass actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat

Q7. After the Map phase finishes, the hadoop framework does "Partitioning, Shuffle and sort". Explain what happens in this phase?

- Partitioning

Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same

- Shuffle

After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.

- Sort

Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer

Q9. If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer

The default partitioner computes a hash value for the key and assigns the partition based on this result

Q10. What is a Combiner

The Combiner is a "mini-reduce" process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

Q11. Give an example scenario where a cobiner can be used and where it cannot be used

There can be several examples following are the most common ones

- Scenario where you can use combiner

Getting list of distinct words in a file

- Scenario where you cannot use a combiner

Calculating mean of a list of numbers

Q12. What is job tracker

Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluster

Q13. What are some typical functions of Job Tracker

The following are some typical tasks of Job Tracker

- Accepts jobs from clients

- It talks to the NameNode to determine the location of the data

- It locates TaskTracker nodes with available slots at or near the data

- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker

Q14. What is task tracker

Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker

Q15. Whats the relationship between Jobs and Tasks in Hadoop

One job is broken down into one or many tasks in Hadoop.

Q16. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What willhadoop do ?

It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job

Q17. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this

Speculative Execution

Q18. How does speculative execution works in Hadoop

Job tracker makes different task trackers process same input. When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Q19. Using command line in Linux, how will you

- see all jobs running in the hadoop cluster

- kill a job

- hadoop job -list

- hadoop job -kill jobid

Q20. What is Hadoop Streaming

Streaming is a generic API that allows programs written in virtually any language to be used asHadoop Mapper and Reducer implementations

Q21. What is the characteristic of streaming API that makes it flexible run map reduce jobs in languages like perl, ruby, awk etc.

Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a Map Reduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

Q22. Whats is Distributed Cache in Hadoop
Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

Q23. What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

Q.24 What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application
This is a trick questions. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution

Q25. Have you ever used Counters in Hadoop. Give us an example scenario
Anybody who claims to have worked on a Hadoop project is expected to use counters

Q26. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job
Yes, The input format class provides methods to add multiple directories as input to a Hadoop job

Q27. Is it possible to have Hadoop job output in multiple directories. If yes then how
Yes, by using Multiple Outputs class

Q28. What will a hadoop job do if you try to run it with an output directory that is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit
The hadoop job will throw an exception and exit.

Q29. How can you set an arbitary number of mappers to be created for a job in Hadoop
This is a trick question. You cannot set it

Q30. How can you set an arbitary number of reducers to be created for a job in Hadoop
You can either do it progamatically by using method setNumReduceTasksin the JobConfclass or set it up as a configuration setting

Src:http://xsh8637.blog.163.com/blog/#m=0&t=1&c=fks_084065087084081065083083087095086082081074093080080069

posted @ 2013-03-18 13:03 鑫龍閱讀(1488) | 評論 (0) | 編輯收藏

基于Hadoop Sequencefile的小文件解決方案

基于Hadoop Sequencefile的小文件解決方案

一、概述

小文件是指文件size小于HDFS上block大小的文件。這樣的文件會給hadoop的擴展性和性能帶來嚴重問題。首先，在HDFS中，任何block，文件或者目錄在內存中均以對象的形式存儲，每個對象約占150byte，如果有1000 0000個小文件，每個文件占用一個block，則namenode大約需要2G空間。如果存儲1億個文件，則namenode需要20G空間。這樣namenode內存容量嚴重制約了集群的擴展。其次，訪問大量小文件速度遠遠小于訪問幾個大文件。HDFS最初是為流式訪問大文件開發的，如果訪問大量小文件，需要不斷的從一個datanode跳到另一個datanode，嚴重影響性能。最后，處理大量小文件速度遠遠小于處理同等大小的大文件的速度。每一個小文件要占用一個slot，而task啟動將耗費大量時間甚至大部分時間都耗費在啟動task和釋放task上。

二、Hadoop自帶的解決方案

對于小文件問題，Hadoop本身也提供了幾個解決方案，分別為：Hadoop Archive，Sequence file和CombineFileInputFormat。

（1） Hadoop Archive

Hadoop Archive或者HAR，是一個高效地將小文件放入HDFS塊中的文件存檔工具，它能夠將多個小文件打包成一個HAR文件，這樣在減少namenode內存使用的同時，仍然允許對文件進行透明的訪問。

使用HAR時需要兩點，第一，對小文件進行存檔后，原文件并不會自動被刪除，需要用戶自己刪除；第二，創建HAR文件的過程實際上是在運行一個mapreduce作業，因而需要有一個hadoop集群運行此命令。

該方案需人工進行維護，適用管理人員的操作，而且har文件一旦創建，Archives便不可改變，不能應用于多用戶的互聯網操作。

（2） Sequence file

sequence file由一系列的二進制key/value組成，如果為key小文件名，value為文件內容，則可以將大批小文件合并成一個大文件。

Hadoop-0.21.0中提供了SequenceFile，包括Writer，Reader和SequenceFileSorter類進行寫，讀和排序操作。如果hadoop版本低于0.21.0的版本，實現方法可參見[3]。

該方案對于小文件的存取都比較自由，不限制用戶和文件的多少，但是SequenceFile文件不能追加寫入，適用于一次性寫入大量小文件的操作。

（3）CombineFileInputFormat

CombineFileInputFormat是一種新的inputformat，用于將多個文件合并成一個單獨的split，另外，它會考慮數據的存儲位置。

該方案版本比較老，網上資料甚少，從資料來看應該沒有第二種方案好。

三、小文件問題解決方案

在原有HDFS基礎上添加一個小文件處理模塊，具體操作流程如下:

1. 當用戶上傳文件時，判斷該文件是否屬于小文件，如果是，則交給小文件處理模塊處理，否則，交給通用文件處理模塊處理。

2. 在小文件模塊中開啟一定時任務，其主要功能是當模塊中文件總size大于HDFS上block大小的文件時，則通過SequenceFile組件以文件名做key，相應的文件內容為value將這些小文件一次性寫入hdfs模塊。

3. 同時刪除已處理的文件，并將結果寫入數據庫。

4. 當用戶進行讀取操作時，可根據數據庫中的結果標志來讀取文件。

轉自:http://lxm63972012.iteye.com/blog/1429011

posted @ 2013-03-04 19:28 鑫龍閱讀(869) | 評論 (0) | 編輯收藏

hadoop jar xxxx.jar的流程

jar -cvf xxx.jar .
hadopp jar xxx.jar clalss-name [input] [output]
----------------------------------------------------------------------
hadoop jar hadoop-0.20.2-examples.jar [class name]的實質是:

1.利用hadoop這個腳本啟動一個jvm進程;

2.jvm進程去運行org.apache.hadoop.util.RunJar這個java類;

3.org.apache.hadoop.util.RunJar解壓hadoop-0.20.2-examples.jar到hadoop.tmp.dir/hadoop-unjar*/目錄下;

4.org.apache.hadoop.util.RunJar動態的加載并運行Main-Class或指定的Class;

5.Main-Class或指定的Class中設定Job的各項屬性

6.提交job到JobTracker上并監視運行情況。

注意：以上都是在jobClient上執行的。

運行jar文件的時候，jar會被解壓到hadoop.tmp.dir/hadoop-unjar*/目錄下（如：/home/hadoop/hadoop-fs/dfs/temp/hadoop-unjar693919842639653083, 注意：這個目錄是JobClient的目錄，不是JobTracker的目錄）。解壓后的文件為：

drwxr-xr-x 2 hadoop hadoop 4096 Jul 30 15:40 META-INF

drwxr-xr-x 3 hadoop hadoop 4096 Jul 30 15:40 org

有圖有真相：

提交job的實質是：

生成${job-id}/job.xml文件到hdfs://${mapred.system.dir}/（比如hdfs://bcn152:9990/home/hadoop/hadoop-fs/dfs/temp/mapred/system/job_201007301137_0012/job.xml），job的描述包括jar文件的路徑，map|reduce類路徑等等.

上傳${job-id}/job.jar文件到hdfs://${mapred.system.dir}/（比如hdfs://bcn152:9990/home/hadoop/hadoop-fs/dfs/temp/mapred/system/job_201007301137_0012/job.jar）

有圖有真相：

生成job之后，通過static JobClient.runJob()就會向jobTracker提交job:

JobClient jc = new JobClient(job);

RunningJob rj = jc.submitJob(job);

之后JobTracker就會調度此job，

提交job之后，使用下面的代碼獲取job的進度：

try {

if (!jc.monitorAndPrintJob(job, rj)) {

throw new IOException("Job failed!");

}

} catch (InterruptedException ie) {

Thread.currentThread().interrupt();

}

posted @ 2013-03-02 17:28 鑫龍閱讀(4023) | 評論 (0) | 編輯收藏

僅列出標題

mysileng

導航

常用鏈接

留言簿

隨筆分類

隨筆檔案

搜索

最新評論

閱讀排行榜

評論排行榜