0

0

MongoDB and Hadoop: A Step-by Step Tutorial Using

php中文网

php中文网

发布时间:2016-06-07 16:29:20

|

1175人浏览过

|

来源于php中文网

原创

The following is a guest post from Jeremy Karn. This article is excerpted from MongoDB + Hadoop: A Step-by-Step Tutorial. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework

the following is a guest post from jeremy karn. this article is excerpted from ‘mongodb + hadoop: a step-by-step tutorial’. jeremy is a cofounder at mortar data, a hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.

People who are worried about scalability often find themselves looking at two tools: MongoDB for storing large amounts of data easily and Hadoop for processing that data. But a common question is: “How do I combine these two to really get the most out of my data?”

Here’s a step-by-step tutorial that will get you up and running with MongoDB and Hadoop in a matter of minutes. And the best part about this tutorial is that at the end you’ll be ready to jump right into using your own MongoDB data with Hadoop.

For this tutorial you’ll be using Apache Pig, a high-level data flow language that compiles down into Hadoop MapReduce jobs. It was designed to be easy to learn and simple to write. If you’ve written SQL, Pig will feel familiar, it is like procedural SQL.

To run your Hadoop jobs, you’re going to use a free Mortar account. Mortar provides Hadoop as a service, which means you can run your jobs without worrying about how to set up and manage a multi-node Hadoop cluster.

To get started, we’ve already set up a small MongoDB instance on MongoLab, populated it with a random sampling of Twitter data from a single day (around 120,000 tweets), and created a read-only user for you.

We’ve also set up a public Github repo with a Mortar project that has three Pig scripts ready to run. Here’s what you need to do:

If you don’t already have a free Github account - create one.? You’ll need a github username in step 4.

  1. Sign into (or create) your free Mortar account.
  2. After you receive the confirmation email, log into Mortar at https://app.mortardata.com.
  3. Install?the Mortar Development Framework:?
    gem install mortar
  4. Clone the example git project and register it as a mortar project:?
    git clone git@github.com:mortardata/mongo-pig-examples.git
    cd mongo-pig-examples
    mortar register mongo-pig-examples

Script 1 - Characterize Collection

If you’re like most MongoDB users, you may not have a great sense of the different fields, data types, or values in your collection. We built characterize_collection.pig to deeply inspect your collection to extract that information.

From the base directory of the mongo-pig-examples project you just cloned take a look at pigscripts/characterize_collection.pig. It loads all the data in the collection as a map, sends the map to Python (udfs/python/mongo_util.py) to gather a bunch of metadata, calculates some basic information about the collection, and then it writes the results out to an S3 bucket.

To see this script in action let’s run it on a 4 node Hadoop cluster. In your terminal (from the base directory of your mongo-pig-examples project) run:

SPLASH
SPLASH

将音乐制作的乐趣带给每个人。

下载
mortar run characterize_collection --clustersize 4

This job will take about 10 minutes to finish. You can monitor the job’s status on the command line or by going to https://app.mortardata.com/jobs?

Once the job has finished, you’ll receive an email with a link to your job results. Clicking on this link will bring you into the Mortar web app, where you can download the results from s3. The output is described at the top of the characterize_collection script but as an example you can scroll down the output and find:

…
user.is_translator	2	false	unicode	118806
user.is_translator	2	true	unicode	31
user.lang	26	en	unicode	114108
user.lang	26	es	unicode	3462
user.lang	26	fr	unicode	532
user.lang	26	pt	unicode	281
user.lang	26	ja	unicode	79
user.listed_count	398	0	int	73757
user.listed_count	398	1	int	18518

Looking at the values for user.lang - we see that there are 26 unique values for the field in our dataset. The most common was “en” with 114108 occurrences, the next most common was “es” with 3462 occurrences, and so on. To see the full results without running the job you can view the output file here.

Script 2 - MongoDB Schema Generator

It can be tricky to properly declare MongoDB’s highly nested schemas in Pig. Now, Pig is graceful—it can roll without a schema, or with inconsistent, or incorrect schemas. But it’s easier to read and write your Pig code if you have a schema because it allows you (and the Pig optimizer) to focus on just the relevant data.

So this next script automatically generates a Pig schema by examining your MongoDB collection. If you don’t need the whole schema, you can easily edit it to keep just the fields you want.

Running this script is similar to running the previous one. If you ran the Characterize Collection script in the past hour, the same cluster you used for that job should still be running. In that case, you can just run:

mortar run mongo_schema_generator

If you don’t have a cluster that’s still running, just run the job on a new 4 node cluster like this:

mortar run mongo_schema_generator --clustersize 4

Script 3 – Twitter Hourly Coffee Tweets

Using a Twitter coffee tweets script (pigscripts/hourly_coffee_tweets.pig), we’re going to demonstrate how we can use a small subset of the fields in our MongoDB collection. For our example, we’ll look at how often the word “coffee” is tweeted throughout the day. As with the Mongo Schema Generator script, you can run this job on an existing cluster or start up a new one.

Next Steps

If you already have a mongo instance/cluster based in US-East EC2, the first two example scripts should run on one of your collections with only minor modifications. You’ll just need to:

  1. Update the MongoLoader connection strings in the pig scripts to connect to your MongoDB collections with one of your own users. If your mongo instance is on a non-standard port (any port other than 27017), just email us at support@mortardata.com to allow your Mortar account to access that port.
  2. If you’d like your jobs to write to one of your own S3 buckets, you can update the AWS keys associated with your Mortar account by following these instructions to enable s3 access.
  3. If you run out of free cluster hours with Mortar, you can upgrade your account to get additional free hours each month.
  4. You can find more resources for learning Pig here
  5. If you have any questions or feedback, please contact us at support@mortardata.com or ping us on in-app chat at app.mortardata.com

相关专题

更多
苹果官网入口直接访问
苹果官网入口直接访问

苹果官网直接访问入口是https://www.apple.com/cn/,该页面具备0.8秒首屏渲染、HTTP/3与Brotli加速、WebP+AVIF双格式图片、免登录浏览全参数等特性。本专题为大家提供相关的文章、下载、课程内容,供大家免费下载体验。

115

2025.12.24

拼豆图纸在线生成器
拼豆图纸在线生成器

拼豆图纸生成器有PixelBeads在线版、BeadGen和“豆图快转”;推荐通过pixelbeads.online或搜索“beadgen free online”直达官网,避开需注册的诱导页面。本专题为大家提供相关的文章、下载、课程内容,供大家免费下载体验。

82

2025.12.24

俄罗斯搜索引擎yandex官方入口地址(最新版)
俄罗斯搜索引擎yandex官方入口地址(最新版)

Yandex官方入口网址是https://yandex.com。用户可通过网页端直连或移动端浏览器直接访问,无需登录即可使用搜索、图片、新闻、地图等全部基础功能,并支持多语种检索与静态资源精准筛选。本专题为大家提供相关的文章、下载、课程内容,供大家免费下载体验。

546

2025.12.24

JavaScript ES6新特性
JavaScript ES6新特性

ES6是JavaScript的根本性升级,引入let/const实现块级作用域、箭头函数解决this绑定问题、解构赋值与模板字符串简化数据处理、对象简写与模块化提升代码可读性与组织性。本专题为大家提供相关的文章、下载、课程内容,供大家免费下载体验。

150

2025.12.24

php框架基础知识汇总
php框架基础知识汇总

php框架是构建web应用程序的架构,提供工具和功能,以简化开发过程。选择合适的框架取决于项目需求和技能水平。实战案例展示了使用laravel构建博客的步骤,包括安装、创建模型、定义路由、编写控制器和呈现视图。本专题为大家提供相关的文章、下载、课程内容,供大家免费下载体验。

20

2025.12.24

Word 字间距调整方法汇总
Word 字间距调整方法汇总

本专题整合了Word字间距调整方法,阅读下面的文章了解更详细操作。

47

2025.12.24

任务管理器教程
任务管理器教程

本专题整合了任务管理器相关教程,阅读下面的文章了解更多详细操作。

7

2025.12.24

AppleID格式
AppleID格式

本专题整合了AppleID相关内容,阅读专题下面的文章了解更多详细教程。

12

2025.12.24

csgo视频观看入口合集
csgo视频观看入口合集

本专题整合了csgo观看入口合集,阅读下面的文章了知道更多入口地址。

371

2025.12.24

热门下载

更多
网站特效
/
网站源码
/
网站素材
/
前端模板

精品课程

更多
相关推荐
/
热门推荐
/
最新课程
MongoDB 教程
MongoDB 教程

共17课时 | 1.6万人学习

黑马云课堂mongodb实操视频教程
黑马云课堂mongodb实操视频教程

共11课时 | 3.1万人学习

MongoDB 教程
MongoDB 教程

共42课时 | 23.1万人学习

关于我们 免责申明 举报中心 意见反馈 讲师合作 广告合作 最新更新
php中文网:公益在线php培训,帮助PHP学习者快速成长!
关注服务号 技术交流群
PHP中文网订阅号
每天精选资源文章推送

Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号