Methods for Web Content Analysis and Context Detection

This project was part of Portland State University’s senior capstone program. It is the work of seven students over the course of six months. For the duration of the project we worked with a Mozilla adviser, Dietrich Ayala, to keep on track with the project’s original requirements. The team was composed of the following students:

Overview and Goals of the Project

This project was a research-intensive proof of concept for a feature that would expand the reader mode to content beyond articles, for one of many Mozilla projects. We set out to solve the problem of how to “put the internet back in the hands of the user”, as web pages are often bloated with unnecessary content that degrades the user experience.

In developing nations with low-powered smartphones and slow internet connections, this can incur a high computing cost on browsing and affect battery life. In our research we divided the problem into four main areas: the quality of the user’s internet connection, the target device of the user, what content is important to the user, and is the data accessible to those with disabilities.

For example, the graph below shows that the difference on one of the web pages tested with and without reader mode was nearly 6mb.

Data Usage

By understanding what part of a web page is content and what is not, we can limit data usage, by only downloading the relevant content. In addition, if we can grab only what’s necessary from a site, it opens the possibility of the user’s device optimizing the view of this data.

This transformation of the data for contextual presentation can be used to improve accessibility, or enable alternate browser models. We outline several possible efficient methods of content analysis. Ultimately, we found that currently available tools solve only a subset of the problems identified. However, by utilizing several of these tools and the concepts explored in our research paper, we believe it is possible to implement such a feature.

What does this mean for an everyday web developer? Imagine smarter tooling for content analysis, detection, and optimization that could be built as advanced features of the browser in the near future. Imagine developer tools that would make building website accessibility and platform-specific features far easier and less costly than it is today.

Read on to learn more about our findings and the research we designed to test our ideas.

Installation & Usage

The process outlined in our paper is referred to as “Minimum Contextualization”, or contextualization for short. This process is split into three main phases: Content analysis, content filtration and content transformation. Each of these phases has several steps.

Phoenix-node is a command line application written in Node.js that we developed to analyze HTML document structure. It relies on Node.js 4.0+, the npm package manager, and the jsdom npm package and its dependencies.

  1. Install Node.js 4.0+ following the instructions for your environment:
  2. Clone the Phoenix-node repository from
  3. Install jsdom into the source directory with ‘npm install jsdom’. A node_modules folder will be made.
  4. Run phoenix-node parsing with ‘node alt.js’. This will print the DOM structure to the terminal.

Phoenix Output


Research Findings

Our research identified three major phases in the contextualization process: content analysis, content filtration and content transformation. Our findings focus on content analysis. Content filtration and content transformation are not covered in our research.



For content analysis, we recommend two distinct steps: The first step should identify which “Structure Group” a site falls into by utilizing cluster analysis of document structures. In the second step, one of several methods can be used to parse through the site to determine which content is essential for the user to understand its meaning. For example, if a site is placed into a cluster which is text-heavy and has little to no other content, then basic reader mode features are sufficient for this, such as shallow-text methods. Otherwise a more advanced method must be used, such as semantic segment detection (discussed further in our paper).

Through our research we were able to learn about the limitations inherent in modern reader mode techniques and the status of similar research. Our team’s recommended method for content analysis and context detection is to utilize a cluster analysis to group like pages in order to learn about the archetypal structure in a cluster and group sites with similar structures together.

Read the full paper here:

Methods for Web Content Analysis and Context Detection

View full post on Mozilla Hacks – the Web developer blog

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

HTML5 games: 3D collision detection

Last week we took a look at Tilemaps, and I shared some new articles that I’d written on MDN. This week I’m back to introduce 3D collision detection, an essential technique for almost any kind of 3D game. I’ll also point you to some more new articles about game development on MDN! Hope they inspire you to stretch your skills.

In 3D game development, bounding volumes provide one of the most widely used techniques for determining whether two virtual objects will collide (i.e., intersect with each other) during game play. The technique of bounding volumes consists of wrapping game objects with some virtual volumes, and applying intersection algorithms to describe the movement and interaction of these volumes. You can think of this approach as a shortcut: it is easier and faster than detecting intersections between arbitrary, complex shapes.

In terms of bounding volumes, the use of axis-aligned bounding boxes (AABB) is a popular option. Depending on the game, sometimes spheres are used as well. Here’s an image of some 3D objects wrapped with AABB:

Screen Shot 2015-10-16 at 15.11.21

The new MDN article on 3D collision detection describes how to use generic algorithms to perform 3D collision detection with AABB and spheres. This article should be useful regardless of the game engine or programming language you are using to develop your game.

We also published an article about doing collision detection with bounding volumes using three.js, a popular 3D library for JavaScript. (Learn more about three.js.)

Check out the live demos and peek at their source code. One of the demos uses a physics engine (in this case, Cannon.js) to perform collision detection. Embedded below you can find another demo that shows how to use Three.js to detect collisions:

Hope you enjoy the demos and find them useful. If there’s a particular topic in HTML5 game development you’d like to learn more about, please drop a comment here and let us know! We’ll try to get it covered for you.

View full post on Mozilla Hacks – the Web developer blog

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

User-Agent detection, history and checklist


User-Agent: <something> is a string of characters sent by HTTP clients (browsers, bots, calendar applications, etc.) for each individual HTTP request to a server. The HTTP Protocol as defined in 1991 didn’t have this field, but the next version defined in 1992 added User-Agent in the HTTP requests headers. Its syntax was defined as “the software product name, with an optional slash and version designator“. The prose already invited people to use it for analytics and identify the products with implementation issues.

This line if present gives the software program used by the original client. This is for statistical purposes and the tracing of protocol violations. It should be included.

Fast forward to August 2013, the HTTP/1.1 specification is being revised and also defines User-Agent.

A user agent SHOULD NOT generate a User-Agent field containing
needlessly fine-grained detail and SHOULD limit the addition of
subproducts by third parties. Overly long and detailed User-Agent
field values increase request latency and the risk of a user being
identified against their wishes (“fingerprinting”).

Likewise, implementations are encouraged not to use the product
tokens of other implementations in order to declare compatibility
with them
, as this circumvents the purpose of the field. If a user
agent masquerades as a different user agent, recipients can assume
that the user intentionally desires to see responses tailored for
that identified user agent, even if they might not work as well for
the actual user agent being used.

Basically, the HTTP specification discouraged since its inception the detection of the User-Agent string for tailoring the user experience. Currently, the user agent strings have become overly long. They are abused in every possible way. They include detailed information. They lie about what they really are and they are used for branding and advertising the devices they run on.

User-Agent Detection

User agent detection (or sniffing) is the mechanism used for parsing the User-Agent string and inferring physical and applicative properties about the device and its browser. But let get the record straight. User-Agent sniffing is a future fail strategy. By design, you will detect only what is known, not what will come. The space of small devices (smartphones, feature phones, tablets, watches, arduino, etc.) is a very fast-paced evolving space. The diversity in terms of physical characteristics will only increase. Updating databases and algorithms for identifying correctly is a very high maintenance task which is doomed to fail at a point in the future. Sites get abandoned, libraries are not maintained and Web sites will break just because they were not planned for the future coming devices. All of these have costs in resources and branding.

New solutions are being developed for helping people to adjust the user experience depending on the capabilities of the products, not its name. Responsive design helps to create Web sites that are adjusting for different screen sizes. Each time you detect a product or a feature, it is important to thoroughly understand why you are trying to detect this feature. You could fall in the same traps as the ones existing with user agent detection algorithms.

We have to deal on a daily basis with abusive user agent detection blocking Firefox OS and/or Firefox on Android. It is not only Mozilla products, every product and brand has to deal at a point with the fact to be excluded because they didn’t have the right token to pass an ill-coded algorithm. User agent detection leads to situation where a new player can hardly enter the market even if it has the right set of technologies. Remember that there are huge benefits to create a system which is resilient to many situations.

Some companies will be using the User-Agent string as an identifier for bypassing a pay-wall or offering specific content for a group of users during a marketing campaign. It seems to be an easy solution at first but it creates an environment easy to by-pass in spoofing the user agent.

Firefox and Mobile

Firefox OS and Firefox on Android have very simple documented User-Agent strings.

Firefox OS

Mozilla/5.0 (Mobile; rv:18.0) Gecko/18.0 Firefox/18.0

Firefox on Android

Mozilla/5.0 (Android; Mobile; rv:18.0) Gecko/18.0 Firefox/18.0

The most current case of user agent detection is to know if the device is a mobile to redirect the browser to a dedicated Web site tailored with mobile content. We recommend you to limit your detection to the simplest possible string by matching the substring mobi in lowercase.


If you are detecting on the client side with JavaScript, one possibility among many would be to do:

// Put the User Agent string in lowercase
var ua = navigator.userAgent.toLowerCase();
// Better to test on mobi than mobile (Firefox, Opera, IE)
if (/mobi/i.test(ua)) {
    // do something here
} else {
    // if not identified, still do something useful

You might want to add more than one token in the if statement.


Remember that whatever the number of tokens you put there, you will fail at a point in the future. Some devices will not have JavaScript, will not have the right token. The pattern or the length of the token was not as you had initially planned. The stones on the path are plenty, choose the way of the simplicity.

Summary: UA detection Checklist Zen

  1. Do not detect user agent strings
  2. Use responsive design for your new mobile sites (media queries)
  3. If you are using a specific feature, use feature detections to enhance, not block
  4. And if finally you are using UA detection, just go with the most simple and generic strings.
  5. Always provide a working fallback whatever the solutions you chose are.

Practice. Learn. Imagine. Modify. And start again. There will be many road blocks on the way depending on the context, the business requirements, the social infrastructure of your own company. Keep this checklist close to you and give the Web to more people.

View full post on Mozilla Hacks – the Web developer blog

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Ambient Light Events and JavaScript detection

I think that one of the most interesting things with all WebAPIs we’re working on, is to interact directly with the hardware through JavaScript, but also, as an extension to that, with the environment around us. Enter Ambient Light Events.

The idea with an API for ambient light is to be able to detect the light level around the device – especially since there’s a vast difference between being outside in sunlight and sitting in a dim living room – and adapt the user experience based on that.

One use case could be to change the CSS file/values for the page, offering a nicer reading experience under low light conditions, reducing the strong white of a background, and then something with more/better contrast for bright ambient light. Another could be to play certain music depending on the light available.

Accessing device light

Working with ambient light is quite simple. What you need to do is apply a listener for a devicelight event, and then read out the brightness value.

It comes returned in the lux unity. The lux value ranges between low and high values, but a good point of reference is that dim values are under 30 lux, whereas really bright ones are 10,000 and over.

window.addEventListener("devicelight", function (event) {
    // Read out the lux value
    var lux = event.value;

Web browser support

Ambient Light Events are currently supported in Firefox on Android, meaning both mobile phones and tablets, and it’s also supported in Firefox OS. On Android devices (the ones I’ve tested), the sensor is located just right to the camera facing the user.

It is also a W3C Working Draft, following the type of other similar events, such as devicemotion, so we hope to see more implementations of this soon!


Dmitry Dragilev and Tim Wright recently wrote a blog post about the Ambient Light API, with this nice demo video:

You can also access the demo example directly, and if you test in low light conditions, you’ll get a little music. Remember to try it out on a supported device/web browser.

View full post on Mozilla Hacks – the Web developer blog

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Cyber Intrusion Detection Analyst

Job description: …PCAP analysis, snort signature development, and familiarity with the following or similar security tools:   McAfee HBSS, McAfee IPS/IDS, Web content filtering, Juniper firewalls, Niksun Net Detector, and ArcSight. … View full post on – web security

View full post on

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)