• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Cross Posting

Page history last edited by Rob Dolin 13 years, 11 months ago

Notes from Brainstorming at Stream Camp (2010-04-17)

 

Scenarios:

  • User publishes an activity from a client to multiple services simultaneously
    • TweetDeck --> Facebook and Twitter
    • Ping.fm --> ~30 services
  • User sync's activity among services
    • Foursquare --> Facebook, Twitter
    • Twitter <--> MySpace, LinkedIn, Facebook

 

Potential Solutions

  • Check publishing <source>, <app_id>, <generator>, etc.
  • Need to cannonicalize status text (ex: remove "robdolin: " or "Rob Dolin" or "via foo")
  • Time window (ex: same update within a short period of time)
  • Doing checking on similar properties (ex: <title> of a Blog-entry in a Status or Status contains a link (possibly short URL) to the blog <link>)
  • For multi-site publishers (ex: Ping.fm) could define/include a property/hash that was unique to the author/publishing client (i.e. JohnSmith123PingFm)
  • Globally unique ID / source - See Martin's spec linked below.  Could potentially start with <entry><id> but would likely need to also check <entry><activity:object><id> and <entry><crosspost:source>.
  • Fuzzy hashing on potentially matching / similar strings 

 

(Older content below)

 

In early October, a few folks from Facebook and Google (others were invited as well) got together to brainstorm on cross posting and reducing duplicated content.  We spent a lot of time talking about Mart's Atom Cross-posting Extension and wanted to share our notes.

 

Brainstorming Notes on De-Duping and Cross Posting

 

Proposal Summary

- Facebook, Twitter, and similar APIs should allow clients to specify a crosspost:source ID when publishing content.

- Services that crawl and re-publish feeds should propagate the crosspost:source element or create it based on the crawled atom:id or RSS guid.

 

Background

It's become clear over the past year that the rate of content being automatically re-shared across social websites is increasing and has caused duplication as aggregators are unable to determine the original source of a piece of content. One example is YouTube's auto-share functionality which allows a user to automatically re-share videos they upload (or favorite) to Facebook, Twitter, and Google Reader. This has resulted in a given video being posted to YouTube, shared to Facebook, Twitter, and Reader, and then re-duplicated on FriendFeed as it's aggregating from both Facebook and Twitter as well. While aggregators like FriendFeed have written custom algorithms to detect this sort of duplication and coalesce content, it would be desirable to reduce this echo effect from the start.

 

So far the crosspost:source element from http://74.125.155.132/search?q=cache:F4JZo0244ZkJ:martin.atkins.me.uk/specs/atomcrosspost (Mart's site seems to be down) seems to solve the main use case.

 

<entry>

  <id>tag:jibber.example.org,2005:4523452</id>

  <title>geraldine: Photos from my Weekend http://sillyurl.example.net/abc123</title>

  <link href="/http://jibber.example.net/statuses/4523452" />

  <!-- (other standard Atom elements elided for brevity) -->

  <crosspost:source>

    <id>tag:blogtastic.example.com,2009:5a12451543</id>

  </crosspost:source>

</entry>

 

 

Where do we want the complexity?

- Options are either at the publisher or at the aggregator.

- It seems more resilient for the publisher to specifiy if this is cross posting versus the aggregator having to figure it out.

 

So when do you include crosspost:source?

- Only use crosspost:source when it really is a cross post, i.e. really constitutes the same original user action, merely being re-broadcast through multiple services.

- We're not solving the general coalescing problem; aggregators still need to make their own decisions on how to coalesce items that originate from distinct but related user actions.

- For example, if the user posted a YouTube video and then manually pasted the link into their Facebook status, crosspost:source would not be included.

- To an aggregator, crosspost:source means, "if you already have another instance of the thing I'm linking to, feel free to completely drop one of the two".

- Items with the same crosspost:source may not be identical -- e.g. on Twitter, the text may be truncated to 140 chars or links may be shortened. Aggregators are free to select among multiple copies however they like -- e.g. try to identify the highest fidelity.

 

A simple scenario of user activity and how the crosspost:source travels through the system:

- User uploads a video on YouTube. The YouTube user feed should not have a crosspost:source, but will have a unique atom:id.

- YouTube auto-shares that video upload action to Twitter. The same ID that appears in YouTube's atom feed should be included as the crosspost:source when posting to Twitter (see below re: extension to posting API).

- If the user has set up FriendFeed to crawl their YouTube feed, then FriendFeed finds the item in the YouTube feed, with no crosspost:source. When re-publishing the user's aggregated FriendFeed feed, they should take the YouTube atom:id and make that the crosspost:source in the re-published feed.

- If the user has set up FriendFeed to crawl their Twitter feed (but not their YouTube feed), then FriendFeed finds the item in the user's Twitter feed, including a crosspost:source. They should propagate that crosspost:source when re-publishing the item in the user's aggregated feed.

- If the user has set up FriendFeed to crawl both their YouTube and Twitter feeds, then FriendFeed finds both, recognizes them as the same, and re-publishes in the user's aggregated feed a single item with the crosspost:source ID.

 

What about one activity which contains multiple things?

- The only use case we can think of where this would occur is where an aggregator automatically coalesces two distinct activities and then publishes the combined activity. (Remember that the crosspost:source element is only applied to automatically re-shared activities, not activities that are the result of explicit user actions.) We have decided not to address the coalesced activity use case now, because we consider coalescing to be a separate problem that is out of scope. One possible solution is to include multiple crosspost:source elements in the activity, but this is not currently permitted by the crosspost extension spec. For the time being, coalesced activities will be considered new activities, and they will not include any crosspost:source element.

 

Issues around who gets linked to when stuff gets coalesced.

- What happens when similar activities (like Favoriting & Rating a video) happen at near-by times for the same video. They would get different crosspost:sources and it'd be up to the aggregator to coalesce them.

 

What if I don't want to use Atom? Aren't more APIs becoming built on JSON instead?

- Twitter specifically raised this question as they see most of their API usage via JSON and not Atom.

- The Activity Streams working group is currently developing native (not some crazy programatic transform) representations for their specifications in JSON. We expect crosspost:source to work in JSON in addition to Atom and RSS.

 

Why not use link rel=via?

- It could it be combined with the link rel 'alternate'? i.e. <link rel="alternate via" href="/http://www.youtube.com/watch?v=bBBw9E2Q_aY" />

- Doesn't using a URL mean that we then need to understand canonical and duplicate URLs? An atom:id is not dependant on the URL(s) of the content.

 

Why not use rel=canonical (http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html)?

- It is designed to say that two URLs are equivalent versus that a piece of content is being re-posted somewhere else.

 

Why not use atom:source (http://tools.ietf.org/html/rfc4287#section-4.2.11)?

- It seems like atom:source is only for if the entire entry is preserved identically.

 

For APIs like Facebook's stream.Publish (and similar on Twitter, etc), how do clients provide a crosspost:source ID value?

- Sites like Facebook, Twitter, and Digg could extend their APIs to expose a new mechanism for clients to provide a crosspost:source ID.

- Open question: How can Facebook (etc) verify the atom:id that is being passed in? Or do they even need to? What are the malicious or just "dumb user" or "dumb developer" issues if the wrong id is passed in? Just broken coalescing?

 

When calling such an API, how do clients indicate whether or not crosspost:source should be included in that service's outbound feed?

- Probably just need to always include crosspost:source if the client provided one.

- YouTube currently provides a message like "David favorited this awesome video" so checking for a blank message won't work. If they pushed in Activity Streams then these sorts of default messages should become less needed.

Comments (9)

Kevin Marks said

at 12:37 am on Oct 12, 2009

"Doesn't using a URL mean that we then need to understand canonical and duplicate URLs? An atom:id is not dependant on the URL(s) of the content." - equivalent URLs are at least discoverable via 301 redirection; equivalent ids are necessarily opaque.
atom:source is for the aggregation case, and for replicating source attribution downstream; not just for exact duplication. rel="via" can be part of it - your friendfeed scenarios sound like the atom:source scenario

David Recordon said

at 12:40 am on Oct 12, 2009

But you shouldn't (and the services we're looking at don't) have a piece of content with multiple ids. Unlike how you can have multiple URLs for the same piece of content.

Kevin Marks said

at 1:47 am on Oct 12, 2009

if 2 systems republish the content it has multiple IDs. rel="canonical" is how to resolve multiple URLs within a domain if you don't do 301's right (hi Typepad); tricker to use across domains.

John Panzer said

at 10:15 am on Oct 12, 2009

(FYI, apparently there is a move afoot to be able to use rel="canonical" across domains.) That said, I think that crosspost: is a fine mechanism and I'm +1 on it; the only major issue is that it's not any type of 'official' spec yet that other 'official' specs can build on. OWF anyone?

Peter Reiser said

at 10:30 am on Oct 20, 2009

hmm and how do you know the author of the original object type ? As example if I favorite a Tweet on Facebook - the activity belongs to me but the original object belongs to the author of the tweet. .. or can we just add the Atom author tag within the crosspost tag ?

John Panzer said

at 10:48 am on Oct 20, 2009

The original intent of Atom entry ids was that they could be retained as entries were cross-syndicated to various places and used for de-duping and disambiguation,with atom:source used just to let the entry carry metadata about its "home feed". The crossposting use case fits completely within this use case, but as a practical matter, a lot of systems want/need to generate their own entry ids, so we need another way to indicate the 'original' id. The crosspost:source element fits this fine, but IMHO should only be required when the atom:id is different than the original -- otherwise it's completely redundant. This viewpoint also means that for cross-posts the author is just the original author.

The case of favoriting a tweet is different than a single author cross-posting something to multiple places; you're the author of the favoriting action, and the author of the object should be in the object of the activity.

Peter Reiser said

at 11:57 am on Oct 20, 2009

@John Panzer - make sense - thanks

Peter Reiser said

at 9:30 am on Nov 6, 2009

What element should be used for the timestamp when the cross posted activity has been received by the aggregator
e.g
According to the activitystreams spec, the <updated> date should be the date that the activity
was updated but currently some activity streams provider uses it when they received the item and became aware of the activity.


John Panzer said

at 10:08 am on Nov 6, 2009

Re: Timestamps - The current AS spec is correct (in terms of compatibility with Atom and "the right thing to do") in saying that <updated> should be the timestamp the activity was created/updated, not the time it was received by some downstream store. The latter interpretation simplifies some implementations but is a major headache especially when dealing with cross-posted activities (which would each get different timestamps, be therefore sorted differently in different UIs, and essentially muck with the historical record of what was said when).

You don't have permission to comment on this page.